Editorial Jane C. Blake, Editor Helen L. Patterson, Associate Editor Kathleen M. Stetson, Associate Editor Circulation Catherine M. Phillips, Administrator Sherry L. Gonzalez Production Terri Autieri, Production Editor Anne S. Katzeff, Typographer Peter R. Woodbury, Illustrator Advisory Board Samuel H. Fuller, Chairman Richard W Beane Donald 2.Harbert Richard J. Hollingsworth Alan G. Nemeth Jeffrey H. Rudy Stan Smits Michael C. Thurk Gayn B. Winters The Digital TecbnicalJoutnnl is published quarterly by Digital Equipment Corporation, 146 Main Street ML01-3/B68, Maynard, Massachusetts 01754-2571. Subscriptions to the Journal are $40.00 for four issues and must be prepaid in U.S. funds. University and col- lege professors and Ph.D. students in the electrical engineering and science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Technical./ournalat the published-by address. lnquirks can also be sent electro~icallyto DTJ~CRL.DEC.COM.S~~~I~copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 1 Burlington Woods Drive. Burlington, ,MA 01830-4597 Digital employeesmay send subscription orders on the ENET to RDVAX:,JOURNAL or by interoffice mall to mallstop ML01-3/B68. Orders should include badge number, site location code, and address. All employees must adv~seof changes of address. Comments on the content of any paper are welcomed and may be sent to the editor at the published-by or network address. Copyright O 1993 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. The information in theJournal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in theJortrnnl.

Documentation Number EY-J886E-DP The following are trademarks of Digital Equipment Corporation: ACMS, ALL-IN-1,Alpha AXE the AXP logo, AXT: DEC, DEC 3000 AXE DEC 4000 AXI: DEC 6000 AXI: DEC 7000 AXp DEC 10000 AXl? DEC DBMS for OpenVMS, DEC Fortran, DEC OSWl AXE DEC Pascal, DEC RAL1.Y- DEC Rdb for OpenVMS, DECchip 21064, DECnet, DECnet for OpenVMS &XI: DECnet for OpenVMSVAX, DECnet/OSI, DECnet-VAX, DECstation, DECstation 5000, DECwindows, DECWORLD, Digital, the Digital logo, DNA, OpenVMS. OpenVMS AXI: OpenVMS RMS OpenVMS VAX, PDP-11, Q-, Thinwire, TURBOchannel, IJLTRM, VAX, VAX-11/780, VAX 4000, VAX 6000, VAX 7000, VW 8700, VrLv 8800, VAX 10000, VAX Fortran, VAX Pascal, VMS, and VMScluster. CMY-I is a registered trademark of Cray Research, Inc HP is a registered trademark of Hewlett-Packard Company. Cover Design IBM is a registered trademark of International Business Machines, Inc. The DECchip 21064, the fzrst implementation LSI Logic is a trademark of LSI Logic Corporation of Digitalk Alpha AXP computer architecture, is a registered trademark of Apple Computer, Inc. is the world'sfastest single-chip micropocessolr MIPS is a trademark of MIPS Computer Systems, Inc. Represented on our couer by the RrlP logo, the Motorola is a registered trademark of Motorola, Inc. DECchip takes its phce among sywzbols of other OSWl is a registered trademark of Open Software Foundation, Inc. dewicesfrom computing history, including PAL is a registered trademark of , Inc. the vacuum tube, apuncb card, sketches of SPEC, SPECfp, SPECint, and SPECmark are registered trademarks of the Standard Bubbagek Analytical Engine, a wheel from the Performance Evaluation Cooperative. ~aciculke,and an abacus. SPICE is a trademark of the University of California at Berkeley The cover was aksi~nedby Deborah Fulck oJ is a registered trademark of UNM System Laboratories, Inc. Digital's Corporate Human Factors Group ,with Windows and Windows NT are trademarks of Corporation. the help of Knzn Design. Book production was done by Quantic Communications, Inc Contents

17 Forauord Robert M. Supnik

Alpha AXP Architecture and Systems

19 Alpha AXP Architecture Richartl L. Sites 35 A 200-MHz 64-bitDual-issue CMOS Microprocessor Daniel W Dobberpuhl. Richard T. Witek. Randy Allrnon. Robert Anglin, Davitl Bertilcci. Sharon Britton, Lintla Chao, liobert A. Conrad, Daniel E. [)ever, Bruce Gieaeke, Soha M.N. H;lssoun, Gregory W. Hoeppnel; KathrynKuchler, Maureen Ladtl, Burton M. Lear): Liam Madden, EtlwardJ. McLellan, Derrick R. Meyer, James Montanaro, Donald A. Priore, Vidya Rajagop:~l;~n,Sridhar Sarnucl~~l;~, and Sribal:~nS;lnrIianam 51 The Alpha Demonstration Unit:A High-performance Multiprocessorfor Sopware and Chip Development Charles I? Thacker, David G. Conro!: ;lnd Lawrence C. Stewart 66 The Design of the DEC3000 AXP Systems, Two High-performance Todd A. I)i~tton,Daniel Eiref, Nlrgh R. K~rrth,JnmesJ.Reisert, ant1 Robin I,. Ste~vart 82 Design and Performance of lhe DEC 4000 AXP Departmental Server Computing Systems Barry A. M;~skas,Stephen E Shirron, and Nicholas A Wlrcliol 100 Technical Description of the DEC 7000 and DEC 10000 AXP Family Brian R. Allison and Catharine van Ingen 1 11 Porting OpenVMS from VAX to Alpha AXP Nancy P Kronenbrg, Thomas R. Benson, Wayne M. (hrtloza, Ravindran J;~jian~~atlian, ant1 BenjaminJ. Thomas Ill 12 1 The GEM Optimizing Compiler System Davitl S. Hlickstein, Peter W Craig, <:aroJine S. Davidson, R. Neil Riman,Jc, Kent D. <;lossop, Richartl H.Grove, Steven 0.I-lobbs, and William H. Noyce 137 Binary Translation Richard L. Sites, Anton Chernoff. Mattliew B. Kirk, M;~uriceP Marks, and Scott (;. Robinson 153 Porting Digital's Database Management Prodzrcts to the Alpha AXP Platforrn Jeffrey A. Coffler. Zia Mohatnecl, and I'eter M. Spiro 165 DECnet for OpercVMS AXP: A Case History James V. (:olombo, Pamela). Rickard, ant1 Paul Benoit 181 Using Simzrlation to Develop and Port SoJfu)are George A. 1):lrcy 111, llonaltl F, Rrcntlel; SrcphenJ. Morris, ;~ntl~Michacl V. IIcs

AIpha AXP Program Management 193 Enrollment Management, Managing the Alpha AXP Program Peter E Conklin I Editor's Introduction

req~~irements:~nd to allow :~pplicationflexibility The result of their design efforts is a microproces- sor that operates at speeds up to 200 MHz-the fastest commercially available chip in the industry Early implementations of this chip became part of a prototype system, the Alpha Demonstration Unit. As Chuck Thacker, Dave Conro): and Larry Stewart explain in their paper, the prototype servecl the overaII Alpha ,tYl' program by giving software devel- opers early access (ten months) to AXP-compliant hardware. Because of tlie architectural emphasis on Jane C. Blake multiple processors, prototype designers focused Editor on delivering a robust multiprocessing system. The ;iuthors tliscuss the significance of tlie choice of a backplane interconnect for a multiprocessor, corn- 'T'his special issue of tlie Digit~llTeclnrlical Jourf1crl pare different ;ipproaches to cache coherence, ancl presents the computer architecture that Digital describe the system modules and packaging. believes will become the universal platform for With constraints different from those of the pro- computing over the next 25 years. A signific;uit totype, the li;~rdwareprocloct projects are repre- ~liilestonein the comp;cny's history the i\lplia ASP sentetl here b), three different implementations: architecture arises out of Digital's extensive etigi- desktop, departmental, and data center systems. In neering experience antl puts into place a cohesive, the desktop area, the I>EC 3000 UPfamily of work- .I'lexible framework for high-perform:~nce 64-bit stations are balanced u~iiprocessorsyste~ns. Totlcl IlISC computing. This issue contains papers repre- Dutton, Dan Eiref, Hugh Kurth. Jim Reisert, and sentative of the scope of tlie program across Iiobin Stewart review tlie rlecision to replace the Digital's Engineering organization, including hartl- traditional common system bus with a crossbar \\?;Ire systems, an , compilers, system intercontiect constructetl of ASI(:S. This new binary tr;~nslators,network ant1 database software, interconnect allorved the designers to meet the and simulators. goals of Low memory latency, high memory band- The results of the engineering efforts cliscusscd width, ;~nd~>ii~ii~nal (:I'll-I/O memory contention in in these papers reflect three primary goals fix a cost-competitive manner. tlie Alpha AXI' architecture: high performance, The I>EC 4000 AXI' system is a tlepartmental longevit!; and e;~symigration from tlie 32-bit VAX server that iniplements the IEEE Futurebus+ stan- VMS computer line. Dick Sites, one of the chief tlard. Barry Maskas, Stephen Shirron, and Nick Alpha AXl' architects, 1i;rs written a definitive p;cper Wrcliol present tlie reasoning behind the system th;lt explains how key architectural decisions were architecture antl technology decisions that resulted made relative to the gonls. He reviews the similari- in the ;~chievementof optimized uniprocessor per- ties and differences between the UP;rrcIiitecti~re form;ince, tlu;rl-l>rocessorsymmetric multiprocess- and other RIS<: ;crchitectures, ant1 then presents ing, and baJ;uiced 1/0 tlirouglipi~t.lletnils of the details of-'the design, including data and instruction subsystems that make up this expanditble modular formats. In his conclusion, he projects evolutionary system are also provitled. c1i:lnges in the ;~rchitect~lreantl the resultilig per- 'The I)EC 7000 ant1 I>E<: 10000 systems are po\ver- formance increases of a thous;~~idfoldover tlie next fill niitl-rangc and ~ii;~inframeplatforms intended 25 years. for large commercial applications ancl clesigned to 'The first implementation of the Alpha t1XP arclii- utilize multiple hiti~regenerations of the DECchip. tecture i.4 tlie I>EC:chip21064 microprocessor, which Described by Brian Allison ancl (:;~tl~arincvan can execute up to 400 million operations per Ingen. tlie heart of these systems is a high-perfor- secontl. I)an L)obberpuIil ancl members of the mancc interconnect that allows communic;~tions Alpha chip team offer an overview of the CMOS pro- between multiple processors, memory arrays, and cess tech~~olog)!the chip micro;~rchitect~~re,ant1 I/O subsystems. The ;~uthorsrevie131 e;ich of the the external interface. They tlieti detail the circuit modules and the I/() subsystem design, which implementation ancl explain the design choices includes interfaces for SMI and Futurebus. Notably, directed toward meeting architectural performance ;I 32-bit \'AX <:I'~I ~iiodulehas been designetl to tlie reqi~irements of the higli-performance system this design approach and provicle tletails of tlie interconnect. Users who wish to migrate from the individi~alporting efforts. VAX system to Alpha UPneetl only swap module The process of porting l)C<:nct-VKX to the boards. OpenV1\1S operating system is described by Jim Migration to Alpha tU(P from other architectures, Colombo, Pam Rickard, and Paul Benoit. They dis- in particular from VAX VMS, is one of the major goals cuss the DECnet features supported in the operat- set by the Alpha architects. Existing software- ing system, tlie softm~aretechniqi~es used, and the operating systems, languages, programs-must be importance of the clecisio~ito build common code adapted to run effectively on 64-bit RISC systems. A for the VAX and Alpha IU;P sjSsterns.The autliors paper by Nancy Kronenberg, Tom Benson, Wavne share details of the port ancl lessons learned that Cardoza, Ravintlran Jagannatlian, ant1 Ben Thomas can be applied to tiiture porting efforts. addresses the ch;~llengesof porting the OpenVMS Complementary to the previously mentionecl operating system-originally eleveloped specifi- prototype hartlware system are h)ur software simu- cally for 32-bit VAX systems-to Alpha AXP systems. lators that enabled engineers to tlevelop softw;ire To deal with the huge amount of code, the project for Alpha UPconcurrently with harclware develop- team developecl a compiler that treats VrLY ,assembly ment. Described by George Darcy Ron Brendel; langiiage (VLY MA<:RO-32) as a source language to be Steve Morris, and Mike Iles, the M;ln~iequinsi~nu- compiled. The authors also cliscuss the major arclii- lator was i~setlby the OpenvivlS group to boot tectural differences in the kernel, performance, and the entire operating system and clebug utilities; some future directions for tlie system. the ISP simulator was used by the l)EC oSF/l group The GEM compiler system is the technology with similar success. A major section of the paper Iligjtal is using to build state-of-the-art compiler focuses on the Alpha User-niotle Debugging Envi- products. <;E>I is describetl here by David ronment in which user-mode code being devel- Blickstein, Peter Craig, Caroline Davidson, Neil oped for Alpha .UP platforms c;un be compiled and Faiman, Kent <;lossop, Rich Grove, Steve Hobbs, executed as N11h;l ILYP code. and Bill Noyce. A significant achievement in the Tlie closing paper is an uni~si~alone for tlie clevelopment of this compiler is that a single opti- Jo~lrnnlbecause it addresses engineering manage- mizer is used for all languages and platforms. ment, not strictly technical issues. Peter Conklin Developers of compilers will find in-tlepth informa- offers insights into the reasons for the success of tion in the ;~i~tIiors'disc~~ssions of optin~izatio~i one of the largest engineering programs untler- techniques, code generation, compiler engineer- taken in the intlustry. He defines tlie enrollment ing, and future enhancements. management rnoclel used for tlie Alpha UPpro- Binary translation is another rne;ins of moving gram and explains key concepts, i~icludingthe cornplex software applications from one architec- program office ;~n(lproject "cusps." ture and operating system to another architecture The editors are very gratefill for the help of Hob and operating system. Two binary translators are Supnik, Vice I'resident and (:o~-porateConsultant. the subject of a paper by Dick Sites, Anton Chernoff, in planning this special issue ;inti for writing its M;~ttliewKirk, Maurice Marks, ~111clScott I

Brian R. Allison Hrian Allison is a senior consultant engineer for Digital's mid-range VAX/Alpha AXP systems groi~pant1 is the stem architect responsible for the coordination of the VAX a~lclDE<: 7000 ant1 10000 system definition and design. Prior to this work, he served as system architect for the Vr\X 6000 product. Brian holcls a B.S.E.E. and a B.S.C.S.from Worcester Polj~echnicInstitute (1977).

Randy Allmon After receiving a B.S. tlegrce in electrical engineering from the University of Cincinnati, Randy Allmon joined Digital in 1981. As a circuit designer in the Semiconductor Engineering Group, he has contribt~tedto the development of numerous high-performance <:Mas processors. Currently, Randy is responsible for the technical tlesign and manilgernent of :I next-genera- tion processor based on the Alpha AXP architecture. He is the coauthor of four high-performance processor papers given at ISSC<: and has one patent pentling.

Robert Anglin Robert Anglin received S.B.and S.M. degrees in electrical engi- neering in 1989 from the M;lssachusetts Institute of Technology. In the same year, he joined Digital's Semiconductor Engineering Group. where he has worked on the design of high-perform:~ncemicroprocessors. Robert is a niem- ber of Signla Xi. He is currently pursuing an M.B.A. degree at Harvard University.

Paul Benoit Paul Senoit is a ]?rincipnl software engineer in the Networks and Communications Group. He is the project/technical leader for tlie DECnet for OpenVMS AXP project; the team receivecl an Alpha Achievenient Award for early completion of project commitments. Previous to this, Paul led the DECnet-VAX Phase 1V effort. He holds an M.S.S.E. (1991) from Boston University and a H.S.C.S. (1986) from the llniversity of Lowell. Paul is a member of ACM anti IEEE Computer Society.

Thomas R. Benson A consulting engineer in the OpenVMS AXI' Group, ?i)m Benson was the project leader ;lnd princip;~ldesigner of tlie V4X ;Llr\CRO-32 com- piler. Prior to his Alpha AXP contributions, he led the VhlS UECwindows Fileview and Session Manager projects and bro~~ghtthe Xlib graphics library to the V,MS operating system. E;trljel; lie supported ;in optimizing compiler shell used by several V!\X compilers. Torn joined Digi1;il's VhX U;~sicproject in 1979, after receiving 13,s. ant1 ,M.S. degrees in computer science from Syracuse li~liversit)~,He has applied for four patents rel;ited to his Alpha AXP work. David Bertucci David Bertircci received ;I B.S.E.E.degree in 1982 from Wayne State University ant1 an M.S.E.E. degree in 1988 from Michigan State University. He joined Digital's Semiconductor Engineering Group in 1989 and worlzed on advanced (:MOS microprocessor design. Currently, he is employed at , Inc.

David S. Blickstein Principal software engineer David Blicksteill has worked on optimizations for the GEM compiler system since the project began in 1985. During that time, he designed various optimization techniques, including induc- tion variables, loop unrolling, code motions, common subexpressions, base binding, and binary shadowing. Prior to this, David worked on Digital's PDP-11 and VAX API. implement:~tionsand led the VAX-11 PL/I project. He received a B.A. (1980) in mathematics from Rutgers College, Rutgers University, and holds one patent on side effects analysis and anotller on incluction v~riableanalysis.

Ronald F. Brender Ron Brender is a senior consultant software engineer, contribilting to the GEM compiler back-end project in the Software Development Technologies Group. He has worked on con1,pilers and program- ming langui~gedefinition for Alpha UP,VAX, PDP-11, and PIIP-10 systems, inclutl- ing Acl;~,FOII'TIWN and HI.ISS. A member of various standards committees since the mid-1970s, Ron is now responsible for Vi\X and Alpha AXP calling standards. He joined Digital in 1970, after receiving a P1i.D. in computer and communica- tion sciences at the University of Michigan.

Sharon Britton Sharon Britton received a B.S.E.E. degree from Boston Ilniversity in 1983 and an M S.E E degree from the Massachusetts Institute of Technology in 1990. She joined Digital in 1983 to work on tlie design and clevel- opment of 80186-based controllers for read-only and write-once optical disk drives. Sharon's graduate research involved tlie development of an integrated content adtlressable memory system with error detection c;~pabilityCurrently a member of the Semicontluctor Engineering Group, she is involvetl in the design ant1 implementation of high-performance CMOS microprocessors.

Wayne M. Cardoza Wayne Cardoza is a senior consultant engineer in the OpenVMS AXP Group. Sincc joining Digital in 1979, he has worlied in various areas of the OpenVMs kernel. \Vdyne was also one of tlie architects of PRISM, an earlier Digital RJSC architecture; lie holcls several patents for this work. More recently, Wlyne participated in the design of the Alpha AXP architecture and was a member of the initial design team for the OpenVMS port. Before coming to Digital, WAyne was employetl by Bell Laboratories. Wayne received a B.S.E.E. from Southeastern Massachusetts University ancl an M.S.E.E. from MIT. Linda Chao Lintl;~Chao received ;I 1% S E E degree from the i\il;tssacht~setts Institute of Technology in 1987. Since joining Digital in the Semiconductor Engineering Group/Atlvancecl llevelopment in 1987, Linda has been engaged in the design of microprocessors basetl on the Vi\X and Alpha AX1' architecti~res. She is currently pursuing master's degrees in electrical engineering and nianage- nient through the MIT Leaclers for illanufirct~~ring1'rogr;rm.

Anton Chernoff A~itonChernof'f is ;I member of tlie technic:rl st;rffat Digital Equi~mentCorporation. working in the Alpha t\XP Migration Tools Group. He joinecl Iligital in 1991, but also worked at 1)igital between 1973 ant1 1981 as proj- ect le;~der;rnd developer of the In'-11 ant1 IISTS/E oper;rting systems. Anton spent 1982 through 1991 ;rt L.iant Soft\v;lre Corporation as a senior consulting engineer in compiler ant1 deb~~ggertlevelolxnent.

Jeffrey A. Coffler A prjncipal software engineer in tlie 1)at;lbase Syst-enis Engineering Gro~lp.Jeff Cofl'ler led the effort to port I>IIMS to the Alpha AX1' plat- form. Prior to this, Jeff worked 011 the DBVS and Rdb b;~ckup/restorefacility ancl on new DBMS fe;itures and ni;rintenance. He is currently working on the project to port Rclb for OpenVMS to operating systems such as Winclo\vs N'f ant1 OSB1. He has ;~lsocontributed to the RSTS/E operating system, WPS-I'LIJS porting, ;rntl workflow nianagernent projects. Jeff joined Digital in 1984 ancl holds a R S<: S (1981) I'roni Californi;~State Ijniversity ;~tNorthridge.

James V. Colombo Project/technical leatler James Colonibo is currently responsil,le for tlie next rele:rse of DE(:net/OSI for OpenVMS for the Vr\X :~nd Alpha AXP computing environments. Prior to this, he lecl the port of DECnet-VAS Phase Iv to the OpenV,\lS AXI' oper;~tingsystem; the team received an Alpha Achievement Aw;~rtlfor earl! completion of the project. Jim also let1 the DE<:net for OS/2 V I .O and v;rrious PIITH\VORKS procluct efforts. Before conling to Digital in 1983, Jirn worked at Prinle <;ornputer, Inc. and Computer Devices, Inc. He holtls ;I 1%SC S fro111Host011 I!niversity ant1 is ;I rnember of lEEE.

Peter F. Conklin Peter Conklin is tlirector of Alpha t\XP Systems Develop- ment. Since joining Digital in 1969, lie h;ls held engineering nian;rgement posi- tions in large ant1 small systems and terminals groups, direct hardware :rntl softw:rre engineering, product management, base product m;~rllS layerecl prod- uct set. Peter received an A\.[\. in n1athem;rtics from Marvard University in 1963. Robert A. Conrad Robert Conrad receivecl a H.S.degree in electrical ant1 com- puter engineering from tlie 111iiversity of Cincinnati in 1984 ancl nn M.S. degree in electrical and computer engineering from tlie University of ~Mii~~achu~ett~in 1992. In 1981 he joined Digital's Semicontluctor Engineering Groilp, where he worked ;IS a co-op student in the Architectilrally Focused Logic Group. Since 1984 Rob has been engaged in the research and tlevelop~lientof VLSl micro- processors, including thc MicroVAX CPU, a 50-MHz RISC CPU, ;lncl most recently the DECchip 21064 microprocessor.

David G. Conroy Dave Conroy receivetl a 13.A.Sc. degree in electrical engi- neering from the University of Waterloo, Canada, in 1977 After working briefly in industrial automation, Dave movetl to tlie in 1980. He cofoundecl the mark Williams Company and built a successh~lcopy of the Iliul>; operating systerr~.In 1983 he joined Digital to work on the DECtalk speech synthesis system and related products. In 1987 he became a member of Digital's Semiconductor Engineering Group, where and has been involved with system- level aspects of RISC microprocessors.

Peter W. Craig Peter Craig is a principal software engineer in tlie Software Development Technologies Group. He is currently responsible for the design ant1 implementation of a tlependence analyzer for use in future compiler procl- ucts. Peter was a project leader for the VAX Code Generator used in the VrLY C and VAX PIJI compilers, and prior to this. he cleveloped CPlr perform;~ncesimulatioli

I software in the VAX Architecture Group. He received a R.S.E.E. (magna cum I;lucle, 1982) from the University of Con~~ectici~t;~ndjoined Digital in 1983.

George A. Darcy I11 As a senior software engineer in the Alpha Migration Tools Group, George Darcy has worked on the Mannequin Alph:~)\XI' sin~ulator, the VEST' binary translator, and the Translateel 11nageEnvironment ('l'lE) run-time library. 111 his ten years at Digital, he has also developed a virtual disk driver for the OpenVMS V5.0 SMP operating system, soltware behavioral models of a high- end VAX processor, and various simulatio~iand <:AD software tools. George receivetl a R.S.C.E. (cum laude, 1984) from Boston Universit): where he was an Engineering Merit Scholar ancl a member of Tau Beta Pi.

Caroline S. Davidson Since joining Digital in 1981, Caroline I);~vidsonhas contributed to several software projects, primarily relatetl to code generation. Currently a principal software engineer, she is working on the (;EM compiler generator project and is responsible for tlie areas of lifetimes, storage ;II location, and entry-exit calls. Caroline is also a project leatler for the cocle generation effort. ller prior miork involved the VAX FOIITRAN for IJL'I'KIX, VAX Code Generator, ancl FORTRAN 1V software products. Caroline has a H.S.<:.S. from the State University of New York at Stony Brook. Daniel E. Dever 1);ln Dever receiveti a I3 S E.E. degree in 1988 from the University of Cincinn:iti. He joined Digit;il's Semiconductor Engineering Group in 1988, where he worked on the tlesign ant1 logic verific;~tionof CMOS V!\X tnicroprocessors. Since 1990 he has been involved in the design of R1S<; arcliitec- ture microprocessors, including the floating-point unit of the DECchip 21064 ~~iicroproceorIhln is currently involvetl in the design of integer arithmetic logic for the next-generation processor b:lsetl 011 the Alpli;~t\Xll architecture.

Daniel W. DobberpuN Dan Dobberpulil received a I3.S.E.E. degree from tlie [Jniversity of Illinois in 1067 Subsequent to positions with the Department of Defense and Ge11er;tl Company, he joined Digital's Sernicontluctor Engineering Group in 1976. Since that time, he has been active in the design of four generations of microprocessors, including the first single-chip PDP-11 and the first single-chip VAX. Most recently, 1)an was the project leader for the first \rl.SI implementation of IXgital's new 64-bit Alpha tD(P computing architecture. He is co;~ut hor of the text, The Desi'yz arld ArznI)!sisof VLSJ Cir-ctlits.

Todd A. Dutton A priticipal hartlw;ire engineer, Totld Dutton was responsible for the overall design integration and timing verification of the DEC 3000 AXl' Motlel 500. Prior to this, he led a team in developing vector processor hardware in the Advanced VAS Development Group. Todd joinetl Digital in 1987 Pre- viously, he was employed at Xumerix Corporation and at Signal Processing Systems. Inc. Totltl 1i;ls a 1%.S.degree in computer science from the ~Massachusetts Institute of Technology and was electetl to Tau Beta Pi. He lioltls a patent on vec- tor processor technology and has published two papers on vector processors.

Daniel Eiref Dan Eiref joined Digital in 1987 after receiving B.S. and MS. degrees in electric;il engineering from <:olumbia University. At Columbia he was electetl to Tau Ret;~Pi ant1 WAS awartled tlie Steven Abbey O~~tstandingStudent- r~thletei4m~artl. He is currently attentling Harvard Business School. A principal hardware engineer, Dan was responsil>le for the design of tlie memory and clock systems of the l>E(: 3000 AXP Model 500. He also designetl the 's SI.I<:E and rU)l)K ASI<:s. Prior to this project, he worketl as an ECL hardware tlesigner in the Advancerl VAX Development Group.

R. Neil Faiman, Jr. Neil Fai~n;lnis ;I consultant softw:ire engineer in the Software Development Technologies <;roi~p.He was tlie primary architect of the (;EM intermediate I;lnguage and a project leader for the <;EM compiler optimizer. Prior to this work. he led the BLISS conipilrr project. Neil c;lme to Digital in 1983 from MDSI (now SchIu~~~berger/Applicon).He has B.S. (1974) and k1.S. (1975) tlegrees in computer science, both from ~Micl~iganState Ilniversity Neil is a mem- ber of Tau Beta I'i and A<:M, and ;In affiliate member of the IEEE Computer Society. Bruce Gieseke Bruce Gieseke received a B.S. degree in electrical engineering from the University of Cincinnati in 1984. and an M.S. degree in electrical e~lgi- neering from North Carolina State IJniversity in 1985. In 1986 he joined Digital's Sen~iconcluctorEngineering Group, where he has been e~lgagetlin the imple- mentation and circuit clesign of RIS<: microprocessors.

Kent D. Glossop Kent Glossop is ;I princip;il engineer in the Softw;ue Development ?'ethnologies G~OLIP.Since 1987 he h;~sworked on the GEM com- piler syste~n,focusing on code gencr;ition and instruction-level tr;insformations. Prior to this, Kent was the project leader for a release of the VAX PWI. compiler and contributed to version 1 of the VAX Performance and Coverage Arxalyzer. Kent joinecl Digital in 1983 after receiving a B.S. in conipilter science from the University of Michigan. He is a meniber of IEEE.

Richard B. Grove Senior consultant software engineer Rich Grove joined Digital in 1971 ant1 is currently in the Software Develol>ment Technologies Group. He has led the GEM compiler project since the effort bcg;~nin 1985, con- tributing to the code generation phases. Prior to this work, Rich was the project leader for the PDP-11 ancl \(AX FORTRlN compilers, \worked on VAX Ada Vl, and was a member of the ANSI X3.13 FOK'I'RAN Committee. He is presently a member of the clesign team for Alpha AXI' c;illing standarcls and architecture. Rich has H.S. ancl M.S. degrees in mathematics from C~rnegie-MellonUniversity.

Soha M.N. Hassoun Soha Hassoun received a H.S.E.E. degree from South Dakota State University in 1986, ancl all S.M.E.E. degree from the Massachusetts Institute of Technology in 1988. From August 1988 to August 1991 she was employecl at Digital as a custom tlesign engineer in the Semiconductor Engineering Group. She contributetl to the design of tlie flo;tting-point unit of the I)E<:chip 21064 processor. Soha was the recipient of a Digital Minorit)r ;in(I Women's Scholarship in I991 and is pursuing a 1'h.l). degree ;~tthe IJniversity of \Vashington, Seattle, (:oniputer Systems Engineering Uepartme~?t.

Steven 0.Hobbs A ~nemberof the Software Development Technoltigies Group, Stcven Hobbs is working on tlie GEM compiler project. In prior contribu- tions at Digital, he was the project leader for \'AX Pascal. the lead designer for the global optinlizer in VAX FORTIWN, ant1 a member of the Alpha AXP architecture design team. Steve received his A.R. (1069) in mathematics at Dartmouth College and while there, helped develop the original BASIC: time-sharing system. He has an M.A. (1972) in m;ithcmatics from the Univel-sit). of Michigan and has done additional graduate work in compiiter science at Carnegie-~MellonUniversity. Gregory W. Hoeppner Gregory Hoeppner graduatetl with tlistinction from I'urtlue IJniversity in 1979. His research topic was ion-implanted optical wave- guicles. In 1980 he worked at C;ener;~l Telephone ant1 Electronics Researc1.r L;~bor;itor!: \vilere he performecl basic properties rese;lrch on (;ails for fabrica- tion of submicrometer FETs. From 1981 to 1992 he helcl ;I number of positions at Digitill Eqi~ipmentCorporation's Nuclson, iLL\ site, including co-implementation 1e;lcler of 1)igit;il's DECchip 21061. He is currentl). emplo!recl ;is a senior engineel- at III.LI, Adv;~ncedWorkstation Division.

Michael V. Iles Michael Iles is a senior technology consultant at (he UK Alpha MI' Migration Centre. Since joining Digital in 1975, Mike has worketl in various fieltl positions, in Advanced \'AX development :IS iI microcotler, and for VMS engi- neering as a software engineer. He worked on the migration of OpenVMS VAX to the Alpha AXl) platform. designing and implemcntillg a user-mode simulation environment that became AuD. Mike has a B.s~.in electrical engineering (hon- ors. 1973) from City University. Lontlon. and holds a patent for tligital speech synthesis techniques. Hc has several patents pentling for AriI).

Ravindran Jaga~athan Ravintlran Jaga~inathanis :I principal software engi- neer in the OpenVMS l'erformance Group currently investig;~tingOpcnVMS ASI' ~nultiprocessingperformance. Since 1986, he has worked on perforrn;~nceanal- ysis ;III~ch;~r;lcteri~;~tio~i, ;~nd;~lgoritlim design jn the ;ire;is of OpenVMS ser- vices, S>IP, \ihXcluster sjwems, ;tncl host-b;~setlvolume sh;itlowing. Ravindrian receivecl a H.E. (honors, 1983) from the University of M;ltlras, Incli;~,and M.S. degrees (1986) in oper;~tionsrese:~rch and st;~tislicsant1 in computer ;rncl sys- tems engineering from Rcnsselaer Polytechnic Institute.

Matthew B. Kirk Matthew Kirk is ;I senior software engineer in the SE(;/I\I> kYI) ,Migration Tools Group, where he works on binary tl.;inslator de\~elopnirnr. testing, ant1 support. He joined Digital in 1986 ;lnd has also tlqsignccl ;ind clevel- oped automated architectural test software for pipelined VAX harclw:ire and the (3 computer interconnect. Matthew holds it 13,s. in computer scicnce (1986) from the University of Massacl~usetts.

Nancy P. Kronenberg Nancy Kro~ienbergjoined Digital in 1978 and has cleveloped VMS support for several vrV( systems. She designetl and wrote the VklS CI port clriver ;111dpart of the VkIscluster System Communications Services. In 1988, Nancy joined the team th;~tinvestigated ;~lternativesto the VAX ;irchitec- ture ant1 drafted the propos;il for the Alpha /\XI1 ;irchitecture and for porting the (.)pcn\/.LlSoperating sjrstem to it. N;lncy is a senior consulting software engineer ;inti technic;~lclirector for the OpenVhIS t\Xl1 <;SOLID. She holcls ;11i !\.1$, degree in physics froni <':orncllUniversity Kathryn Kuchler Kathryn Kuchler received a R.S. degree in electrical engi- neering from Cornell University in 1990. Upon graduation, she joined Digital's Semiconductor Engineering Group, where she worked on tlie first implementa- Ition of a RISC microprocessor basecl on the Alpha AXP architecture.

Hugh R. Kurth Hugh Kurth joined Digital in 1986 after receiving a R.S. degree in electrical engineering, computer engineering, and mathematics from Carnegie-Mellon University. At Carnegie-Mellon, he was elected to Eta Kappa Nu and was awarded the David Tuma Undergraduate Laboratory Project Award. A senior hardware engineer, Hugh designed the TCDS ASIC and SCSl subsystem for the DEC 5000 AXP Model 500. Prior to this work, lie designed floating-point hardware for two projects in the Advanced VAX Development Group.

Maureen Ladd Maureen Ladd received a B.S degree in computer engineering frorn tlie University of Illinois in 1986. She then joinecl tlie Semiconcliictor Engineering Group within Digital and worked on a 32-bit RISC microprocessor. Maureen received an M.S.E. degree in electrical engineering from the University of Michigan in 1990 through Digital's Graduate Engineering Education Program. Upon her return to Digital, she worked on the implementation of the first micro- processor based on tlie Alpha AXP architecture.

Burton M. Leary Mike Leary is currently a consulting engineer in the Semiconductor Engineering Group/Advanced Developn~entMemory Groi~p.He designed the instruction and data caches for the DECchip 21064 CPlJ and is cur- rently working on the design of advanced memory products. Milze joined Digital in 1980 after receiving a B.S.E.E.degree from the University of Massachusetts.

Liam Madden Liam Nladden joinecl Digital in 1984 and has since designed both ClSC and RISC microprocessors and contributed in the area of CMOS process development. He is currently a consultant engineer in Digital's CPU Advancecl Development Group and his interests include circuit design and CMOS tech- nology development. Prior to joining Digital, Liam designed industrial micro- controllers for ~Mahon and McPhillips, Ireland, and worked for Harris Semicontluctor. He received a B.S, clegree from University College Dublin in 1979 C and an M.E. degree from Cornell University in 1990. Maurice P. Marks Maurice Marks is a senior engineering manager in the Semiconcluctor Engineering Advanced Development Group. He currently man- ages the AXP Migration Tools Group and contributed to the design and imple- mentation of the translators. In Maurice's twenty years with Digital, he has led compiler, operating system, hardware and software tools, CAD, system, and chip projects. He holds B.Sc. and B.E. degrees from the University of New South Wales ant1 l~aspublishecl papcrs on transaction processing, software portabilic): and CAD technology. Maurice is a member of the Australian Computer Society.

Barry A. Maskas Barry Maskas is the project leader responsible for architec- ture, semiconcluctor technology, and development of the t>EC 4000 AXP system buses, processors, and memories. He is a consulting engineer with the Entry Systems Busi~~essC;roup. In previous work, he was responsible for the architec- ture and development of custom vLS1 peripheral chips for VAX 4000 and MicroVAX systems. Prior to that work, be was a codesigner of the MicroVAX 11 CPU and mem- ory modules. He joined Iligital in 1979, after receiving a R.S.E.E.from Pennsylvania State University. He holds three patents and has eleven patent applications.

Edward J. McLellan Etl McLellan is a principal engjneer in the Semi- contluctor Engineering Group. He has co~ltributedto the design of several pro- cessor chips. Ed joined Digital in 1980 after receiving a B.S. degree in computer and systems engineering from Rensselaer Polytechnic Institute, where he was elected to Eta Kappa Nu. He holds three patents in computer design and l~asone application pentling.

Derrick R. Meyer Dirk Meyer joined Digital's Senliconductor Engineering Group in 1986. He was initially involvetl in the design of the cache and memory systems for a chilled CMOS VAX processor. He 11% since been involved in the development of n~icroprocessorsbased on the Alpha AXP architecture. Prior to joining Digital, he was employed at Intel Corporation, where he was involved in the design of various CMOS microcontrollers, inclucling the 80C51 and 80~196. Dirk received a R.S. degree in compilter engineering from the University of Illinois in 1983.

Zia Mohamed Zia Mohamed has been a member of the Database Systems Group since joining Digital in 1989. He works in the area of query optimization for the DEC Rdb for OpenVMs products; his contributions involve cost-based optimization of database queries and algorithms for execution of optimized cluery plans He has tleveloped clynamic OI~optimization techniques, refine~netlt of cost-model, ant1 algorithms for better access plans for views. Zia holds a B.S. tlegree in electrical engineering from Bangalore Universit): India, and an M.S degree in cornpiitcl-science from Texas Tech Universit)! James Montanaro James Montanaro received B.S.E.E. and M.S.E.E. degrees from the Massachusetts Institute of Technology in 1980. He joined Digital Equipment Corporation in 1982. He was a circuit designer on the floating-point chip for the LSI 11/74 and a MicroVAX peripheral chip. He led the physical imple- mentation of the uPRrSM CI'IJ, a 70-MHz prototype RlSC CPU completed in 1988. James also led the pllysical implementation of the first CPU chip based on the Alpha AXP architecture and then contributed as a circuit designer for the DECchip 21064 CPU. He is currently with Apple Computer, Inc.

Stephen J. Morris Stephen Morris is a consultant software engineer in the Semiconductor Engineering Advanced Development Group. In addition to writ- ing the Alpha ISP simulator, he wrote the OpenVMS antl OSF PALcode for the Alpha AXP program. In previous work, Stephen designed the control sections of the instruction prefetch and translation look-aside buffer for an experimental Digital MS<: chip. He also worked on the MicroVAX chip team, doing console ant1 debug work, and in the RSTS/E operating system group. Stephen joined Digital after receiving a U i\ in biology from the University of Rochester in 1977.

William B. Noyce Senior consultant software engineer William Noyce is a member of the Software Development Technologies Group. He has developetl several GEM comp~leroptimizations, inch~dingthose that eliminate branches. In prior positions at Digital, Bill implementetl support for new disks and proces- sors on the RSl'S/E project, led the development of VAX DBMS V1 and VAX RdbNMS V1, and designed and implemented automatic parallel processing for VAX FORTIIAN/HPO. Bill received a B.A (1976) in mathematics from Dartmouth College, where he implemented enhancements to the time-sharing system.

Donald A. Priore After receiving an S.M. degree in electrical engineering and computer science from the Massacli~lsettsInstitute of Technology, Donald Priore joined Digital in 1984. Initially, he worked on device characterization, yield enhancement, and yield modeling of NMOS and CMOS processes in manu- facturing. Subsequently, he joined a CMOS design group, working first with low-temperature CMOS technology and later with conventional CMOS in high- performance microprocessor design. His interests include signal, clock, and - power integrity in the on-chip environment.

Vidya Rajagopalan Vidya Rajagopalan received a B.E degree in electronics engineering from Visvesvaraya Regional College of Engineering, Nagpur, India, in 1986, and an M.S. degree in electrical engineering from the University of Maryland in 1989. She was with Norsk Data India Ltd, from 1986 to 1987 as a systems design engineer. In 1989 she joined Digital's Semiconductor Engineer- ing Group and was a member of the design team of the DECchip 21064 RISc microprocessor. Vidya is currently involvecl in the design of high-performance microprocessors. James J. Reisert A senior hardware engineer.Jim Reisert designecl the TC ASIC for tlie DE(: 3000 AX[' Moclel 500. I'rior to this project work, he designed instruc- tion parsers/decotlers for two \RX imp1ement:itions. Jim holtls a patent for his tlesign of a method for replaying instructjons after a microtrap. Before joining I>igital in 1986. he received an S.B. in electrical engineering from tlie Massa- cliusetts institute of Technolog)! He is currently in charge of timing verific;ition h)r another AXP workstation.

Pamela J. Rickard Principal software engineer Pan1 Rickard is a member of the team porting DECnet/OSI for OpenVMS to tlie Alpha AS1' pl;rtform. As the ini- tial member of the DECnet for OpenVMS AXP porting team, Pam took responsi- bility for creating an effective team, ported NETDRIVER ant1 other MACRO-32 cocle, atitl debugged major portions of the portecl product. Si~lcejoining Digital in 1978, she has contributed to I.'hTHWORKS for OS/2 ant1 led the console, , and system test activities of the VAX-11/785 project. Pam receivecl a H.S. (1970) in mathematics and computer science from the IJniversity of 1)enver.

Scott G. Robinson Scott Robinson is a software engineering mirnager in the AXP Migration Tools Group. He contributed to the design ant1 implementation of the binary translators, particularly the \'AX tratislatecl iniagc environment. Scott has also developed iniplementations of DE<:net ilntl CAI)/<:ANl systems to design \%X processors. Prior to joining Digital in 1978, Scott worketl on ;I vi4riety of Digital hardware ant1 software implementations. He holds a B.S. in electrical engi- neering from the University of Arizona anel is a member of IEEE.

Sridhar Samudrala Sridliar Sa~~iutlralais ;I consulting li;rrtl.iv;~reengineer in the Semicontlirctor Engineering Group, where lie is currently working on a new (:I'[I chip. He joined Digital in 1977. Since t1i:rt time, lie h;rs \vorketl on tlie design and verification of PDPil1/23 chips, VAX 8200 micrococle tlevelopnient, ancl on the ;rrchitecture ;~ntldesign of floating-point chips. He holtls t\vo p:ttcnts :rnd has three patent app1ic;rtions pending, all on floating-point design. Sritlhar received ;III M.Sc. ('Tech) tlegree from Antlhra Universit); India, and ;III M.S E.E. tlegree fron'l the llniversity of Wisconsin.

Sribalan Santhanam Sri Santhanam receivccl a n.E. degree in e1ectric;il cngi- neering from Anna University, Maclr;~s,India, in 1987, anel an M.s.E. dcgrcc in co~il- lxtter xiecne crntl engineering from the University of Michigan in 1089. I Jpon gratluation, he joined Digital as a design engineer h)r the Semicontluctor Engineering Group, responsible for tlie full-custo11-1tlesign rind clevelopnient of high-perform;~nceCMOS VLSI processors. Sri worked on the design of the flo;ct- ing-point unit of the UECchip 21064 CPU. He is currently involveel in the tlesig~lof another high-performance microprocessor. Stephen F. Shirron Stephen Shirron is a consulting software engineer in the Entry Systems Business Group and is responsible for Open\/MS support of new systems. He contributed to man)' ;lre;is of the I>EC 4000, including PALcode, con- sole, and OpenVMs support. Stephen joined Digital in 1981 after completing B.S. ant1 MS. degrees (summa cum lautle) at <>~tholicUniversity In previous work, he tleveloped an interpreter for VAX/Smalltalk-80 and wrote the for the ItQDX3 disk controller. Stephen has two patent applications and has written a chapter in Smc~lltalk-80:Hits of lfistor:~~,Words oJAdi)ice.

Richard L. Sites Dick Sites is a senior consultant engineer in the Semicon- tluctor Engineering Group. where he is working on binary translators ;~ntltlie Npha AXP architecture. He joined Digital in 1980 and I~ascontributecl to v;lrious VAX implementations. Previously, he was employetl by IBhl, Hewlett-Packard, and Burroughs, and taught at the I.iniversit)' of C;~lifornia.Dick received a 13,s. in mathematics from MIT and a Ph.1). in computer science from Stanford University He also studied computer architecture at tlie Ilniversity of North Carolina. He Iiolds a number of patents on computer hnrclware and software.

Peter M. Spiro Peter Spiro. ;I consulting software engineer, is presently the technical director for the Rdl:, ant1 IIt3;MS software proclucts. Peter's current focus is database performance for Alpha AXl' systems and very large database issues. Peter joined Digital in 1985, after receiving 1M.s. degrees in forest science ant1 computer science from the Ilniversity of Wisconsin-Madison. He has four patents related to database journaling and recover): and he has authoretl two papers for earlier issues of the L)i

Lawrence C. Stewart Larry Stewart received an S.U. in electrical engineering from M1'1 in 1976, followed by M.S. (1977) and P1i.D. (1981) degrees from Stanfortl IJniversity, both in electrical engineering. His PI1.D. thesis work was on data com- pression of speech waveforms using trellis cotling. Upon graduation, he joinetl the Computer Science Lab :~tthe Scrox Palo Alto Resr:~rchCenter. In 1984 he joined Digital's Systems Research Center to work on the Firefly multiprocessor workstation. In 1985) he moved to Digitnl's Gunibridge Research Lab, where he is currently involved with projects relating to multimetlia ant1 AXP products.

Robin L. Stewart Robin Stewart joined I)igit;~lin 1986 after receiving a 1l.s.in electrical engineering from tlie University of \/ermont. She is in the process of obtaining an M.B.A.degree from Boston College. A senior technology (liartlware) engineer, Robin had responsibility for the teclinolog)~in the I>EC 3000 aPMoclel 500 workst:ttion. Prior to this project work, she was a com- l~onentengineer in Digital's Semiconductor Hi~sinessOrganization. Charles P. Thacker Chuck Thacker h;~sbeen with 1)igital.s Systems Research Center since 1983. Ikfore joiiiing Digitill, he was a senior rese;trch fellow at the Xerox I'alo Alto Rcsearch Center. His research interests inclutle computer archi- tecture, comprlter networking, ;inti computer-;~idedclesign. He holds several patents in the ;ires of computer organization :u~tlis coinventor of the local network. [ti 1984. Chuck was the recipient (with B. Lampson and R. T;~ylor)of the i\(:kl Software System Award. He received :In A.H. degree in physics from the ljniversity of California in 1967. He is a member ofACM and IEEE.

Benjamin J. Thomas 111 Benjamin Thom;~sjoined the OpenVMS AXIJ project in 1989 ;IS project leader for I/O subsystem design ;mcl porting. In this role, he has also contributetl to the I/O architecture of current ancl future AXP sgtenls. Ben joined Digital in 1982 ant1 has worked in the VklS groi~~since 1984. In prior work, he WASthe director of software engineel-ing ;it ;I microcomputer firm. Ben is ;I consulting engineer and has a H.S. (197%) in physics from tlie University of New Hampshire :ulid an M.S.C.S.(1990) from Worcester Polytechnic Institute.

Catharine van Ingen A consulting softw;lre engineer, <:atliarine van Ingcn was co-system ;~rchitectfor the \'A)(. and [>I':(:7000 proclucts. is cur- rentl!, on Ie;~vefrom Digital and is n7orking 011engineering document manage- ment in large heterogeneous systems. Hefore joining Digital in 1987. she worked on d;lta ncquisition systems for two I;irge physics tlctectot.~:it tlie Ferrni N;ltio~l;iI Acceler;~tor1.abor;ttory and Stanfortl Linc;ir Acceler;~torCenter. She holcls serf- era1 clegrees in civil engineeri~ig,inclutling a 13 s. ;untl ;In k1.S. from the University of

Nicholas A. Warchol Nick Warchol, a consulting engineer in the Entry Systems Ii~isiness<;soup, is the project le;itler responsible for I/O architecture and I/() rnoclule tlevelopment for the DE(: 4000 AXI' systems. In previous work, he contributeti to he development of VAX 4000 s)rstelils.He was also a tlesigner of the MicroV,\\S 3300 and 3400 processor moclules and the RQDXJ disk con- troller. Nick joined 1)igitnl in 1977 ;lfter receiving a I3.S.E.E. (cum 1;rude) from the New Jersey Institute of Technology. In 1984 he received an k1.S.E.E. from Worcester Polylechnic Institute. He h;~shmr patent :ipplications.

Richard T. Witek Rich Witek joined l>igit;~lin 1977 to work on DE(:net network nrchitecture during Phase 11. In 1982 lhe joined lligital's Semiconductor Engineering <;SOLI~~where he worked on <:,\I> rlevclopment, MicroVAX VLSr chil>s,ant1 ;I variety of internal MS<:projects. Rich \v;~s;I coclesigner of the ALpha .\XI' ;uchitectilre :rntl the principal micso;~rchitectof the DECchip 21064 CP11 chip. He recei\rctl a 13.,1. degree in computer sciencc from 11uror;a College. Rich is 1 'w!.1 currently- employed by Apple Computer. Inc. I Foreword

the proposed solution l~aclto provide strong conl- patibility with current products. After several months of study, tlie team concluded that only a new RISC architecture could meet the stated objec- tive of long-term competitiveness, and that only the existing VMS and UNlX environments coulcl meet the stated constraint of strong compatibility. Thus, the challenge posed by the task force was to design the most competitive MS<; systems that would run the current software environments. Key groups in Engineering responded to this challenge. A cross-functional team from hardware Robert M. Supnik and software defined the basic architecture. Corporate Consultant, Advanced development teams began work on the Vice Preside~zt knotty eiigineering problems: it1 the serniconduc- Tkch~icnlLlirectol; tor group, the specification ant1 design of a fast Engineering microprocessor, and the automatic translation of executable binary images; in the operating systems groups, on the porting of ULTRlX and of VMS (which was not portable!); and in the compiler group, on It all started with eight people in a conference superscalar code generation. In the fall of 1989, room.':' Alpha became an officially sanctioned advanced The time was the summer of 1988. Digital development pr~gram.~In the summer of 1990, it Equipment Corporation had just closed the best transitioned to product development. fiscal year in its history, with record revenues ancl From the original core in semiconductors, oper- profits. Digital's vA>; systems were the most widely ating systems, and compilers, work expantlecl i~secltimesharing systems in the intlustry and were throughout Engineering. The server and work- the "blue-ribbon standard" for mid-range comput- station hardware groups specified and started ing. Digital was tlie second-largest workstation ven- designing a family of systems, from desktop to clata dor. The company hat1 just introducetl the VA>( 6000 center. The networks group began porting DE<:net, system, its first exlxmdable multiprocessor, was TCP/IP, X.25, LAT, ant1 the many other network- deveJ.oping a true VAX mainframe, ancl had decided ing products. 'l'he layeretl software group inve~l- on a rapid thrust into lilSC workstations to capital- toried the existing portfolio of products and ize on that growing market. What could possibly go prioritized the ortier and i~nportanceof clelivei-J: wrong? The research group pitched in by designing an Nonetheless, senior managers and engineers saw experimental multiprocessor as a software devel- trouble ahead. Workstations hat1 displaced VAX WS opment testbed. from its original technical market. Networks of per- In parallel with the engineering work, market- sonal were replacing timesharing. ing, sales, and service teams worked closely with Application investment was moving to standarcl, business partners and customers to shape tlie tleliv- high-volume computers. Microprocessors had sur- erables and messages to meet external require- passed the performance of traditional mid-range ments. These teams briefed key customers ancl computers ant1 were closing in on mainframes. And partners early in the development process ant1 aclvances in RISC technology threatened to aggra- vate all of these trends. Accordingl): the Executive The Corona Borealis conferencr room in the LTNI f:~ciljryin Committee asketl Engineering to develop ;I long- Littletoi~.Mass. !,I4N1 was clloscn bec:~ilsc~t nins thc geogr;~phic term strategy for keeping Digital's systems cornpet- epicenter of the arc of Digital engineering k~cilitieson ,\lassa- itive. Engineering convened a task force to study chusetts Koc~te495, the Corona Borealis bec:~useit was the only conference room mith ~.c'iotlo.ivs. the problem. l'he task force looked at a wide range ofpotential +Mergoing through Illore than one nanlc change. Thc original solutions, from the application of atlvancetl pipe- study team was c;~lletlthe 'RIS(:y \'AX Tksk Forcc:"The ;~tlvancedtlcvclopment nlorli W;ISlabeled "EVhX:' Wl~e~irhc lining techniques in Vtm systems to the tleployment program w;kh al,provetl, the Executive Committee den~;~ntletl;I of a new architecture. A basic constraint was that nr~~tralcode tl;lrne, hence "Alph;~." incorporated their aclvice into the develol~ment The work of Engineering, M;~n~~hcturing.M;II-- prog~-:lm.Ongoing partner ant1 customer aclvisory keting, Sales, and Service c;rme together in Noveni- bo;~rtlsprovitlecl 1-egr1l;ir!kc.tlb;~ck on all aspects ber 1992 with the annoilncement of the Alpli;~&XI' of the 1xogr;Im ;uicl helped shape two critic;il systems family: seven s),stenls, three operating sys- extensions of the original concept: the open licens- tems, six languages, multiple networks, migration ing of Alpha technology. and the porting of tools, open licensing of technology, hartlwilre ancl Wi ndo\vs N1'. software partnersl>ips, and more than 2000 com- Taken together. the scope of the Engineering mitted applic;~tions.Totl;~; Alph:~ ASI' e~iibodies21 effort, the ~ieetlfor il1:lrketing. Field, and Service fi~ndamentalrepositioning of 1)igit:il Equipment involvement, ant1 the liigli degree of customer antl Corporation to be the technology alitl solutions bilsiness partner p;~rticipation,posed irniclue man- leader in twentjyfirst centrlry conipi~ting:;I com- agement challenges. Rather than organize a large- pan). dedicated to meeting customers' neetls n~ith scale hierarchical project, the company chose to the best computing, business, ancl service technol- manage Alpha as a clistributed program. A small ogy available. The tlelivery of Alpl~aAXI' recluired progr;lni te;m irsed enrollment m;~nagernentprac- the largest engineering progr;lm in 1)igit;il's histor): tices ancl strict operation;~ldiscipline to coortlillate spanning lllore tli;tn twenly Engineering groups and inspect activities ;Icross the cornp;lny. This net- worldwide. This issue of tlie lli~qit~ilExlnt?icnl worked appro;~chto n1;ln;lgernent g;ivc the program Jozrrr?al documents just ;I few of tlie lli~ntlredsof both flexibility ant1 resiliency in the face of rapitlly projects involved in bringing Alph:c to fruition; changing business and organizational conditions. filture issues will continue the story. Richard L. Sites I

Alpha AXP Architecture

The Alpha AXP 64-bit conzptiter. arcl~itect~ireis designed for. high pefonizance GIH~ lor~~evitl!Bewi~ise oftlnefoc~is or1 i~~~~ltipl~ilistr~~dior~iss~/e,the ar-chitectui~ does not contain jhcilities s~ichcis brc~~zch~lelc~~ slots, lyte zur3ites,arzclprecise arithnzetic e,vceptions. Brccluse of thefocus 012 multiple processors, the architecture does con- tail? a careful shared-memory rrzodel, atomic-tipdate pri~nitirei~zstluctio~zs, atzd r.elcixed read/ii~riteorcleri~zg. Tl~e first itrzplenzentation ofthe Alpha AXP arclgitec- ture is the zi~orld'sfastestsirzgle-chill ~~zicr~oprocessorThe DECtbip 21064 r.utzs 11zulti- ple operwti~~gsjsterns and r~i~zsnati~le-conzpiled progra~rzs that were tra~ulated fro111 the 1!4X arzd 11.flPSarchitect~ires.

Thus in all these cases the Romans did what all Architecture Distinct wise princes ought to [lo; namely, not only to look from Implementations to all present troilbles. I,ut ;~lsoto those in tlie From tlie beginning of the Alpha AX[' design, we fut~lre,;lgailist which they provided with the Lltniost prudence. distinguished the architecture from tlie implemen- -Niccolo Machial-i-lli. The Y~Yrrce tations, following the distinction made by the IL\M System/360 architects: Computer ;~rcliitectureis tlefinetl as cl~c;~tt[.ibutes Historical Context ant1 behaviol. of ;I computer as seen by :I mz~chine- The Alpha IUP architecture grew out of a srn;~lltask language I,rogr:unmer. This definition includes the force chartered in 1988 to explore ways to preserve instruction set, instruction formats. oper:~tion codes, addressing motles, and all registers ;~ntl the VAX VMS custon~erbase through the 1990s. This memory loc:rtions th;~tmay be tlirec[ly m;~nipu- group eventually came to the conclusion that ;I new laced by a machine-1:lnguageprogr;lninier. retlucecl instruction set computer (RISC) architec- 1mplement:itioll is tlefinccl as the actu:~lh;trdw;~re ture would be neecled before the turn of the cen- structure, logic design. and data-path org;~nization tury, primarily because 32-bit architectures will run of a particular embotliment oF the architecture.' out of address bits. Once we 11i;ltle the decision to Thus, the architecture is a doc~~mentthat pursue a new architecture, we shaped it to do describes the behavior of all possible irnplementa- much more than just preserve the VLI 15 .' customer tions; an implementation is typically a single com- base. puter chip2The architecture and software written This paper cliscusses the architecture from a to the architecture are intended to last several number of points of view. It begins by making the decades, while indivitlual implementations will distinction between ;lrcIiitecture and implementa- have much shorter lifetimes. The architecture must tion. The paper then states the overriding arclii- therefore carefi~llydescribe the behnvior that a tectural goals and cliscusses a number of key machine-langu:lgr programmer sees, but must not arcliitect~~raldecisions that were derived directly describe tlie nie;lns by which a p;lrticul;~rimple- from these goals. The key clecisions distinguish the mentation achieves that behavior. Alpha AXP architecture from other architectures. A similar appro;~chhas been used with much The remaining sections of the paper discuss the success in specifying the PDP-11 ant! VtlX klmilies of ;irchitecture in more (letail, From data and instruc- computers. An alternate approach is to design and tion formats through the detailed instruction set. build a fast RISC chip, then wait to sce if it is suc- The paper conclirtles with a discussion of the cessful in the marketplace. If so, successive imple- designed-in futi~regrowth of the architecture. An mentations are often forced to reproduce accidents Appenclis explains some of the key technical terms of the initial design, or to introduce slight software l~seclin this paper. These terms are highlightecl incompatibilities. This approach works, but with with ;In asterisk in tlie text. varying success. Alpha AXP Architecture and Systems

Architectural Goals Digital RIS(: clesign cnlletl I'RISM.~ We p1:icetl the When we st;irted the tletailetl clesign of the Alpha underpinnings for interrupt tlelivery ;uid reti~m, AXP architecture, we had ;I short list of goals: exceptions. context switchitig, memory manage- ment, ant1 error hantlling in a set of privileged 1. High perfi)rrnance sol'twarc subroutines c:~lledI'r\Lcotle. These sub- routines have control let1 entry points, run with interrupts ti~rnedoff, ;~nd11;lve access to real hard- 3. Capability to run both Vh1S ;lnd IJNIX oper;iting ware (implementation) registers. By inclueling tlif- s!stems ferent sets of PA1.code h)r different operating 4 Easy migr;ition from Vr\X ancl MIPS architectures systems, neither the harclware nor the operating system is burclened wit11 ;I b;itl interface m;~tch,anti Tliese goals directly influencecl our key decisions the arcl~itectureitself is not bi:isetl tow;ircl a partic- in clesigning tlie architecture. ular compi~tingstyle. In consiclering performance atlcl longevity, we To ru~iexisting VA~and MIPS binary images, we set a 15- to 25-year design horizon and tried to avoitl atloptetl the idea of bin;rry tr;msl~~tion,':'as tlescribed any clesign elements that we thought could hrco~iie in n cornpanion paper.^.^.'^ The co~iil,in:ttion of limitations during this ti~iie.In current arcliitec- Pr\I.cocle anel binary transl;ition gave us tlie luxury tures, ;I primary limitation is the 32-bit memory of designing ;I new architecture. Other tli;ln the h~n- address. Thus we adopted a fill1 64-bit architecture, clamental integer and floating-point tlata types. with :I mjnimal number of 52-bit operations for there ;Ire no specific VAX or M~PSfeatures carried backw:irtl compatibility. directly into tlie Alpha i\>;ll instruction-set architec- W'e also cotlsidered how implenietitation perfor- tilre for compatibility re;isons. mance shoi~ldscale over 25 years. During the p;ist 25 ye:irs, computers have become about 1,000 Key Design Decisions times faster. Therefore we focusetl our design deci- This section presents the design tlecisions that clis- sions on :~llowingAlpli;~ ASP system imple~nentn- tinguisli the Alp1-1;~AXI' arcl~itccturefrom others. tions to become 1,000 times kister over tlie coming 25 years. 111011s projections of future perforni;ince, we re;lsoned that raw clock rates woulcl improve by a factor of 10 over that time, ant1 that other tlesign Tlie Alpha AXP architecture is ;I tradition;il HIS<: dimensions would have to provide two niore kic- loacl/store ;irchitecture, All ti;it;i is JII~V~~between tors of 10. registers ;inti Iiletnory without cornpiitalion, ;lnd ;ill If the clock cannot be made fastel; then more computation is done bctwcen values it1 re&''ISteI-S. work milst be done per clock tick. We therefore Little-entlian adtlressing ;lnd both VAX and IEEE designed the Alpha AXP architectirre to encourage floating-point operations':' arc carried over from the ~nultipleinstruction issue" implementations that \biS ;ind %]IPS;~rchitecturcs.- We assumed th;it most will eventi~;illysust:iin ;~l)oi~tten new instructions irnpleoicntations woultl pipeline instrilctions, i.~., starting every clock cycle. This aggressive tech- they woulcl start execution of a second, thircl, etc. nique of starting multiple instructions distin- instruction before the execution of ;I first instruc- guishes tlie Alpha AXI' architecture froni many tion con~pletes.We ;~ssunirtlthat the implementa- other I

Digital lkcbnical Journal Vo1.4 llkr 4 Special Issue 1392 2 1 Alpha AXP Architecture and Systems

tlie suppression state 111irst be swed ant1 restored. ;I little more complicatetl anel a little slower. With There are ;~lsoclefinitiotlal problems with this nonreplicated hitlden state, it is difficult to issue ;ipproach: Are exceptions ever reported for sup- another byte store until the first one finishes. pressed instructions? What happens if the sup- Fin;~lly,the existence of a byte store instruction has pressed instruction suppresses a third instruction? led to programs and library routines for other NSC implementations with single-byte move ancl com- Byte Loud or Store Iristt-uctions The Alpha AXl' pare loops. String manipillatioa on Alpha AXP architectlrre has no byte load or store instructions implementations is up to eight times faster by pro- anel no iniplicit irnalignetl accesses. There also are cessing eight bytes ;it a time." no partial-register writes. The byte load/store Insteacl of inclueling byte loacl/store, we followed instructiolis and unaligned accesses found in some tlie RISC philosophy of exposing hidden computa- RIS<: architectures can be a perforniance bottle- tion as a sequence of many simple, fast instructions. neck. They require an extra byte sl~ifterin tlie In the Alpha UParchitecture, a byte load is clone as speetl-critical loatl and store paths, ant1 they force a an explicit load/shift sequence; a byte store as an hard choice in fast cache tlesign. The partial-regis- explicit load/modify/store sequence. We tuned the ter writes found in other RlSC architectures can also instruction set to keep these sequences short. The be a performance bottleneck because they require instructions in tliese sequences can be intermixed, masking ancl shifting in the filndamental operation scheduled, and issued as multiples with other com- of accessing a register. putation, as can tlie rest of the instructions jn the On a previous project involving a MII'S implen~en- architecture. Table I gives :I sunlmary of the Alpha t;~tion,we found the shifter for the loatl-left/lo;rd- fit' instruction set. right instri~ctions to he a direct cycle-time bottleneck. Also, the VAX 8700 i~~iplementation~tio~~A~'il%?/rietic E.vceplio7l.s The Alpha IU(P architec- (circa 1986) removed the byte shifter in the ture has no precise arithmetic exceptions. lo;~cl/storehardware in favor of a faster microcycle, Reporting an arithmetic exception (e.g,,overflow, with 2 cycles for ;I byte 1o;rcl and 6 c!.cles for :III unclerflow) precisely means that instri~ctions ~ln;rlignetl32-bit access. This decision irchieved ;I subsecluent to tlie one c;rusing the exception net performance gain. Our experience encour;~getl must not be executetl. This is straightforwarcl 11sto avoid byte load/store. in 21 slow implementation that runs a single instruc- An addition;il problem with byte stores is that an tion to completion before starting the next one, iliiple~ilenterniay easily choose only two of the but becomes substantially more difficult to do three design features: fast write-back cache, single- qi~icklyin a pipelined four-way issue implemen- bit error correction code (E<:<:), or byte stores. tation. There are standard techniques available Byte stores are straightforward in simple byte- for clelivering precise exceptions while run- parity write-through cache implementations. ning quickly (checking exponents, supl~ressing Except for tlie expensive design of four or five E<:<: register writes, exception silos and backout), but bits for every eight bits of clata, a byte store to ;I fast thesc techniqiles consume substantial design E<:<; write-back cache ilivolves time and can cost some performance. They appear not to scale well with wider multiple issue or 1. Reading a11 entire caclic word::' faster clocks. 2. Checking tlie ECC bits and correcting any single- Exceptional cases are just that-exceptional, or bit error mre, events. Based p;~rtlyon customer requests, we cleciclecl to eliipliasize the performance of normal 3. Moclifiing the byte operations at the expense of exceptional cases. 4. Calculating the new E<:<: bits Rather than an implicit exception ortlering between every pair of instructions, we atlopted the 5. Writing the entire cachc word Cr;iy-1 model of :~rithmeticexceptions-in which This reatl-rnotlifi-write sequence requires l~iclclen exceptions are reported eventirally-plus an seqire~icerli;~rtlware and hidden state to hold the explicit trap barrier (T1WR) instruction that c;111be c:rche word temporarily. The secluelicer tencls to used to make exception reporting as precise as slow clown ortlinary full-cirche-wortl stores. The desired.lo We also tlocumented ;I code-generation neetl for byte stores tends to ripple tl~roughout clesign that needs one trap b;rrrier per branch (at the memory subsystem tlesign, m;rking each piece most) to give precise reporting. Using TRAPB Alpha AXP At.c./7itectrrt.e

Table 1 Alpha AXP Architecture Instruction Set Summary I Load/Store, Byte Manipulation CMPLT Compare signed quadword < CMPLE Compare signed quadword 5 LDA Load address CMPULT Compare unsigned quadword < LDAH Load address high CMPULE Compare unsigned quadword 5 LDL Load sign-extended longword MULL Multiply longword LDQ Load quadword MULQ Multiply quadword LDQ-U Load unaligned quadword UMULH Multiply quadword high, unsigned LDL-L Load sign-extended SUBL Subtract longword longword, locked S4SUBL Subtract longword, scale by 4 LDQ-L Load quadword locked S8SUBL Subtract longword, scale by 8 STL-C Store longword, conditional SUBQ Subtract quadword STQ-C Store quadword, conditional S4SUBQ Subtract quadword, scale by 4 STL Store longword S8SUBQ Subtract quadword, scale by 8 STQ Store quadword AND AND logical STQ-U Store unaligned quadword BIS OR logical EXTBL Extract byte low XOR XOR logical EXTWL Extract word low BIC AND-NOT logical EXTLL Extract longword low ORNOT OR-NOT logical EXTQL Extract quadword low EQV XOR-NOT logical EXTWH Extract word high SLL Shift left, logical EXTLH Extract longword high SRL Shift right, logical EXTQH Extract quadword high SRA Shift right, arithmetic INSBL lnsert byte low CMOVEQ Conditional move if reg = 0 INSWL lnsert word low CMOVNE Conditional move if reg # 0 INSLL lnsert longword low CMOVLT Conditional move if reg < 0 INSQL lnsert quadword low CMOVLE Conditional move if reg 5 0 INSWH lnsert word high CMOVGT Conditional move if reg > 0 INSLH lnsert longword high CMOVGE Conditional move if reg 2 0 INSQH lnsert quadword high CMOVLBC Conditional move if reg low MSKBL Mask byte low bit clear MSKWL Mask word low CMOVLBS Conditional move if reg low MSKLL Mask longword low bit set MSKQL Mask quadword low CMPBGE Compare bytes, unsigned MSKWH Mask word high ZAP Clear selected bytes MSKLH Mask longword high ZAPNOT Clear unselected bytes MSKQH Mask quadword high lnteger Branch Floating Point Load/Store BEQ Branch if reg = 0 LDF Load F format (VAX single) BNE Branch if reg # 0 LDG Load G format (VAX double) BLT Branch if reg < 0 LDS Load S format (IEEE single) BLE Branch if reg I0 LDT Load T format (IEEE double) BGT Branch if reg > 0 STF Store F format (VAX single) BGE Branch if reg t 0 STG Store G format (VAX double) BLBC Branch if low bit clear STS Store S format (IEEE single) BLBS Branch if low bit set STT Store T format (IEEE double) BR Branch BSR Branch to subroutine AddressIConstant JMP JU~D LDA Load address JSR ~umpto subroutine LDAH Load address high RET Return from subroutine -JSR-COROUTINE Jump to subroutine, return lnteger Computation and Conditional Move Floating Point Branch ADDL Add longword S4ADDL Add longword, scale by 4 FBEQ FP branch if = 0 SBADDL Add longword, scale by 8 FBNE FP branch if # 0 ADDQ Add quadword FBLT FP branch if < 0 S4ADDQ Add quadword, scale by 4 FBLE FP branch if I0 S8ADDQ Add quadword, scale by 8 FBGT FP branch if > 0 CMPEQ Compare signed quadword = FBGE FP branch if 2 0

I I I conrinucd on ncx~page Alpha AXP Architecture and Systems

Table 1 Alpha AXP Architecture Instruction Set Summary (continued)

Floating Point Computation CVTGF Convert G to F format and Conditional Move (VAX double/single) CWQ Convert T format to quadword CPYS Copy sign (IEEE double) CPYSN Copy sign, negate CVTQS Convert quadword to S format CPYSE Copy sign and exponent (IEEE single) CVTQL Convert quadword to longword CWQT Convert quadword to T format CVTLQ Convert longword to quadword (IEEE double) FCMOVEQ FP conditional move if reg = 0 C~S Convert T to S format FCMOVNE FP conditional move if reg ;t 0 (IEEE doublelsingle) FCMOVLT FP conditional move if reg < 0 CVTST Convert S to T format FCMOVLE FP conditional move if reg 5 0 (IEEE singleldouble) FCMOVGT FP conditional move if reg > 0 Dlw Divide F format (VAX single) FCMOVGE FP conditional move if reg 2 0 DNG Divide G format (VAX double) MF-FPCR Move from FP control register D~VS D~videS format (IEEE single) MT-FPCR Move to FP control register DlVT Divide T format (IEEE double) ADDF Add F format (VAX single) MULF Multiply F format (VAX single) ADDG Add G format (VAX double) MULG Multiply G format WAX double) ADDS Add S format (IEEE single) MULS Multiply S format (IEEE single) ADDT Add T format (IEEE double) MULT Multiply T format (IEEE double) CMPGEQ Compare G format = SUBF Subtract F format (VAX single) (VAX double) SUBG Subtract G format (VAX double) CMPGLT Compare G format < SUBS Subtract S format (IEEE single) (VAX double) SUBT Subtract T format (IEEE double) CMPGLE Compare G format 5 (VAX double) Srstem CMPTEQ Compare T format = (IEEE double) CALCPAL Call privileged architecture CMPTLT Compare T format < library (IEEE double) TRAP8 Trap barrier (precise exception) CMPTLE Compare T format I FETCH Prefetch (cache) date hint (IEEE double) FFICH-M Prefetch (cache) data, CMPTUN Compare T format modify hint unordered (IEEE double) ME Memory barrier (serialize) CVTGQ Convert G format to quadword WMB Memory barrier (serialize) write (VAX double) RPCC Read process cycle counter CVTQF Convert quadword to F format RC Read and clear (VAX single) RS Read and set CVTQG Convert quadword to G format PALRESO PALcode reserved opcode 0 (VAX double) PALRES1 PALcode reserved opcode 1 CVTDG Convert D to G format PALRES2 PALcode reserved opcode 2 (VAX double/double) PALRES3 PALcode reserved opcode 3 CVTGD Convert G to D format PALRESQ PALcode reserved opcode 4 (VAX double/double)

instructions in the first Alpli;~AXP imp1emenl';ttion lowers ~,erformance 3 percent to 25 percent in real The Alpha ASP architecturc's sl~arecl-nlcmory floating-point progralms ;~iirlless than 1 percent ill ~~lultipl-oc~~si~~gmodcl is ;In integral p;irr of the integer progr;irns, but improves cycle time :lpprosi- clesign. It is 1101 the ;~tltl-onfouncl in other RlSc: mately 10 percent. arcl~itcctures. In co11tr;tst to aritll~l~eticexceptions, nlemor). 'I'hc untlerl!ing pri~iiitivcfor safe updating of nlanagenicnt exceptions, such :IS page faults. ;Ire a mi~ltilxocesor-s11:trctl memory location is a reportetl precisel!: This is not ;IS much of ;I I,urtlen sequcncc of RI:.r' instructions: loacl-locked. in-regis- on implemcnters as precise arithtnetic mceprions ter modify, store-conditional. test. If this seqilence a~o~~ltlbe. ;tntl I;lck of precise nlenior!r manapmcl1t completes with no it~terrupts,no exceptions, ;~iitl faults would be a severe burden on softw:tre no interfering write from anothc-r processor, then writers. the store-conditional stores the motlified result. Al)lna AXP A ~.clnitecl.~~i1.c~ ant1 the test indic;~tessuccess: an atomic update arcliitecture differs from other Rlsc architectures was in fact performed. by careli~llyspecifying a canonical form for 32-bit If anything goes wrong, the store-conditional tlat;~in 64-bit registers. A c;inonical form is ;I stan- does not store a result, and the test i~iclicatesfail- cl;lrclizetl choice of clata representation for retlun- ure. The program must then retry the sequence d;~ntly encoded values. Since 32-bit operations until it succeetls. We chose this primitive sequence assume canonical operands and give canonical (quite similar to tlie MIPS chip designs) results, very few explicit conversions between 32- because it can be implemented in a way that scales ;incl64-bit representations are needed. up with processor performance. 111the xbsence of The fi~ndamentalunit of data in the Al~lpli;~AXl' an interfering write, the entire secluence can be architecture is a 64-bit quadword." As shown in tlone in an on-chip write-back cache, ant1 hi~ndreds Figure I, quadwortls may reside in memory or regis- of chips can tlo noninterfering sequences simulta- ters. For backwartls compatibility, 32-bit long- neously. The sequence can also be ilsetl to achieve wortls:' may also be stored in memory. byte granularity" of writes in sharecl memory.(' There are three fi~ntlamentalclata types: integer, Tlie Alpha AX[' arcliitecture has no strict multi- IEEE floating point, and VAX floating point; each processor reatl/write ordering, whereby the is available in 32-bit and 64-bit form~.~-I~VAX floating- sequence of re;icls ant1 writes issuetl 1,y one proces- point values in memory have 16-bitwords swapped, sor in a mi~ltiprocessorconfiguration is tlelivered for compatibility with VAS (and PDP-11) formats. to :ill other processors in exactly tlie order issued. The VAX floating-point load and store instructions Strict order is simple, but has a problem similar to tlo word swapping" to give a common register that of byte stores. An implenienter may easily order. The 9-bit loat1 instructions expand vali~esto choose only two of the three tlesign llentures: 64-hit canonical form, anti tlie 32-bit store instruc- pipelined writes. bus retry, or strict reacl/write tions contract 64-bit values back to 32."AII register ordering. to-register operations are thus done on fill 1 64-bit If one processor starts ;i write to location A and a values in a common integer or floating-point for- write to location 13, then tliscovers tli;~tthe write to 111at. No partial-register I.~:ICISor writes ;ire tlone. i\ i\ 11;ls hiled (bus parity error. etc.) :11itl retries it suc- The cano~licalform of ;I 32-bit value in ;I 64-bit cessfully, then a second processor will observe the integer register has the most significant 33 bits all writes out of order: H, then A. equal to bit<31>. In essence, bit<31> is kept ;IS ;I Before Alpha AXI' implementations, many VAX "fat bit." This allows signet1 integer values to be implementations ;ivoicletl pipelinetl wrilcs to main used directly in 64-bit ;~ritIimeticand branches. memory, multibank c;~ches,write-buffer bypassing, This canonical for111 is maintained as ;I closed routing networks, crossbar memory interconnect, system (even for 32-bit data considered to be etc.. to preserve strict reatl/write ordering. Tlie "i~nsigned")by using a combination of 64-bit oper- Alpha AXl' architecture's shared-n~emory111otlel ates, 32-bit add/subtract/rnulti~>ly,and two-instruc- instead specifies 110 implicit orclering between tlie tion sequences for shifts. reads and writes issuecl on one processor, ;IS viewed 'T'he canonical form of a 32-bit value in a by a different processor. This programming model 64-bit floating-point register has the %bit exponent is an enabling technology for a wide variety of high- field expanded to 11 bits ant1 the 23-bit mantissa perfor~iiance i~iil)lement;~tiontechniq~~es. Strict field expanded to 52 bits. Except for IEEE tlenor- orclering can be hpecificd when neecletl by insertion mals.':' this allows single-precision floating-point of explicit memory barrier (Me) instructions, quite values to be used directly in double-precision arith- similar to the lB*l System/370 serinlizatio~itlesign.11 metic and branches. This canonical form is main- tained as a closecl slatem by using single-precision Data Representation instructions. and Processor St-ate Bytes nntl words (16-bit quantities) are not fund;^- This section tlescribes the fundamental Alpha AXP mental data types. Tl~e!. may be transferred tlata types ant1 their representation in memory and between memory ant1 registers with short I-egisters.It also tlescribes the con~pleteIl;~rdware sequences of jnstructioiis ;~nclmanipulatetl jn regis- register state for e;ich processor ant1 outlitles ters using normal arithmetic ;incl the byte-m;inipil- the adtlitio~ial state maintained by operating- lation instructions described in the Operate system-specific 1'AI.cotle routines. The Alpha AXP Instructions section.

DigiLnl TecbaiUrtJoumd Vd. 4 hb, 4 Specic~l/.s.srrc 1992 2 5 Alpha AXP Architecture and Systems

QUAbH6RD lNfffiER (MEMORY) QUADWORD INTEGER (REGISTER)

TEE€ T#.mlWG POINT (MEMORY) IEEE T-FLOATING POlNT (REGISTER) fiR

VAX G-FLOATING POlNT (MEMORY) ,-. mn VAX G-FLOATING POINT (REGISTER) - U OJ L '.=. '.. . . I .. W 1 M2 EXP MI M2 ~3 M4 FX Im war.I I r I 16 16 16 1 11 4 52

LONGWORD INTEGER (MEMORY) LONGWORD INTEGER (REGISTER) 0 -. SSSSSSSsssSsSSSSsSs ... S - - RX 1 3 1 32 1 31

IEEE S-FLOATING POlNT (MEMORY) IEEE S-FLOATING POlNT (REGISTER) 31 0 63 0

EXP MANTISSA SXX EXP MANTISSA 00000000000000000... 0 FX 18 23 1' 11 ' 52

VAX F-FLOATING POlNT (MEMORY) VAX F-FLOATING POlNT (REGISTER) 31 0 63 0

M2 S EXP WI SXX EXP M1 M2 00000000000000000... 0 FX

16 18 7 1' 11 ' 52

'rlie liartlware processor state, shown in Fig~rre2, Hardware implementations may optionally includes 9 integer registers RO..K?I of 64 bits each; inclutle ;I pair of state registers for menlory R31 is alw;~!lszero. There are ;ilso 32 floating-point prefetching (FETCN/FFI'(:I-I-hl instructions), ;inti an registers FO..P31 of 64 bits cacli; FS1 is alw:ys zero. optional interrupt flag for use only by tr;rnslatetl Writes to R31 ;rnd F31 ;Ire ignorecl. VAX OlxnVMS ASP progr;lms th;rt reprotluce com- A 64-bit (IIC) contains a long- plex instruction set computer (C1SC8) instruction word-aligned virtual byte atldress (i.e., tlie low 2 atomicity using a sequence of RISC instructions.(> bits of the I>(: ;Ire ;~lwayszero). The VAX arcliitectr~rr In acltli tion to the above hartlw;rre state, the privi- keeps tlie I><: in general I-egister 15, wlicre it is legctl :ircIiitecture library routines for the \;;irio~rs directly used for PC-rel;~ti\:ememory ;iddressing. In operatitlg systems implement ;itltlitional st;rte. This the Pclgh;~AX]' architecti~rc,lio\vever, code ;rntl tl:~t;~ state m;iy be m:~intainedby h;rrtlware or (I'AI.code) pages are usirally separated by 64 kilobytes (K11) or softw;ire, ;it the option of the implementer, and it more to allow separate memory protection, but tlie varies from one operating system to anotlier. 16-bit displ;lcement in lo;~d/storeinstructions c:ln- Typical I'i.\l.cotle state includes a processor st:ltrrs not span more than 64~11. (13s) worcl, kernel and user stack pointers, ;l process The hardware processor state includcs ;I lock flag control block base for context switching, a process- ant1 ;I locked physic;il atltlress for the lo:rtl- unique v;tlire for thre;ids. ;rntl ;I processor number locketl/storc-condiriornrl sccjirence. It also h;ts ;I for mi11tipl-ocessor disp;itchil~g.Atldition;il I'til.code floating-point control register containing the IEEE state may include a floating-point enable bit, inter- dynamic rountling motle."' rupt jxiority level, ant1 translation loolc-aside Alphci AXP Architecture

HARDWARE STATE

/ 4 d /

R30 (STACK POINTER) F30 R31 (ALWAYS ZERO) F31 (ALWAYS ZERO)

a IEEE FLOATING-POINT DYNAMIC ROUNDING MODE

I LOCKED PHYSICAL ADDRESS I 0 LOCK-FLAG

OPTIONAL HARDWARE STATE I PREFETCH STATE A I PREFETCH STATE B I 0 INTR-FLAG TYPICAL PALCODE STATE

I PS I I KERNEL STACK POINTER I USER STACK POINTER I PROCESS CONTROL BLOCK BASE I PROCESS-UNIQUE VALUE I WHO AM I (PROCESSOR NUMBER) I I] FLOATING-POINT ENABLE (FEN) 1INTERRUPT PRIORITY LEVEL

-

I-STREAM TRANSLATION BUFFER 1 D-STREAM TRANSLATION BUFFER 1 T

buffers for mapping instruction-stream nnd clata- two aligned accesses or report a fat;~lerror, depentl- stream virtual adclresses. All of this state is soft in ing on the operating system design). the sense t1i:lt it is defined only in relationship to illpha I\XP implementations allow clata to be the I'ALcotle routines for a specific operating accessed using either a little-e11di:tn"'view (byte 0 is systeni. In ;I tni~ltiprocessorimplementation, all of tlie low byte of an integer), or ;I big-entlian'. view the above state is replicated for each processor. (byte 0 is the high byte of an integer). As tlescribetl in the Load/Store Instructions section, there is a Mernory Access one-instruction bhs in tlie secluenceb for little- ant1 Alpha AX11 memory is byte atltlressetl, using the low- big-endian byte manipi~lation. est-numbered byte of a datum. Only aligned long- Virtual addresses are a full 64 bits; implenienta- wortls or clu:cclwords may be accessetl: an :~ligned tions may restrict ;~tlclressesto have some number longworcl is a fonr-byte clati~mwhose atltlress is a of identical high-ostler bits, but must always distin- niultiple of four; an aligned quadwortl is an eight- guish at Least 43 bits. Virtual ;~ddresses;Ire nx~pped byte datum m~lioseaddress is ;I multij,le of eight. in an operating-specific way to plij~sical;~dtlresses, Normal loatl or store instructions that specify an using fixed-size pages. Memory protection is clone unaligned :~cltlresstake a precise data alignment on a per-page basis. Atldrcss mapping errors (e.g.. trap to I?4Lcocle (which may clo the access irsing protection, page faults) take precise traps to

Digilnl Tecbnicnl Jortrnnl Vo1. 4 !\'a 4 Speci~~lIssue 1392 27 Alpl~aAXP Architecture and Systems

I'i\l.code. Each page ma!. :~lsobe m;irkccl to provicle The rem;~ining bits cont;~i~ifiltlction (opcode a fault 011each read, write, or instruction-fetch. extension). literal, or dispI;~cetnentfields. To mini- Virt11;il ;~ddressesmay be further qu;~lifietl bj- mize register file ports in fast implementations, R13 atltlress space numbers (ASAS).to allo\v multiple is never written, ;~nclR(: is never rcatl. disjoint addresses spaces. l'hc choice ofclisjoint or All the operate instructions are three-operand common mapping across all processes is clone on a register-to-register, c:llcul;rting RC = RA oper~lteRR. per-page basis. In integer operzttes, tlic opcode and a 7-bit firnction 'The virtual- to physical-atltlrt.ss mapping is tlonc field specify tlic exact operation. Integer operates 011:I per-page basis. Each i~iiplc~~ient;irio~im;iy 1i;ive may Iinve ;In 8-bit zero-extentlecl literal instead of ;I p;tge size of 8m, I~KH,-32Kll. or 64Kl1. ?'he 0 ih;~ RH. In floating-point operates, the opcode and ;in up1?er bound allows a linker to ;~lloc;iteblocks of 11-bit function fieltl specify the exact operation. memory with cliffering protection or ,\h\ proper- Tliere :ire no flo;iting-point literals. ties htr enough apart to work on all implementa- Memory format instructions are used for lo;tds. tions. The virtual- to p.hysic;~l-;ttltlrcshmapping can stores, ;und a few miscellaneous operations. Loads be m;in) to one, i.e.. sgnon!.ms :~rc;~llo\i~ecl. In ;I ant1 stores ;ire two-opcrancl instructions, specifying multiprocessor implement;ition, sIi;~reclmain rnern- a register KA ;uid a base-disp1;lcernent virtual byte ory 1oc;ltions haw tlie same pl1ysic;ll ;~tltlrc.sson all adtlress. l'hc effective atltlress calculation sign processors. Per-processor i~~i.sI~:irccl1oc;ttions ;ire extertcls the 16-bit clispl;~cementto 64 bits ancl :iclcls also ;~llowetl. the 64-bit RI$ b;ac rcgister (ignoring overflow). The Memory has longworcl gl-;~ni~l:irity:two proces- resulting virtu:~lbyte ntldress is mappecl to a pliysi- sors III;IJ'si1ni1Itaneo~is1y ~CCC.SS ;tclj:tccnL longwortls cal adclrrss. The ~nisceIl;~aeousinstructions makc without rnut~~alinterScrcnce. T1ie loatl-locketl/ other uses of the 11A, RH. ;ulcl displacement fields. store-co~~tlitiollalsequence disci~sx~l~>re\,io~~sl!. c;in Branch h)rm;lt instructions spec@ a single regis- be i~sedto achieve rnultiproccs~orbj,te granul;lrit!: ter It1 ant1 ;I signet1 PC:-relative lotigword displ:ice- Inpi~t/oi~tpi~tis memory m:~ppeti: some phys- rnent, The br;uicIi target c;tlculation shifts tlie 21-bit ical memory adclresses may refer to I/() device tlisplacemcnt left b)~2 bits to make it a longwortl registers whose access trigpcrs sirlc cl'l'ccts (such (not byte) tlisplacemetit. then sign extc~ldsit anel as tlie transfer of d;~t;l).Sicle effects on re;~clsare atltls it to tlie uptlatecl I>(:. Conditional br;inch cliscour;tgecl. instructions test register hi, ;ind unconclitional branches write tlie uptl;~tedPC to FU for subroutine Irtstraiction Formats linkage. The 1:lrge longwortl displacement allows a Four funclamental instruction format>-operate, range of+4Rill3, subst;~ntiallyreducing the need h)r menlor!;, branch, and <:h1.IL124L-are shown in branches arountl or to other branches. Figi~re3. All instructions :Ire 32 bits witlc ant1 reside The (:~r.I.-l%l.instruction has only a 6-bit opcotle ill memory at aligned longwortl ;icltlresses. E;tch ;ind :I 26-bit function fielcl. The hnction fieltl is ;I i~~structioncontains :1 &bit opcocle fieltl :mtl zero small integer specifying one of a few dozen privi- to three ?-bit register-number t'ieltls. Kh, Rl). ;tntl RC:. Iegecl ;ircl-litecture Iil)r;~rj~s~~broutines.

OPERATE FORMAT BRANCH FORMAT 31 31 26 21 0 LITERAL 1 FUNC. INTEGER. LITERAL OP RA DISPLACEMENT

INTEGER. REGISTER 6 5 21

RB FUNC. FLOATING POINT CALL-PAL FORMAT 31 26 0 6 5 5 11 5 I I 1 MEMORY FORMAT 1 Op 1 FUNCTION I 31 26 21 16 0

OP RA RB DISPLACEMENT 6 5 5 16

28 I hl. .# .Vo.-I .T/~c,c,ir/l/s.s/tc~ 1992 Digitnl Tec/~niralJour~rnl Alpha AXP Arclnitect~lr.c

OJemte Instructions placing it in the low end of a register with high 'I'here arc five gro~~psof register-to-register operate order zeros. A pair ol E>(Txl./EXt'xFI instructions can instructions: integer arithmetic, logical, byte- perform unalignecl loacls, pulling the two parts of manipulation, floating-point, and miscell;~neo~~s. an unalignecl cl:~tum out of two qlladwords anel All instructions operate on 64-bit qu;~dworcls placing the parts in result registers. A si~npleOR unless otherwise specified. operation can then co~nbinethe two parts into the full tlatum. Integer Arithinetic I~~structiorzsThe integer arith- The insert (INSxx) ;ind mask (MSKxx) instruc- metic instructions are add, subtract, multiply, and tions position new t1:1r;1:incl zero o~~toIt1 data in reg- compare. Acld. subtract, and multiply have variants isters for storing bytes. words, and i~naligtleddata. t11;lt enable ;~rithmeticoverflou~ tr;~ps. ?'liejr ;~lso If the Alpha AXI1 ;ircliitecture were a four-operantl 11;lve longword variants that check h)r 32-bit over- one, inserting and masking coulcl have been com- flow (instc:id of 64) ancl force the high 33 bits ofthe bined into ;I single instruction. result to all equal bit<31>. Add ancl subtract also The compare-byte instruction ;~llowscharacter- Ihave sc;~leclv;lri;rnts that shift the first operanel left string search and conipnre to be done eight bytes at by 2 or 3 bits (with no overflow cl~ecl~ing)to speed a time. The ZAP instructions ;~llowzeroing of arbi- trary patterns of bytes in a register. These instruc- 1113 simple subscripted address arithmetic. The tions allow very fast iml>leme~lt;~tionsof the C IIMII1.H instruction (from I'RISM) gives the higll 64 bits of an unsigned 128-bit product and may be language string routines, among other uses. used for dividing by a constant. rl'lie~.eis no integer clividc instruction; a software subroutine is usecl to Floatingpoilzt AritI~~~zeIicIi~struc'tior?~ Tlie float- tlivicle by a nonconstant. The compare instructions ing-point arithmetic instructions :ire aclcl, subtract, are signed or i~nsignedancl write a Hoole;~nresult (0 multiply, clivide. conipare, ;inti convert. The first or I) to the target register. four have v;iriants for IEEE ;~nd\b\X floating-point, a~idsingle- and double-precisio~iclat;~ t!.pes. ?'hey Logiccrl Itistr~lctiorls The logical instructions :Ire ~lsohave variants that en;ible combin;ttions of arith- AND, OR, and XOR, with the secontl opcs;lnd metic tral~sanclt1i;lt spccih the roi~nclingmode. optionally complemented (ANDNOT, ORNOT, Tlie single-precision instructions write canonical SOIINOT). The sliifts are shift left logic;il, shift right 64-bit results, but clo exponent checking and logical, ;lncl shift right arithmetic. The (,-bit shift rouncling to single-precision ranges. The compare count is given by RB or a literal. The conclition;~l instructions write ;I Boolean result (0 or nonzero) move instructions test RA (same tests as the branch- to the target register. Tlie convert instr11ctio1.1~ ing instructions) and conditionally move RIi to It(:. transfer between single nncl clouble, floating-point These can be llsetl to eliminate branches in short and integer, ancl two forms of Vt\X double (D-float sequences such as &llN(a,b). ;Inel G-bat). A conlbination of liarelware and soft- ware provides full IEi(li ;~rithmetic.Operations on Bj)te-~i~unip~~btionctonIt~structiorts The byte-manip- \%S reserved operancls;' clirty zeros,'.' IEEE denor- i11;ition instructions are usetl with the lo;~tl:~ncl mals, infinities, and not-;1-1ii1rnht.rs are done in store i~nalignedinstructions to manip11l;lte short softnlare. un;~lignedstrings of bytes. Long strings shoultl be There are also :I fc\v f1o:lting-point instr~~ctio~is manipulated in groups of eight (aligned quad- tlxit move tlata withoi~t;~pl,lying any interpretation worcls) whe~ieverpossible. The b\~e-m;u~iipi~l:ltion to it. These include a complete set of conditional instructions :Ire fi~nclamentallymaskecl shifts. They move instructions similar to the integer conditional differ born normal shifts by h;~vinga byte count moves. (0..7) instead of ;i bit count (0..63), ;lncl by zcroing some bytes of the result, based on tlic cl;~t;~size h4iscellar~eon.s I~zslruclio~zsThe miscellaneous given in the function field. instructions incl~~de:Inernory prefetching instruc- The extract (ESTxx) instructions extract part tions to help decre:~sememory latency a re:ld cycle of a I-, 2-, 4-, or 8-byte field from ;I qu;iclworcl counter instruction for perh)rrn;111cemeasurement, anel place the resulting bytes in a fielcl of zeros. A ;I trap barrier instruction for forcing precise arith- single ES'l'xL instruction can perform b!.te or worrl metic traps, ancl nlernor!. barrier instructions for loads, p~~llingthe datum out of a qi~aclwortl;11ic1 forcing multiprocessor rc:lcl/write ordering.

Dig ~I~IIT~CIJ~I~CL~I JOU~ILLCI 14~1.4 !YO. 4 .S/)L,CIO/ /.C.SIIO 1092 29 Alpha AXP Architecture a~ldSystems

conditional moves, the ;irchitecture contains hints The load ant1 store instructions only move data. to improve branching performance. They never apply an interpretation to tlie data and The integer contlitional branches test register RA therefore never take any data-tlepentlent traps. This for an opcotle-specified condition (>O >=O =O !=O design allows moving completely z~rbitrarybit pat- <=O routine call, return, CASE (or storing ;I byte (the first two and last two instruc- SWIT(:H) statements, and coroutine linkage. tions might issue sin~ultiineouslyon the first Alpha The architecture specifies three kinds of brancli- AXP implementation). Example 4 shows a sequence ing hints in instructions. The hints need not be for ;In explicit unalig~ietlIoatl quatlwortl (no data correct, but to tlie extent that they are, implementa- ;il ignment trap). tions may perform bster. 'l'lie integer load-lockecl ;lnd store-conditional The first for111 of llint is an architectetl static (1,I)Q-I., LDL-L, ST(>-C, STL-<:) instri~ctionsare branch precliction rule: forward conditional inclutled in the architecture to facilitate atomic branches :ire predicted not-taken, and backw;~rcl i11~tl;ites of multiprocessor-sh;~red tiAs ones taken. To the extent that compilers and hard- tlescribed above, they can be i~sed in short ware implenlenters follow this rule, programs c;~n sequences of RISC instructions to do atomic read- run more quicldy with little hardware cost. This modify-writes, Example 5 shows ;I sequence for hint does not eliminate the use of dynamic br:unch doing a multiprocessor test-;inti-set. Note that prediction in ;in implementation, but it may recluce ch;lnging the r.DQ-~J/Sl'Q-tl in Exr~niple 5 to tlie need to use it. ANI)/~.I>Q-L/ST(~-C/BEQgives a byte-store sequence The secontl for111 describes co~npi~tedjump tar- chat is s;ll'c to use with n~~~Itiproccssor-~l~:~recldata. gets. Unusetl instruction hits are defined to give the There :ire two related loxl :~dtlrcssinstructions. low bits of the most likely target, using the same tar- 1.1)~ calculates the effective atltlress and writes get calcul;~tionas unco~iditionalbranches. Tlie 14 it into R(;. LDAH first shifts the tlisp1;lcernent bits providetl are enough to specifi the instruction left 16 bits, then calculates the effective aclclress offset within ;I page, which is often enough to start ;inti writes it into RC. LDAM is inclilcletl to give a sirn- a fastest-level instruction-cache read many cycles pie way of creating most $2-bit constants jn a before the :~ctualtarget value is known. pair of instructions. (Because I.I>A sign-extends Tlie third for111 tlescrjbes sr~broutine;lot1 corou- the tlisplacement, some v;ilucs in the range tine returns. By marking each branch and jump as 000000007FFF800O .. 000000007FFFFI:FF require c:ill, return, or neither, the architecture provides three instructions.) Constants of 64 [?its :ire loadetl enough information to maintain a small stack of with I,DQ instructions. likely subroutine retilrn atldresses within an iniple- mentation. This implement;~tionstack can be usecl Blwrzclning Instructions to prefetch subroi~tinereturns quickly. 'The br;incli instructions include conditional The conditional move instructions (discussed branches, unco~lditio~l;~~br;lnches, and c;llculated previously in the L.ogic:~lInstructions section and jumps. In ;~cltlition to the previously described the Floating-point Arithmetic Instructions section) Alpha AXP Architecture

EXAMPLE 1: LOAD BYTE (UNSIGNED. LITTLE-ENDIAN) 7 60432 10 LDQ-U R2,O(R1) 1 ( BYTE 1 R2 76543210 EXTBL R2,Rl ,R2 r 0 IBYTE R2

EXAMPLE 2: LOAD BYTE (SIGNED. BIG-ENDIAN) 01034567 LDQ-U R2,O(R1) I ( BYTE ( R2

SUBQ R31 ,R1 ,R3 I -2 R3 01234567 EXTQH R2,R3,R2 BYTE 1 R2

EXAMPLE 3: STORE BYTE (LITTLE-ENDIAN) 76043210 LDQ-U R2,O(Rl) I I OLD I R2 76543210 INSBL RO,Rl,R3 1 NEW 1 R3

OR R2,R3,R2 I ( NEW ( R2 76543210 STQ-U R2,O(R1) I NEW I (o(~1)

EXAMPLE 4: EXPLICIT LOAD QUADWORD (UNALIGNED, LITTLE-ENDIAN) 7 6043210 LDQ-U R2,O(RI) LOW PART I R2 15 14 13 @ 11 10 9 8 LDQ-U R3,7(R1) HIGH PART R3 76543210 EXTQL R2,Rl ,R2 I LOW PART R2 76543210 EXTQH R3,Rl ,R3 I HIGH PART R3 76543210 OR R2,R3.R2 I HIGH PART ( LOW PART R2

EXAMPLE 5: MULTIPROCESSOR TEST-AND-SET

LDQ-L R2,O(R1) I FLAG R2

BNE R2,FLAG-SET FLAG R2

BEQ R2,CONTENTION I STORED? IR2

Figure 4 Load/Store Instrzlctio17s

Digital Tech~cicalJo~i~-lrnl Vol 4 iVo 4 J/I~LLCI/I.FSLIL' 1992 3 1 Alpha AXP Architecture and Systems

anrl the bratiching hints eliniinate some branches 1,000. liacli ;irchitected IYl'P; form:it also hi~sonc bit ant1 speetl 1111 tlie remaining ones without conipro- reservetl for hture expansion. mising multiple instruction issue. Sevcr:il other soft 1)ALcocle registers, such ;IS tlic 1'5 or ,iSN. that need only a few bits toclay ;Ire allo- Supmu ision cated ;I fi111 64 bits for future exp:insion. The ;ictions unclerpinni~igan operating system are Exception processing can limit perforrn;~ncr. perforrnetl in [?.\J.cotlesubroutines ant1 are a flexi- I'Al.code routines tleliver exceptions to an oper;lt- ble 1x11-tof the architecture. All ;isyncI~rotioiis ing s).stcln, so the dcsign can be gr;iclt~;~lly events, such as intet-rupts, exceptions, and m;~chine im]>rovecl. In hct. I?,\Lcotle routines for t1-1~.clat;~ errors, are rnediatecl by l%I,code routines. I'i\Lcotlc ;~lignmentIxtve bcen iniprovetl in the Open\ ITISAXP establishes the initial state of the machine before ;untl Ill:(: OSWI t\XP operating systcnis. Some cur- execution of the first software instruction. l'A1~c0tle rently specified softwarc exceptions (such ;is Ilrlil; routines 11iecli;iteall Accesses to physical liartlw:~rc clenornl;il ;irithmetic) could be mo\~etlinto 1'/2l,coclc resources, includitig physical main memory anel or hartl~v:ire. memory-rn;i1>1>etlI/() device registers. ?'here ;)I-e;I llumber of areas of instl-i~ction-set flesibilit!. designetl into the architecture. tour of This design allows irnple~iientersto craft ;i set of l't\Lcotle routines tliat closely match an oper;tting the (,-bit opcodes are no~iiinally reser\.ecl for s),stem tlesign, not onljr for trnclitioniil oper:iting ;itlcling intcgcr and floating-point alignctl oct:~- s),stems, but also for specialized environ~iientssuch \s;ort14(128-bit) load/store instructions I' Nine more ;is rc;il-time or bigblj. secure cotiiputing. As new 0-bit ol~codesremain for other cxpiinsion. Witliin co~~ipi~tingp:ir;~dignisare atlopred ;lntl new opcr;it- e;icli opcocle, the function firltl contains room for ing systems ;ire created, the Alpha r\XP architecture fi~~-tIicresp;tnsio~l. For csample. tlie sc;ilcd ;icltl/sirl~- may we1 l prove flexible enougli to accommotl;~le tract f~rnctionswere atldcd between l~rototypc then1 cfficie~~tly. chip iind protluct chip. The fnct th;it tlie function fielcls ;ire not f~11Iypolicecl is a mistitke. Future Changes Within the IiXE floating-point f~tnction ficltl, coclc points are nomin~llyreserveel for clo~~blc- The Alpli;~AXP ;~rchitecturewill surely ch:ulge estenclrtl" precision (128-bit) aritlimetic. Within during its lifetime. In addition to the I'~iI.cocle {:lie memory b:trricr instruction gr0~113,tbre~ COCIC' flexibility discussed above, explicit performance poi~itswere reserved for subset bal-ricrs. One of flexihilit)~and instruction-set flexibility exist in these h;is alre;idy been redcfi~inetlas ;I write-write the arcliitcct~~re. b;trricl: Architectural fields that are too sm;11l can limit Not all changes involve growth. l'herc ;*resubsct- perfo-ni;ince. The Alpha hXP architect~~rethere- ting rulcs clcfined for removing either one or both fore h;~smany fields tleliberatel!. sized for 1;iter (1I;I;l; ;uicl US) floating-point clat:i types. If both ;Ire exp;insion. removecl, the floating-point registers c:ln ;~lsoI>e i\lthougli initial implernentations ~1stonl!, 43 rerno\,etl. Tlie h.\l(>Yss P~\Lcocleroiltinc> :~ntl1<5/11(; bits of virtual ;~cldress.they check the reriiaining instructions are defined as option:il ;~ntlc;in be 21 bits, so tli;~tsoftware can run ~inrnoclifictl011 tleletecl nknthe trailsirion of tl-;insl;itcrl \ii\S coclc 1;lter implementations that use (up to) :ill 04 bits. is cornplcted. Other unneecled l)r\l.cocle rolltines Furthermore, ;~ltl~oughiniti;il implementations use c:m ;~lsobe removed e\;enti~;ill\: only 34 bits of pliysical atldress, the nrchilcctetl I>;lge t;tble entry (PTE) formats ancl p:~ge-size choices allow growth to 48 bits. By exp;lncling into SZ~VI~?~~~?~~ a 16-bit I'?'E fieltl tli;lt is not currently used by m;~p- FShcgoals rii;lt shaped tlie Alpha AXl' ;~rchitecture ping Ii;irtlw;~re,another 16 bits of physical adtlress clesigli have largely been realized. For high perfor- grcnvth can be acl~ieved,if ever neetletl. mance, the first irnp1ement;ttion (the I)lcc:cliil~ Initial implementations also use only SKI3 p;lges, 21004 11iicroprocessor) is listcd in the October 1992 but tIie clesign accommodates Ii~iiit~tlgrowth to (,'r~irilicss:soot2 of Recor.ds ;IS tlie world's cistest sill- 6,iKIi pages. Beyontl that, page t;tljle gr:in~~l:lrit)~ glc-chip microprocessor. It is too early to mc;isure liin~s:illow groups of 8, 64. or 513 p; to be longe\..ity but tlie fact that \\.c had clesigncd-in flcsi- rrc;ttc.cl ;is ;I single large page, th~iseSf'ectivcljr bility in pl:~ceatl1;it cliangecl tlurir~gclc\.clopment is cstcncling tlic. p:~gc-sjzerange b). :I ktctor of ovc,r ;it 1c;ist encour:iging. OpenVYlS ,\XI', [>kc:OSI/l ,\XI'. and Windows NT operating systems a11 run on IEEE not-a-~zuinber(NLLAV-A symbolic entity Alpha AXI-' implementations today. Programs from encoded in a floating-point format. The lEEE stan- the VAX ant1 ,\LIPS architectures transport easily to dard specifies some exceptional results (e.g., 0/0) Alpha AXIJ implenientatiolls ant1 run quickly Many to be NaNs. of the ideas in the Alpha AX11 design are now being Linear crddressilzg-A memory addressing tech- adopted by other architectures in the industry. nique in which all addresses form a single range, from 0 to the largest possible address. Subscript cal- Appendix culations can create any address in the entire range. Biiznry traizsl~~tioiz-A software technique to change an executable program written for one Little-endian rnenzory addressirzg-A view of architecture/operating-s~rstempair into an equiva- memory in which byte 0 of an operand contains the lent program for a different architecture/ol>erating- least significant bit of an integer. The terms little- system pair. endian and big-endian are borrowed from Gulliver's Tr~iz~elsin which religious wars were Rig-endiaiz nleinory addressing-A view of niem- waged over which end of an egg to break. ory jn which byte 0 of an operancl contains the Longz~~ord-A32-bit datum. most significant (sign) bit of an integer. Compare lit- tle-endian memory addressing. Mz~ltipleinstr~iction issz~e-A highperformance computer implementation technique of starting Byte-An 8-bit datum more than one instruction at once. An implements Byte granularity-The appearance that two pro- tion that starts (up to) two instructions at once is cessors can update adjacent bytes in memory with- called dual-issue; four instructions, quad-issue or out interfering with each other. four-way issue; etc. CLSC-Complex instruction set computer, charac- terized by variable-length instructions, a wide vari- Quadu~orcl-A 64-bit datum. ety of memory aclclressing motles, and instructions that combine one or more memory accesses with RISC-Reduced instruction set computer, charac- arithmetic. <:Is(: designs express computation as a terized by fixed-length instructions, simple Inem- few complex steps. ory acltlressing modes, and a strict decoupling of load/store memory access instructions from regis- IEEE clei~ormalizeclnunzber (denorma1)-A float- ter-to-register arithmetic instructions. NSC designs ing-point number with magnitude between zero express computation as many simple steps. and the smallest represelltable normalized number. Numbers in this range are typically not repre- Segmented addressi~zg-A memory addressing sentable in other floating-point arithmetic systems; technique in which acltlresses are broken into two such systems lnight s~gnalan untlerflow exception or more parts (segments). Subscript calc~~lations or force a result to zero instead. can only be done within a single segment, and elab- orate software techniques are needed to extend IEEE clouble-extended fornzat-A loosely specifed addressing beyond a single segment floating-point format with at least 64 significant bits of precision and at least 15 bits of exponent VAX dirty zero-A zero value represented with a witlth; typically implemented using a total of 80 or non-zero faction; must be converted to a true zero 128 bits. result.

IEEE clynntnic rounding mode-One of four differ- VAX flroating-point-A form of computer arith- ent rounding rules. metic specified by the VAX architecture manual.4 VAX arithmetic includes rules for reserved IL5k f%o~!tiizgll,oi~zt-Aform of computer arith- operands and dirty zeros. metic specified by lEEE standartl 754.l' IEEE arith- metic inclr~desrules for tlenormalized numbers, VAX reserved operand-A non-number that signals infinities, and not-a-numbers. It also specifies four an exception when used as an operand in VAX float- different modes for rounding results. ing-point arithmetic. IEEE i~zfinitjl-An operand with an arbitrarily large VAX ulord su~apping-The rearrangement needed magnitude. for the 16-bit pieces of a vi\X floating-point number

Digital Technical Journal %)I. 4 IVO. 4 Special Issue 1992 Alpha AXP Architecture and Systems to put the fields in a more ilsr~alorder; this is an arti- 8. There are two special-resource ;inornalies in krct of the l'l>IJ-l L 16-bit ;~rchitecture. the architecture that we were unable to avoid: the tleclicated state for the load-locked W~I*L/-)~16-bit tI:~ti~ni, instruction and the dynamic rountling-mode register required for f11ll IEEE conh)rm:~nce. Acknowledgments 9. This is borne out in a large customer's recent Hirntlretls of people have worked on the Alpha AX[' C string manipillation benchmark result, run- ;~rcliitectt~re,h;irdware, ancl software. M;my Alpli;~ ning 3 to 6 ti~iiesfaster than the custoniet-~s AXI' arcllitectural icleas came from the PRISM expectation (wl~icliwas basecl solely on clock design, most notably the PALcode idea. The archi- rate ratios). tecture work was clone in the rich envirolirnent of dozens ;inti later Iiundrecls of bright, thoi~glitfi~l, 10 C~-LIJ~-~Colrzputer Systenz KeJerence ~Vl~~~zuol, and outspoken professional peers. Ellen bat bout;^. For111 2240004 (Minneapolis Crny Rese:ircli, Dileep 13handarkar, Richard Brunnel; Wayne Inc., 1977). <:;irdoz;~.Dave Cutler, Daniel Dobberpuhl, Robert I I. IB,l-1 Systenz/.?70 Prilzciples of 0~)erwtion. (iiggi. Henry Grieb, Richard Grove, Robert Form C~22-7000-4(Armonk. M': Ilih4 Corpo- M:~lste;~d.Jr.. MicIi;~elHarvey, Nancy Kronenberg, ration. 1974): 28. R~ymonclLanza. Stepheti Morris. Willialn Noyce, Charles Nylanclec Ilave Orbits, Mary Payne. Audrey 12. Institute of Electrical and Electronics Engi- Reith, Robert Supnik. Benjamin Thomas, Catharine neers. "Binary Floating-point Arithlnetic for van Ingen, ant1 Rich Witek all contributed directl!, ,Microprocessor Systems,'' Standard Number to the written specification. Rich Witek is co-;rrclii- 11313-754 (New York, 1985). tect ;~ndis the other half of the term "we" usetl in 3. The careh~lreader will notice that Alpl~;~,\XI) this paper. implementations require a longword shifter in the loacl/store patli for 32-bit operands. References and Notes Although we briefly considered ;I design with no 32-bit operands, we decided to Iccep 52-bit 1. C;. Amdahl, G. Blaauw, and E Brooks, Jr., load/store support for good business reasons. "Architecture of the IBM System/S60," Ih'M Similarly, iUpha AXP implementatiolls require Jou1.7~rrloJKesearcl7 arzd Derlelopment, vol. 8, ;I worcl swapper in the load/store patli for VAX no. 2 (April 1967): 87-101. flo;~ting-pointoperands. We clecidetl to keep 2. R. Sites, ed., Alkbn Architectur-e Refererlce VAX floating-point support for good I)i~siness lLIr~rz~~~l(Burlington? ha: Digital Press, 1992). reasons. Depending on market needs, \'AX floating-point support can be removetl in the 3. 11. Conr;ltl et al., "A 50 MIPS (Peak) 32/(i4b h~ture. Nlicrolxocessor," ISSCC Digest cg ~~'c%?I~~L.cI/ Pz~I~ers(February 1989): 76-77 14. Many commercially successful arcliitcc- tilres have grown to double-witlt11 memory 4. It. Brunner, etl., I2.X Arclnitectnre Kt?jbr.er?cc. implementations in mid-life: the lohl 709 ,Wcln~~oISecond Etlition (Bedhrtl, MA: 1Iigit;ll series from 36 to 72 bits; the II%MSysteni/3GO Press, 1991). series from 32 to 64 bits: the Digital I'1)1'-11 series from 16 to 32 bits; ;rnd the 1)igit;ll 5. <;. Kane and J. Heinrich, ,llIP,F R1.K Ar'cbitec- VAX series from 32 to 64 bits. This trentl is ture (Englewoocl Cliffs, NJ: Prentice-Hall, likely to continue. 1992).

6. R. Sites, A. Chernoff, M. Kirk, M. Marks, atid S. Robinson, "Binary Translation," L)igitz~I Tcclnnic-a/Jourrznl, vol. 4. no. 4 (1992, this issue): 137-152.

7 The little-endian bias is very slight; both big- and little-entli;~nAlpha AXP systems and soft- wxre are in fact being built. Daniel W Dobberpuhl Robert A. Conrad Liam Madden Richard 1: Witek Daniel E. Deuer EdwardJ McLellun Randy Allmon Bruce Gieseke Derrick R. Meyer Robert Anglin Soha M.N Hassoun James Montanaro David Bertucci Gregory W Hoeppner Donald A. Priore Sharon Britton Kathryn Kuchler Vidya Rajagopalan Linda Chao Maureen Ladd Sridhar Samudralu Burton M. Leary Sribalan Santhanam

A 200-MHz 64-bit Dual-issue CMOS Microprocessor

A 400-1n~s/200-1~1FLOPS(peak) custom 64-bit VLSICPU chip is described. The chip is fabricated in a 0.75-pmCMOS technology utilizing three levels of metnlizntio~zand optimized for 3.3-Voperation.The die size is 168 mm X 13.9 mm and contains 1.68 lnilliolz transistors. The chi) includes sepamte 8KB instrtiction and dcit~rcacl?es and a fully pipelined floating-point unit tlgat can handle both IEEE and VRY standard floating-point data types. It is designed to execute two instructionsper cycle among scorebonrded integer;floating-point, address, and branch execution units. Pozoer dissipation is 30 W at 200-MHz operation.

A reduced instruction set computer (NSC)-style CMOS Process Technology microprocessor has been designed and tested that The chip is fabricated in a 0.75-pm, 3.3-V, n-well operates up to 200 megahertz (MHz). The chip CMOS process optimized for high-performance implements a new 64-bit architecture, designed to microprocessor design. Process characteristics are provide a huge linear address space and to be devoid shown in Table 1. The thin gate oxide and short of bottlenecks that would impede highly concur- transistor lengths result in the fast transistors rent implementations. Fully pipelined and capable required to operate at 200 MHz. There are no of issuing two instructions per clock cycle, this explicit bipolar devices in the process as the incre- implementation can execute up to 400 million oper- mental process complexity and cost were deemed ations per second. The chip includes an 8-kilobyte (JSB) I-cache, 8KB D-cache and two associated trans lation buffers, a four-entry, 32-byte-per-entry write buffer, a pipelined 64-bit integer execution unit Table 1 Process Description - - with a 32-entry register file, and a pipelined floating- Feature size 0.75 Krn point unit (FPU) with an additional 32 registers. The Channel length 0.5 pm pin interface includes integral support for an exter- Gate oxide 10.5 nm nal secondary cache. The package is a 431-pin pin 0.5 Vl-0.5 V grid array (PGA) with 140 pins dedicated to V,,/y, Ynl 4, (power supply voltage/grouncl). The chip is fabri- Power supply 3.3 v cated in a 0.75-micrometer (pm) n-well comple- Substrate P-epitaxial with n-well mentary metal-oxide semiconductor (CMOS) Salicide Cobalt-disilicide in diffusions process with three layers of metalization. The die and gates measures 16.8 millimeters (mm) x 13.9 mm and con- Buried contact Titanium nitride tains 1.68 million transistors. Power dissipation is Metal 1 0.75-pm AICu, 2.25-pm pitch (contacted) 30 watts (W) from a 3.3-volt (V) supply at 200 MHZ. Metal 2 0.75-~mAICu, 2.625-~m pitch (contacted) 0 IEEE. Reprinted, with permiasion, from thc /EEEJortr.i?alof Metal 3 2.0-pm AICu, 7.5-pm pitch SolidStute Circuils, volumc 27, number 11, pages 1555 to 1567, (contacted) November 1992.

Digital Tecbnicnl Jourrrnl Vol. 4 No. 4 Sl,ecinl lssi~e1992 3 5 Alpha AXP Architecture and Systems

too large in comparison to the benefits provided- I-CACHE principally more area-efficient large drivers such as clock and I/O. The metal structure is designed to support -F the high operating frequency of the chip. Metal 3 E-BOX - - F-BOX is very thick ancl has a relatively large pitch. It is important at these speeds to have a low-resis- - I-BOX tance metal layer available for power ant1 clock ttl t t l~ BIU distribution. It is also used for a small set of special - FRF signal wires such as the clata buses to the pins and the control wires for the two shifters. Metal 1 t A ancl metal 2 are maintained at close to their maxi- mum thickness by planarization and by filling metal 7 7

1 and metal 2 contacts with tungsten plugs. This A- A-BOX removes a potential weak spot in the electromi- gration characteristics of the process and allows more freedom in the design without compromising reliability.

Alpha AXP Architecture The computer architecture implemented is a 64-bit Figt~~-c?I CPU Cbip Block. Dingizlm load/store RISC architecture with 168 instructions, a11 32 bits wick.' Supportecl data types include E-box), the floating-point unit (FRF plus F-box), the 8-,16-, 32-, and 64-bit integers ancl both Digital and loacl/store unit (A-box), and the branch unit (dis- IEEE 32- and 64-bit floating-point formats. Each of tributecl). T'he bus interface unit (UIU), described in the two register files, integer and floating point, the next section, handles all communication contains 32 entries of (54 bits with one entry in each between the chip and external components. The being a hardwired zero. The program counter and microphotograph (Figure 2) shows the boundaries virtual address are 64 bits. Implementations can of the major functional units. The dual-issue rules subset the virtual atltlress size, but are required to are a direct consequence of the register file ports, check the fi1I1 64-bit adclress for sign extension. tl~efunctional units, ancl the I-cache interface. The This ensures that when later implementations integer register file (IM) has two read ports and one choose to support a larger virtual address, pro- write port dedicated to the integer unit, and two grams will still run and not find addresses that have read anrl one write port shared between the branch dirty bits in the previously "unused" bits. unit and the load/store unit. The floating-point reg- The architecture is designed to support high- ister file (FRF) has two read ports and one write speed multi-issue implementations. ?b this end the port dedicated to the floating unit, and one read architecture does not include condition codes, and one write port sharecl between the branch unit instructions with fixed source or destination regis- and the load/store unit. This leads to dual-issue ters, or byte writes of any kincl (byte operations are rules that are quite general: supported by extract ant1 merge instructions Any loacl/store in parallel with any operate within the CPU itself). Also there are no first-gener- ation artifacts that are optimized around tod;iy's An integer operate in parallel with a floating technology, which would represent a long-term lia- operate bility to the architecture. A floating operate and a floating branch

Chip Microarchitecture An integer operate and an integer branch The block diagram (Figure 1) shows the major func- tional blocks and their interconnecting buses, most except that integer store and floating operate and of which are 64 bits wicle. The chip iniplements floating store and integer operate are disallowecl as four functional units: the integer unit (IRF plus pairs.

36 W)1. 4 No. 4 S~ecidlssue1992 Digilal Tecbrrical Jozrrrzal A 200-MHz 64-bitDual-issue CMOS iMicro~~rocessor

CLOCK

Figure 2 ~klicrophotogrl,hof Chip

As shown in Figure 3a, the integer pipeline is subroutine return stack. The architecture provides 7 stages deep, where each stage is a 5-nanosecond a convention for compilers to predict branch deci- (ns) clock cycle. The first four stages are associated sions ancl destination adtlresses, inclutling those for with instruction fetching, dccotling, and score- register indirect jumps. The penalty for branch mis- board checking of operands. Pipeline stages 0 predict is four cj~cles. through 3 can be stalled. Beyond 3, however, all The floating-point unit is a fi~llypipelined 64-bit pipeline stages advance every cycle. Most arith- floating-point processor that supports both vioc metic and logic unit (ALU) operations complete in standard and IEEE stantlarcl data types and rouncling cycle 4, allowing single-cycle latency, with the modes. It can generate a 64-bit result every cycle shifter being the exception. Primary cache accesses for all operations except divide. As shown in Figi~re complete in cycle 6, so cache 1;ltency is three cycles. 3b, the floating-point pipeline is identical and The chip will do hits under misses to the primary mostly shared with the integer pipeline in stages 0 D-cache. through 3; however, the execution phase is three The I-stream is based on autonomous prefetch- cycles longer. All operations, 32- and 64-bit (except ing in cycles 0 and 1 with the final resolution of divide) have the same timing. Divide is handled by a I-cache hit not occurring until cycle 5. The nonpipelined, single bit per cycle, dedicated divide prefetcher includes a branch history table and a unit.

Digital Techtrical Jourt~al W1. 4 /\lo. 4 .S/)eciullssue 1992 37 Alpha AXPArchitecture and Systems

0 2 4 5 6 IF 10 A1 A2 W R

CACHE DECODE ALUl WRITE ACCESS 4 4 4 SWAP ISSUE ALU2 PREDICT RF READ

PCGEN ITB I-CACHE HITIMISS

VAGEN DTB D-CACHE HITIMISS

(0) Integer Unit Pipeline Tirning

0 1 2 9 4 5 6 7 8 9 IF SW 10 I1 F1 F2 F3 F4 F5 FWR '4 BY PASS I I CACHE DECODE ADD LID SHIFT ADDIRND FRF WRITE ACCESS 3X MULl MUL2 ADDJRND FRF WRITE SWAP ISSUE PREDICT RF READ

KEY: PC GEN GENERATE NEW PROGRAM COUNTER VALUE VA GEN GENERATE NEW VIRTUAL ADDRESS ITB INSTRUCTION TRANSLATION BUFFER DTB DATA TRANSLATION BUFFER

Figure 3 Pipeline Tinzi~zg

In cycle 4, the register file data is formatted to pages and 4 that map 4-megabyte (MB) pages. l'he fraction, exponent, antl sign. In the first-stage data tr;~nslationbuffer contains 32 entries that can :rclder; exponent difference is calculated and ;I 3 x map SKU, 64~~,512KB, or ~MBpages. ~n~lltiplicandis generated for multiplies. In aclcli- T11e CPLI supports performance measurement tion, a predictive leading 1 or 0 detector using with two counters that accumulate system events the input operands is initiated for use in resillt nor- on the chip such as dual-issue cycles and cache malization. In cycles 5 and 6, for add/subtract, misses or external events through two dedicatetl alignlnent or normalization shift and sticky-bit cal- pins that are s;lmpled at the selected system clock cul:ition are perfornietl. For both single- ant1 tlou- speetl. ble-precision mi~ltiplication,the multiply is done in ;I radix-8 pipelined array multiplier. In cycles 7 and External Interface 8, the final acldition and rounding are performed in The external interface (Figure 4) is designed to parallel and the final result is selected and driven directly support an off-chip backup cache that can back to the register file in cycle 9. With an ;tllowed range in size from 128KB to 8h4B and can be byp;~ssof the register write data, floating-point constructed from ordinary SRAMs. For most opera- latency is six cycles. tions, the CPU chip accesses the cache directly 'The <:I'IJ cont;rins all the hardware necessary to in ;I combinatorial loop by presenting an address support a tlemand paged virtual memory system. It antl waiting N CPlJ cycles for control, tag, ant1 tlat:~ inclutles two translation buffers to cache virtu;il-to- to appear, where N is a mode-programmablt:nibl num- physic;~laddress translation. The instruction trans- ber between 3 and 16 set at power-up time. For lation buffer contains 12 entries, 8 that map 8KR writes, both the total number of cycles and the adr-h<33:5> 4 RAM-cll sys-RAM-ctl -1 -;-1-I I I I SYSTEM DEPENDENT LOGIC 1 . I I

I RAM CPU MEMORY CHIP SYSTEM INTERFACE

osc<2> (400 MHz)

Fi'q~1r.e4 CPU Exterrzal I~zter;firce t1ur;ltion and position of the write signal are the 16-byte pin bus. In &-bit mode, it is four cycles. progranimable in units of CIJU cycles. This allows This yields a maxiniuni backup cache read bantl- the module designer to select the size and access width of 1.2 gigabytes per second ((;B/s) antl a write time of the SkL\ls to match the tlesirecl price/ bandwidth of 711MH/s. Internal cache lines can performance point. be invalidated at the rate of one line per cycle The interface is designed to allow all cache pol- using tlie dedicated invalidate address pins, icy clecisions to be controllecl by logic external to iAdr-11<12:5>. the c:l'ri chip. There are tliree control bits associ- In the event extern;~lintervention is requirecl, a i~teclwith e;~chbackup cache (B-c;~che)line: valid, request code is presented by the CDU chip to the shared, and dirty. The chip completes a B-cache external state machine in the time domain of the re;ld ;IS long as valid is true. A write is processetl by SYS-CLK as describecl pre\~iousl)! Figure 5 shows the <:l'lJ only if valid is true and sliarecl is false. the read miss timing where each cycle is a SYS-CLK When ;I write is performed, the dirty bit is set to cycle. The external transaction starts with the true. In ;ill other cases, the chip defers to an exter- address, the quadword within block and instruc- nal state machine to complete the transaction. This tion/data indication si~ppliedon the cW>Iask-11 stare m;~chineoperates synchronously with the pins, ant1 REAII-I3I.OCK function supplied on the S\'s_(;l.l

Digital Tecbricnl Jottrnnl Vof. 4 No. 4 Spccial fswr 1992 39 Alpha ,UP Architecture and Systems

sysCLKOut-h

adr-h

cWivlask-h X VALID X cReq-h 1

data h / ecc VALID VALID check-h

rdAck h cAck-h -

Tlie chip itllplements a novel set of fe;~turessup- One d ifIiculty inherent in this clocking methoel porting chip ant1 motlule test. When the chip is is the substant~i;tl lo;~tlon the clock node. 3.25 reset, the first action is to reael from ;I seriiil re;td- nanofaratl (nl;) in our design. This load anel the only memory (SKOM) into the I-cache via a private requirement for a fast clock eclge led us to take par- three-wire port. The <;Ptl is then enabled ancl the ticular care with clock routing and to do extensive program counter (PC) is forced to 0. 1'11~1swith only analysis on the resulting grid. Figure 6 shows thc three fi~nctionalcomponents (<:I'll chip, SlWkI, and distribution of clock 1o:cel among the major func- clock input), a system is able to begin executing tional units. The clock drives into a grid of vertical instructions. This initial set of instructions is r~secl metal 3 ;tnd horizont;tl metal 2. Most of the loading to write tlie bus control registers inside the <:I'U occurs in tllc integer ;lnd floating-point ini its that chip to set the cache timing and to test the chip and are fed from the more robust metal 3 lines. To nlotlule from the CPI out. After the SKoiLI loads tlie ensure the integrity of the clock grid across tlie I-c;tchc, the pins used for the sl

INT UNIT WRITE 1129 pF BUFFER

I-CACHE 373 pF TOTAL CLOCK LOAD = 3.2 nF D-CACHE 208 pF

803 pF

TOTAL BYPASS CAPACITANCE = 128 nF

Note: Total effective switching capacitance = 12.5 nF

Figure 6 Qock Load Distrib~ltiorz

Figure 7 CPU Clock Skew

The clock driver ;~ndpredriver represent about on-chip. This consists of thin oxide c;tpacit;~nce 40 percent of the total effective switching c;~paci- that is clistributed ;irountl the chip, primarily ulider tance cletermined by power mc:lsurement to be the data buses. In addition, there are horizont;tl 12.5 nF (wrorst case inclucling output pins). To metal 2 power and clock sliortj~igstraps adjacent manage the problem of di/dt on the chip power to the clock generator, and the thin oxide decoup- pins, explicit decoupling c;~pacit;tnceis provided ling cap under these lines supplies cb;~rge to Alpha AXP Architecture and Systems

the clock driver. di/dt for the driver alone is about fered versions of CLK. The resulting clock skew is 2 X 101' amperes per second. The total decoupling managed or eliminated with special latch structures. capacitance as extracted from the layout measures The latches used in the chip can be classifietl into 128 nE Thus the ratio of decoupling capacitance two categories: custom ant1 standard. The custom to switching cap is about 10: 1. With this capacitance latches were used to meet the unique needs of tlata- ratio, tlie decoupling cap could supply all the charge path structures ant1 the special constraints of criti- associated with a complete CI'U cycle with only a 10 cal paths. The standarcl latches miere used in the percent reduction in the on-chip supply voltage. design of noncritical control and in some data-path applications. These latches were designed prior to Latches the start of implementation ant1 were includetl in As previously described, the chip employs a single- the library of usable elements for logic synthesis. NI phase approach, with nearly all latches in the core synthesized logic used only this set of latches. of the chip receiving the clock node, CLK, directly. The standard latches are extensions of previously A representative example is illustrated in Figure 8. publisked work, and examples are shown in Fig- Notice that L1 and L2 are transparent latches ures 9 to 11.' To understand the operation of separated by random logic and are not simultane- these latches, refer to Figure 9a. When CLK is high, ously active; 1.1 is active when CLK is high and L2 Pl,Nl, and N3 function as an inverter cornplemcnt- is active when CLK is low. The minimum number of ing IN 1 to produce X. P2, N2, and N4 filnction as a delays between latches is zero and the maximum second inverter and complement X to produce number of delays is constrainecl only by the cycle OUT. Therefore, the structure passes IN1 to OUT. time and the details of any relevant critical patlis. W%en C1.K is low, N3 and lV4 are cut off. If INl, X, The bus interface unit, many tlatepath structures, and OU'T are initially high, low, and high respcc- and some critical paths deviate from this approach tively, a transition of IN1 falling pi~llsX high and use buffered versions and/or conditionally buf- through P1 causing P2 to cut off, which tristates OUT high. If IN1, X, and OUT are initially Low, high, and low respectively a transition of lnll rYsing CLK causes P1 to cut off, which tristates X high leaving out tristatetl low. In both cases, additional transi- tions of IN1 leave X tristated or driven high with OUT tristated to its initial value. Therefore, the LOGIC structure implements a latch that is transparent when CLK is high and opaque when CLK is low. Figure 9c sliows the dual circuit of tlie latch just dis- cussed; this structure implements a latch that is transparent when CLK is low and opaque when LOGIC CLK is high. Figures 9b and 9d depict latches with can output buffer used to protect the sonietiines dynamic node OUT and to drive large loatis. (h) Latching Scl?ema The design of the standard latches stressed three primary goals: flexibility, immunity to noise, and immunity to race-through. To achieve the desired flexibility, ;i variety of latches like those in Figures 9 to 11 it1 a variety of sizes were characterized for the implementors. Thus the designer could select a latch with an optional output buffer and an embecl- decl logic function that was sized appropriately to L1 OPAQUE L2 OPAQUE L2 TRANSPARENT L1 TRANSPARENT drive various loatls. Furthermore, it was clecided to allow zero delay between latches, complctely free- ing the designer from race-through considerations (6) Latch Timing when designing static logic with these latches. In the circuit methodology adopted for the imple- mentation, only one node, X (Figure 9a), poses

42 W)/.4 No. 4 .Y~L'c~cI/ISSLLL' 1992 Digitnl Technical Jour'rtrtl CLK CLK

IN1 OUT

CLK

CLK CLK IN, -+6G

OUT

(c) Noni~zvertingActiz~e-loz~~ Latcln (d) Inzm-ting Actiue-loLLI Latch

Figzcre 9 Basic Latches inordinate noise margin risk. As noted above, X the variety of available latches, and the zero clelay may be tristated high with OUT tristated low when goal between latches. The clock skew concern the latch is opaque. This maps into a dynamic node was actually the easiest to atlclress. If data propa- driving into a dynamic gate that is very sensitive to gates in a direction that opposes the propagation of noise that recluces the voltage onX, causing leakage the clock wave front, clock skew is functionally through P2, thereby destroying OtlT. This problem harmless ancl tends only to reduce the effective was addressecl by the addition of PS. This weak cycle time locally. minimizing this effect is of con- feedback device is sized to source enough current cern when designing the clock generator. If data to counter reasonable noise and hold P2 in cutoff. propagates in a direction similar to the propagation N5 plays an analogous role in Figure 9c. of the clock wave front, clock skew is a functional Race-through was the major f~~nctionalconcern concern. This was acltlressed by radially distrib- with the latch design. It is aggravatecl by clock skew, uting the clock from the center of the chip. Since

Digital TechnicalJotrrnal Vol. 4 iVo. 4 .S/>ecirrl lss~~e1992 4 3 Alpha &W Architecture and Systems

CLK CLK IN1 & OUT IN2 ,-ql xp5kq-1 IN1 X OUT ,I -

CLK I, N3 I, N4

N1 N2

CLn CLK ;;;-+ OUT

IN1 -1 IN1 P4 CLK T CLK - OUT OUT -

the clock wave front moves out radially from the 'l'hen m;iny simulations mrith varying CLK rise and clock driver toward the periphery of the die, it is fall times, temperatures, and power supply volt- not possil~lefor tlie data to overtake tlie clock if tlie ages were performed. The results sbowetl no clock network is properly designed. ;~ppreciableevidence of race-through for C1.K rise li) verify the remaining race-through concerns, ;I and fall times at or below 0.8 ns. With 1.0-ns rise mix-ancl-m;~tchapproach was taken. All reasonable ~undhll times, tlie latches showed signs of ti~ilure. combin;~tionsof I:~tclieswere cascatletl together To gi~;t~-;lnteefi~nctionality, CLK was specifiecl and and sirn~~latetl.The simulations were stressetl by tlt-signed to have an edge rate of less than 0.5 ns. eliminating all interconnect ant1 difilsion cap;~ci- This was not a serious constraint since other tancc ;inti by pushing each device into ;I comcr circilits in the chip required similar edge rates of of the process that emphasizetl race-through. the clock.

44 Vol. 4 A'o. 4 Sper'iot Issue 1992 Digilnl Tecl~~ricnlJorrrrrttl CLK

I N2

- IN1 -b X OUT --OUT - N3 CLK , -

N2

( 6) T~i~o-ii~p~~tArOK Aclir~e-high Lntch

CLK CLK

;;; OUT

IN1

CLK

OUT

(i.) Tzilo-ii.zp~~.tOR Actiue-LOZLJLatch (d) T~t~o-ilzp~itNOR Acti~~e-~OLULCI~CIJ

A litst design issue worth noting is the feedback tl~ro~~gllPI can drive X down by more th;m ;I tlevices, PI5 and P5, in Figures 10c, 10d, 112, ancl threshold even with weak feedback, thereby Ilb. Notice thar these devices l~vetheir gates tied destroying olrl'. To counter this phenomenon, P5 to C1.K instead of OrJT like the other latches. This c:lnnot be :I weak feedback device arid thcreforc difference is required to account for an effect not cannot he tietl to OUT if the latch is to h~nction present in the other latches. In these latches, a properly when <:LK is high. Note that taller stacks stack of tlevices is connected to node X, without ;iggr;tvate this problem because the devices passing through the clocked transistors P.3 or N3. become larger and there are Inore tlevices to partic- Referring to Figitre Ila, assume CLK is low, X is ipi~tein coupling. For this reason, st;~cl

Digirtrl Tecbrrictrl Jorrrirnl WlI. ,-I ,Vo. 4 Special lssrre 1992 4 5 Alpha AXP Architecture and Systems

since X will uncontlitionally go high with C1.K The longword carry select is built as a distributed falling, iuntl OIIT must be able to retain a stored ONE. cascode structure used to conibine tlie byte gen- So there is :I two-sidetl constraint: P5 must be large erate, kill, and propagate signals across the lower enough to counter coupling and small enough not 52-bit longworcl. It controls tlie final data selection to c;~userace-tlirougli. These trade-offs were ana- into the upper longworcl output latch ant1 is out of lyzed by simulation in a manner similar to the one tlie critical path. oi~tlined;ibove. The NMOS byte carry select switches are con- trolletl by a cascade of closest neighbor byte c;~rr)l 64-bitAd~Zer o~~ts.Data in the most significant byte of the upper longworcl is switchecl first by the carry-out dat;~of A tlifficult circuit problem mras the 64-bit adtler por- the next lower byte, byte 6, then by byte 5, ancl tion of the integer and floating-point ALUs. Unlike a finally byte 4. The switches direct the sum dxta previous Iiigh-speed design, we set a goal to from either the carry-in channel or the no-carry ;~cliievesingle-cycle latency in this unit.3 Figure 12 channel (Figure 13b). Sign extension is accom- h:is ;In organiz;~tionaldiagram of its structure. Every plished by clisabling the upper longwor-tl switcli p:~ththro~~gh the adder includes two latclies, allow- controls on longword operations and forcing the ing fully pipelined operation. The result latches are sign of tlie result into both data channels. shown explicitly in the diawani; honlevec the input latches ;ire somewhat implicit, taking advantage of the predischarge characteristics of tlie carry chains. 'flit. con~l>leteadder is a combination of three meth- To provide maximum flexibility in applic;itions, the ods for proclucing a binary adcl: a byte long carry external interface allows for several different ch:iin, 21 longwortl (32-bit) carry select, ant1 local modes of operation all using common on-chip cir- logal-ithmic carry select.' The carry select is built as cuitry. This includes choice of logic fri~nily(CMOS/ :I set of n-channel metal-oxide sernicontluctor transistor-transistor logic [TTL] or emitter-couplet1 (NMOS) switches that direct the data from byte logic [B<:L]) as well as bus width (64/128 bits), exter- c:irry chains. The 32-bit lo~igwortllool

40 W1. -i ,\i~,f .S/)ot irtl 1ss11c. 1992 Digital ~~chrricrrlJorrrnnI A 200-MHz 64-bitDuc~l-isstie CMOS Microp~ocessor

Digilal Techriicnl Jozrriinl k1. 4 No. 4 Specic11ISSLI~ 1992 47 Alpha AXP Architecture and Systems

GENERATE DEVICE VDD VDD VDD VDD VDD VDD VDD

-I

PREDISCHARGE -' PROPAGATE AND KILL DEVICE DEVICE

(a) Adder Carrj~Chain

SUM-OUT-ASSUMING-CARRY

I u1CO GETS CO

SUM-IN-ASSUMING-NO-CARRY

(6) Adder Cufry-selectSuu'tcl7e.s

Figure I.? Adclel- C~I.[)I

transistor to no more than 4 V: QG is a Ph4OS pull-up device that shares a common n-well with Q7 through Q10, which have responsibility for supply- ing the well with a positive bias voltage of either C;,, or the I/O pin potential, whichever is higher. Q3 through Q5 control the source of voltage for the gate of Q6-either the output of the inverter or the I/O pad if it moves above y),,. K1 and R2 provide 50-ohm series termination in either operating mode.

CLZC~~S 'I'he two internal caches are almost identical in con- TO PAD struction. Each stores up to 8KB of data (D-cache) or instri~ction(I-cache) with a cache block size of 32 bjTes. T'lie caches are direct mapped to realize Figure 14 Floating-zuell Driver a sjngle cycle access, ancl can be accessed using

48 kt)/. 4 /Vo. 4 Sfircictl I.c.sus 1392 Digitnl Tecbiricnl Jorrrrzal A 200-MHz 64-bit Dual-issue CMOS Microprocessor untranslated bits of the virtual address since the tor cell. A segmented word line is used to accom- page size is also 8KB. For a read, the address stored modate the banked design, with a global word line in the tag and a 64-bit quadword of data are implemented in third-level metal and a local word accessed from the caches and sent to either the line implemented in first-metal layer. The global for the D-cache or the word line feeds into local decoders that decode the instruction unit for the I-cache. A write-through lower two bits of the address to generate the local protocol is used for the D-cache. word lines. As shown in Figure 15, the word lines The D-cache incorporates a pending fill latch are enabled while the clock is high, and the sense that accumulates fill clata for a cache block while amplifiers are fired on the falling edge of the clock. the D-cache services other load/store requests. Once the pending fill latch is full, a11 entire cache Summary block can be written into the cache on the next A single chip microprocessor that implements a available cycle. The I-cache has a similar facility new 64-bit high-performance architecture has been called the stream buffer. On an [-cache miss, the described. By using a highly optimized design style I-box fetches the required cache block from mern- in conjunction with a high-performance 0.75-pm ory and loads it into the I-cache. In addition, the technology, operating speeds up to 200 MHz have I-box will prefetch the next cache block and place it been achieved in the stream buffer. The data is held in the stream The chip is superscalar degree 2 and has 7- and buffer ancl is written into the I-cache only if the data 10-stage pipelines for integer and floating-point is requested by the I-box. instructions. The chip includes primary instruction Each cache is organized into four banks to reduce and data caches, each 8KR in size. In each 5-ns power consumption and current transients during cycle, the chip can issue two instructions to two of precharge. Each array is approximately 1,024 cells four units, yielding a peak execution rate of 400 wide by 66 cells tall with the top two rows used mips and 200 MFLOPS. as redundant elements. A six-transistor, 98-pm* The chip is designed with a flexible external static RAM cell is used. The cell utilizes a local inter- interface providing integral support for a sec- connect layer that connects between polysilicon ondary cache constructed of ordinary SRAMs. The and active area, resulting in a 20-percent reduction interface is fi~llycompatible with virtually any in cell area comparetl to a conventional six-transis- multiprocessor write cache coherence scheme,

PIPELINE STAGE 1314151617181 CLK

DISPLACEMENT ADD

CACHE WORD-LINE

SENSE AMP ENABLE

CACHE DATA1 TAGS OUT

REGISTER FILE WRITE PORT

ALU BYPASS IN

Figure 15 D-cache Timing Diagram

Digital Techrrical Journal Yr)f. 4 No. 4 Spedallssrre 1992 49 Alpha AXP Architecture and Systems

and can accommodate a wide range of timing 3. R. Conrad et al., "A 50 MIPS (peak) 32/64b parameters. It can interface directly to standard l'TL Microprocessor," ISSCC Digest of Technical and CMOS as well as lOOK ECL technology. Papers (February 1989): 76-77

References 4. J. Sklansk, "Conditional-Sum Addition Logic," IRE Transactions on Electronic Computing, 1. Alpha Architecture Handbook (Maynard: Digital vol. EC-9 (1960): 226-231. Equipment Corporation, Order No. EC-~1689-10, 1992). 5. H. Lee et al., "AII Experimental IMb ChlOS SRAW 2. J. Yuan and C. Svensson, "High-Speed CMOS with Configurable Organization and Operation," Circuit Techniques," IEEEJotrrnal of Solid-State ISSCC Digest of Technical Papers (February Circuits, vol. 24, no. 1 (February 1989). 1988): 180-181.

50 Vol. 4 No 4 S/~acicrlfss~~e1992 Digital Tecbrricu~Jor~rira~ Charles I? Thacker David G. Conroy Lawrence C. Stewart

The Alpha Demonstration Unit: A High-pmformance Multiprocessor for Software and Chip Development

Digital's firs1 RISC systenz built usir?g tlge 64-bit Alpha iVTP architecture is the prototype knoz~lnas the A@ha dernorzstlwtion unit or ADU It consists of n Dackplalze corztairzi~zg14 slots, each of z~lhichcon hold a CPU modctle, a 641118storqe nzod~rle, or a ?nodz~lecontainirzg lziu 501MB/sI/O cbcrr~~zels.A new cache cobererzce protocol prouides eachp~.ocessorand I/O chatltzel with a consistent uieui of shared merlzorj! T/gir&y-five ADU systenzs ulere built zvitbin Digital to nccelerwte softtonre develop- 11zent and early chi11 Iesti~zg.

There is nothing more dif'ficiilt to tnkc in hantl, rapidly as possible. These systems dicl not require more perilous to contluct, or more uncertain in its the high levels of reliability and availability typical success, than to take the lead in the introtluction OF of Digital products, nor did they need to have low :I new order of things. -Niccolo ~Machiavelli,Ibe Prirrce cost, since only a few wo~~ltlbe built. They clicl need to be ready at the same time as the first chips, and Introducing a new, 64-bit computer architecture they had to be sufficiently robust that their pres- posed ;I number of challenges for Digital. In ence would accelerate the over;lll program. addition to developing the architecture and the Digital's Systems Research Center (SRC) in Palo first integratetl implementations, an enormous Alto, CA had had experience in builtling similar pro- amo~lntof software had to be moved fronl the VAX totype systems. SRC: had designed nntl built much of and MIPS (MIt'S Computer Systems, Inc.) architec- its computing e~luipment.~Being located in Silicon tures to the Alpha AXP architecture. Some software Valley, SRC coultl employ the services of a number was originally written in higher-level languages and of local medium-volume fabrication ant1 assembly coultl be recompiled with a few changes. Some companies without impetling the mainstream could be converted using binary translation too1s.l Digital engineering and manufacturing groups, A.11 software, however, was subject to testing and which were develoj>ing AXI' protluct systems. debugging. The project team was cleliber;~telykept small. It became clear jn the early stages of the program Two designers were locatetl at SRC. one was with the that builtling an Alpha demonstration unit (~[)li) Semiconductor Engineering Group's Aclvancetl would be of great benefit to software developers. Development Crroul> in Hudson, a\,and one Having a functioning hardware system would moti- was a member of Digital's Cambridge Research vate software developers and reduce the overall Laboratory in Cambridge, M.A. Although the project time to market considerably. Software clevelop- team was scparatecl both geographically ant1 organ- ment, even in the most disciplined organizations, zationall~ communication flowetl smoothly proceetls much more rapidly when real 11artlw;lre is because the individu;ils had collaborated on similar available for programmers. In addition, hardware projects it1 the past. The team used a commonset of engineers could exercise early implementations of design tools, and Digital's global network made it the processor on the ADU, since a part as complex possible to exchange design information between as the DE(:cliip 21064 CPU is difficult to test i~sing sites easily. As the project moved from the design conventional integrated circuit testers. phase to production of the systems, the group For these reasons, a project was started in early gren7,but at no point did the entire team exceed ten 1989 to build a number of prototype systems as people.

Digital Technical Jourd Vo1. 4 /\lo. 4 .Sl,ecu~lISSII~ 1992 51 Alpha A;r(p Architecture and Systems

Since multiprocessing capability is central to the Alpha AXP architecture, we decided that the ADU had to be a multiprocessor. We chose to implement a bus-based memory coherence protocol. A high- speed bus connects three types of modules: The CPU module contains one rnicroprocessor chip, its external cache, ancl an interface to the bus. A stor- age module contains two 32-megabyte (Me) inter- leaved banks of dynamic random-access memory (DRAM). The I/O module contains two 50MB per second (MB/s) 1/0 channels that are connected to one or two DECstation 5000 workstations, which provide disk and network I/O as well as a high- performance debugging environment. Most of the c logic, with the exception of the CPU chip, is emit- (a) CPUrModtlle ter-coupled logic (ECL), which we selected for its high speed and preclictable electrical characteris- tics. Modules plug into a 14-slot card cage. The card cage ancl power supplies are housed in a 0.5-meter r -. (m) by 1.1-m cabinet. A fully loaded cabinet tlissi- ;,T'.L~-a-..O. ,: -,-t-\r:r - pates approximately 4,000 watts and is cooled by ,, -,-LI.-IL. ;-*mm: : $ -!..;a1* - =:=?=,=!= ,;- -_-,.S:)B I...... ,-- .: forced air. Figures 1 and 2 are photographs of the -. .- . .. , - i @ .: - -I-- . system and the modules. :* ,;,t,:-.-.--.d=~ ,,c. -0001 - -=,b--2-*E;:b:h*h,- , In the remaining sections of this paper, we dis- : --*.-#-:-:-.-: -,-{- - .ao.o: ; =::!=:=!= cuss the backplane interconnect and cache coher- ence protocol used in the mu.We then describe the system modules and discuss the design choices. We also present some of the uses we have found for the ADU in addition to its original purpose as a soft- ware development vehicle We conclude with an assessment of the project and its impact on the overall Alpha LKP program. Storage lModule

Figure I The A//~haDernonstrzttio~z Unit

52 W>l. 4 No. 4 Specicll lssrce 1,992 Digital TechriictrlJournal The Alpha De~nonstrationUnit

Backplane Interconnect Table 1 Bus Signals The choice of a backplane interconnect has more Signal Name Pins Use impact on the overall design of a multiprocessor than any other decision. Complexit): cost, and per- -Data[63..00] 64 Data -ECCO[6..O] 7 ECC on Data131 ..00] formance are the factors that must be balanced to -ECCl[G..O] 7 ECC on Data163.-321 procluce a design that is adequate for the intended -pi01 1 Even Parity over use. Given the overall pilrpose of the project, we Data[31..00], ECCO[G..O] chose to minimize complexity and maximize per- -PI11 1 EvenParityover formance. System cost is important in a high-vol- Data[63..32], ECC1[6..0] ume product, but is not important when only a few B-shared Cache coherence systems are produced. B-dirty Cache coherence To minimize complexity, we chose a pipelined Retry 1 Storage module busy bus design in which all operations take place at Error 1 Data or address parity error fiietl times relative to the time at which a request is ArbRequest 8 Arbitration for the bus issued. To maximize performance, we defined the Clock 2 100 MHz differential clock operations so that two independent transactions Phase 1 50 MHz Reset 1 can be in progress at once, which h~llyutilizes the nTy peCl k 1 Module identification ~TYpe 1 Module identification bus. nld 4 Module slot number (0..13) We designed the bus to provide high banclwidth, set by backplane wiring which is suitable for a multiprocessor system, and to offer minimal latency. As the CPU cycle time becomes very small, 5 nanoseco~ltls(ns) for the of simplifying the design and improving perfor- I)E<:chip 21064 chip, the main memory latency mance at the expense of increased cost. The parity becomes an important component of system per- bits are provided to detect bus errors during formance. The ADU bus can supply 320N113/s of user address and data transfers. All modules generate data, but still is able to satisfy a cache read miss in and check bus parity. just 200 ns. The module identification signals are used only during system initialization. Each module type is assigned an 8-bit type code, and each backplane slot Bus Signals is wired to provide the slot number to the module it The i\l)lJ backplane bus uses E(:L 100K voltage lev- contains. Each module in the system reports its els. Fifty-ohm controlled-impetlance traces, termi- type code serially on the nType line during the 8 X nated at both ends, provide a well-characterized slot number nl)lpeClk cycles after the deassertion electrical environment, free from the reflections of system reset. A configuration process running and noise often present in high-speed systems. on the console processor toggles nTypeClk cycles Table 1 lists the signals that make LIPthe bus. The and observes the nType line to determine the type data portion consists of 64 data signals, 14 error of module in each backplane slot. correction cotle (KC) signals, ant1 2 parity bits. The The 100-megahertz (MHz) system clock is dis- ECC signals are stored in the memory modules, but tributed radially to each module from a clock gen- no checking or correction is clone by the memories. erator on the backplane. Constant-length wiring Instead. the ECC bits are generated and checked and a strictly specified fan-out path on each mod- only by the ultimate producers ant1 consumers of ule controls clock skew. Since a bus cycle takes two data, the I/O system and the CPU chip. Secondary clocks, the phase signal is used to identify the first caches, the bus, and main memory treat the ECC as clock period. uninterpreted data. This arrangement increases performance, since the memories do not have to check data before delivering it. The memory mod- Addressing ules would have been less expensive had we used The bus supports a physical address space of 64 an ECC code that protected a larger block. Since the gigabytes (23%~~).The resolution of a bus address crrr caches are large enough to require ECC and is a 32-byte cache block, which is the only unit since the CPLI requires ECC over 32-bit words, we of transfer supported; consequently, 31 address chose to combine the two correction mechanisms bits suffice. One-quarter of the address space is into one. This decision was consistent with our goal reservetl for control registers rather than storage.

Digilal Tecbnical Jottrnal k1. 4 /Vo. 4 .S/)rri~~llsscie 1992 Alpha AXP Architecture and Systems

Accesses to this region are treatetl specially: (;l'us monitors all bus operations and must request only clo not store data from this region in their caches, those cycles that it knows the target can accept. ant1 the target need not supply correct E<:<: bits. Initiators are ;~llowetlto arbitrate for a particular 'I"11e methocl used to select the target ~notluleof target nine or more cycles after that target has ;I bus operatjon is geographic. The initi;itor sencls started a read, or ten or more cycles after the target the target module's slot number with the :iddress has started a write. To arbitrate, an initiator asserts tluring a request cycle. In addition to the 4-bit slot the ArbReqilest line correspontling to its current number. the initiator supplies a 3-bit subizo~leidem priority. Priorities range from 0 (lowest) to 7 (high- tiJier with the address. Subnodes are the itnit of est). If a module is the highest priority requester memory interleaving. The 64~~storage moclule, (i.e.. no higher priority ArbRequest line than its for ex:~rnple,contains two intlepentlent 32MB sub- own is asserted). that motlule wins the arbitration, nodes that can operate concurrently and it transmits an ;iclclress ant1 a comma~lclin the 'The geographic selectio~iof the target means that next cycle. The winning module sets its priority to a particular subnode only needs to compare the zero, ant1 all initiators with priority less than the ini- requested slot and subnode bits with its own slot tial priority of the winner increment their priority and subnode numbers to decicle whether it is tlie regclrdless of u~hetl~erthcy made a request dztrirzg tiirget. This reduces the time reqiliretl li)r the deci- the ctr6itrclTion cycle. Initially, each initiator's prior- sion compared to a scheme in which the target ity is set to its slot number. Priorities are thus inspects the atldress fielcl, but it means that e;lch ini- distinct initially 21ncl remain so over time. This algo- tiator rnilst niaintain a mapping between physical rithm favors initiators that have not macle a recent addresses and slot and subnotle numbers. This map- request, since the priority of such an initiator ping is performed by a Iw;M in each initiator. For increases even if it tloes not make requests. If all ini- CPII modules, tlie RAM lookup tloes not reduce per- tiators make continuous requests, the algorithm formance, since the access is clone in parallel with provides rountl-robin servicing, but the implemen- the access of the module's secontlary cache. The tation is simpler than round robin. slot-mapping W\ls in each initiator are lontled at An arbitr;~tioncycle is followeel by a request system initialization time by the configuration pro- cycle. The initiator pl;ices ;in address, node and cess described previously. subllode numbers, ;ind a command on the bus. There are only three commands. A read command Bus Operation requests a 32-byte cache block from memory. The The timing of adtlresses and data is shown in Figure 3. target memory or a cache that contains a more All dat:~transfers take place at fixetl times relative recent copy supl'ljes the clata after a five-cycle to the start of an operation. Eight of tlie backplane delay A write comnialld transmits a 32-byte block slots can contain modules capable of initiating to memory, using the same cycles for the clata trans- requests. These slots are numbered from 0 to 7, but fer as the read cornm;it~d.Other caches may also are located at the center of the backplane to retluce take the block ant1 itpdate their contents. A victim tlie transit time between initiators and targets. writc is issued by a <;IYJ module when a block is A bus cycle starts when one of the initiators arbi- evicted from the seconclary cache. When such an trates lor the bus. The arbitration metliotl gu;iran- eviction occurs, nny other c:lches that contain the tces t11;lt no initiator can be st;~rvetl.Each initiator block are guaranteed to cont;~inthe same value, so

CYCLE 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

ARB REQUEST R1 DATA A1 B-SHARED, B-DIRTY S2 S4 ERROR El E3

This figure shows the contenls of the bus during four read cycles. If requests are made at full rate, the bus is fully occupied wilh addresses and data. Bshared and 0-dirty are sen1 in Ihe fifth cycle after the arbilration request. If any module detects a parily error during an address cycle, it asserls error two cycles later.

5 4 Vol 4 No. 1 S/)et iitl Issr~c,19%' Digilnl Techrricrtl JortrwrrI The AlJ!bcrDelnonstration Unit they need not participate in the transfer at all. The the blocks in other caches whenever the associated block is stored in memory, as in a nor~iialwrite. CPU performs a write. If no other cache shares a block, this write through is unnecessary ant1 is Cache Coherence not done. Wlien a secontlary caclie receives an update (i.e., it observes a write on the bus directed In ;I multiprocessor system with caches, it is essen- tial that writes clone by one processor be made to a block it contains), it has two options. It can invalidate tlie block and report to the writer that available to tlie other processors in the system in it has done so. If it is the only cache sharing the ;I timely fashion. A number of approaches to the cache coherence problem have appeared in the Jit- block, subsequent write-through operations will erature. These approaches fall into two categories, not occur. Alternatively, it cat1 accept the uptlate depending on the way in which they liancl Ie proces- and report that it did so, in which case the cache sor writes. Azt~nlidntiorzor oti!lzership protocols that performed the write-through operation con- tinues to send upclates whenever its CPU writes the require th:lt ;I processor's cache must acquire an exclusive copy of the block before the write can be block. clone.? If another cache contains a copy of the The actions taken by a cache that receives an block, that copy is invalidated. On the other hand, update are determined by whether the block is in ~rpd~zteprotocols maintain coherence by perform- the CPU's on-chip cache. The seco~idarycache con- ing write-through operations to other caches that tains a table that allows it to determine this without sh;~rethe block.2 Each caclie maintains enough interfering with the CI'IJ. If tlie block is in the on- state to determine whether any other cache shares chip caclie, tlie secondary c;~cheaccepts the the block. If the data is not present in another update and invalidates the block in the on-chip cache, then write through is unnecessary and is cache. If the block is not in the on-chip caclie, tlie not done. secondary cache block is invalidated. If tlie block is The two protocols have quite clifferent perfor- being actively shared, it will be reloaclecl by the CPIJ mances, depending on system activity. An update before the next upd;~tearrives, and the block will pl-orocol performs better than an invaliclation pro- continue to be sharetl. If not, the block will be inval- tocol in an application in which data is sharetl (and idated when the second update arrives. written) by multiple processors (e.g., a parallel algorithm executing on several processors). In an Implementation of the P~otocol inv;~lid;~tionprotocol, each time a processor writes The implementation of the coherence protocol is a location, the block is inrralidateci in all other not complex. The five possible states of a secondary caches that share it. All caches require an expensive cache block are shown in Figure 4.Initially, all miss to retrieve the bloclc when it is next refel= blocks in the cache are marked invalid. Misses in enccd. On the other hand, an update protocol per- the CPU's on-chip cache cause a bus read to be forms lx)orly in a system in which processes can issued if the block is not in the secondary cache. If migrate between processors. With migration, data the cache block is assigned to another memory loca- iippears in both caches, and each time a processor tion and is dirty (i.e., has been written since it was writes a location, a write-through operation read from memory), a victim write is issued to evict uptlates the other cache, even though its CI'U is no the block, then a read is issuetl. Other caches moni- longer interested in the block. Larger c;~cheswith tor operations on tlie bus and assert the block- long block lifetimes exacerbate this problem. shared (B-shared) signal if they contain the block. If a cache contains a dirty block ancl it observes Coherence Protocol a bus reacl, it asserts B-shared and B-dirty, and The coherence protocol usetl in the ADIJ is :I hybrid supplies the data. B-dirty inhibits the memory's of an update ancl an invalidation protocol, ant1 like delive~yof data. many hybritls, it combines the good features of The CPu's on-chip caclie uses a write-through both parents. The protocol depends on the fact that strategy. A CPU write to a shared block in the sec- the CPrJ chips contain an on-chip cache backed by ondary caclie initiates a bus write to update the a much larger secondary caclie that monitors all contents of other caches that share tlie block. bus operations. Initially, the seco~iclarycaches use Memory is written, so the block beconies clean. If ;in update protocol. Caches that contain shared another cache takes the uptlate, it asserts B-shared, d;rt;~perform a write-through operation to upclate and tlie initiator's state beconles Shared not (-)

Digital Ticbrrical Jorrrnnl Vol. -I No. 4 .S/)ec'irrlIssue 1992 5 5 Alpha AXP Architecture and Systems

C-READ C-READ C-WRITE

C-WRITE -SHARED * -SHARED -DIRTY 4 C-WRITE (-6-SHARED) DIRTY

C-WRITE (-6-SHARED) B-WRITE, INC; 6-READ 6-READ INVALID

SHARED + -DIRTY

C-WRITE (6-SHARED) C-READ C-READ 6-READ B-WRITE, INC B-READ

Transitions occur as a result of CPU reads and writes (C-read. C-write) and bus operations initiated by other caches or I10 controllers (6-read, 6-write). A C-read or C-write to an invalid block causes a 6-read; a C-write to a shared block causes a 6-write. The 6-shared response ind~catesthat some other cache contains the block. INC indicates that the block is in the CPU's on-chip cache.

Figrrre 4 Secondary Cache Line States

Dirty. If no other cache takes the update, either By running this program with sequences of sitnu- because it does not contain the block or because it lated memory requests, we were able to refine the decicles to invalidate it, then the B-shared signal is design rapidly and nleasure the performance of the not asserted, and the initiator's state becomes system before clesigning any logic. Most design -Shared -Dirty. The B-shared and B-dirty signals errors were discovered at this time, and prototype may be asserted by several moclules during cycle system debugging took much less time than usual. five of bus operations. The responses are ORed by the open-emitter ECL backplane drivers. More than System Modules one cache can contain a block with Shared = true, In this section, we describe the system modules but only one cache at a time can contain a block ant1 the packaging of the ADU. We discuss the wit11 Dirty = true. design choices made to produce the c:IjU module, Designing the bus interconnect and coherence storage modules, and I/O niodule on scheclule. We protocol was an experiment in specification. The also cliscuss applications of the ADU beyond its informal description required approximately 15 intended use as ;I vehicle for software development. pages of prose to describe the bus. The real specifi- cation was a multithreaded program that repre- CPU Module sented the various interfaces at a level of detail The ADU CPU module collsists of a single CPU chip, sufficient to describe every signal, but, when exe- a 256-kilobyte (KR) secondary cache, and an inter- cuted, simulatecl the components at a Iiigher level. face to the system bus. All CPU modules in the

56 Vol. 4 ,\'o. 4 .S/):[,ci'iolISSILP 1992 Digil~lTecbrricnl Jorrrnnl The Alphrr De~norzstmtiorrUnit

system are identical. The CPll motlules are not self- invalidation filter and selects between update and sufficient; they must be initialized by the console invalidation strategies. workstation before the Cl'u can he enabled. Tlie system bus interface w;~tcliesfor reads and The CPti module contains exterisive test access writes on the bus, and looks up e;icli ;idtlress in the logic that allows other bus agents to read and write secontlary cache. On read hits, it ;isserts B-shared most of the module's internal state. We irnple- on the bus, ;ltid, if the block is dirty in the sec- rnented this logic because we knew these motlules ond;u-y c;iche, it asserts B-dirty ;und supplies read would be used to debug <:Pll chips. Test access logic data to the bus. On write hits, it selects between the would help us determine the cause of a CPU chip invalidate and update strategies, modifies the con- malfunction and would make it possible for us to trol fieltl in the secondary cache tag store appropri- introduce errors into tlie secondary cache to test ately, and, if tlie update strategy is selected, it the error detection and correction capabilities of ;iccepts tlata from the system bus. the C;I'LJ chip. This logic was used to perform almost Unlike most bus devices, the <;Plr module's all initi;~lizationof the Cl'u nioclule ;~ndW;IS also system bus interface must accept a new ;td

BYPASS

DATA STORE REGISTERS

BUS DATA SYSTEM BUS WRITE CPU DATA LATCH

- BUS DATA , BUSTAG ADDRESS ADDRESS - I

Digital Technical Journrrrl Vd. 4 No. 4 Spec-id Lss~fe199.2 57 Alpha AXP Architecture and Systems

fi;~tllt-eof the sjatern bus liiakes reading and writing The bidirectional tlata b~~sof the CPU chip is con- the data store somewhat trichy. Figure 61shows verted into tlie unidirectional data buses used b). ;I write hit that has selected the update strategy the rest of the <:I'I! module by transparent cutoff immedi;~telyfollowed by a read hit that must supply latches. 'These latches, wliich are located in a ring d~tato the bus. High performance mandates surrounding the <:Pli, also convert the quasi-ECL lell- the use of clocked transceivers, which means the els generated by the Cl'U chip into true E<:L levels second;~rj~cache data store must reacl one cycle for the rest of the <:P'IJ module. These latches are ahcad of the bus ancl must write one cycle behind norn~allyIield open, so tlie CI3[! chip is, in effect, thc bus, resulting in a conflict in bus cycle 11. connected directly to the secondary cache tag and However, the bus transfers data in a fixed order, data RAMS. Control signals from the CPU cl~ip'sinte- so thc reat1 will always access quadword 0 of grated seconcl;rry c;lche controller are simply OKed the block, ant1 the write will always :~ccessquacl- into the appropri;~tesecontlary cache ft~\,1drivers. word 3 of the block. By implementing the data These latches are ;tlso ~lseclto pass data across store ;IS two 64-bit-wide banks, it is possible to lian- the two-clock-clom:~i~ibounclary Normally all tlle these back-to-back transactions without creat- latches are open. On reatls, logic in tlie CPU chip ing :uliy special cases, as shown in Figilre 6b. This clock clomain closes all the latches and sends a read example is typical of the style of design used in the reqi~estinto the bus clock domain. Logic in the bus t\1)11,wliich eliminates extra mechanisms wherever clock domain obtains the cli~ta,writes both the sec- possible. ontlary c;lcIie and the reacl latches, and sends an ?'hc <:1'lJ interface handles the arbitration for the acknowledgment back into the CPlJ chip clock seconclary cache and generates the necessary reads domain. Logic in the CPtJ chip clock tlotnain ant1 writes on the system bus when the CPI' sec- accepts the first half-block of the data, opens the ontl:~ryciiche misses. first read I;ltch, accepts the second half-line of the Thc (:PIJ chip is supplied with a clock that is not data, ancl opens all remaining latches. Writes are rel;~tetlto the system clock in frecl~lencyor phase. similar: Logic in the <:l'Ll chip clock dolnaiti writes This factor made it easier to use both the 100-MHz the first half-line into the write latch, makes the frequency of the DC227 prototype chip ant1 tlie second half-line valid (behintl the latch), and sentls 200-MHz frequency of tlie DECcliip 21064 CPIJ. It a write request into the bus clock tlomain. Logic in ;~lso;~llowed us to vary the operating freqilency the bus clock clomain accepts the first half-line of tluring (:PU chip debugging. However, the clata data, opens tlie write I;~tcli,accepts the secontl 1i;rlf- lxlses connecting the CPIJ chip to the rest of the block of clata, and scncls an acknowledgment back (:I'IJ motlule must cross a clock-domain boundary. into the <:Hichip clock tlo~iiain. Perhaps more significant, the secondary cache tag Logic in tlie <:IW chip clock tlomain controls all and data stores have two asynchronous sources of latches. Only two signals pass through synchroniz- control, since the <:PU chip contains an integrated ers: a single request signiil passes from the CPU chip secoticlary cache controllel: clock domain to the bus clock clom;iin, ant1 a single

CYCLE- ~ toihe secondary cache RAMS Caused by back-to-back cycles. In the marked WRITECYCLE WO W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 cycle, thecache writes the bus data RO R1 R2 R3 R4 R5 R6 R7 R8 R9 R1O READ CYCLE thal arrived in cycle W10. but it also CACHE W7 W8 W9 R8 R9 R10 {$O needs to read data lo supply it during f cycle R7.

CYCLE 14 15 Figure 6b shows how this conflict can be resolved by treating the cache as two independent banks (even and odd) WRITE CYCLE READ CYCLE R9 CACHE EVEN R10 CACHE ODD R10

Figzlr-r 6 CPU Timing acknowledge signal passes from the bus clock registers are processetl by the system bus inter- domain to the CPU cliip clocl

Digilrrl TechrricnlJorrrirrrl WJI 4 i\b. 4 Specinl lssr!c. I992 59 Alpha AXP Architecture and Systems

1-megabit (Mb) I&\,i chips, the capacity of each A memory cycle consists of a five-bus-cycle niotlule is 64MB. Figure 7 sllows the organization access period followed by four bus cycles of data of the storage modille. The module consists of transfer. Each data transfer cycle moves two 39-bit two independent subnodes, each with four banks longwords between the module and the backplane of storage. Control signals are pipelinetl through bus, for a total of 32 data bytes per memory cycle. the banks so that the module can deliver or accept This is the size of a CPU module cache block. A read a 64-bit data word (plus EC:<:) every 20 ns. With operation takes 10 bus cycles to complete, but a the exception of the DIb\IM interface sign~ls,all write requires 11 cycles. signals are ECL levels. The GO14 gallium arse- Since a data rate of 1 word every 20 ns is beyond nitle (<;aAs) driver chip improves performance the capabilities of even the fastest nibble-mode by allowing parallel termination of the DRAM RAMS, we needed an approach that did not require adclress lines. each to provide more than 1 bit per access.

...... BANK 3 1 I I 1 M BY 39-BIT 1M BY 39-BIT , n STORAGECARD STORAGECARD n

+------t--t--j------BANK-. .. . 2 - I 1 M BY 39-BIT 1 M BY 39-BIT A I STORAGE CARD

1 M BY 39-BIT STORAGE CARD

STORAGE CARD t A

1 M BY 39-BIT I M BY 39-BIT STORAGE CARD STORAGE CARD

STORAGE CARD

------J t ADDRESS I BANK ENABLES I 4 4 SUBNODE 1 SUBNODE 2 CONTROL CONTROL

TRANSCEIVER DATA + CONTROL DATA + ECC, ECC (39) ADDRESS (39) BACKPLANE

Figure 7 ADU Stor~~~qeModtlle

60 I'ol. 4 ,Vo. 4 Specir~llss~re I992 Digital TecbnicnlJorrrnnl The Alpha Demonstration Unit

We chose to pipeline the four banks of each sub- DRAM DRAM -.. DRAM node. Each of the four banks contributes only one

78-bit word to the block. The banks are started B Zo C sequentially, with a one-cycle delay between each bank. The high performance of the storage module is achieved by maintaining ECL levels and using ECL lOOK components wherever possible. The W\1 1/0 pin levels are converted to ECL levels by latching transceivers associated with each bank. VC: vh~--0 Fortunately, the timing of accesses to the two sub- nodes of a module makes it possible to share these transceivers between the same banks of the mod- Series term~nat~onresults In a half-amplilude signal at the first ule's two subnodes. RAM on the line until the signal reflects from point C The DRhM chips are packaged on small daughter cards that plug into connectors on both sides of the (a) Series Terrninal.io7z main array module. There are 2 daughter cards for each bank within a subnode, for a total of 16 daugh- ter cards per module. The DRAM address and con- trol lines are carried on controlled impedance DRIVER traces. Since each of the 39 DUMs on an address A line represents a capacitive load of approximately 8 picofarads, the loaded impedance of the line is about 30 ohms. VA: "h~p0 The usual approach to driving the address ancl control lines of a RAM array uses series termination, as shown in Figure 8a. This arrangement has the advantage that the driver current is reduced, since the load impedance seen by the driver (R, + Z, is Parallel termination saves one line lranslt time, but increases twice that of the loaded transmission line (Zo). driver current. Unfortunately, thc RpuM access time is increased, because the signal from the driver (5,)must propa- ( 6) Parallel Terniiuafioiz gate to the far end of the line, be reflected, and return to the driver before the first rwlM on the line sees a full-amplitude signal. Since the capacitive F~~LLI-e8 Aclclress Line Ter.r?zinntion loading added by the RAM pins lowers the signal propagation velocity in addition to reducing the address lines, and a clriver for one of the three IbiM impedance, the added delay can be a significant control signals (US, CAS, Write). To reduce the cur- fraction of the overall cycle time. rent switched by each chip, each address bit tlrives Since low latency was a primary design goal, we two output pins. One pin carries true data, ant1 the chose parallel termination of the 1ziiiM address ant1 other is complemented. The total cunent is there- control lines, as shown in Figure 8b. Each address fore constant. Each pin drives one of the two IwM line is terminated to +3 volts with a series resistor modules of a bank. A total of three GaAs chips (R,) of 33 ohms, slightly higher than the line is required per bank. In the present module, with impedance. In this configuration, each line's driver 1M- by 1-bit UMchips, only 10 of the 12 address must sink a current of almost 0.1 ampere. Since no drivers are usrcl, so the system can be easily commercial chip could meet this requirement at expanded to make use of 16M RhiMs. the needed speed, we com~ilissioneda semicustom The storage module contains only a small GaAs chip.' amount of control logic. This logic generates the As shown in Figure 9, each CaAs chip contains a control signals for the IWiLls and the various register for eight address bits, row/column address transceivers that route data from the backplane to nlultiplexing ant1 high current drivers for the UM each bank. This logic also generates the signals

Digital Tecbnicnl Jozrrnnl Vol. 4 1X;o. .I Specin1 lssiie 1992 6 1 Alpha &XP Architecture and Systems

to exercise new CPIJ chips and untested software. With this in mind, we org~nizedthe I/O system round a DECstation 5000 workstation as a front- end ant1 console processor. This reduced our work consitlerably, as all I/<)is clone by the workstation. A 'I'IlRt3Ochannel module connects the DECstation 5000 over a SOMB/s cable to the I/O module in the A1)Ii. We selectetl SOMU/s in orcler to support the simultaneous, peak-bandwitlth operation of two SCSI disk strings, an Ethernet, and a fiber dis- tributed data interface (FI)Ol) network adapter. The I/() module contains two of these channels, which ;tllows two DECstation 5000 worlU niemory, The memory address of a comm;lnd ring is placed into I I an I/() control register, ;mcl the associated doorbell SELlN DO SELOUT , I 1 is rung. The doorbell causes a hardware interrupt on the front-end DECstxtion 5000, which alerts the I/<) server process that action is needed. The I/o server reads the command ring from ADU memory and performs the requested I/(). 1/0 completion sta- tus is stored into ADU memot-); and an interrupt is sent to the requesting t\DI1 processor. In addition to its role as an I/O front-end proces- I+TCy~lr.e9 Add~.c.ssand Control Driver sor, the DECstation 5000 workstation acts as a cliag- nostic and console processor. When an ADU is powerecl on, diagnostic software is run from the ncccled to refresh the IU contains two 50Mn/s Once diagnostic software has run, the console I/o channels ant1 ;I loc;~l(:I'll subsystem. The I/O soft~vi~reis given control. This software is responsi- ch;innels connect to one or two DECstation 5000 ble for loading privileged architecture library (PAL) workstations, which act as I/() front-end proces- ;~ndoperating system software. Once the operating sors ;ud also provide console and tliagtlostic fi~nc- system is running, the workst;ition becomes an 1/0 tions. The local CPlJ subsystem is used to provitle server. interval timer and time-of-day clock services to ADll The presence of the I>E<:statjon5000 gave the processors. chip team and operating system developers a stable 'l'he original specification for the ADIJ I/O system place to stand while cliecking out their own com- required support only for serial line, small com- ponents. In addition, the complete diagnostic capa- puter s)-stems interface (S(:Sr) disk, and Ethernet bility ;lnd error checking coverage of the ADCJ I/o clevices. We knew that the ADU would be used hardware helped to isolate faults.

62 W1. 4 ,Ira. 4 SpecJnl lssrre 1392 Digilnl Tecbnicnl Jorrrnal The Alpha Den?orzsrrzrtion Unit

The central features of the I/o moclule, shown in block-transfer st;ite machine that 11;lncl les bus Figul-e 10, are two 1K- by 80-bit register files built transfers. from 5-1-1s ECL WMs. These memories are cycled Each I/O ch;~nnelcarries 32 bits of tl;~t~plus 7 bits every 10 ns to simulate dual-portecl memories at the of ECC in parallel on a SO-pair cable. Each dat:~word 20-11sbus cycle rate. One memory is used as a stag- also carries a 3-bit tag that specifies the destin;~tion ing RAM for block transfers from the I/O processors of the data. The cable is half-duplex. writ11 the tlirec- 1.0 ADlJ memory. The other memory is shared tion of data flow uncler the control of softw;lre on between use as commancl register space for the I/O the DECstation. Data arriving from the I)ECst;~tionis system and a staging WM for transfers from AI)II buffered in 1K E'[FOs. These FIFOs c;irry data across memory to the 110 system. the clock-domain boundary between the I/O On the bus side, the register files are connected system and the AULI nncl permit both 1/0 cIi;~nnels directly to the backpl;~nebus transceivers. On the to run at h~llspeed simultaneously. I/O side, the register files are connected to a shareel Each I/O channel interface also has a11 ;~ddress 40-11sbus that connects to the two I/O channels. counter and a slot-mapping IWM, which are loadecl The buses are time-slotted to eliminate the from the workstation. The slot-mapping function need for arbitration logic. As a consequence, the sets the correspondence between t\l)ll bus 1/0 module control logic is contained in a small addresses and the geographically addressed storage number of programmable array logic chips that and CPU moclules. The address ant1 slot m;~pfor implement the 1/0 channel controllers ant1 a each cllannel are connected to a common adclress

TO DECSTATION

- INTERFACE

SYSTEM TO BUS DECSTATION

(61) ADU I/O Module

OUTBOUNDn FlFO TURBO- 4-1 4yB-kINTERFACE CHANNEL INTERFACE

INBOUND FlFO

(I?) TURBOchan~zelI/O Mod~ile

Digital Technical Jotrrrrrrl 161 4 No. 4 Speci~~lIssue 1992 63 Alpha AXP Architecture and Systems bus. This bus bypasses the register files ancl directly usecl to hold disks. Power supplies are off-the-shelf clrives the backplane transceivers during bus units. Three supplies are requiretl to provide the address cycles. 4,000 watts consumed by a system containing a full The far end of tlie I/() cable connecls to a single- complement of 111otluIes.A stantl;lrd power condi- width TI'RBOchannel niotliile in the 1)EC:station tioner provides line filtering and distributes pri- 5000. This nodule cont;~insEC<: gener:ltion and mary ti(; to the power supplies. This unit ;tllows the checking logic, and FIFO clueues for buffering data system to operate on 110-volt tic in the United between the citblc antl tlie 1~ClRBOch;uinel.The FIFO States, or 220-\wit AC in Europe. queues also carry tlatn across the clock-domain Coolitig was the most difficult part of the packag- boundary between the I/O channel ;rml the ing effort. The use of ECL throughout the system nJRBOchannel motlules. meant that we had to provide an airflow of at least The I/<> module has a local <:PU subsystem con- 2.5 m/s over the modules. After st~ldyingseveral taining a 12-MHz Motorola 68302 processot; 128KH alternatives. we selected a reverse impeller blower of erasable programmable read-only memory usecl on Digital3 VAX 6000 series m;~cliines.Two of (EI'KOM). anel 128KI3 of RAV. The CPll subs!.stem these blowers provide the requireel airflow, while also includes an Ethernet interface, two serial generating much less acoustic noise than conven- ports, an SCSl interface, an Integrated Services tional fans. Digital Network (ISDN) interfidce. and auclio input Since blower tiiilure woulcl result in a catas- ant1 output ports. When in use, the loc;tl <;l'r r sub- trophic meltclown, airflow and temperature sen- system uses one of the 1/0channels otherwise i~vail- sors are provided. A small module containing a able for the connection of a DE(;st;~tiotl 5000. rnicrocontrolle~-monitors these parameters as well Although the local <:I't! on the I/O module is capltble as all power supply voltages. In the event of failure, of rilnning the ~LII[Al>rl 1/0 system, in pr;ictice \ve the ;iutonoruorls controller can shut down the ~1st.it for supplying interval timer and real-time power supplies. This module also generates the clock service for the Al>lJ. system clock. The VO morlule was somewhat overdesignetl h)r its original purpose of supplying disk, networl<,:tntl Conclusions console I/O service for ADlJ processors. This capa- Sometimes it is better to have twenty million bil ity was piit to use in mid-1991 when the tle~nand instruct~onbby Fritlay than twenty million instruc- tions per seconcl. -Wesley Cl;~rk for ADUs became so intense that we considered building adtlitional systems. Instead. by using the One liuntlred <:I'lJ antl storage modules and 35 I/O excess 1/0 resources, the slot-mapping featilres of modules haw been built. packaged as 35 ADU sys- the hardware, nncl the capabilities of PAI.code, we tems, :untl dcliverecl to software development groups throughout Digital. Not just laboratory were able to use ;I three-processor Al)u ;IS three independent virtual computers. Inclependet~t curiosities, these systems have become part of the copies of the console program shared the I/() h;ircl- mainstream AXP clevelopment environment. They ware through software locking and were allocateel are in regul:~rusc by compiler development groups, one CP'LJ ant1 one storxge motlule e;lch. operating system developers, and :~pplications lMultiprocessor xl>[ls now routinely run both groups. The AJ)II also provided a full-speed, ill-system OpenVi\fJS AXl' ;incl l>T:(: OSW1 AXP oper:iting sys- tems :it the same time. exerciser h)r the chips. By using the ti1)U. the chip tlevelopers were ;~bleto detect several subtle prob- lems in carly chip implementations. Packaging The /\I>[] project was quite successful. AIIU sys- Simplicity was the primary goal in tlie tlesign of the tems were in the hands of tlevelopers ;ipproximately Al)U package. Our short schedule dem;~ncletlthat ten months before the first product prototypes. we avoicl innovation and use standard p;irts wher- The systems esceedcd our initial expectations for ever possible. reliability. ;rntl provided a rugged, stable platform The WU's modules ant1 card cage are st;ind;ircl 9U for software clevelopment ant1 chip test. The proj- (280 millimeter by 367 tnillimeter) Euroc;~rtlcom- ect demonstratetl that a small team, with focused ponents, which ;Ire avail;tble from a nurnbcr of ven- objectives. can protluce systems of substatltial com- dors. The cabinet is ;I st;tndard Digital unit. ~~su;ill'~' plexity in a short time.

64 Vd. 4 No. 4 SpccZL(tlsuue 1992 Digital Technical Journal The AlJlna Denzo~zstratio?~Unit

Acknowledgments 2. C. Thacker, L. Stewart, and E. Satterthwaite, Jr., John Dillon designed the power control subsystem "Firefly: A Multiprocessor Workstation:' and the package. Steve Morris wrote the ADU con- IEEE Transactions on Conzputers, vol. 37, sole software. Andrew Paync contributed to ADU no. 8 (August 1988): 909-920. tliagnostics. Tom Levergood assisted with the physi- cal design of the VO modules. Herb Year): Scott 3. R. Katz, S. Eggers, D. Wood, C. Perkins, and Kreider, and Steve 1,loytl tlitl module debugging and R. Sheldon, "Implementing a Cache Consistency testing at Huclson ant1 at SIK:. Ted Equi handled proj- Protocol," in Proceeclirigs of the 12th Interna- ect logistics at Hutlson, ant1 Dick Parle was respon- tional S~~nzpositlmon Co~lzputerArchitecture sible for material accluibition and supervision of (IEEE, 1985). outsicle venclors at SRC. 4. J. Archibald and L. Baer, "Cache Coherence Pro- References tocols: Evaluation Using a Multiprocessor Simu- lation Model," ACM Transactions on Computer 1. R. Sites, A. Chernoff, M. Kirk, M. Marks, and Systems, vol. 4 (November 1986): 273-298. S. Robinson, "Binary Translation," Digital Rclnlzical Jo~lrnal,vol. 4, no. 4 (1992, this issue): 5. 1991 GaAs IC Data Book and Desigizer's Guide 137- 152. (GigaBit Logic, Newbury Park, CA, 1991): 2-39.

Digital Tecbnkal Journal Vol. 4 No. 4 Speciallssue 1992 65 ToddR Dtitton Daniel EireJ Hugh R. Kurth JamesJ RReisert Robin L. Stewart

The Design of the DEC3000 AXP Systems, Two High-performance Workstations

A fanzily of bigh-perJ-bnnance64-bit RISC zuorkst~~tionsalzd serzlers based 012 the izeuJDigital Alpha AXP ar~chitectureis described. The bar-du~areimplementatiot? uses the pozue~zzllnew DECcl?$21064 CPU and ernploys a sophisticated neLc sj~stenz interconnect structzrre to achieve the necessary high Da~tdwidthand lmu-latency cache, memory, and t/O 6tues. The melnory subsjlstem of the high-end DEC 3000 AXP Model 500prouides a 512KB secondary cache and up to I GB of tnemory The I/O subsyst@?zof the Model 500 has integral two-dime1.2sionalgrqbics,SC.51, ISDIV, and six TURBOchannel expansion slots.

The DEC 3000 AXP system family consists of both using a large number of high-performance I/O workstations and servers that are basecl on Digital's devices. Increased latency can also result from the Alpha IUP architecture.' The family includes the additional levels of buffering ancl system bus loaci- desktop (DEC 3000 A?(P Model 400) ancl desk-side ing common to traditional architectures. Many and rack-mounted (DEC 3000 AXP lMoclel 500) sjls- system Isi~sesalso exhibit multiplexing between terns. The available operating systems are the DEC address and data, leading to further performance OSF/1 AXP and the OperlViMS AX-' systems. All sys- degradation. tems use the DECchip 21064 microprocessoc2 To meet the goals of low memory latency, high Table 1 gives the specifications for the three DEC memory bandwidth, ancl minimal CPU-1/0 mcrnory 3000 AXP systems. contention in a cost-competitive manner, the The IIEC 3000 iucP systems are designed to be sig- designers implemented the DEC 3000 AXP system nificantly faster than all previous Digital work- architecture in an unusual way. They chose to build stations and to offer performance competitive with the system interconnect from inexpensive applicit- that of other reduced instruction set computer tion-specific integrated circuits (ASICs), as shown (RISC) workstations currently available. In general, in Figure 1. Tlie ASICs act as a crossbar between the NSC systems have larger code sizes ancl conse- CPU, memory, and VO buses. Addresses and data are quently require more instruction-stream bantl- switched independently by the crossbar. width than complex instruction set computer The system block diagram in Figure 2 shows the (CISC) systems. Further, 64-bit machines recluire system architectl~reof the DEC 3000 UPsystcrns. more data-stream bandwidth than 32-bit machines. The system crossbar in the center of the diagram is To complement the power of the DECchip 21064 composed of six ASlCs, consisting of the N)DR ASIC, microprocessor, the systems need a balanced the TURBOchannel (TC) ASIC, and four SLICE ASICs. system architecture, including a high-bandwidth, The ADDR ASIC switches addresses between low-latency memory system and an efficient, high- the CPU, the memory, and the TC ASIC. The four performance 1/0 subsystem. SLICE AS~CSswitch data between the CPU, the mem- Traditional workstation designs that use a com- ory, and the TC ASIC. The TC ASIC switches I/O mon system bus exhibit increased memory latency acldresses and clat;~between the ADDK ant1 SLlCL and reduced memory bandwidth due to system bus ASKSancl the TClRUOchannel bus. Connected to the contention. This is a special concern for clesigiis TURBOchannel bus are the various 1/0 controllers,

66 1f01. 4 1%. 4 spec.int /.YSII~1992 Digital Tecbrzical Juurtrnl The Design oftbe DEC 3000 AXP Systems, Two High-performance Workstations

Table 1 DEC 3000 AXP Family Specifications Desk-side Rack-mount Desktop Specifications Model 500 Model 500 Model 400

Height, inches 24.7 15.75 5 Width, inches 12.75 17.5 20 Depth, inches 29.7 27 16.75 Maximum DC power 480 480 295 output, watts Memory Standard, MB Maximum, MB Internal hard disk Standard, MB 1050 1050 426 Maximum, MB 4200 4200 21 00 Serial ports 2 2 2 ISDN port 1 1 1 SCSl ports* 2 2 2 Ethernet portst 2 2 2 TURBOchannel slots 6 6 3 Removable media* 2 2 1 Integral graphics accelerator Yes Yes No Audio Yes Yes Yes

Notes: ' One internal and one external. 1 AUI (th~ckwire) and 10Base-T (twisted pair). 5.25-inch half-height slots.

secondary cache, control logic, TURBOchannel interface and, in the Model 500, the two-dimen- CPU MEMORY sional graphics subsystem. It has connectors for the I/O module, four memory mother boards, a lights and switches module (LSM), three TURBOchannel options, and the power supply. Figure 3 shows the SYSTEM CACHE - I10 layout of the module. CROSSBAR CPU The DECchip 21064 microprocessor is the CPU of FieI Simple Crossbar the DEC 3000 AXP systems. On the Model 500. the CPU runs at 150 megahertz (MHz), and on the Model including the dual small computer systems inter- 400, it runs at 133 MHz. The processor is a super- face (SCSI) controller ASIC, the general t/O con- scalar, fully pipelined implementation of the Alpha troller ASIC, and the two-dimensional graphics AXP architect~re.~It contains two on-chip %kilo- accelerator ASIC (not present in DEC 3000 AXP byte (KB) direct-mapped caches, one for use as an Model 400 systems). In addition, six TURBOchannel instruction cache, the other as a data cache. Both option slots are available for expansion (three slots the integer and floating-point units are contained in DEC 3000 AXP Model 400 systems). on-chip. CPU Module B-cache Szlbsystem The DEC 3000 AXP systems are composed of two The system employs a second-level cache (B-cache) primary modules, the CPU module and the l/O mod- to help minimize the performance penalty of ule. The CI'U module contains the processor, misses and write throughs in the two relatively

Digflal Technical Journal 1/01, 4 No. 4 Special Issue 1992 67 I FOUR MEMORY MOTHERBOARDS

::OR. ADDRESS::OR. Fw ---- 1------EIGHT LONGWORDS .L 0 8 8 _C------....----_____..IIII -C ###I# ------..---- nn,,,4 Id 0 0 0 8 8 8 I I I I I, 0 0 CPU ADDRESS, TAG, AND 1: I I, I : TC CACHE PROBE BUSES CPU. 110, DMA I10 ADDRESS BUS 4 j OPTIONS j::' I P 1------_-..-, - BUFFERING I DECCHIP I ADDRASIC II J L 21064 CACHE I v - CPU I CPU I / BUFFERING I ONE LONGWORD - ,/ LONGWORDS1 - I FOUR SLICE ASlCS - L------JI SYSTEM CROSSBAR

ISDN AND AUDIO

Figure 2 System Block Diagram The Desi~nof the DEC 3000 AXP Systems, Tzvo HighQerformnfzce Workstutiofzs

MEMORY MEMORY CLOCK TC CONNECTORS CONNECTORS SUBSYSTEM CONNECTOR 0 I

I INTEGRAL GRAPHICS

DECCHIP 21 064 CPL 1 TURBOCHANNEL II nATA-DATU I W-IP

SECONDARY CACHE I

, "C SU"", ,,...,.LL I CONTROL LOGIC

SYSTEM RESET LOGIC I

110 MODULE TC ' TC ' CONNECTOR CONNECTOR 1 CONNECTOR 2

Fgure 3 CPU Module small 8KB primary caches of the DECchip 21064 The block size of the B-cache is 32 bytes, match- processor. The B-cache is a 512KB, direct-mapped, ing the block size of the primary caches. The cache write-back cache. A direct-mapped cache elimi- block allocation policy used is to allocate on both nates the logic needed to choose among the multi- read miss and write miss. Hardware keeps the cache ple sets of a set-associative cache, resulting in a coherent on direct memory access (DM) trans- faster cache cycle time. A write-back protocol was actions; DMA reads probe the cache and DMA writes selected because it reduces the amount of write update the B-cache (and invalidate the primary data traffic from the B-cache to main memory, leaving cache). more main memory bandwidth available for other The DEC 3000 AXP systems are designed to be memory transactions. uniprocessor systems, which simplifies the cache

Digital Technical Journal Vo1. 4 No. 4 Special Issue 1992 69 Alpha AXP Architecture and Systems controller design in a number of ways. For example, block. The Model 500 has a maximum cache read since no other CPU's cache can contain a copy of a bandwidth of 480 megabytes per second (MB/s). cache block, there is no need to implement cache The cache may be written by the CPU with an initial coherency constructs such ;IS a shared bit. Further, tag probe latency of five CPU cycles followed by up by loading the B-cache during the power-up to two write cycles of five CPU cycles each. The sequence and keeping it coherent during DIM by Model 500 has a cache write bandwidth of 320 MB/s. using an always-update protocol, cache blocks in When a CPU probe misses in the B-cache, or the B-cache are always guaranteed to be valitl. This when the CPli accesses the external lock register, method eliminates stale data problems without control of the cache is turned over to the external needing to use a valid bit. cache controller This logic controls filling the In addition to the cache memor): the subsystem cache with the required data from main memory, consists of the cache controller, the main memory handing the data to the CPlJ during reads, merging controller, and the protocol control logic for mem- CPU write data into the cache on writes, and main- ory access arbitration. A block diagram of the CPU taining the contents of the external cache tag ancl and B-cache subsystem is shown in Figure 4. tag control store. In addition, this logic maintains The B-cache is alternately controlled by the CPU the architecturally defined lock flag and locked and the external cache controller. When controlled physical address register, which can be used to by the CPU, the cache may be read by the CPIJ in five implement software semaphores and other con- CPU cycles. The cache data bus width is 16 bytes; structs normally requiring atomic read-mod@- therefore two reads are necessary to fill a cache write memory transactions.

CPU ADDRESS BUS > DMA CACHE INDEX

2:1 MUX

CACHE DATAIECC CACHE TAGIPARITY CACHE TAG CONTROU STORE STORE PARITY STORE

512KB 16K x 1 1 -BIT TAGS 16K x 2-BIT CONTROL DECCHIP 16K X 32-BYTE BLOCKS TAGS 21064 SYSTEM MICROPROCESSOR CROSSBAR CPUDATABUS > CPU TAG BUS > CPU TAG CONTROL BUS > CPUICACHE CONTROL LOGIC AND MEMORY CONTROL SEQUENCERS SIGNALS *

CPU STATUS SIGNALS CROSSBAR STATUS SIGNALS CPU CONTROL SIGNALS *a-CROSSBAR CONTROL SIGNALS CYCLE DECODER

I10 CONTROLLER STATUS SIGNALS

110 CONTROLLER CONTROL SIGNALS MAIN SEQUENCER *

Figure 4 CPCJ and R-cache Block Diapm

kl. 4 No. 4 Special Issue I992 Digigllal TecbrricalJouriral The Design of the DEC3000 AXP Systems, Two High-perfo~nnanceWorkstations

The control logic for the B-cache consists of two example, during the "read cacheable memory with interlocking state machines. These state machines victim block" flow, the n~ainsequencer reads the control arbitration and decoding of processor and victim block from the B-cache and stores it in the 110 subsystem requests. They also generate the con- SLICE ASKS in parallel with reading the new block trol signals needed to execute these requests to the from main memor)l. The same flow without a vic- CPII. B-cache, and main memory. tim block makes use of the main memory access The state machines prioritize and arbitrate time to update the tag store. The control flows for requests from various sources, including the CI-'U, writes to cacheable memory also take advantage of the 1/0 subsystem, and the memory refresh logic. this parallelism. A hlrther write optimization is Arbitration is done according to a fixed priority used when the cycle decoder determines that the First priority goes to DMA requests from the I/O sub- entire cache block will be written; in this case the system. Second priority goes to memory refresh data from memory is completely overwritten, and requests. Lowest in priority are requests made by therefore it is never fetched from memory. the CPU. The one exception to this scheme occurs Dmflows are entered upon request of the DM at the conclusion of a Dmtransaction. In this case, controller in the I/O control section. DIM control the first arbitration cycle following the DMA flows start by asserting a "hold request" to the CPU, changes the priority to memory refresh first, CPU causing the CPu to cease B-cache operations within request second, and DLMA last. This guarantees that a specified time, after which it asserts a "hold requests for CPll and memory refreshes are granted acknowledge" signal. It shoulcl be noted that the during heavy DiMA traffic. CPU will continue to execute instructions inter- The larger state machine, or main sequencer, nally until such time as it experiences a miss in one examines the commancl generated by the smaller of its internal caches, or it requires some other state machine, or cycle decoder, and initiates the external cycle. control flow necessary to perform that command. Each DM.% write to memory results in a probe of Fifteen unique flows are implemented by the main the B-cache for the DMtarget block, with a hit sequencer. They are resulting in the B-cache block being updated in par- Read cache;rblc memory with/without victim allel with main memory and the corresponding pri- block mary data cache block being invalidated. DMreads cause main memory to be read in parallel with Write cacheable memory with/without victim probes and reads of the R-cache. If a cache probe block hits, the B-cache data is used to fill the Dmread Write noncacheable memory (diagnostic use buffer in the SLICE ASICs; otherwise the main mem- only) ory data is used. In this manner, cache coherence is maintained. Full block write cacheable memory with/with- out victim block Memory System and System Crossbar Tag space write (cliagnostic use only) The DEC 3000 ltYP Model 400 and Motlel 500 archi- Programmed I/O read/write tecture supplants the traditional system bus with a system crossbar constructetl from ASICs. Tightly Load lockliit coupled to the crossbar is the system memot-).. Three Store conclitional hit types of ASICs-SLICE, ADDR, and TC-form the Memory refresh crossbar. SLICE and tDDR are cliscussetl next and TC is discussed in the I/O Subsystem Interface section.

When a cache miss occurs and the new cache SWCE ASICs block replaces a cache block that has been modi- The four SLICE ASICs are used strictly for data path; fiecl, as intlicatetl by the "dirty" status bit, the dis- together they form a 32-byte bus to main memory, a placed data is referred to as a "victim block" or 16-byte bus to the CPU and cache, and a 4-byte bus "victim data." to the TC ASIC. It is helpful to think of the SLICE The many variants of cacheable reads and writes ASICs as a train station for data with the data buses provide optimized flows that maximize the paral- as train tracks. Data can come and go on any track, lelism of cache accesses and memory accesses. For different tracks have different speeds and widths,

Digil~lTech~ricnl Jorrrrrcrl Vol. 4 IVO. 4 Special lsstre 1992 7 1 Alpha AXP Architecture and Systems and data can find temporary storage in the ASKS. read, victim write, ant1 DivU atltlresses to send to The SLICE ASlCs provide the systems with a location memory. A counter that increments DLMAaddresses to buffer Dm,I/O read, I/O write, ant1 victim data on long TIRROchannel DMAs :llso resides in AIIDR. while the data waits to trwel the next leg of its jour- AI)I>R provides a home to the memory configura- ney The use of the SLICE ASICs also eliminates one tion registers. At power-on time, the boot firmware to two levels of buffering between the dynamic ran- writes ant1 reacls memory space, determines tlie dom-access memories (DMMs) and the <:PLl, thus memory configuration, ant1 writes the configura- decreasing latency and improving bandwidth. tion registers. At run time, each memory address A key design decision was determining the width maps into a i~niquebank, regardless of the type and of tlie memory data bus. A conventional design order of the single in-line memory modules (SIMMs) would have matched the width of the memory bus installed. to the width of the cache bus (16 bytes). However, AI)I>R also provides a home for niiscellaneous to reduce the memory latency of the second half of functions such as tag parity checking, refresh the cache block (cache line size is 32 bytes), the counter. ;~nclthe lockecl physical address register. It system reads the entire cache block from memory generates the c;lche probe intles to check the cache at once using a 32-byte memory bils. Tliis technique tags for ;I hit or a miss on .DMA probes. eliniinates the additional latency from ;I seconcl page-mode read. Memory Mother Board and S1MlV.s Tlie DEC 3000 AXP Model 500 returns the entire ?'he memory system is composetl of memory block to tlie cache ant1 CPlJ with an average latency mother bo;lrcls (MMRs) th:~trise from the system of only 180 nanoseconds (ns) from the <;PU memory card. and SIMMS. This arrangement is a good solu- request. In contrast, a less aggressive prelimin;~ry tion to the problem of limited space on the system design using a system bus ant1 16-byte-wide mem- motlule. It ;~llowsfor a wide dicta bus and for good ory bus yielded an average memory latency of 320 sign:~lintegrity for short propagation times on the ns. The 32-byte n1emoi-y bus costs little more tli;ln a memory data bus. 16-byte bus-two low-cost ASlC:s, resistor p;~cks, As shown in Figure 5, an MMB module supports and some address fan-out parts. up to eight SlMMs at a time (four SIMMs in Motlel400 systems). A minimum of two SIMMs is required for ADDR ASIC each boarcl. A system always contains four MMUs. The ADDR ASIC is a crossbar for addresses. AJ)I)R Tlie MMHs act as a carrier for the SIlLlMs ant1 also con- sends addresses from the CPU to memory (CI'II tain tlrive~-sfor adclress and control signals. reads ant1 writes), from the c~rlto I/O (I/<) reacls A total ofX, 16, 24, or 32 SlMPls (m;~xim~~rnof I6 in and writes), ant1 from the I/O to CPri and meniory Morlel400 systems) can be pluggcd into tlie system. (DIM reads and writes). ADDR selects between <:1-'1J Slhh~lsm;ly be single- or double-sitled with LO DRhikIs

1 TO 8 DRAMS INSTALLED

I I I I I I I I I MEMORY MOTHER BOARD

I CACHE RAM I

CACHE DATA BUS MEMORY DATA BUS I I CPU 1 SLICE ASIC 1 I I

72 lid 4 ,\'(A 4 S/)ecirrl I.v.vrfr. 1332 Digital Tecb~icnlJorrrrtal The Design of the DEC 3000 AXP Systems, Two High-pwfofonnnnce Workstations

per side. Each side of a SIMM constitutes one-eighth DMA reads) or empty (for DMA writes), the CPU is of a bank. Eight SlMMs must be plugged in to com- allowed to use memory. When the I/O controller plete a bank; hence the 320-bit-wide data bus (4 bits detects that the buffer is too full or too empty, it per DWM by 10 DRAivls per SIMM by 8 SIMMs). One requests memory time to service the DMA buffer. megabit (Mb), 4Mb, and 16~bDRATMs are sup- At this time, further CPU requests are temporarily ported, and users are allowed to populate banks in ignored. This technique prevents the CPU from any order. In this way, the DEC 3000 AXP Model 500 being locked out of main memory, even cluri~~glong can support from 8MB to 1 gigabyte (GB) of mem- DMA transactions ant1 even though DMhas priority ory, and the DEC 3000 AXP Model 400 can support over CPU transactions. 8MB to 512MB of memory. The crossbar also permits sirnultaneous write Main memory is protected by a single-bit-correct, transactions from the CPU to main memory and tlouble-bit-detect error-correcting code (EX). In from the CPU to an I/O device. SLICE and ADDR ASICs addition, the arrangement of data bits allows the can buffer one I/O write transaction of up to 32 detection of any number of errors restricted to a bytes in size. Once the ASICs have accepted the data single DWM chip. ECC corrections for CPU trans- and address, the cache and crossbar are free to pro- actions are performed by the CPU, and corrections cess other CPU transactions, which can include for I/O transactions are done in the TC ASIC. cache and main memory reads and writes. If the CPU issues an I/O write while a previous write Memory is still pending in the ASIcs, the cache controller Transactions simply stalls. When data is stored in the B-cache by the CPU, it is not immediately sent to memory. Data is written to main memory only when a dirty block in the cache I.0 Subsystem Interface/ is replaced. Data tlestined for the cache is read from TURBOchannelASIC main memory only on cache misses Reads to main The I/O system is based on the TURBOchannel, a 32- memory, whether from the CPU or from DIM, bit high-performance, bidirectional, multiplexed always return 32 bytes. On CPU reads of main address and data bus developed by Digital for work- memory, data is returned to the cache and CPU in stations."he DEC 3000 AXP supports up to six two halves by the SLICE ASICs. Likewise when the plug-in options, as well as the integral smart frame B-cache control writes victim data to main mem- buffer (SFB) graphics ASIC, the I/O controller ory, two reads are made of the cache, but only one (IOCTL) ASIC, and the TURBOchannel dual SCSI write is made to main memory. (TCDS) ASIC. The TURBOchannel bus is synchronous On Dhc~writes, 4 bytes of clata arrive from the and requires only five control signals in each direc- TURBOchannel interface ASIC each cycle and are tion between the system and the option cards. storetl in the SLICE ASICs. The SLICE ASICs can buffer The system interfaces to the TURsOchannel bus up to 128 bytes of data prior to writing the data to by a data-path Tc ASIC ancl control logic contained main memory using page-mode writes, 32 bytes at a in a number of progra~nmablearray logic devices time. To maintain cache/memory coherence, data is (PALS). The TC ASIC: completes the system crossbar also provided to the cache RAMS so that it may be by passing addresses between the TURBOchannel written in the case of a cache hit. On DlMA reads, up bus and the address ASIC, and passing data between to 128 bytes of data are read page mode out of main the TURROchannel bus and the SLICE ASICs. memory and buffered in the SLICE ASICs. Data flows Furthermore, the TC ASIC checks and generates par- out to the TC ASIC and the TURBOchannel bus at the ity on the TURBOchannel, and checks, corrects, and rate of 4 bytes per cycle (100MB/s). In the event of a generates ECC on the data bus to the SLICE ASICs. cache hit, data is taken preferentially from the Parity checking of TURBOchannel data is optional cache. and is enabled on a per-option basis through a con- The crossbar employs a techniclue that permits figuration register in the TC ASIC. Finally, the TC simultaneous transactions from C1'U to main menl- ASIC contains a number of counters for tracking ory and DMA. The TURBOchannel bus supports Dm Divh progress, as well as configuration and error transactions of up to 512 bytes in length. Once the registers. All control logic was implemented in PALS DIM starts, the system must be able to provide or to minimize the impact to the project schedule of receive data without any gaps. However, while the any design changes. The TURBOchannel interface Dmbuffer in the SLICE ASICs is sufficiently fill1 (for block diagram in shown in Figure 6.

Digital Technical Jouml 1/01. 4 r\b. 4 Speciullssue 1992 73 Alpha AXP Architecture and Systems

TURBOCHANNEL

SLICE ASlCS

DMA REQUEST . CONTROL

STATE MACHINES TC AND DECODE LOGIC - OPTIONS .

Figure 6- TURBOcbaiznel Interface Block Diagram

There are two types of TURI30channel opera- recpest to the system. The DEC 3000 AXP architec- tions: the system initiates I/O reads and writes, and ture employs an arbitration scheme using rotating the options initiate Dhfi reads and writes. On an priority that prevents any option from being locked I/O operation, the system sends the I/o address out. After being granted the bus, the option sup- from the ADDR ASIC to the TC ASIC, ant1 from there plies a DMaddress on the TURBOchannel bus. This to the TURBOchannel. For I/O reads, the option address routes through the TC ASIC and onto the returns data on the TLJRBOchannel. This data passes address ASIC. In the case of a DLMAwrite, data imme- through the "I'C ASIC ancl over the bus to the SLICE diately follows the address on the ru~ochannel. ASlCs. The system inclucles some special hardware This data passes through the T<: ASIC and onto the for byte rnasking of [/O read data. This hardware is clata bus to the SLICE buffers. used to provide support for VIMEbus adapters. DMA reads are more complicated than writes For 1/0 writes, the system sends data from because the TURROchannel bus does not transmit the SLICE ASlCs across the data bus to the TC ASIC. ahead of time the number of bytes of data to be read The TC ASIC: then sends it to the option over the from memory. Instead, it continues to assert its TIIRROchannel. The DEC 3000 AX-' workstation read request signal for as long as it is requesting supports a block write extension to the original data. The SLICE buffers begin to fill up with D,W TURBOchannel protocol. In this mode, the system data, and only when they can guarantee that there supplies a single address followed by multiple will be no gaps in the DlMA will the data transfer consecutive data transfers for improved I/O write start. The TC ASIC receives the read data from the performance. This extension is also config~~rable SLICE ASlCs and sends it onto the TURBOchannel to on a per-option basis through the TC configuration the requesting option. register. Virtual DM allows the system to map non- The TURBOchannel protocol specifies that before contiguous regions of physical address space into any option can use the bus for D&U,it must issue a contiguous regions of virtual address space. This

74 Vol. 4 No. 4 S'pecicrl Jssue 1992 Digital Tecbtricnl Jozrrtinl The Design of the DEC 3000 AXP Systems, Two Highperformance Workstations

method allows TURBOchannel options to transfer interface (AUI) Ethernet, 10Base-T Ethernet, large blocks of DMA data without knowledge of how Integrated Services Digital Network (ISDN), alter- that data is mapped in the physical address space in nate console/serial printer, mouse/keyboard, com- main memory Virtual DIMA enhances operating munications, internal and external SCSI, three system performance because the memory mapping TURBOchannel options, and audio module port. is performed before the transfer of DMA data. The various I/O controllers interface to the The DEC 3000 AXP workstation supports virtual TURBOchannel through one of three ASICs. These DMA through the use of a scatter/gather (SG) map, ASICs are the smart frame buffer (SFB) on the CPU which acts as a translation buffer. SG mapping is module and the TURBOchannel dual SCSl (TCDS) enabled on a per-option basis through the configura- ASIC and the I/O controller (IOCTL) ASIC on the I/o tion register in the TC ASIC. The SG map is organized module. as 32K 24-bit entries. Each entry contains a 17-bit physical page number (PPN), parity, and valid bit. Software sets up the map through 1/0 space reads VIRTUAL DMA BYTE ADDRESS FROM TURBOCHANNEL and writes. DMA byte address bits [27:13] index the 33 28 27 1312 0 SG map, which produces a 17-bit PPN (bits [29:13]) to UNUSED VIRTUAL PAGE BYTE append to the virtual DMA byte address bits [12:0]. The resulting 30-bit physical DMA byte address can then address all IGB of the possible system address SG MAP space. An SG map is shown in Figure 7 29 + 13 12 + 0 PPN I/O Subsystem BYTE PHYSICAL DMA BYTE ADDRESS TO MEMORY SYSTEM Most of the I/O subsystem is implemented on its own module. This I/O module, shown in Figure 8, contains the connectors for attachment unit Figure 7 Scattec/Gather Mafifiing

Mousv TPlC \ EXTERNAL SCSl COMMUNICATIONS KEYBOARD ISDM AUI CONNECTOR

I \ I \ \ I \ INTERNAL SCSl TC CPU MODULE TC REAL-TIME FLASH TC AUDIO CONNECTOR 5 CONNECTOR CONNECTOR 4 CLOCK MEMORY CONNECTOR 3

Figure 8 VOModule

Digital Technical Journal Vo1. 4 No. 4 Speciallss~te1992 75 Alpha AXP Architecture and Systems

I/O Module-IOCTL ASIC signal-to-noise ratio ranged from 24 decibels with a A key I/O subsystem design tlecision was to reduce minim;~lsignal input to 58 decibels with a high- time-to-market by elinii~iating unnecessary new level sign;ll inpiit. harclware and software development. I'o support most of the I/<) functionalit): the designers chose I/O Module-TCDS ASIC the lOCTL ASIC developed for the I>E<:st;~tion5000 Although the IO<:TL ASIC contains an interface Motlel 240. to a SC:SI controller, the DE<: 3000 AXP s)Istems Tlie lO<:TLASIC provides ;in interface to a 16-bit, implement their SCSI interface using the TCDS general-purpose I/O bus, which supports the fol- ASIC:. This design has several advantages. First, the lowing devices: two Z85C30 serial communi- l'<:I>S ASIC: supports two SCSI ports rather than cations controllers (S<;Cs), an &\.I[) 7990 Loc;~l;tre;i the one supportetl by the IOCTI.. ASIC, permitting netnlork controller for Ethernet (ILANCE), a I);~ll:is separate internal and external SCSI chains. Second, semicontluctor DS1287 real-time clock. an ~,\ll> this design eliminates contention between the 79C30A ISDN data controller (IIIC), a S<:SI coo- Ethernet controller ant1 the SCSI controller for the troller. and an AVlD 27C020 256~~er;ls:ible pro- 10(:'1'1.. 1x1s. 'T'hird, tlie T<:l>S t\SIC supports much grarnrn;ible re;~d-onlymemory (El)l<~)x~). longer 'I'1JReoch;lnnel I>,W\ bursts (64-byte bursts The SCCs implement the keyboard, mouse, alter- rather than 16-byte bursts). Finally, tlie resulting nate console/printe~-.and cornmunic;itions ports. ASIC: design is i~sedto implement a dual SCSl The mouse and keyboanl tlo not use I>Yl~lr\.The alter- -l'lJi~l$OclianneIoption module. nate console/printer and the communic;itions port The 'I3c:1)S i\Slc: inlplenients two separate SCSl do use Dkw. ports using two NCK 5SC94 advanced SCSI con- The 1.ANCE implements the Ethernet interface, trollers (A~<:S). The TcDs allows both controllers to wliich connects to the loc;il area network (LAN) h;lve I>lLli\ transfers in progress si~~~i~lta~~eo~~slj~. tliroiigh either the Arrl (thickwire) or lOH;~se-~l 'L'CIIS TIJRHOchan~iel DIMA transactions are (twisted-pair interconnect [I'I'IC]) connectors. Soft- aligned 64-byte blocks. Starting Dim addresses that ware controls which one of these interfaces is :Ire not aligned to these bounclaries begin with a enabled. sm;rller IIMA tr;ins;rction. This technique aligns tlie 'The real-ti~iieclock provitles time-of-year (TOY') atlclress so that succeeding transactions are aligned reference ant1 50 IsjTes of nonvol;~tile Iwhl. A 64-byte blocks. Large, alig~letltransactions increase lithium battery supplies power in the event of both ~l~~~ochanneland memory access efficiency. system power-off or failure. The TCns ASIC: and the AS(;s provicle odd parity The IDC implelilents both an ISDN interk~ceant1 protection on major data paths. This protection telepl~one-qualityautlio. The autlio connects to the inclutles 8-bit parity on the 16-bit bus between the autlio interface module (AIM), which provides the TCOS and the AS<:s, 32-bit parity on TCDS DIMA buffer audio I/O in the Model 500. Audio I/<> in the Model entries, and 32-bit parity on TIJRROchannel trans- 400 is on its f/O motlule. :ictions, both I/O ;ind IIILIA. Tlie i\lV on the iModel 500 supports ;~uclioinput through either a X-inch minijack for microphone input, a 4-pin modular jack (8lJ) connector for use Grapbics of a telephone handset, or ;In RCA-style phonogr;lph The gr;rphics subsystem on the Model 500 sys- jack used as a line-in inp~~tOutput is provitletl by tem c:irtl provides integral 8-plane graphics with the i~lJconnector ;IS well as by a %inch stereo- hardware enhancements for improved frame buf- pl~onicjack. The stereophonic jack accepts only a fer performance. These enklncements increase stereophonic plug. If monophonic headphones are the performance of stipple, line drawing, ant1 copy usctl, ;I mono-to-stereopl~onicatl;ipter is requiretl. operations. The grapl~icssystem consists of an SFB On the Nlotlel400, audio input ;lnd oi~tpl~tis imple- ASIC:, 2MH video lUM, ant1 the Brooktree Bt459 mented using a 4-pin 81~connector. IL4MI)A<: chip for sourcing the 8-plane RGB data. Analysis of the complete audio system in a Motlel The user can select either a 66-Hz or a 72-Hz moni- 500 shows a frequency response of '145 I-Jz to 3,500 tor refresh rate through a switch on the back of the Hz, with typical distortion in the 0.8 percent to 1.9 workstation. The graphics subsystem can draw percent range for the microphone ;ind 0.4 percent 6I5K two-dimensional vectors per second and can to 1.5 percent for the telephone handset. 'Che perform copy operations at 31.8MH/s.

76 Ibl. .-I 1Vo. .Fpcc.ic~lLSSII~ 1992 Digital Techtrical Joun~al The Design of the DEC 3000 AXP Sj/sle?ns,Tzijo High-perfimance Workstations

The graphics subsystem is available separately as margin. The PECL clocks are converted to transistor- the T~~sochannelHX graphics option card. In acldi- transistor logic (nL) in the last stage of the clock tion, high-performance two-dimensional and three- fan-out tree. dimensional graphics accelerators are available Two stages of system clock fan-out are used as through the TURBOchannel bus for all systems. shown in Figure 9. Two MClOOElll ECL clock buffer chips (PECL input and output) provide 18 lorn.- Clock System skew differential copies of the clock. Seventeen The input clock circuitry to the DECchip 21064 CPU @1~100~641ECL-to-TTL converters (PECL input, TTL contains a differential 300-MHz oscillator (266 MHz output) are distributed throughoi~tthe system and for the Model 400), which drives an alternating cur- I/O boards to provide more than 100 clock lines. All rent (AC) decoupling circuit and the CPU chip. The clock lines are length matched to reduce skew, and CPU chip divides down the input clock frequency PECL wires are separated from TTL. Worst-case by a factor of two ancl operates internally at 150 SPICE simulation indicates a skew between typical MHz. The DEC 3000 AXP Model 500 is capable of sup- components such as PALS to be 1.5 11s.Actual skews porting a 200-MHz CPU with a ~~O-MHZoscillator. measisurecl in the lab are approximately 0.5 ns. The entire system, with the exception of some To give designers maximum flexibility, four I/O devices, runs synchronously. The master system phases of the system clock are generated, one every clock is generated by the <:PU chip at a frequency of 10 ns. Delay lines are used to generate an offset of 10 25 MHz (22 MHz for the Model 400), resillting in ns. By swapping the 11igI1 and low differential inputs system clock cycles of 40-ns duration. This master to selected ~~100~641com7erters, the 20- and 30- system clock is duplicated and distributed with ns delayed clocks are generated. The master system differential pseudo-emitter coupled logic (PECI.) clock is delayed using delay lines so that the even- to maintain minimum skew and to improve noise tual system clock is synchronous with the (:PrJ chip.

3.3-v rFd-T25 MHz FGEL,DELAY SYSCLK

300 MHZ TTL LEVELS -0 1 I CLK > PAL PECL 2 LEVELS 3- C ECL-TO-TTL 4 - CONVERTER 5 - > GA 6- 7- 0- 8- > FLOP ECL 13 0- PECL 1 - LEVELS BUFFER 5 2 - - CMOS-TO- CHIP ECL-TO-TTL -PECL CONVERTER CONVERSION -

0 1 - BUFFER 2- DELAYED CHIP 10-NS - CLK ECL-TO-TTL > PAL - DELAY CONVERTER LINES 2- I ib6-

Figure 9 Clock Distribution

Digital Tecbnical Journal Vo1. 4 No. 4 Special Issue 1992 77 Alpha AXP Architecture and Systems

implemented. TPD is defined as the total diameter The goal in choosing semiconductor devices was to of permissible movement from a theoretical exact select mature silicon technologies and then push location around the true position of the pads. This those technologies to the limit. Module- and chip- TPD requirement ensures proper positional accu- level signal integrity was verified by correlating racy between the solder paste stencil apertures and silicon bench characterization data to device simu- the surface-mount features. In addition, solder lation modules. CAD tools were used to perform mask between pads on the fine-pitch components worst-case module timing and signal integrity sim- is used to reduce manufacturing defects. ulation. This methodology minimized device costs, To reduce power and cost, the slower DEC 3000 reduced risks, and shortened time-to-market. AXP Model 400 design substitutes CMOS technology The nine ASICs in a DEC 3000 AXP workstation for the BiCMOS cache SRAMs and for many of the use six unique 1.0-micrometer complementary bipolar PU. metal-oxide semiconductor (CMOS) designs. (See Table 2.) Plastic quad flat packs (PQFP) are used as Pawer and Packaging the packaging technology to limit device cost. The following fixed disk drive options are currently Because the ASICs are 1/0 limited and the PQFPs do available. not have ground planes, the effects of simultaneous switching outputs (SSOs) were a concern. The RZ25 3.5-inch half-height 4261~~disk drive potential effects of ssos in CMOS output buffers ~~263.5-inch half-height 1050MB disk drive include corrupted data and undesirable oscil- lations. Simulation and bench characterization The following removable media options are also were used to quant@ the SSO effects, and in some available. cases SSOs were reduced by staggering output driver timing. RRD~~5.25-inch half-height 600~~CD-ROM drive Although ASICS were chosen for the data path, ~X263.5-inch half-height 2.8MB floppy disk drive PALS were used for control logic due to their greater flexibility and faster turnaround time. A total of 63 TZKlO 5.25-inch half-height 525MB QIC tape 20XX (5 ns) and 22V10 (10 ns) PALS with 57 different drive codes was used. Exhaustive system-level simula- TLZO~5.25-inch half-height 4000~0DAT drive tion and bench characterizations were performed to understand device behavior in the man)? differ- The Model 500 has a 480-watt output, off-line, ent loading scenarios. switching regulated power supply, which includes The CPU board technology proved moderately a capacitor-input, automatic voltage-selecting cir- difficult for system-level assembly due to the large cuit to permit worldwide operation without a volt- distance between the fine-pitch (25 mil) compo- age-select jumper for 120 or 240 volt (V) input. The nents. There are 19 fine-pitch components on the power supply provides five outputs to the load: 14- by 16-inch CPU board, with a maxinlunl distance +3.3 V, +5.1 V-CPU, +5.1 V-turbo, +12.1 V, and - 12.1 't: of 14 inches between any two devices. With this The power supply also provides power for three large distance, an aggressive, true positional diam- external fans. Temperature-sensing fan speed con- eter (TPD) tolerance requirement of 6 mils was trol is provided to reduce system noise. The power

Table 2 ASlCs Used on the DEC 3000 AXP Workstations Total Number Number of Number of Used Available Chip of Pins Pins Used Signal Pins Gates Gates

SFB 184 184 150 21.6K 54 K TC 184 182 144 12.1K 44K SLICE 184 184 153 11.2K 44 K ADDR 184 183 148 5.7K 44K TCDS 120 120 94 26.5K 68K IOCTL 160 160 126 11.2K 44K

78 Vol. 4 Ale 4 Special Issue 1992 Digital Techtricnl Journal The Design of the DEC 3000 AXP Systems, Two High~erforrnnnceWorkstc~tiorzs supply senses tachometer outputs from the fans, loaded into the instruction cache. 'These programs and when a fan fails, it shuts down and illuminates include power-up code for loading the real console, an indicator. a miniconsole, and five tliagnostic programs for testing memory and the graphics subsystem. Other tests are available by replacing the EPROM. These programs are of great value in the manufacturing The designers provided several debugging features, debug environment. including test points on the module, tristate out- Two flash EPROMs contain the console code for puts on ASICS and PALS, an on-board diagnostic the system. On power-up, code in the serial ROM ROM, and programmable console ROkI. Since the loads the console code into memory and begins module is composed almost exclusively of surface- executing it. Users can easily update the console mount devices, the designers specified as many vias ROMs (for example, to provide PAL code enhatice- as possible for use as test points. Consequently, all ments) through a special utility booted off a CD- wires on the board have test points, which allows RO&i connected to the system. Field service can for 100 percent short-circuit coverage and 94 per- update the console code in the system remotely cent open-circuit coverage. through the Ethernet. The DEC 3000 AXP workstation takes full advan- tage of the serial ROM port on the DECchip 21064 CPU. This port allows code to be directly loaded Conclusions into the instruction cache. During prototype devel- The primary goal of this project was to design a bal- opment, designers loaded special debug programs anced system that exhibited low memory latencj: into the C:PU through this port. However, the real high memory bandwidth, and mini~nal <;pU-I/O innovation is in also wiring this port to the output memory contention in a cost-effective manner. of a 64K by 8 EPROM on the module to provide 8 Table 3 gives the measured peformancc numbers programs that are individually selectable by moving for these characteristics. Except where noted, all a jumper on the module. On system reset, serial numbers are for sustained performance. Of particu- program data from the selected EPROM output is lar note are the numbers showing that the CPU

Table 3 System Performance - DEC 3000 AXP DEC 3000 AXP Model 500 Model 400 CPU speed 150 MHz 133 MHz B-cache size 51 2KB 51 2KB B-cache read bandwidth 480MB/s 426MBls B-cache write bandwidth 320MB/s 284MB/s Maximum main memory 1GB 51 2MB CPU memory latency (average) 32 bytes1180 ns 32 bytes1203 ns CPU memory read bandwith 114MBIs 101MBIs CPU read with victim write 16OMB/s 141MBls memory bandwidth TURBOchannel peak bandwidth 89MBIs I10 read bandwidth 8 bytes 12MBIs I10 write bandwidth 8 bytes 29MB/s Block I/O write bandwidth 32 bytes 59MB/s Block I/O write bandwidth 32 bytes with CPU I/O=47MB/s read and victim write memory bandwidth MEM=95MB/s DMA read bandwidth 51 2 bytes 81 MBIs 64 bytes 51 M B/s DMA write bandwidth 512 bytes 82MB/s 64 bytes 52MBIs 64-byte DMA write bandwith with DMA=52MB/s CPU reads from memory CPU=27MB/s

Digital Technical Journal V'l 4 No. 4 .5pecr~1llss~~e1992 79 Alpha AXP Architecture and Systems receives significant memory bandwidth even in the those mrho contributed to the design of original presence of heavy block VO and DIMA traffic. hardware: Dave Archer, Mark Baxter, John DeRosa, Another goal of the project was to offer per- Chris Gianos, Leon Hesch, Dave Laurello, Bob formance that is cotnpetitive with RISC worksta- McNarnara, Dick Miller, Rick Ruclman, Dave tions available from other vendors. The benchmark Senerchia, Petr Spacek, Bob Stewart, Ned Utzig. performance of any system derives from the inter- Debbie Vogt, and John Zurawski. The tight schedule dependent performance of the hardware, the oper- could not have been met without the special efforts ating system, and the compilers that generate the of the Power and Packaging, Console, Qualifi- application code. The benchmark perforn~ance cation, Proto Management, and Technology and should improve as each element matures. Table 4 Operating Systems Groups. The design team for the shows the performance of the DEC 3000 AXP sys- DEC 3000 AXP Model 400 project is also recognized: tems on a selected set of benchmarks as of the John Da): Jamie Pierce, Dennis Rainville, and Ken announcement dates of these products. Table 5 Warcl. The thorough device evaluations by Rob compares the performance of the DEC 3000 AXP Zahora contributed significantly to the success of Model 500 to the published performance of several the projects. We would also like to acknowleclge currently available competitive systems.-' the contributions by FXO personnel. The Electronic Storage Development Group was responsible for Acknowledgments the design of the nEC 3000 AXP Model 500 memory The DEC 3000 AXP Model 500 design was a team module. Significant efforts by the Maynard TIME, effort-more people were involved than can be Albuquerque, and Ayr Manufacturing Plants should acknowledged in this space. Recognition is due to be recognized for delivering quality hardware

Table 4 Benchmark Performance DEC 3000 AXP DEC 3000 AXP Model 400 Model 500 Clock (MHz) SPECmark89 Dhrystones V1.l (Dhrystones per second) V2.1 (Dhrystones per second) LINPACK 64-bit double precision 100 x 100 (MFLOPS)* 1000 x 1000 (MFLOPS) Xl 1PERF Two-dimensional vectors per second Two-dimensional pixels per second

Note: *Million floating-point operations per second

Table 5 Com~etitiveCom~arison DEC 3000 IBM RS6000 HP9000 Model 500 Model 580 Model 750 SPECmark89 121.5 126.2 86.6 Dhrystones V1.l (Dhrystones per second) 257.7K nla 133.7K V2.1 (Dhrystones per second) 281.2K nla 122.3K LINPACK 64-bit double precision 100 x 100 (MFLOPS) 26.4 38.1 23.7 1000 x 1000 (MFLOPS) 79.9 84.0 nla

80 Vol. 4 No. 4 Special Issue 192 Digital Technical Journal Tbe Design of the DEC -3000AXP S]lster?zs,TLPO High performance Workstations during the development ancl production phases; a 1992) 1555-1567 and Digital TecI7~zicalJo~i~~nal, special thanks to Jirn Ersfelcl for his significant vol 4, no 4 (1992, this issue) 35-50 efforts in this regard. 3. TI:RROchannel Specifications, Version 2C (Palo References Mto, CA: Digital Equipment Corporation, 1. R. Sites, etl., Alpha Architecture Reference TRI/ADD Program, Order NO. EK-TCDEV-DK-004, Manual (Burlington, MA: Digital Press, Order September 1991). No. EY-L520E-DP, 1992). 4. Alpha AXP Workstation Family Performance 2. D. Dobberpuhl CL al., "A 200-MHz 64-bit Dual- Brief-Openv>lS, Secontl Edition (Maynard: issue CMOS Microprocessor," 1EEE Jout,lza/ oJ Digital Ecluipment Corporation, Order No. Solid-Stc~teCircuits, vol. 27, no. 11 (November EB-N0102-51,November 20, 1992).

Digilul Techrrical Journal Vo1. 4 IVO. 4 Spccirtl Issue 1992 8 1 Barry A. Maskas Stephen F. Shindon Nicholas A. Warclgol

Design and Performance of the DEC 4000 AXP Departmental Server Computing Systems

DEC 4000 AXP systems demonstrate the highest performance and fu~zctionality for Digital's 4000 series of departmental server systems. DEC 4000 AXP sjatems are based on Digital's Alpha AXP architecttire and the IEEE's Futurebust profile B stczndard. They provide sy~nnzetricmultiprocessing perforwza~zcefor Open EVJSRYP and DEC OSF/I AXP operating systems in an office enz~ironment.The DEC 4000 AXP systems were designed to optimize the cost-perforinance ratio and to irzclticle zqgmdability and expa~zdability.The systems combine the DECchip 21064 nzicro- processol; submicron CIMU sea-of-gatestechrzolog7~ ClVIOS memory and I/Oper.~ph- erals technolog3,, a high-performancemultiprocessing backplane interconnect, and modular system design to supply the most advancedfunctionality forperformance- driven applications.

The goal of the departmental server project was to System Overview establish Digital's 4000 family as the inclustry's most The DEC 4000 AXP system provides supercomputer cost-effective and highest-performance depart- class performance at office system cost.?This com- mental server computing systems. To achieve this bination was achieved through architecture and goal, two clesign objectives were proposed for the technology selections that provide optimized DEC 4000 AXP server. First, migration was necessary tuniprocessor performance, low additional cost from the VAX architecture, which is based on a com- symmetric multiprocessing (SMP), and balanced plex instruction set computer (CISC), to the Alpha I/O throughput. High I/O throughput was accom- AXP architecture, which is basecl on a reduced plished through a combination of integrated con- instruction set computer (RIsc). Second, for expan- trollers and a bridge to Futurebus+ expansion I/(). sion I/o in an upgradable office environment enclo- The system uses a modular, expandable, and sure, migration was necessary from the Q-bus portable enclosure, as shown in Figure 1. With to the Futurebus+ I/O bus.' In addition, the new current technologies, the system supports up to system had to provide balance between processor 2 gigabytes (GB) of dynamic random-access nlem- performance and I/O performance. maintaining ory (DRAM), 24GB of fixed mass storage, and 16~~ customer investments in VAX and MIPS app1ic;ltions of removable mass storage. The DEC 4000 AXP through support of OpenvMS AXP and DEC OSF/1 system is partitioned into the following modular AXP operating systems was implicit in the archi- subsysten~s: tecture migration objective. Migration, porting, and upgrade paths of various applications were Enclosure (~~640box) defined. CPU module (DECchip 21064 processor) This paper focuses on the design of the DEC 4000 AXP hardware and firmware. It begins with a discus- 110 module sion of the system architecture and the selection of Memory modules the system technology. The paper then details the CPU, 1/0,memory and power subsystems. It con- Mass storage compartments and storage device cludes with a performance summary. assembly (brick)

82 Vil. 4 IVO.4 Special Issue 1992 Digital Techrrical JoztrtraI Design and Perfor~nanceoftbe DEC 4000 AXP Depclrtmeiztul Seri~crConzputirzg Sjntenzs

increased processing power provided by the DECchip 21064 CI'11. Although processing power was taking a revolutionary jump in performance with no cost increase, disk and main memory tech- nology were still on an evolutionary cost ancl per- formance curve. 'l'he nietrics that had been used for VA~systems were difficult, if not impossible, to meet through lineill- scaling within a fixed cost bracket. These metrics were based on VAX-11/780 units of performance (VtlPs); they give main mem- ory capacity in megabytes (MR)/VLIP, clisk-queuecl I/O (($0) completions in QlOIs/\~lTP,and disk data rate in MB/s/VUP. As an example, Table 1 gives the metrics for ;I VAX 4000 ivlodel 300 scaled lin- early to 125 wl's i~ndthen nonlinearly scaled for the DEC 4000 system implementation. Performance rnoc.leling of the DECchip 21064 <:PlJ suggested that 125 VllI's was a reasonable goal for the DEC 4000 AXP. Without an Alph;~AXP architecture custo~ner base, we did not know if these metrics woulcl scale linearly with the processor performance. The DECchip 21064 processor technology has the poten- tial for attracting new classes of compute-intensive applications that may make these metrics obsolete. We therefore chose a nonlinear extrapolation of the Figure I DEC 4000 AXP S~atemEnclos~tre metrics for our initial implementation. By trading off disk and memory capacity for I/O throughput Futurebus+ Expansion I/O, Futurebus+ con- performance, we kept within established cost ant1 troller module (FBE) performance goals. The implementation metrics Power supply modules - universal line front-encl were not limited by the architecture; further scal- unit (FEU) ing up of metrics was planned. Of the four metrics, - Power system controller (PSC) the disk capacity metric has the most growth - DC-DC converter unit 5.0 volt (V) (DC5) potential. To ensure conipliance with both the Alpha &?

Digital Technical Joumnl Vo1. 4 iVo. 4 Sprcirrl Issrbe 1992 83 Alpha AXP Architecture and Systems

ASYNCHRONOUS SERIAL LlNE (AUXILIARY WITH MODEM CONTROL) CONNECTION MODULE I ASYNCHRONOUS SERIAL- - LINE (CONSOLE LINE) - CONTROL PANEL I I ------I CPU SUBSYSTEM I I I I I I I I I FUTUREBUS+ 1 I10 EXPANSION 11 ;======2 1 I MASS STORAGE COMPARTMENT I I I I I 1 I DSSIISCSI 0 FIXED MEDIA 161 DEVICE 110 I I DSSIISCSI 1 FIXED ) MEDIA DEVICE I ic DSSllSCSl2 4 > DEVICE 0b; TO STORAGE EXPANSION DSSIISCSI 3 DEVICES I I1.8 . nh, DEVICE U ' I ' I I DSSIISCS14 REMOVABLE MEDIA > DE\JICE > I I I

1 / 1 q~EE::~A~A~~~F-+DEVICE ETHERNET PORT 1

ETHERNET PORT 0 TOETHERNET

r------I r------i I POWERSUBSYSTEM I COOLING SUBSYSTEM I I I

AC POWER CABLE 1 ------A

KEY: fl EXTERNAL PORT CONNECTION

Figure 2 DEC 4000 AXP S'lstem F~~~zctiouralPartitiotr

84 Vol 4 ,Vo. 4 Specin1 Issue 1992 Digital Technical Journal Design and Pevorrnance of the DEC 4000 AXP Departmental Server Comnpziting Systems

Table 1 Extrapolated VAX Metrics VAX 4000 Scaled Scaled Model 300 Linearly Nonlinearly Metrics to 125 VUPs for DEC 4000 AXP

Memory capacity 60 MBIVUP 7.5 GB 2 GB Disk capacity 1.65 GBIVUP 206 GB 100 GB Disk QIO rate 49 QlO/s/VUP 6,125 QlOIs >4,000 QlOls I10 data transfer rate 1.4 MBIslVUP 175 MBIs 210 MBIs

compartment where each brick has a dedicated 4. Provide scalable memory banclwidth, based on controller and power converter. Support for DSSI, protocol timing of 25 nanoseconds (ns) per SCSI, and high-speed 10MB/s SCSI provides maxi- cycle, which scales with improvements in DRAM mum flexibility in the storage compartment. The and static memory (SW~) access times. modular mass storage compartments enable user 5. Use module and connector technology consis- optimization for bulk storage, fast access, or both. tent with Futurebus+ specifications. The cost of SIMP was a key issue initially, since Digital's SMP systems were considerecl high-end sys- The cache coherence protocol of the system bus tems. Pulling high-end functionality into lower- is designed to support the Alpha t\XP architecture cost systems through architecture and technology and provide each CPU and the I/O bus with a consis- selection was managed by evaluation of perfor- tent view of shared memory. To satisfy the band- mance and cost through trial designs and software width and latency requirements of the processor's breadboarding. Several tlesigns of a CPU module instruction issue rate, the processor's second-level were proposed, including various organizations of cache size, 128-bit access width, and 32-byte block one or two DECchip 21064 CPUs per module inter- size were optimized to avoid bandwidth limits to faced to I/O and memory subsystems. Optimization performance. The block size and access width were of complexity, parts cost, performance, and power made consistent with the system bus, which satis- density resulted in a CPU module with one proces- fied the VO throughpilt metrics. Consitleration was sor that could operate in either of two CPU slots on given to support of a 64-byte block on the 128-bit- the centerplane. Consequently, a system bus had to wide bus. Such support would have resulted in a 17 be developed that coulcl be interfaced by proces- percent larger miss penalty ant1 higher average sors, memory, ant1 I/O subsystems in support of the memory access time for the CPU and I/(), more stor- shared-memory architecture. age and control complexity, and hence higher cost. As development of the DECchip 21064 processor Simplicity of the bus protocol was achieved by progressed, hardware engineers and chip designers limiting the number and variations of transactions established a prioritized list of design goals for the to four types-read, write, exchange, and null. The system bus as follows: exchange transaction enables the second-level cache of the CPU to exchange data, that is, to per- 1. Provide ;I low-latency response to the CPLl's form a victim write to memory at the same time as secontl-level cache-miss transactions and I/O the replacement read transaction. This avoided the moclule read transactions without pending coherence complexity associated with a lingering transactions. victim block after the replacement read transaction 2. Provide a low-cost shared-memory bus, based completed. on the cache coherence protocol, that woulcl To address the issue of bandwidth requirements facilitate upgrades to faster CPU modules. This o17ertime as faster processors become available, an provision implied a simple protocol, synchro- estimate of 40 percent bus utilization for each pro- nous timing, and the use of transistor-transistor cessor with a lMB second-level cache was obtained logic OTL) levels rather than special electrical from trace-based performance models. The utiliza- interfaces. tion was shown to be reduced by using a 4MB sec- 3. Provide I/O bandwidth enabling local VO to ond-level cache or by using larger caches on the operate at 25 megabytes per second (MB/s) and DECchip 21064 chip. This approach was reserved as the Futurebus+ to operate at 100MB/s. a means to support future CPU upgrades.

Digital Technical Journal Vol 4 iVo 4 Sj~ecr~~lIssue 1992 Alpha AXP Architecture and Systems

Figure 3 is a block diagram of the length-limited storage devices had to be protected. Tlie DECchip seven-slot synchronous system bus. To achieve 21064 CPU's data bus and secontl-level cache are tight motlule-to-modulc clock skew control for this lo~lgworderror detection and correction (EDC) pro- single-phase clock scheme, clocks are radially dis- tected. The system bus is longword parity pro- tributed from the CPU 1 module to the seven slots. tected. The memory subsystem has 280-bit-wide This avoided the added cost of a separate module EDC-protected memory arrays. The Futurebus+ is dedicated for radial clock clistribution, and enabled longword parity protected. the bus arbitration circuitry to be integrated onto the CPlJ 1 module. System Bus Clocking Arbitration of the two CPU modules ancl the I/O module for the system bus is centralized on the (;ptJ To establish the 25-11s bus cycle time, analog models 1 motlule. To satisfy the I/O motlule's latency of the interconnect were developed and analyzed requirements, the arbitration priority allows the for 5 0-V CMOS transceivers. Assuming an edge-to- I/O nloclule to interleave with each CPU module. In edge data transfer scheme, the modelers evaluated the absence of other requests, a module may utilize the timing from a driver transition to its settled sig- the system bus continnously. Shared-memory state nal, including clock input to driver delay, receiver evaluations from the bus adclresses during continu- setup time, and module-to-module clock skew. The ous bus utilization causes CPU "starvation" from cycle time and the data transfer width were com- the seconcl-level cache. To avoid CPU starvation bined to determine compliance with low latency from the second-level cache, the arbitration con- and bandwidth. Further analysis revealed that the troller creates one free cycle after three consecu- second-level cache access timing was critical for tive bus transactions. performing shared-memory state look~rpsfrom the bus. One solution to this problem was to store duplicate tag values of the second-level cache. This Technology Selection was evaluated and found to be too expensive to The primary force behind technology selection was implement. However, the stucly did show that a to realize the f~~llperformance potential of the duplicate tag store of the CPU's primary data cache DECchip 21064 microprocessor with a balanced I/O had a performance advantage and was affordable if subsystem, weighted by cost minimization, sched- implemented in the CPU module's bus interface unit ule goals, and operation in an office environment. (BIu) chips. SPICE analysis was used to evaluate various module To evaluate second-level cache access timing, and semiconductor technologies. A technology a survey of SRAM access times, density, availabil- tlemonstration module was designed ancl fabri- ity, and cost was taken. Results showed that a IMB catecl to correlate the SPICE models and to validate cache using 12-11s access time SWswas optimal. possible technology. Basetl on clemonstrations, the With a 12-11s access time SRAM, the critical timing project proceeded with analytical data supported could be managed through the design of the BlU by empirical data. chips. The SRAM survey also showed that a ~MB The 25-watt DECchip 21064 c:PU was designed in second-level cache could be planned as a follow-on a 3.3-V, 0.75-micrometer compleincntary metal- boost to performance, as SRAII,~prices declined. oxide semiconductor (CMOS) technology and was Trace-based performance simulations proved that packaged in a 43-pin pin grid array (PGA). The CPU these cache sizes satisfied performance goals of 125 was the only given technology in the system. The WPs. This clock rate required a bus stall mecha- power supply, air cooling, and logical and electrical nism to accommodate current DRAIIaccess times CPU chip interfacing aspects of the CPU module and in the memory subsystem, which will enable future system bus designs evolved from the DECchip 21064 enhancements as access times are reduced. specifications. System design attention focused on The system bus clocks are distributed as positive powering ancl cooling the Cr-'rJ chip. Compliance emitter-coupled level (PECL) differential signals; with power and cooling specifications was deter- four single-phase clocks are available to each slot. mined to be achievable through conventional volt- Each module receives, terminates, and capacitively age regulation and decoupling teclino1og)r and couples the clock signals into noninverting and conventional fan technology. inverting PECL-to-CMOSlevel converters to provide To adtlress system integrity and reliability four edges per 25-ns clock cycle. System bus hand- requirements, all data transfer interconnects and shake and data transfers occur from clock edge to

Vol. 4 No. 4 Special Issue 1992 Digital Technical Journrrl Fi'qure 3 DEC 4000 AXP Sj~stenzBus Alpha AXP Architecture and Systems

clock edge ancl utilize one of two system bus After examination of several I/O buses that satis- clocks. A custom clock chip was implemented to fied these criteria, the Futi~rebus+was selected. At provide process, voltage, temperature, and load the time of our investigation, however, the (PVTL) reg~llationto the pair of application-specific Futurebus+ specification was in development by integrated circuit (ASIC) chips that compose each the IEEE and a wide range of interest was evident BKl. The clock chip achieves module-to-module throughout Lhe industry By providing the right sup- skews of less than 1 ns. port to the Futurebus+ committee, Digital was in a Our search for a clock repeater chip that coulcl position to help stabilize and bring the specifica- minimize module-to-module skew and chip-to- tion to con-2lctjon. chip skew on a module, and yet directly drive high A DigiLal team represented the project's interests fan-out ASIC chips with CMOS-level clocks, led us on the IEEE P896.2 Spccification Committee and to Digital's Semiconcluctor Operations Group. Such proposecl standards as the DEC 4000 AXP system a chip was in design; however, it was tailored design evolvecl. This team achiewd its goal by help- for use at the DEC 6000 system bus frequency. ing the IEEE Committee define a profile that The Semiconductor Operations Group agreed to enabled the Futurebus+ to operate as a high-perfor- change the chip to accommoclate the DE(: 4000 MI' mance I/O expansion bus. To mitigate schedule system bus frequency. impact due to instability of the Futurebus+ specifi- cations, the I/O module's Futurebus+ interface was I/O Bus Technology architected to accommodate changes through a more discrete, rather than a highly integrated Because of technology obsolescelice, 1/0 buses implementation Compliance with thr I;uturebus+ have a 21-year life cycle divideel into 3 phases. specifications influenced most mechanical aspects During the first 7 years of acceptance, peripherals of the module compartment design, as is eviclent and applications are developed anel supportecl. from the centerplane, card cage, rnodu le construc- Sustained acceptance takes holcl in the next 7 years tion and size, and power supply voltage specifica- as peripherals and applications are enhanced. In tions and implementations. the last 7 years, a phase out or rnigmtion of periph- erals ant1 applications occurs. For the DEC 4000 AXP systems, our first priority was selection of an open Module Technology expansion I/O bus in the first third of its life cycle. 1Module technology was selected to maximize sig- In addition, we wanted to select an open IEEE stan- lyal density w-ithin the fewest layers with minimal dard bus that woulcl attract third-party developers crosstalk and to provide a uniform signal distribu- to provide I/<) solutions to customers. The follow- tion impedance for any module layer. Physical-to- ing prioritized criteria were established for the electrical modcling tools were used to create SPICE selection of a new 1/0 bus: models of connectors, chip packages, power planes, signal lincs of various lengths and 1. Open bus that is an accepted industry standard impedances (based 011 the nlodule construction in the beginning third of its life cycle technology), ant1 multiplc signal lines. Uecause the placement of components affects signal perfor- 2. Compatibility with Alpha hXP architecture mance anel quality and system performance (e.g., in 3. Minimum data rate of 100MB/s the sccond-lev(:' processor cache), moclule floor plans and trial layouts wcrr. cornpletecl. A moclule 4. Scalable features that are perfor~nance-exten- layout tool was used to ensure procli~cibilitycom- bible through architecture (e.g., bus width), pliance wit11 manufacn~ringstandarcls as well as sig- and/or thro~~ghtechnology improvements nal routing constraints. The moclule layout process (e.g., semiconductor device performance and was iterative. As sections of the module routing integration) were completed. SPICE moctels of the etch were extracteel. These ext~-acteclmoclcls were connectecl 5. Nlinimum 64-bit data path to SPIG models of chip elrivers and run. Analysis 6. Support of bridges to other I/O buses was completed and reqi~ireclchanges were imple- mented and a~lalyzedagain. The process continuecl 7. Minim;~l interoperability problenls between until the optimal specification conformance was devices from different venclors achieved for all signals.

88 Vnt. 4 >Vo.4 Spect'ul Issue I992 Dtgital Techrrical Journal Desiqlz and Perfor~nnnceof the DEC 4000 AXP Deprrt~nentalSeri~er Co~nputing Sjntenzs

Motlule size was estimated basecl on system h~nc- submicron CMOS or Hi(:i\.IOS technolog)! All the I/O tionnlity requirements and a stutly of tlie size and and power subsystem controller chips such as the power recli~irementsof that fi~nctionality.To simpli- SCSI ancl DSSl controllers, Ethernet controllers, fy the enclos~~redesign, module size specifications serial line interfaces, and analog-to-digital convert- are consistent with the Futurebus+ motlule specifi- ers were implemented in CMOS technology. cations. To achieve lower system costs, the proces- Speed or high drive is critical in radial clock dis- sor, memory, and I/o modules arc basecl on the tribution, Futurebus+ interfacing, or memory mod- same ten-layer controlled impedance construction. ule address and control signal fan-out. In these Chip engineers avoitled the specification of fine- special cases, lOOK ECL operated in positive motle pitch surface-mount chips when possible. Compo- (PECL) or BII'OLAR technology was ernployeel. nent choices and module layouts were completed with ;I view toward ma~iufactur;tbility.Cost 2111alysis Syst~n.Bus Protocol and Technology showetl th:~tmixetl, tlouble-sitletl surface-mount components and through-hole components had The cache coherence protocol for the shared-mem- insignificant aclded cost when fusecl tin-lead mod- ory system bus is based on a scheme in which each ule technology ancl wet-film solder-mask technol- cache that has a copy of the data from memory also ogy were i~setl.The required layer constri~ctionand has a copy of the information about it. All cache impetlances of 45, 70, and 100 ohms coulcl easily be controllers monitor or snoop on the bus to deter- achieved within cost goals through this technology. mine whether or not they have a copy of tlie shared Solder-mask over bare copper technology was also block. Hence tlie system bus protocol is referred to ev;~lu;~tetlto determine if fine-pitch surfi~ce-mount as a snooping protocol, and the system bus is components achieved higher yield througli the sol- referred to as a snooping bus:' der reflow process. This evaluation showed fused The 128-bit-wide synchronous system bus pro- tin-le;~tltechnology was better suited, based on vides a write update 5-state snooping protocol for defect densities, for the manufacturing process. write-back cache-coherent 32-byte block read and Consequently, all DEC 4000 UPmodules are imple- write transactions to system memory address space. mented with fused tin-lead module technology ;~nd Each module uses a 192-pin signal connector-the wet-film solder-mask technology same connector used by Futurebus+ modules. Each module interfaces between the system bus and its back port with two 299-pin PGA packages contain- Semiconductor Technology ing CpIOS ASIC chips, which implement the bus pro- As ;I result of a performance, cost, power, ancl mod- tocol. A total of 157 signals ant1 35 reference ule real est;ite study, CMOS technology was used connections implement the system bus in the 192- extensively. The custo~ii-tlesigneclI'VTI. clock chips pin connector (6 interrupt and error, 8 clock and were developed in 1.0-micrometer (:MOS technol- initialization, 128 command and address or data, 4 ogy to supply CMOS-level signals for driving directly parity, 11 protocol). All control/status registers into the HllJ chips. Each module's RIlJ used the same (CSRs) are visible from the bus to simplify the data 0.8-micrometer ASIC technology ant1 die size to paths as well as to support SNIP. closely manage clock skews. Each system bus mod- To sinipldy the snooping protocol, only full ~~le's1%11l is implemented by two iclentical chips block transactions are supported; masking or sub- operatetl in an even ant1 an odd slice rnotle. Chip block transactions occur in each module's BIU. clesignrrs invented a rnetliotl for accepting 5.0-V Transactions are described from the perspectives sign;ils to be driven into their 3.3-V biased IIECchip of a commandel; a responder, and a bystander. The 21064 (:1'IJ. Consequently, the selection ;~ndirnple- address space is partitioned into CSR space that can- ment;~tionof 5.0-V ASIC: technology were easier. not be cached, memory space that can be cached, ASI(: vendor selection was based on (1) perfor- and secondary VO space for the Futurebus+ and I/O mance of trial tlesigns and timing analysis of parity module devices. Seconclary 1/0 space is accessible and El)<: trees, (2) SPICE analysis of 1/0 drivers with through an 1/0 module mailbox transaction, which direct-drive input clock cells, and (3) a layout abil- pends or retries the system bus when access to very ity to si~pportwide clock trunks ant1 distributed slow I/O controller registers conflicts with direct clock buffering to effect low skews. memory access (Div1.A) traffic. 'This software- All memory chips on the CPr! moclule, memory assistecl procedure also provides masked byte read module, and I/() module were implemented in and write access to VO devices as well as a standard

Uigitrrl 7tcbnicnlJorrrnal ).%I. 4 IVO. 4 .Speciul Isare 1992 89 Alpha kW Architecture and Systems

software interface. The use of 32-bit peripheral cipates in the coherency policy by signs~lingslis~red I>I&~ devices :~voiclecltlie need to imple~nentharcl- status to a CPU reacl of :I block it lias bufksecl. The ware acltlress tr;insl;itors. Tlie software drivers pro- five states of the cache coherence protocol ;ire vide physical addresses; hence mapping registers given in Table 2. are not necess:lr): The cache coherence protocol etisures that only The I/O mod~lledrives two device-related inter- one CPU rnoclule can return a dirty response. The rupt signals that are received by both CPI. modules dirty response obligates the responding <:I'li mod- clue to SMI' requirements. One interrupt is associ- ule to supply the read clata to the bus, since the ;ited with tlie Fut~~rebus+,;lnd the other is associated memory copy is stale and the memory controller wit11 :ill tlie device control.lers local to the I/O mocl- aborts the return of tlie re:icl data 1311s writes ;ilw;lys ule. Tlie 1/0 niodule 1,rovicles a silo register of clear the dirty bit of the seferencecl c:~clieblock in Futurebus+ interrupt pointers and a device request both the comrnantler niodule ;~ncltlie module that register of local clevice interrupt requests. CPU 1 or takes the update. CPU 2 is the clesignated interrupt dispatcher mod- A CPlJ hiu two options when a bus transaction is ule. I'rivileged architecture library software sub- a write and the block is found to be valid in its routines, Icnown ;IS I'Al.cocle, run 01-1 the prin1a1-y cache. A CPLJ either invalitlates tlie block or accepts CPIJ module ancl re;id tlie device interrupt register the blocl< and upd:ites its copy, keeping tlie block or Futurebus+ interrupt register to cletermine valicl. This decision is basecl on the state of tlie pri- which local devices or which Futurebus+ device mary cache's duplicate tag slore iuid tlie state of the handlers are to be dispatched. second-level cache tag store. Acceptance oC the The enclosure, power, and cooling subsystems transactiol~into the second-level cache on a tag ;Ire capable of interrupting both processors when immecliate attention js required. A CP1J can obtain information fro111 s~~bsystenissliow~i in Figure 2 Table 2 Five States of the Cache through tlie serial control bus. The serial control Coherence Protocol bus enables highly reli;~ble conirnunications State Remarks between field replaceable subsystems. [luring -- - power-up, it is used to obtain configuration infor- 1 NOT VALlD Block is invalid. mation. It is also used as an errol--logging channel 2 VALlD Valid for read or write, this and as a me;lns to communicate between the CPlJ NOT SHARED cached block contains the only subsystem, power subsystem, and the OCP. Tlie NOT DIRTY copy of the block; the copy is identical to the memory copy. nonvolatile RAhl (NVKAM) chip iniplementecl 011 each module :lllovled the firni\vare to use software 3 VALlD Valid for read or write, this NOT SHARED cached block contains the switches to configure the system. Tlie software DIRTY only cached copy of the block. switches avoidecl the need for hardware switches Thecachedcopyhasbeen and jumpers, fielcl replaceable unit identification modified more recently than tags. and handwritten error logs. As ;i result, the the memory copy. h;irtlware system is fully configured tllrough 4 VALlD Block is valid for read or write, but a write must broadcast to firmw;lre, ;ititl fault information travels with tlie SHARED NOT DIRTY the bus. This block may be in fielcl repl:ice;lble unit. another cache, but the memory The five-state c;~cliccoherence protocol assumes copy is identical. that tlie processor's primary write-through cache is 5 VALlD Block is valid for read or write, maintained as ;I subset of the seconcl-levcl write- SHARED but a write must broadcast to back cache. The slr: on the CPrI module enforces Dl RTY the bus. This block may be in this subset policy to simplify the simulation verifi- another cache, but the contents have been modified more cation process. Wjtl-~outit, tlie number of verifica- recently than the memory copy. tion cases woulcl h;ivc been escessive, difficult This is a transitional state that to exlxess, :~ntlclifficult to simi~l:~teanel check for occurs when arbitrating for the correctness. Tlie l/O module implements an invali- bus to broadcast a write or date-on-write policy, such that a block it has read when an unshared dirty block is returned to a bus read from memory will be invalicl;lted and then re-read transaction. if a <:PI1 writes to the block. The I/O module parti- Design and Perfirrnai?ce of the DEC 4000 AXP Depnrt7neiztal Seriier Corlz,!!t~tingSystenzs m;ltcli is c;tlled conditional update. When the com- caches to a commander module. The controller mander is the I/O module, the write is accepted by a then signals the start of the adclress and commantl CPtr only if the block is valid. Depending on the cycle 0 (CA). The commander drives a valid address, state of the primary data cache tluplicate tag store, command, and parity (CAI)) in cycle I. A comman- two types of hit responses can be sent to an 1/0 der may stall in cycle 2 before supplying write data commander-I/O update always ancl VO conditional (W)in cycles 2 and 3. update. In the case of either I/O or CPU commander Read data (RD) is received in cycles 5 and 6. The writes, if the valid block is in the primary data addressed responder confirms the clata cycles by c:tclie, the block is invalidated. The two acceptance asserting the acknowledge signal two cycles later. modes of I/() writes by a CPIi are programmable The commancler checks for the acknowledgment because ilccepting writes uses approximately 50 and, regardless of the presence or absence, com- percent more second-level cache bandwidth than pletes the number of cycles specified for the trans- invalidating writes. action. Snooping protocol results are made To implement the cache coherence protocol, the available half way through cycle 3. As shown in <:l'tJmotlulc's second-level cache stores informa- Figure 5, the protocol timing from valid address to tion as shown in Figure 4 for each 32-byte cache response of two cyclcs is critical. A responder or block. bystander may stall any transaction in cycle 4 by Figure 5 shows the cycle timing and transaction asserting a stall signal in cycle 3. The bus stalls ;IS sequences of the system bus. Write transactions long as the stall signal is assertecl. Arbitration is occur in six clock cycles. Read, null, ant1 exch;u~ige overlappecl with the last cycle of a transaction, such transactions occur in seven clock cycles. A null that tristate conflict is avoiclcd. tr;tnsaction enables a commander to nullify the A 29-bit lock address register provides a lock active transaction request or to acquire the bus and mechanism per cache block to assist with software avoid resource contention, without motltying synchronization operations. The lock address regis- memory. The arbitration controller monitors the ter is managetl by each CP1J as it executes load from bits transaction type and follows the transactions, memory to register locket1 longword or quadwortl cycle by cycle, to know when to rearbitr~teand sig- (LDx-L) and store register to memory conditional nal a new atltlress and command cycle. Atlditional longword or quadwortl (STx-C) instruction^.^ The cycles can be added by stalling in cycle 2 or cycle 4. lock adtlress register holds an address and a valid Transactions begin when the arbitration controller bit, which are compared with all bus transaction grants the use of the CPrJ motlule's second-level addresses. The valid bit is cleared by bus writes to a

TAG Tp V S D Cp LWO CKO LWI CK1 LW2 CK2 LW3 CK3 LW4 CK4 LW5 CK5 LW6 CK6 LW7 CK7

TAG conslsts of 9 physical address blts with a 4MB second-level cache, or 11 physical address blts with a 1MB second-level cache.

TAG PARITY (TP) bit indicates even parity

VALID (V) b~tlndlcates whether or not th~sblock can be considered for a response lo the snoop transaction

SHARED (S) b11indicales whether or not this block may also be resident in another module's cache.

DIRTY (D) bit indicates whether or not this block has been modified by thls processor

CONTROL PARITY (CP) bit indicates even parity.

DATA (LW) b~tsorganized as two 128-b~t-widehalf blocks; each 128-bit block is composed of tour longwords.

CHECK (CKO through CK7) bits detect errors for each longword.

Figur'e 4 Second-level Cache Structure

Digital Tecbnicnl Jotrnrc11 W)l. 4 ~VO.4 S/)c.cial lss~lr19% Alpha AXP Architecture and Systems

WRITE CYCLE 6 0 1 2 3 4 5 0 I

ARBITRATE GRANT GRANT COMMAND C A CA ADDRESS CAD CAD ACKNOWLEDGE CA WDI WD2 SHARED CAD DATA WDI WD2

READ, NULL, EXCHANGE CYCLE 5 0 1 2 3 4 5 6 0

ARBITRATE GRANT GRANT COMMAND CA CA ADDRESS CAD CAD ACKNOWLEDGE CA WD1 WD2 SHAREDJDIRTY CAD DATA WDl WD2 RDI RD2

KEY:

CA COMMAND CAD ADDRESS WD WRITE DATA RD READ DATA

Figure 5 System Bus Tlwizs~ictio~zSequetzces

~llatcbingaddress or by CI-'lJ execution of STx-C 1. DE<:chip 21064 processor instructions. The register is loaded and validated by 2. lMB or 4bIB physically addressed write-back a CPll's LDx-L instruction. This hardware and soft- secontl-level cache ware construct, as ;i means of memory synchroniza- tion, statistically avoids the known problellls with 3. BIlJ chips containing write merge buffers, a exclusionary locking schemes. Exclusionar~7lock- duplicate tag store of the processor's 8-kilobyte ing schemes create resource deatllocks, transaction (KR) data cache for invalidate filtering and write ordering issues, and performance degraclation ;is update policy decisions, an arbitration con- side effects of the exclusion. This construct allows troller, a system bus interface, an address lock a processor to continue program execution while register, and CsRs harclware provitles the branch conditions. The lock 4. System bus and processor clock genera- f;tils only when it loses the race on a write collision tors, clock and voltage detectors, and clock to the locked block. tlistributors A bus transaction address that hits on a valid lock address register must return a snooping protocol 5. System bus reset control shared response, even if the block is not valid in the 6. 8KR serial KOkI for power-up software loading primary anti seconcl-level caches. The shared of the processor response forces writes to the block to be broadc:tst, and STx-C instructions to function correctly. The 7 Microcontroller (MC) with serial system bus NULL transaction is issued when a STx-C write is interface allcl serial line unit for communication abortetl clue to the failure of the Jock to avoici with the processor's serial line interface system memory modification. 8. i\WhM. chip on the serial control bus

CPU Module Sz~bsystems Sincc a (:PU nlodule bas to operate in either CPU 1 Each <:PU modiile consists of a number of subsys- or CPlJ 2 rnocle, the CPU 2 connector was designed telns as shown in Figure 3. The CPU module's sub- to provide an identification code that enables or dis- systems are ables the clock drivers and configures the CSRs'

Vo1. 4 iVo. 4 .Specic~lIss~te /992 Digitul TechnicalJounzul Design and Perfir.r?~unce(Jlhe 111: atldress space and CPU identification code. As a or write accesses to complete in four internal clock result, arbitration and other slot-depentlent func- cycles, calletl the four-tick loop timing of the sec- tions are enabled or disabled when power is applied. ond-level cache. To realize hoth optimizations, tlie A re1i;ibility sti~dyof a parity-protected second- CPlJ's synchronous interface is supportecl by an lcvel cache showed that the SWMs contributed 44.7 asynchronous interkrce in tlie DllJ. \fiarious timing percent of tlie failure rate. By implementing El)(: on relationships between tlie processor and the the data S~W&Iportion of the secontl-le\iel cache, a system bus are control lecl by programmable timing tenfold improvement in per processor me;ln time to controls in the IlllJ chips. failure was achieved. Consequently, six Slli\M chips To achieve the tight, four-tick timing of the sec- per processor were implemented to ensure high ond-level cache, clouble-sided surface-nlount tech- reliabilit)! nologjrm7asused to place the SIL4M chips physically The niultiplcxecl interface to tlie second-level close together. This minimized address wire length cache of the CPU module allows tlie processor chip and the number of tiiodule vias; hence tlie driver ant1 tlie system bus equal ant1 sh;~retlaccess to the was loaded effectively This c;~refulplacement was second-level cache. To achieve low-latency memory combined with the tlesign of a custom CMOS access, both the microprocessor and the system address fan-out buffer ;Inel ~iii~ltiplexerchip (<;I\H) bus operate the second-level caclie ;is fast ;is pos- to achieve hst propagation delays. The CAB chip sible based on their clocks. Hence the seconcl- was implementeel in the same <:MOS process ;IS tlie level cache is multiplexed, and ownership deh~ults DECchip 21064 (:I)[! to obt;lin the desired through- to the microprocessor. When the system bus put clelay. Combiliecl with 12-11s SWMchips, tlie <:At% requires access, ownership is transferred cluickly chip enabled optimiz;~tionol'the (;rrl's second-level with tlat;i Slii\l\l parallelism while the t;ig SlbiMs are caclie timing 21s well as the system bus snooping monitored. protocol response timing. Many of the CPLJ module subsyste~nsare found in the interkice gate array called tlie <;A chip. Two of I/O Module, Mass Storage, and these chips working in tandem implement the Expansion I/O Subsystems and tlie secontl-level caclie controller. Write merge The I/O moclule co~isistsof a local I/O subsystem buffers combine masked write data from the micro- that interfaces to the common I/() core ancl a bridge processor with the cache block as p;irt of :11i allo- to the Futi~rebus+for I/() options. By incorporating cate-on-write policy. Since the microprocessor has modularity into the tlesign, a bro;itl range of appli- write buffers that perform packing, fill1 block write cations could be supported. To satisfy the tlisk per- arouncl the second-level cache w;rs implementetl ;is formance and bulk storage ~iietricsgiven in Table 1, an exception to the allocate-on-nrrite policy. To mass storage was configured based on applications meet schedule atid cost goals with few personnel, requirements. Elst access times of 3.5-inch tlisks one complex gate array was designecl r;ither t11;ln and multiple spindles were selectetl for applica- sever;~llower-complexity gate arrays. Hence the tions with results in QlO/s. Tlie density of 5.25-inch data path and the control functions were parti- disks was selected for ;~pplicationsrequiring more tioned such that the microprocessor could operate storage space. As intlic;~tetlin T~ble1. the metrics of as an even or odtl member of ;I pair on the <:IJ[I 1 or greater than 4,000 QlO/s tletermined the perfor- the <:lJU 2 niodule. malice requirrrnents of tlie storage colilpartrnent. Tlie system bus clock design is somewhat intle- Each of the four disk storage com],artn~entsin the pentlent of the processor clock, but the r;ilige is system enclosi~recan holcl a till I-size 5.25-inch tlisk restricted clue to the implementation of the snoop- if cost-effective bulk storage is needed. If the neetl ing protocol timing, the multiplexed us;Ige of tlie is for the mnximuni numI>er of I/Os per secontl, secontl-level caclie, ant1 the CPrJ interk~ceIi;~ntl- each compartment can holtl 1113 to four 3.5-inch shake and data timing. Therefore, the system bus tlisks in a mini array. cycle time is optimized to provide the rn;~xin~um Co~ifigurationsof 5.5-inch disks in a brick enable performance regardless of the processor speed. optimization of throughput tlirough parallelism Likewise, tlie processor's cycle time is optimized to techniques such ;IS stripe sets and redundant arr;q7 provide maximum performance reg;~rdlessof the of inexpensive tlisks (1b111)) sets. Tlie brick con- bus speed. Considerable eff'ort resultetl in a second- figuration also enables hult tolerance, at the level caclic ;~cccsstime that enablecl tlie (:IJ~I'sreat1 expense of tIiroi~glil>i~t,13)' 11si11g sliatlow sets. With

Digital Technical Jorirtral Vol. 4 iVo. 4 Specicrl lss~lcI991 9.5 Npha AXP Architecture and Systems

this technique, each storage compartment is inter- 5. Console serial line unit (SLIJ) interface faced to the system through a separate built-in con- troller. The controller is capable of running in 6. Auxiliary SLU interface with modem control supp01-t either DSSI mode for high availability storage in cluster connections with other OpenVMS UPor 7 Time-of-year(TOY) clock, with battery backup VMS systems, or in SCSI mode for local disk storage available from many different vendors. For applica- 8. 8KB of electrically erasable memory (EEKOM) tions in which a disk volume is striped across multi- for console firmware support ple drives that are in different storage cavities, the 9. Serial control bus controller and 2 kilobits of benefit from the parallel seek operations of the m11 drives combines with the parallel data transfers provided by the multiple bus interfaces. The main 10. 64-bit-wide Futurebus+ bridge memory capacity of the system allows for disk 11. BIU, containing four hexwords of cache block caching or RAM disks to be created, and the process- buffering, two mailbox registers, and the ing power of the system can be applied to managing system bus interface the multiple disk drives as a RAID array. With cur- rent technology, maximum fixed storage is 8GB The instability of the Futurebus+ specifications with 5.25-inch disks ancl 24G~with 3.5-inch disks. If and the use of new, poorly specified controller the built-in storage system is inadequate, connec- chips presented a high design risk for a highly inte- tion to an external solution can occur through the grated implementation. Therefore the Futurebus+ Futurebus+. bridge and local VO control logic were imple- Tlie BlU is implemented by two 299-pin ASIC mented in programn~able logic to isolate the chips. The bridge to the Futurebus+ and the inter- high risk design areas from the ASIC development face to the IocaJ I/O devices are provided with sepa- process. rate interfaces to the system bus. Each interface contains two buffers that can each contain a hex- Memory Subsystem word of data. This allows for double buffering of I/O writes to memory for both interfaces and for the As shown in Figure 3, up to four menlory modules prefetching of read data by which the bridge can reside on the system bus. This modularity of improves throughput. These buffers also serve to the memory subsystem enabled maximum configu- merge byte and longword write transaction data ration flexibility. Based on tlle metrics listed in into a fill1 block for transfer over the system bus. In Table 1, 2GB of memory were expected to satisfy this case, the write to main memory is preceded by most applications requirements. Given this 2GB a read operation to merge modified and unmodi- design goal, the available DRAM technology, and the fied bytes within the block. module size, the total memory size was configured The Ethernet controllers and SCSI and DSSI for various applications. controllers can handle block transfers for most The memory connectors provide a unique slot operations, thus avoiding unnecessary merge trans- identification code to each BIU, which is used to actions. As shown in Figure 3, the I/O module inte- configure the CSRs' address space based on the slot grates the following: position. Memory modules are synchronous to the system bus and provide high-bantlwidth, low- 1. Four storage controllers that support SCSI, latency dynamic storage. Each memory module high-speecl SCSI, or DSSI for fivecl disk drives uses 4-bit-wide, 1- and 4-megabit-deep DRAW tech- and one SCSI controller for removable media nology in various configurations to provide 64~~, elrives 128MB, 256~8,or 512MB of storage on each module. 2. 128K.B of SRAM for disk-controller-loadable To satisfy memory performance goals, each microcode scripts memory module is capable of operating alone or in one of numerous cache block interleaving configu- 3. Two Ethernet controllers and their station rations with other memory modules with a reacl- address ROMs, with switch-selectable stream capability. A performance study of stream ThinWire or thick-wire interfaces buffers revealed an increase in performance from 4. 512KB flash erase programmable ROM memory-resident read-stream buffers. The strean1 (FEPROM) for console firmwdre buffers allow each memory module to reduce the

94 Vo1. 4 No. 4 Special Issue I992 Digital TechnicalJournal IlesiLynarzd PerorA~?zanceof the DEC 4000 UPDel,artw?e?ltul Server. Co~~zp~lti~zg.Sj(j_~'sterns

average read latency of ;I <:I)( or I/O module. 'Thus Tlie DE(: 4000 AXP system is a portable unit that more banclwidth is us;lble on a congested bus provides rear access and simplified install;~tionand because the anticipated read data in a detected maintenance. Tlie systelii is mounted on casters access sequence is prefetched. The stream buffer and fits easily into an open office environnient. orefetch activity is st;rtistically determined by bus Modular design allowed con~pliance with stan- ;~ctivity. dards, ease of manufacturing, and easy fielcl servic- Overall memory bandwidth is also improveel ing. Constructed of molded plastics, the chi~ssis through exchange trvis;lctions, which use victim is sectioned into a card cage, :I storage compart- writes with repl;~cernent reacl parallelism. Intel- ment, a base containing four 6-inch variable-speed ligent memory refresh is scheduled based on bus Dc fans and casters, an air plenum ;uicl baffle assem- iictivity and anticipatecl opportunities. Write trans- bly, front ant1 rear doors, ancl side panels. The actions are buffered from the bus before being writ- mass storage compartment supports 111, to I6 ten into the DbWs to ;~voiclstalling the bus. fixed-storage clevices and 4 removable storage Data integrity, memory reliability, and system devices. Expansion to storage enclosures is sup- ;~vailabilityare enh;lnced by the EDC circuitry Each ported for applications that recjuire speci;~lizecl memory niotlule consists of 2 or 4 banks with 70 storage subsystems. I)RrW chips each. This enables 256 data bits ant1 24 Feedback from fielcl service engineers proriiptecl El><:hits to be accessetl at once to provide low 11sto omit uselcss light-emitting devices (LEIIS) in latency for the system bus. A cost-benefit ;inalysis each subsystem, since access to most electronics is sl?owetl a savings of IIRAM chips when ED<: is in~ple- from the rear. As a result, the OCP was m;~deconi- nientcd on each memory module. The processor's n~olito all sul>systems through the serial control 32-bit EDC requires 7 check bits as opposed to the bus and made visible inside the front door of the 128-bit ED<:, which reqilires 12 check bits zinc1 uses enclosure. It provides DC on/off, hz~lt.;lncl restart ['ewer chips per bank. 'T'lie selected EDC code also switches, and eight LEDs, which inclic:~tefaults of provides better error detection capabilit-y of 4-bit- CPU, I/O, memory, and Futurebus+ nlodules. The wide chips than the processor's 32-bit EDC. fault lights are controlled either by a microcon- To improve performance, separate EDC logic troller on either <:I'(J module or by ;In interhce on is implemented on tlie write path and read path the 1/0 module. of each memory nlodule's RIU. The EDC logic Futurebus+ slot spacing was provided by the lEEE performs detection anel correction of all single- specification. The system bus slot spacing for each bit errors and most 2-bit, .)-bit, and 4-bit errors in module was cleterniined by h~nctionalrequire- tlie D&\1 array Tlie El>(:'s generate function can ments. For example, the CPU moclule requires 300 detect certain types of addressing failures associ- linear feet of air flow across the DE<:chip 21064 ated with the DRAM row and column address bits, microprocessor's 3-inch square heat sink, as seen in along with the bank's select address bits. Failures Figure 1, to ensure the 25-watt chip could be associated with these ;~dtlressingfields can be cooletl reliably. Since 4000 systems provide this detected, thus improving data integrity. Software same air flow across modules, cooling was not a errors can be scrubbed from liiemory by the <:PU major design obst;~cle.The rnotlule compartment's when requested through use of PALcode subrou- Futurebus+, systerri bus, and power subsystems can tines, which use tlie LUX-L and STx-C synchroniza- be seen in the enclosure back view of Figure 6. tion construct without having to suspend system All electronics in the enclosure, as shown in operations. Figure 7, are air cooled by four 6-inch fans in the base. Air is drawn into the enclosure grill at the top Enclosure and Paver Subsystems front, guided along a pleni~mancl baffle ;lssernbly The DEC 4000 AXP enclosure seen in Figure 1 is and down through the module compartment and c;~lledthe ~~640box and is 88.0 centimeters (cm) power supply comp;lrtrnent to the base. Air is also high, 50.6 cm wide, and 76.2 cm deep. It weighs 118 drawn tlirougli front door louvers and across the to 125 kilograms hlly configured. The enclosure is storage compartments and down to tlie base. designed to operate in an ol'fice environment from Electro~iicsconnected to the power subsystem 10 to 35 degrees Celsius. The power cord can con- monitor ambient and nodule compartment nect to a conventional will1 outlet which supplies exhaust teniperatures. Thus the fan speed c;ln be up to 20 amperes at either 120 V AC or 240 V AC. regulated based on temperatllre measurements,

Digilal Technical Joun1~111/01. 4 rVo. 4 Special 1.ssr1.e 1992 95 Alpha AXP Architecture and Systems

SYSTEM BUS are for fixecl storage bricks. A storage brick consists of a base plate and mounting hardware, disk drives, local disk converter (LDC). front bezel assembly, anel wiring harnesses. The LDC converts ;I tlis- tributed 48.0 V to 12.0-V and 5.0-V supplies :lnd a 5.0-V termination reference for the brick to ensure compliance with voltage regulation specific;~tlons I and termir~ationvoltage levels of current ;ul~clfi~tilre disks. The 20-ampere power subsysteni can cleliver 1,400 watts of DC power divided across 2.1 V, 3.3 V, 5.0 V, 12.0 V, ant1 48.0 V. The enclosure can cool 1,500 watts of power and can be configured as a master or a slave of AC power application. IJse of ;I universal FEU eliminates the neecl for selecting operating voltages of 120 V or 240 V A<;. The FHr converts the input AC into 385 V DC, which is clis- tributed to provide 48 V DC to two step-down I)(:- to-I)C converters. which work in parallel to share the lo;~dcurrent. The combined 48 V I)(: outpi~t from these converters is deliverecl to the rest of the system. Control of tlistributed power electronics is diffi- cult anel expensive with dedicated electronics. A cost-effective alternative was use ol a one-chip <:l\lOS microcontroller, surrounded with an array of sensor inputs through CMOS analog-to-digital con- verters, to provide PSC intelligence. I>ecisiori-mak- POWER ing ;ubility in the power subs!~stern enabled compliance witli voltage-sequencing specific;~tions anel fail-safe operation of the system. l'he micro- Figure 6 DEC 4000 AXP Enclosure Rear View controller can control each LDC and communicate witli the C:Pl.I and OCP over the serial control bus. It monitors over and under voltage, temperature, and reducing acou5tic noise in an air-conditioned office energy storage conditions in tlie n~oditleand stor- environment. age compartments. It also reports status ancl failure The centerplane assembly consists of a storage information either to the CPU or to a elisplay on the backplane, a module backplane, and an electroniag- PS(: rnocl~~le,whicli is visible insicle the enclosure netic shielcl. This implementation avoids depen- back door. clence on cable assemblies, which are unre1i;tble and d~fficultto install and repair. Defined connec- tors and module sizes allowed the enclosure devel- Firmware opment to proceecl unencumbered by moclule The prim;lry goal of the console interface is to specification changes. The shielded module com- bootstrap the operating system tliro~lgha process partment provides effective attenuation of signals c;ulled boot-block booting. The console inter- UP to 5 gigahertz. There are six Futurebus+ slots, face runs a nlinimal I/O device h;lncller routine four memory slots, two CPU slots, one 1/0 slot, and (boot primitive) to read a boot block fro111a device four central power module slots, which include the that has descriptors. The descriptors point to the FEU, PSC, DC5, and DC3 units logical block numbers where the prinlar); boot- The storage compartment contains six cavities, strap program can be found, and the console as seen in the enclosure front vlew of F~gure8 interface loads it into system memory To accom- Two cavities are for removable media, and four plish this task, t11e firmware must configure at~cl

96 Vo1. 4 i\b. 4 Special Issue 1992 Digital Technical Jouniol Design and Performance of the DEC 4000AXP Departmental Server Computing Systems

OCP STORAGE BRICK CENTERPLANE STORAGE BRICK FBE

9 :;;$qm!p'.:- ,rq 0% '4~ . 4 .. ' y,t; .uT , - ,,!, 7, -*-a..*>.- -..- fit,? .

VTERM PSC MEMORY CPU

Figure 7 DEC 4000 AXP Modular Electronics

test the whole system to ensure the boot process console memory, and then transferring control to can complcte without failures. To minimize the the console code. bootstrap time, a fast menlory test executes in the The FEPIiOM firmware consists of halt dispatch, time necessary to test the largest memory module, entry/exit, diagnostics, system restart, system boot- regardless of the number of memory modules. If strap, and console services functional blocks. fa~luresare detected after configuration is com- PALcode subroutines prov~dea layer of software pleted, the firmware must provide a means for diag- with common interfaces to upper levels of sofware. nosis, error isolation, and error logging to facilitate PALcode serves as a bridge between the hardware the repair process. The DEC 4000 AXP system pro- behavior and service requirements and the require- vides a new console command interface as well as ments of the operating system. The system takes integrated diagnostic exercisers in the loadable advantage of I'ALcode for hardware-level interrupt firmware. handling and return, security, implementation of The firmware resides on two separate entities, a special operating system kernel procedures such as block of serial ROM on the CPU moclule and a block queue management, dispatching to the operating of FEPROM on the I/O module. The serial ROM con- system's special calls, exception handling, DECchip tains software that is automatically loaded into the 21064 virtual instruction cache management, processor on power-up or reset. This software is virtual memory management, and secondary I/o responsible for initial configuration of the CpU operations. Through a combination of hardware- module, testing minimal module h~nctionality,mi- and software-dependent I'ALcode subroutines, tializing enough memory for the console, copying OpenVMS AXP, DEC OSF/l AXP, and future operating the contents of the FEPROM into this initialized systems can execute on this hardware arcliitecture

Digital Tecbnical Journal Vo1. 4 No. 4 Special Issue 1992 97 Alpha AXP Architecture and Systems

Ackmnuledgments Development of a new system requires contribu- tions from individuals throughout the corporation. The authors wish to acknowledge those who con- tributed to the key aspects of the DEC 4000 AXP system. Centerplanes: Henry Entnan, Jim Padgett; CPU: Nitin Godiwala, George Harris, Jeff Metzger, Eugene Smith, Kurt Thaller; Firmware: Dave Baird, Harold Buckingham, Marco Ciaffi, John DeNisco. Charlie Devane, Paul LaRochelle, Keven Peterson; Futurebus Exerciser: Philippe Klein, Kevin Ludlam, Dave Maruska; Futurebus+: Barbara Archinger, Ernie Crocker, Jim Duval, Sam Duncan, Bill Samaras; 110: Randy Hinrichs, Tom Hunt, Sub Pal, Prasad Paranjape, Chet Pawlowski, Paul Rotker, Russ Weaver; Management: Jesse Lipcon, Gary I? Lidington; Manufacturing: Mary Doddj: Al Lewis, Allan LyaII, Cher Nicholas; Marketing: Kami Ajgaonkar, Charks Monk, Pam Reid; Mechanical: Jeff Lewis, Dave Moore, Bryan Porter, Dave Simms; Memory: Paul Goodwin, Don Smelser, Dave Tatosian; Operations: Jeff Kerrigan; Operating Systems: AJ Beaverson, Peter Smith; Power: John Ardunio, Robert White; Simulation: Paul Kinzelman; Systems: Vince Asbridge, Mike Collins, Dave Conro): A1 Deluca, Roger Gagne, Tom Orr; Eric Piip; Thermal: Steve Cieluch, Sharad Shah. Figure 8 DEC 4000 AXP System Enclosure Front Vieru Refmence and Note Performance Summary 1. IEEE Standard for FzttztreDus+-Physical Layer The DEC 4000 AXP Model 610 system's performance and Profile Spec~iccltionlEEE Standard P896.2- numbers as of November 10, 1992 are given in Table 1991 (New York: The Institute of Electrical and 3. Its performance will continue to improve. Electronics Engineers, April 24, 1992).

Summary 2. Supercomputer performance as defined by the composite theoretical performance (CTP) rating r)EC 4000 IU(P systems demonstrate the highest of 397, with the or:<:cliip 21064 operated at 6.25 performance and functionality for Digital's 4000 ns, as established by the U.S. export regulations. series of departmental server systems. Based on Digital's Alpha i\xP architecture and the IEEE's 3 Inter-l7ztegr~1lc.dCircuit Ser3ial Bus Specvi- Futurebus+ profile B standard, the systems provide catior~(ILC Bus Specification). (Sunnjwale, CA: symmetric multiprocessing performance for Signetics Company, 1988). OpenVMS AXP and DEC OSF/l AXP operating systems in an office environment. The L)EC 4000 AXP systems 4. J. Hennessy and D. I'atterson, compute^^ were designed to optimize the cost-performance Architecture: A Quantitative Approach (San ratio ant1 to include upgradability and expanclabil- iMateo, ct\: Morgan Kailfm;inn Publishers, Inc., ity. The systems combine Digital's CMOS technol- 1990): 467-474. ogy, I/O peripherals teclmology, ahigh-performance multiprocessing backplane interconnect, ant1 mod- 5. R. Sites, ed., Alpha AX'P Sjjstenz Reference ular system design to supply the most advanced Manual, Version 5.0 (Maynard: Digital filnctionality for performance-driven applications. Equipment Corporation, 1992).

98 Vol d iVo 4 Ypccmllsc~~c~I992 Digital Technical Jorrrrral Table 3 CPU Performance Summary for the DEC 4000 AXP System Futurebus+ Performance Latency Bandwidth Peak 16 bytes/l 00 ns 16OMBls Read 16 bytedl 82 ns 88M B/s Write 16 bytes1133 ns 120MB/s Local Bus Performance Latency Bandwidth Peak 4 bytes180 ns 50MBIs Read 4 bytes11 60 ns 25MBIs Write 4 bytedl 60 ns 25MB/s System Bus Performance Latency Bandwidth Peak 16 bytes125 ns 640MB/s Read 32 bytes/l 75 ns 182MB/s Write 32 bytedl50 ns 21 3MB/s Exchange 64 bytedl75 ns 365MBIs Internal Cache Miss, Second-level Cache Hit (Four-tick) Performance Latency Bandwidth Read 16 bytes/25 ns 640MB/s Write 16 bytes125 ns 640MBIs CPU Second-level Cache Miss Performance Latency Bandwidth Read 32 bytes/275 ns 11 6MB/s Write 32 bytes1200 ns 16OMBIs Exchange 64 bytes/275 ns 232MB/s - -- DEC 4000 Model 610 SPECmark89 and SPECthruput89* Estimated CPU Performance Summary Integer (INT) Benchmarks Ratio Ratio Ratio GCC 61.58 1@ 54.80 2@ 50.78 ESPRESSO 82.91 [email protected] 28 78.33 LI 93.05 [email protected] 2@ 92.18 EQNTOTT 103.46 1@ 100.76 2@ 97.94 Floating-point (FP) Benchmarks SPICE2G6 72.58 [email protected] DODUC 113.81 1@ 113.53 NASA7 229.27 1@ 221.56 MATRIX300 1019.1 7 1@ 963.81 FPPPP 180.32 1@ 177.89 TOMCATV 128.70 1@ 123.25 SPECmark > 136.23 SPECthruput > [email protected] SPECint > 83.73 SPECintthruput > [email protected] SPECfp > 188.45 SPECfpthruput > [email protected]

LINPACK - double precision 100 X 100 36.8 MFLOPS LINPACK - double precision 1000 X 1000 78.4 MFLOPS Dhrystone 165.0 MIPS

Note: 'Version 1.0 OpenVMS AXP operating system, 160-MHz clocked DECchip 21064 microprocessor, 1ME3 second-level cache. Notice the 1.9 scaling of the second CPU.

Digital Techrricwl Journal &)I. 4 :Yo. 4 SpeciolIssrrr 192 99 Brian R Allison Catharine van Ingen

Technical Description of the DEC 7000 and DEC 10000 AXP Fame@

The DEC 7000 ar7d DEC 100OOprod~rcts~11.0 rizicl-lwrlye nlzd nzairIfrnlrze Alphcr AXP sjb-tei~z($fk/'illgs ~;'OI?I Digitcll Ey~~il~iliei~tCorpoiriti011. These r?zucbir7es ~1le1.e designed to lneet the rieeds of large colnnzerrial al7d scientific c,yl,licutiolzs arzd ther.@re ore /?igl?$erfon.nc~i?ce,expattcl~rble sj~sterns that can be e~rsib~~~pg~wded. The DEC' 7000 ~17d10000 ~j~ster'lzszltilize t/7e IlI:Cc/?if)21064 17zicroprocessoioperat- illg at speeds LIP to -300 MHz. The high-speed chips, large c~~cbes,~~lultip~~oce.ssor sj~ste~izai.c/~itc~ctui.e, hiyl~$eifornza17c DLIC~~~LII~C ir?te~.cot~~~ect, a~?dIGII;~C IIIPIIIOI:~~ cclprpacit~lcolnbiile to create ~~znh~iz~~~~e-LkIssper~fi,~~~i~a~~cezllith a cost ri17d sizej)i.e- zliousl)~att~.iDuted to ~/zid-ralzgesjlsten~s.

The tlesign of the DE<:7000 and 10000 systems pro- I)E<:chip 21064 whereas the I)E<: 10000 uses a 200- vides a high-end pl;~thrmnntl system environment MHz I)E<:chip 21064. for rn~~l.tiplegenerations of iilpha AXP chips. This A \/er!, important goal h)r the project that encom- platform, combined with a multiprocessor archi- passed the clevclopment of the I>E<: 7000 and 10000 tecture, yields a multidimensional upgrade capabil- systems was to provitle a si~llil:irpair of systems ity that will allow the s)rstem to meet users' needs b;isecl on a VAX microproccssoc A Vi\X niicroproces- Cor several years. System i~pgratlecan take place by sor, called hWAS+, W;IS designed to be pin com- ;ttlding processors, replacing existing processors patible with the DECchip 21064 (the Alpha AXP with next-generation processors, or both. 'l'his microprocess~r).~+ The S!~S~CIII'~V;IS designed to be upgrade capability ensures stability to the system somewhat microprocesso~-independent, and both in terms of the physical and fiscal ;tspects of the entl VAX and Alpha AXP versions of the systems were user's computing erivironment. i~iiplernentetl.The VAX prodr~cts(VAX 7000 ancl VAS The DEC 7000 and LIE<: 10000 systems are 10000) were introduced in July 1992 and can be the logical follow-on products of the highly suc- upgraded to DEC 7000 ;uld I)E(: 10000 systems by a cessf~~lVAX 6000 himily 'The new systems are capa- simple swap of CPrr modules. ble of supporting either VAX processors or Alpli;~ ASP processors. The c;tpabilit)- to upgrade from System Architecture ;I VAX processor to ;rn Alpha AXI' processor with- The 1>E<: 7000 system consists of <:PIl(s), memory out ch;inges to the system is essential for niini- an I/() port controller, ancl I/() ;td;~pters,as shown in rnal disruption of large commercial applications. 1:igure 1. The system is configurecl in a variety of Most fe;~tures of the VAS 6000 systems have w:~ys,depending on the size and f~~nctionof the been carried forward to the 1)E(: 7000 and DE(: system. A system backplane consists of nine slots 10000 proclucts, ant1 any tleficiencies have been and houses CPlJs, memory, and an I/(> port con- corrected. troller. The I/O port controller resides in a fixed The DEC 7000 :1nc1 I)E(: I0000 products :KC slot, ;ind CPlrs and memories occupy the remaining derived from the same system tlesign. The DE<: eight slots. The initial system offerings allow up to 6 10000 is a more Si111y configured system and (:r'rts. (Architecturally, the s!.steni may support i~p inclutles ;ln 1z+1 ~~~iinterruptiblepower system. to 16 CPUs.) Up to I4 gig:tbytes ((;I3) of memory can atlditional 1/0 subsystems. ;mcl 1/0 expansion cabi- bc supported if onlj. 1 (;I>[: niodule is present and nets. 'l'he DEC 7000 uses n 182-megahertz (MHz) ;ill remaining slots contain memory.

100 Vi1. 4 No. 4 Speciallssue 1992 Digital Technical Journal TecLw~icu[Description of the DEC 7000 and DEC IOOOOAXP Fcrmil]~

systeni. Etch width is 5 mils with 7.5-mil minimum spacing. Via sizes down to 15 mils are used. A tiiis- ALPHA AXP OR VAX 64,128,256,512MB PROCESSOR(S) ture of physical coniponent technologies is used with all large vl.sI (very large-scale integration) + f parts in 100-~iiilI'<;A packages. Most standard logic SYSTEM BUS 640MBlS utilizes 50-mil surface-mount technology Moclule 4 interconnect to the backplane is made through a 340/420-connection, four-row, 100-mil-spaced pin I 110 PORT CONTROLLER I and socket type connector. Forty-eight-volt power is distributed throughout the system; local regula- tion is provided on tlie niodule for specific voltages XMl ADAPTER I10 ADAPTER I10 ADAPTER US+ k'V'77required. XMI FUTUREBUS+ System Interconnect Note: All four I10 ports are ~dentical.Any combination of XMI, The heart of tlie I>E<: 7000 systeni is a high-perfor- Futurebust, or "custom" inlerfaces may be configured. mance system interconnect, called the LSR, which allows communications between multiple proces- Figure 1 DEC 7000 nncl LIE(: 10000 sors, memory arrays, ant1 I/O subsystems. It pro- System Architecture vides a low-latency, high-l~~ndwitlthdata path among all components. A common shared view of The 1/0subsystem consists of ;In I/O port con- menlory is maintained by means of the systeni intcr- troller and four I/() ports which have been ;~d;~pted connect and cache logic on processor modules. to the X1\41 or the Futurebus+. Tlie I/O ports are Three types of niodules ;Ire defined for the LSD. generic ;~ntlmay be adapted to other forms of inter- Processor modules, which contain the CPli chip, connect in the fiuture. The system b;~ckplane. cache subsystem, ;~ndconsole functions. The ini- power system, and up to two 1/0 backplanes are tial DEC 7000 design has the capacity for a maxi- housed in the system cabinet. Adtlitional 1/0 bi~ck- mum of six processor modules. planes (up to ;I system total of h)ur) may be config- i~redin expansion cabinets. Memory moclules, wllicb contain dynamic raii- dorn-access memory (I)Ib\ii\iLI) chips and a mem- ory controller. A system can contain up to seven Technology memory niotlules, e;~cliwith a capacity of 64 The I)E<: 7000 system is built prim;lrily of C:MOS megabytes (MIj) to 2(;13. (coniple~nentarymetal-oxide semiconductor) coni- I/O interface motll~les,which provide access to ponents. The IIECchip 21064 microprocessor is I/O buses and I/O adapters. Only a single I/O port built using Digital's 0.75-micrometer <:MOS-~pro- controller niotlule may reside in the system. Tlie cess. AJI modules utilize LSI I.ogic l.<:r\lOOK series 1/0 port controller niotlule can arbitrate at a gate arrays for tlie system bus interface and for higher priorit! than <:l-'tJ nocles to improve I/O on-bo;~rcllogic fi~nctions.Tlie LSl Logic LCA100K direct memory ;lccess (IIMA) latency and provide features up to 235K two-input NAN> gates. MI atomic DMA writes of tl;lt;~less than ;I c;~che motlules use the same custom r/o tlriver circuit block in size. within their respective gate ;Irr:.tys to drive ant1 receive the system bus. A custo~ii419-pin pin gritl Tlie LSB is a liniitecl-length, non-pended, pipe- array (P<;A) package was tleveloped to house all bus lined, synchrotious, 128-bit-wide bus with clistrib- interface gAte arrays. Unlike the VAX 6000 series, a i~tedarbitration. All transactions occur in ;I set of common bus driver part is not used in ortler to min- fixed cycles rel;~tiveto ;in ;isbitration cycle. Up to imize tlie nunlber of levels of buffering in the three transactions can be in the pipeline at a given system. time. enabling the fill1 capability of the bus to be Moclule technology is standard 10-I,],' ver construc- realized. Arbitration occurs on a dedicated set of tion with 4 signal layers, 4 power I;lyers, and top control signals :~ntlmay he overl:~ppedwith tlata ;rncl I,ottom cap layers. Double-side, surfi~ce-mount transfer. Data ;inti :~tldressare n~ultiplexedon tlie construction is used extensively throughout the same set of signals. 'The bus protocol supports Alpha AXP Architecture and Systems

write-back caches, ancl all memory transfers are 64 All I5R transactiotis consist of a single commantl I~j~tesin length. The cycle time of the bus is 20 cycle ant1 four dnt;~cycles. These five cycles appear nanoseconcls (ns), providing an overall clnta rate of in fixed cycles relative to the arbitration c)lcles. Up 8001MB per second ancl a utilized system bandwitlth to three transactions may be pipelined, as shown in of 640~1~per second. Figure 2. The T.SR transmits 40-bit physical addresses, pro- The LSR uses a distributed arbitration scheme. viding a physical address space of 1 terabyte. Given Ten request wires are driven by the cpus or the I/o the current rate of DRAM technology evolution, the module that wishes to use the bus. Eight request LSIj will have a usef~illife of 8 to 10 years before lines are ;~lloc;itetlto tlie eight potential CPU ~iiocl- physical atldress space is exhausted. A 40-bit physi- ules. The remaining two request lilies arc i~setlby cal address was chosen to minimize the data path the I/O controller moclule. All modules indepen- width in the processor bus control gate array dently monitor the request wires to determine A non-pencled pipelined bus was chosen instead whether a transaction has been requested, and if so. of :I tr;iditional pended bus to allow for si~nplenode which module wins tlie right to sentl a cornm;lncl interface clesigns. Transactions start ;cncl finish at cycle to start the transaction. precisely defined times. A "stall" function may be The arbitration scheme employs a least-recently- used if a given transaction cannot be completed used rotating priority algorithm for CplJ modules within the system timing constraints. The "stall" ancl a fixed high/low scheme for the I/O port con- function freezes the bus pipeline, maintaining the troller. The I/() port controller arbitrates using the order of all transactions. Consequently, nodes can highest ant1 lowest priority levels, arbitrating high be clesignetl with no queuing between tlie bus six times then low two times. This arrangement interface and local storage (D~ilsfor main mem- ensures that the 1/0 port controller can i~tilize ory or static RANIS [SFGuVs] for cache memory). The greater than 50 percent of tlie available system bus maintenance of strict bus transaction ortlering also bandwidth while still ensuring the CPUs some allevi;ites many potential lockout conclitions expe- access to the system bus. The I/() port controller rienced on petided buses. also uses its uniqi~e;~rbitration scheme to ensure Digital's previous mainframe systems have usetl a atomic reatl/modify/write sequences on the bus switch-based system interconnect instead of a bus. necessary for performing writes of less than a full This interconnect was typic;~llyrequiretl because naturally aligned 64-byte quantity. Tlie I/O port these systems were based on emitter coupled logic controller does the rend at its next scheduled prior- (E<:L) with only a small, single-level cache suh- ity ant1 then imrnet1i:itely follows up with the write system; therefore, high bandwitltli was required at highest priorit)! This scheme ensures that 110 between main memory and the processor. The other node can access the data between tlie read <;MOS clesign of the DEC 7000 allows a large (4MR) and the write. seconcl-level cache to complement the 16-kilobyte A1 comm;incl/acldress anel control/status register (KB) on-chip cache. Tlie large amount of c;iche (CSR) cycles are protected with parity Data cycles minimizes the need for memory bandwidth. A to and from niemory are protected with error cor- bus-based design was chosen over a switch-based rection code (E(;(:). Transmit check is used by all design to minimize memory latency minimize modules to verifj that what a given module is clesign complexit): and reduce system cost. asserting on the bus is ;~ctuallybeing seen on the

41 I+ BUS CYCLE TlME = 20 NS ARBITRATE COMMAND CONFIRMATION SHARElDlRTY DATA I+ BUS ACCESS TlME = 340 NS +I

Bus Data Rate = 16 bytes per 20 ns = 800MBls Utilized Bus Bandwidth = 16 bytes per 20 ns x 4 data cycles per 5 bus cycles = 640MBIs

lo:! 161. 4 ,Vo. 4 .Sprci~~lI.ss~~c19'11 Digital TechtricalJozrrnnl Teclmical Description of the DEC 7000 and DEC 20000 AXP Family bus. Transmit check allows the detection of bus col- networks. Designers solvecl this problem by adopt- lisiolis and faulty bus drivers or receivers. ing a distributed termination scheme with bus ter- The system interconnect is physically imple- minator networks present on all modules in the mented as a centerplane which is 350 millimeters backplane. (rnrn) wick ancl 500 mm high. There are four mod- ule connections on one sicle, ancl five on the other. Processor Module The centerplane-moclule connection is imple- The primary purpose of the processor module is to mented using a four-row pin and socket connector provide a large second-level cache to the processor with connections on a 100-mil grid. Modules are chip and to act as an interface to the system bus and 410 mm high and 340 mrn deep. This module size memory for missed cache references. The proces- was chosen to allow the maximum module size sor module in the DEC 7000 system was designed to within the constraints of an 865-mm-deep cabinet use either VAX or Alpha AXP chips. As noted above, and of the centerplane technology. Moclules are a common design is used in the implementation of spaced 011 65-mm centers ant1 are contained within the VAX ant1 DEC 7000 and 10000 systems, with the a box that provides customized air flow for each dif- only significant clifferences being the processor ferent module design. chip and the console/diagnostic cocle. Figure 3 is a The DEC 7000 was tlcsigned with a centerplane block diagram of the processor module. interconnect to solve the problem of bus length The processor module provides ;I 4MB external and to meet the need for wide moclule spacing cache, which is shared by the processor chip and that allows for the anticipated heat-dissipation the bus interface chips. The cache is organized as a requirements of future processor chips. With a single set (direct mapped), with a block ancl fill size centerplane, the number of module slots available of 64 bytes. The external cache conforms to a write- for a given length of bus increases by (72.2)-1 back, conditional update, cache coherency proto- where n is the number of slots available in a con- col. The processor on-chip clata cache is a proper ventional backplane. t\ centerplane configuration subset of the external cache and uses a write- leaves little space on the backplane for termination through protocol.*4

- CONTROL DECCHIP + * BUS INTERFACE 21 064 GATE ARRAYS PROCESSOR ADDRESS * LSB ADDRESS *-

BACKMAP D-CACHE D-CACHE

B-CACHE 0-CACHE TAG, VALID DATA TAG AND STATUS I-CACHE DATA, ECC BUFFER

F646 DATA. ECC W-BUFFER TAG, VALID, SHARED, DIRTY VICTIM

ROM ROM I WATCH rttH UART A SYSTEM I I I BUS

Figz~re3 Block Dicigram of the DEC 7000 Processor Module

Digital Technical Joz~mnl ti,/. 4 .Vo. 4 Sl~ecictlIssue 192 103 Alpha AXP Architecture and Systems

The structure of the cache is shown in Figure 4. tion, the current cache line is stored in a local victim Each cache line consists of 512 bits of data (with 112 buffer while the new data is read. After the new data bits of ECC), 12 bits of tag (with 1 parity bit), and 3 has been placed in the cache, the oltl clata is written status bits (with 1 parity bit). The 12 bits of tag data back to memory :IS ;I background operation. applied to a ~MBcache size sets a processor pliysi- A cluplicate set of cache tags (backniaps) are kept cal address capability of 16~;~.(This is a processor by the bus interface logic for both the external limitation, ant1 fi~tureprocessors will address larger cache and the internal processor chip Ikache. memory sizes.) The control bits contain informa- These backmaps are accessed by the bus interface tion that allows the c;~cheand mernory systems to logic 011 all bus references to determine the action maintain coherency The control bits are tlefinetl :IS necessary to maintain cache/memory coherency follows: On bus read requests, the processor bus inter- face references its external cache backnlap and sup- A valitl bit, indicating whether or not this line plies data from the on-board cache if a "dirt)"' copy contains valid data of the d;lt;l is present. On bus writes, a check is per- A shared bit, indicating whether or not this line for~netlto see if the data is present in the processor niny also be resident in another procesbor's on-chip D-c;tche. If the data line is present, the cache in the system i~pdateddata is ;~ccepted.If the data line is not pre- sent but is instead in the external cache, the line is A clirty bit, indicating whether or not this line inv;llitlated. This cache update policy is an attempt has been written to by this processor to minimize false sharing of tlata by only updating Upon detection of a cache read miss in the pro- on references to a cache line in the processor on- cessor on-chip cache, the processor accesses the chip cache, which is small 2nd should contain only extern;~lcache tag to see if the given block is resi- freshly referenced data. dent. The processor chip contains the tag compari- False sharing of data is a problem common to tor ancl status logic to cletcrrnitie ;I "hit." If the block multiprocessor systems running fully symmetric is resident in the external cache, the processor then operating systems. When a process is migrated cycles the external cache d;tt;i store twice, each time fi-on1 one processor to nothe her, dirty data often reading in 128 bits of data and 28 bits of ECC for a remains in the cache of the previoi~sprocessor. total of 32 bytes (internal processor cache block When the new processor requests that data, it size is 32 bytes). The external cache cycles at a rate becomes "shared," resulting in the need to update five tjnles the processor chip clock period (and at all copies by means of bus transactions on all subse- two times the period for the VAX variant). Upon the quent modifications of the tl:it;~.Since the process detection of a "miss," the processor chip informs has migrated, there is no need to maintain the state the bus interface chips by means of hanc1sh;lke sig- of the clat;~it1 the cache of the previous processor; nals and waits until the miss is serviced on the ISH. tloing so slows down execution of the process due IJpon a data write by the processor, the clata is to the bus tr;~ns;~ctionsrecluired to update. The written through to the external cache. If the data write-upd;tte policy described in the previous para- is already resident in the cache. it is updated and graph provides a means to estimate if "sl~ared"data conditionally bro;tdcast onto the system bus if is still in use by the previous processor and pro- marked as sh;lred. If the selectecl cache line cont21ins vides ;I means to flush it from the previous cache if a different valid tag, the current (old) cache line is it has not been recently referenced. written to memory and replaced by the new tag ancl The external cache is 128 bits wide with long- data. To improve performance during this opera- word E<:<: protection. The E<:<: scheme i~sedto

P IV ISI D 112-BIT TAG I P LONGWORD 3 ECC LONGWORD 2 ECC LONGWORD 1 ECC LONGWORD 0 ECC ECC LONGWORD 5 LONGWORD 4 ECC L DIRTY LONGWORD 7 ECC LONGWORD 6 ECC SHARED LONGWORD 11 ECC LONGWORD 10 ECC LONGWORD 9 ECC LONGWORD 8 ECC VALID LONGWORD 15 ECC LONGWORD 14 ECC LONGWORD 13 ECC LONGWORD 12 ECC PARITY

X 64K CACHE ENTRIES

104 14)l. 4 I\?). 4 ,S/)CL.~CI/ ISS~I~199.2 Digital Technical Joui-nnl Technical Description of the LlEC 7000 and DEC 10000 AXP Fanzilj) protect the external cache is identical to that used Data wrapping is a method used to provide a on the LSB, which allows flow-through ECC. The lower latency return of the data required by a read processor chip checks and corrects data for all pro- commantl. The bus contains an extra address bit cessor refills. The bus interface chips perform that indicates in which half of a 64-byte block the lookaside ECC checking for fault isolation purposes requested data lies. The memory colitroller returns but do not perform ECC correction. the half block containing the target clata first, allow- The processor module also provides system con- ing Faster resumption of processing. Data wrapping sole functions. The moclule includes universal has no benefit on write transactions but is clone to asynchronous receiver/transrnitters (IJARTs) for sirnplrfy the design of the system. communication with the console terminal and DEC 7000 memory modules are protected with a power subsystems, a time-of-year clock, and 896~~quadword ECC algorithm. The chosen ECC: imple- of flash read-only memories (]

Digital Tecb~ricalJonrnal Vol. 4 iVo. 4 Special I.ss~lc,1992 105 Alpha AXP Architecture and Systems

182 TO 200 MHZ ALPHA AXP CPU CHIP 4MB CACHE

m -r

-& yg Y...

POWER SUPPLY SYSTEM BUS INTERFACE

Figure -5 Processor Moduk Major CornJ~onentsHigIdighted modules provide the interface between the high- controller cable(s) will function to a maximum speed parallel ports and specific standard 1/0 buses. cable length of 3 meters. Th~slength allows 110 To date, interfaces have been designed for the XMI, expansion cabinets to be placed on e~therside of which is used as the I/O bus on the VAX 6000 and the main system cabinet. VAX 9000 systems, ancl for the Futurebus+, which is The aggregate bandwidth of the I/O port con- an IEEE standard high-performance bus definition. troller is 256~~per second. Each parallel port 1s The I/O port controller and specific bus adapter capable of operating at a maximum of 135MB per architecture was adopted to allow a flexible bus second for data flowing from the I/O subsystem to strategy that can evolve over time, as well as to memory and at 88MB per second for data flowing accommodate the pl~ysicalseparation of processor from memoly to the 1/O subsystem. and I/o subsystems necessary in an expandable The I/O port controller module with its four system with multiple I/O channels The 1/0 port parallel ports is a st;~ndardpart of all DEC, 7000

106 Vol. 4 No. 4 .Ypecial lssrte 1992 Digital Technical Journnl systems and resides in a tletlicated system back- Tlie disk controller receives the m;iilbox data plane slot. Various systeni configurations are avail- ant1 then generiites ;In request to reatl its able that contain between one ancl flour SMI I/() command packet from niemor)I. buses. 'The Futurebus+ subsystems will be available 'The r/o port controller re:~tlsthe specified corn- when Futurebus+ components become :~v;tilablein nivitl packet from Inenlory 64 bytes at a time the computer industry. and sends it back to the disk controller 32 bytes The I/O port controller provides :I "mailbox'. ;it a time. interface between the processor ancl I/O tlevices. A processor instruction cannot tlirectly access a regis- Tlie disk controller decodes the command packet, ter in ;In I/(>tlevice, as was possible on previous Vt\X reads the requested data from tlisk, and starts in~plementations.To use the "nlailbox" interface, a writing to system memory in 32-byte segments. processor creates a work descriptor packet in meni- The I/<) port controller buffers tlie 32-byte ory and then issues a conimantl to tlie I/() port con- writes from tlie disk controller into 64-byte seg- troller to execute the cotiimancl. Comniand ments ;rntl writes the data to system memory. completion is asynchronous and the processor may 'The tlisk controller sig11:ils an interrupt o~ltlie choose to tlo otlier work while the conimancl is exe- >(IL~[to indic;~tethat tlie requested operation is cuted. The "mailbox" interface between proces- complete, wliicli is received 1))' the 1/0 port coli- sors and I/() devices was createtl to allow relatively troller. The I/() port controller signals an inter- slow I/O devices to interface to ;I high-speed, non- pended systeni bus. If a processor were allowetl to rupt to the <:PII. access the I/() device tlirectly, the system bus u~oi~ld Console and Diagnostics be stalletl for I;irge portions of time. Clearly the mailbox communic:~tionsmethotl is Like m:lny previous VtiS systems, the DEC 7000 more con1plic;rted than traditiolyal direct access. systeni employs ;in enibedtletl console. The console Fortunately the mailbox is usetl only when a pro- function is performed by code run on the proces- sors within the system r:ither tli;~nbjr a dedicatecl, cessor neecls to directly access an I/O device. The detached front-end processor. I/O device can clirectly access main memory (or Unlike the strategy for previous VA); systems, a possibly a Ct'U cache) with all necessary buffering done by tlie I/() port controller. Most modern higli- unifietl console and tliagnostic strategy w;is performance 1/0 adapters use high-level, packet- adopted for the DE(: 7000 and 10000, VAX 7000 ant1 based protocols, which require very little direct 10000, ;mtl IIEC 4000 systems. A single code base access of the r/o adapter by the processor. not only provides tlie basic console functions but also extends diagnostic support for manufacturing A typical <:P1J-initiatedI/<) trans;iction to an intel- and fieltl firmware upgrade support. This i~nifietl ligent tlisk controller on an XNII bus to re;~tlfrom the disk would have the following steps. strategy has reduced the total development effort and promoted :I common "look and feel" across the The CPU jAaces ;I disk controller conim;~nd different systems. packet requesting ;I disk read into system The console development also differed from that nlenior)r. of previous Vt\x systems. The priniary implemen- tation 1;lnguage was (1, \vith only various architec- The CHI sets up an I/(> mailbox structure with a ture-specific code in Alpha AXP (or VAX) assembly conimantl to inform the disk controller that I:~nguage.Tlie console :~ntlprocessor cliagnostic there is a command packet in memory, writes a code was simulated prior to the arrival of hardware. register in the I/O port controller to inform it This simulation greatly simplifietl early hardware that there is a mailbox transaction to complete, tlel~i~g;the console h;ld basic functio~lalityafter a and then spins on ;I clone bit in tlie mailbox single debi~gsession. structure. At power-up, each processor ~ctsindependently The I/O port controller fetches the mailbox to execute processor-specific diagnostics ;uitl con- strLlctilre from memory gener:!tes ;in SMI write sole initializ~tion.The processors then select a con- cornrn;~nclto the disk controller, and sets the sole primary, which then proceetls to test and done bit in the mailbox structure. The CI'U sees configure the memory atid I/O subsystems. Tlie the assertion of tlie done bit and goes on to other console primary also retains control of tlie console work, terminal line; console second;lries communicate

Digital TechrricnlJorrrrral kl..4 No. 4 .S/)eciollssrlc I992 107 Alpha AXP Architecture and Systems

with the primary through memory-resident mes- which houses up to two additional 1/0 subsystems sages. After initialization, diagnostic or other con- and additional disk arrdys. Further mass storage sole tasks can be assigned to any processor in the expansion is possible with Digital's standard line of configuration. One benefit of this arrangement is mass storage cabinets connected by CI, DSSI, or S1 that system tliagnostics and exercisers can be run in iilterconnects. parallel. The DEC 7000 cabinetry has been designed for Like previous DECsystem consoles (that is, sys- easy system upgrade and servicing. The system tems based on MIPS Co. chips). the DEC 7000 con- backplane assenlbly, power system, and 1/0 subsys- sole provides a set of services, or callbacks, to the tems are modular and easily replaced by field per- operating system. These services can be used to sonnel. The process of future upgrades can be control automatic bootstrapping across operating accomplished more quickly and reliably through system crashes as well as primitive l/O servicvs the use of modular subassemblies. used by the operating system during bootstrap ancl As shown in Figure 6, the DEC 7000 main system system crash. The latter simplifies the operating cabinet contains a central air mover with logic system device support by providing simple assemblies above and below it. The air mover is a read/write functions common to all clevices. single motor with a large molded vane assembly A feature of the power of the console is the field and can pull air througl~both the upper and the firmware update utility. Field upgrade of all system lower logic assemblies. An air flow of approxi- firmware (console and I/O adapters) is accorn- mately 900 cubic feet per minute with velocities plished by the DEC 7000 firmware update i~tility up to 1800 linear feet per minute is maintained (LVU). LFll is really a dedicated console image which through the upper logic assembly, which contains is tlistributecl on CDROM. The system console is the processor ant1 memory subsystems. Although used to boot LFU, which is then usetl to update all not necessary for the DECchip 21064, this large system firnlware. volume of air movement was designed into the machine to allow upgrades through several genera- System Packaging tions of processor chips. Hy using standard air-cool- The DEC 7000 system cabinet is 1700 mm high ing techniques and customized module "boxes" by 800 mm wide by 865 nim deep. The cabinet that optimize local air flow, it is possible to cool houses the system backplane, up to two T/O subsys- processor chips of up to 70 watts in the DEC 7000 tems, and disk arrays or batteries for the system bat- system cabinet. tery-bacicup function. Expansion is possible by Above the air mover are the system backplane using one or twI/<) expander cabinets, each of and the modular power subsystenl. Below the air

REMOVABLE MEDIA

CPU OR MEMORY MODULES

N + 1 POWER SYSTEM

110 PORT CONTROLLER MODULE

COOLING SYSTEM

110 SUBSYSTEM (XMI)

MASS STORAGE

Figure 6 DEC 7000 1Wui11 System Cuhitzet, hv~zl(Left) and Rear (Right) Vieztls

108 161. 4 No. 4 Specirrl Issue 19.92 Digital Technical Joul-nu1 ./'tion of the DEC 7000 ctnd DEC 10000 AXP Fninily mover are four modular spaces for I/O bus back- niaintainetl ody to systeln memory, the DEC 7000 planes, disk drives, or batteries. keeps the entire system powered, including in-cabi- I/O, disk, ant1 battery subsystems occupy varying net mass storage. Depending on the system config- amounts of the four modular spaces. The XMI sub- uration, power is maintained for a minimum of 20 system occupies two spaces and is oriented front to minutes in an n+l power configuration. N+l back because of its rear-exit cabling scheme. The power with fill1 battery backup is standard on all Futurebus+ subsystem occupies a single rear space. DEC 10000 systems. Disk subsystems consisting of up to six 5.25-inch The DEC 7000 system employs a highly intelligent (DSSI or SCSl [small computer system interface]) or power subsystem with microprocessors in all 48- fourteen 3.5-inch (3.31 only) drives may occupy any volt regulators, which report status to processor of the modular spaces. Batteries for the uninter- modules by means of a serial interconnect. System ruptible power system occupy two modular software can therefore monitor a wide range of spaces, which may be oriented either front to back power system operating parameters, including (for SI\/lI-based systems) or side to side (for voltage output, AC inpi~t,efficient): and battery Futurebus+ systems). charge state. In a large configuration with optional The expander cabinet is identical to the main expander cabinets, the expander cabinet power sys- system cabinet, with two exceptions: disks may be tems also communicate with the systenl processors packaged in the area occupietl by the system back- to provide system-wide power status. plane, and there is no control panel. Up to two XMn or Futurebus+ subsystems may be placed in an expander cabinet. The UEC: 7000 and DEC 10000 systems are the fastest uniprocessor and multiprocessor, microprocessor- Power Subsystem based computer systems in the worlcl as of their The power subsystem of the DEC 7000 family has introduction date (10 November 1992) and as a highly modular, hierarchical design. The basic definecl by SPEC89 and SPEC92 benchmark data. For power system provides 48-volt direct current (VDC) compute-intensive benchmarks, the I>EC 10000 is to all subassemblies which in turn further regulate approximately 10 percent faster than the DEC 7000, to necessary voltages. Each module in the system basetl entirely on the difference in processor clock backplane contains on-board regu1;ltion. This fea- speed. ture will allow the system to easily evolve wit11 The base performance of the DEC: 7000 ant1 DEC changing voltage requirements as <;MOS technology 10000 systems is determined by the speed of the moves to lower voltages to reduce power consump- processor chip and is heavily influencetl by cache, tion. Voltage toler;~ncescan be tightly controllecl memory, and 1/0 subsystems. The design goal for since transmission clrops are negated; a precise the DEC 7000 and DEC 10000 systems was to extract voltage level can be set at the time of module manu- the maximum possible performance from the facture. The voltage and tolerance to a high-per- DECchip 21064 by providing an electrical and physi- formance CMOS processor must be very tightly cal environment capable of supporting 200-MHz controllecl in order to extract maximu~nperfor- processor operation as well as large c;tclies, a mance. %MI, Puturebiis+, ant1 disk subsystems large and fast memory s~tbsystem,and m~~ltiylet/O all regulate the 48 VDC to lower voltages at a subsys- subsystems. tem-wide level, not at the module level. While h111 system-level performance data is still The 48-lTDC modular power system consists of being collected, the very high speecl processor per- one to three pasallel regulators, each of which pro- formance measurecl on the SPEC benchmarlcs com- duces 2400 watts of power. A maximally config bined with the very high performance cache, used cabinet needs 110 more than two power memory, and I/O subsystems of the DEC 7000 ant1 regulators. An additional regulator can be config- DEC 10000 systems sho~~ldyield very im],ressive ured into the system to provide an ?z+l capability overall systenl performance. See Table 1. for higher availabilitj~. The power system also includes a battery Design Process standby function that provides 48 VDC throughout The DEC 7000 system was specified, designecl, ant1 the system in the event of an AC power failure. tested by a group of approximately 200 people in Unlike earlier VAX systems in which power was Boxboro, Massachusetts. The system design teani

Digital TechnicalJournal 1/01. 4 No. 4 SpectalIss~te19% 109 Alpha AXP Architecture and Systems

Table 1 DEC 7000 and DEC 10000 System (primarily multiprocessor VAX 6000 Model 500 sys- Performance Measurements tems) and used over 325,000 hours of CPtJ time DEC 7000 DEC ,0000 for simulations.

- -- -- SPECmark89 167.4 184.1 Conclusion SPECint89 95.1 104.5 SPECfp89 244.2 268.6 Tlle DEC 7000 and DEC 10000 systems are the sec- SPECint92 96.9 106.5 oncl generation of highly configurable and expand- SPECfp92 182.1 200.4 able systems produced by Digital Equipment SPECthroughput89 Corporation. These are the first systems expressly (4 CPUs) 604.4 654.6 designed to accommodate multiple-processor archi- LINPACK double-precision tecture types. As computer technology moves for- 100x100 (MFLOPS) 38.6 42.5 ward at an ever-increasing pace, this type of clesign 1000x1000 (MFLOPS) 102.1 111.6 will be cletnandetl by computer users and will be necessary to manage engineering costs. was responsible for all aspects of the design except The DEC 7000 and DEC 10000 system platform the DECchip 21064 microprocessor. will accommodate new VAX and Alpha AXP proces- Conceptual work on a system to follow the Vi\x sors for several years. Over that time, this platform 6000 family was started in early 1989, although at will span a performance range of greater than 50:l. that time design work was focused on iniplernenta- It will provide computer users with a stable system tions using VAX and MIPS R4000 processors. I11 the environment tl~atshould help minimize the changes latter part of 1989, the tlecision was made to pursue caused by the continued development of new pro- the Alpha AXP strategy, and earlier concepts were cessor chil?s. While this level of flexibility incurs reworked to incorporate much higher levels of per- additional initial engineering ant1 product costs, it formance to accommodate the proposed Alpha does provide a very cost-effective way to cleal with AXP chip. the inexorable forward march of technology In October-December 1989, a core team of approximately 10 engineers was assembled to Ackninuledgments firmly define system architecture and to produce The following engineers formed the system archi- specifications for all subassemblies. By July 1990 all tecture team of the project that protluced the DEC specifications were complete, and implementatjon 7000 and DEC 10000 and VAX 7000 and VAX 10000 was started. The first processor module was pow- products: Frank Romba, Reinhard Schumann, Mike ered up in June 1991, followed by a full system Callander, Steve Polzin, Kathy Harrington, Dave power-up in September 1991. The VMS operating Mayo, Catharine van Ingen, Vicky Triolo, Bob system was booted on a DEC 7000 system on Dickson, Dave O'Keefe, Jim Leahy, Hansel Collins, September 9, 1991, ancl OSF was booted in Jim Stegeman, Darrel Donaldson, Dave I-Iartwell, November 1991. Charlie Barker, Mark Stefanski. Brian Allison. A minimal DEC 7000 system includes 430,000 Various parts of this text originated within engi- gates of logic contained in gate arrays, whereas a neering specifications written by this team. minimal VAX 6000 Motlel200 includes 94,000 gates. Despite more than four times the gate count, References the clesign portion of the DEC 7000 program was completed in approximately 9 months as com- 1. Digital Technical./o~trnal,vol. 2, no 2, featuring pared to 12 months for the VAX 6000 program. This papers on the VA;\; 6000 Model 400 (Spring 1990). reduction in design time was achievable in part 2. G. lJNer et a].,"The ~VAXand NVAX+ High-perfor- because of the maturing of the engineering pop- mance VAX Microprocessors," Digital Teclwzical ulation (many of the DEC 7000 engineers had Journal, vol. 4, no. 3 (Summer 1992): 11 -23. worked on various VAX 6000 implementations), 3. D. Dobberpuhl et al., "A 200-MHz 64-bit Dual- as well as advances in design tool technology alld issue CMOS Microprocessor," Digital Technical the availability of significantly more powerh~l Jounzal, vol. 4,no. 4 (1992, this issue): 35-50. computers for design simulation. At its peak, the DEC 7000 program was consuming 1500 VAX units 4. A.J. Smith, "Cache Memories," Coflzpc~ting of performance, or WPs, of compute power S~rrvq1s,vol. 14, no. 3 (September 1982).

110 W1. 4 A1o. 4 .Fpecial lsslre 1991 Digital Technical Journal Nancy l? Kronenberg Thomms R. Benson Wayne M. Cardora RavindranJagannuthan BenjaminJ Thomas 111

Porting OpenVMS from VAX to Alpha AXP

The Open ViVlS operating system, developed by Digital for the VAX famnilj~of comnput- ers, zuas recently moved from the VAX to the Alpha AXP architecture. The Alpha AXP architect~lreis n new RISC architect~~reintroduced In/ Digital in 1992. This paper describes solutions to severalproblerns in porting the operating systeml in addition to perjonnance benefits nzeaszlred on one of the systems that implements this new architecture.

The VAX architecture is an example of complex priority levels. The Alpha AXP architecture also instruction set coniputing (CISC), whereas the defines a privileged architecture library (PAL) envi- Alpha AXP architecture is basecl on reduced instruc- ronment, which runs with interrupts disabled and tion set conlputing (RISC). The two architectures in the most privileged of the four modes (kernel). are very different.' ClSC architectures have perfor- PALcode is a set of Alpha ASP instructions that exe- mance disadvantages as compared to RISC architec- cutes in the PAL environment, implementing such ture~.~Digital ported the OpenVblS system to the basic system software functions as translation Alpha architecture mainly to deliver tlie perfor- buffer (TB) miss service. On OpenVMS AXP systems, mance advantages of RISC to OpenVMS appli- PALcode also implements some OpenVMs VAX fea- cations. This paper focuses on how Digital's tures such as software interrupts and asynchronous Open\lMS AXP operating system group dealt with traps (ASTs). The combination of hardware archi- the large volume of VAX assembly language and tecture assists and OpenVMS AXP PALcode made it with system kernel modifications that had VAX easier to port the operating system to the Alpha architecture dependencies. AXP architecture. The OpenVMS AXP group had two impor- The VAX architectrlre is 32-bit; it has 32 bits tant requirements in addition to tlie primary goal of virtual address space, 16 32-bit registers, and a of increasing performance: first, to make it easy comprehensive set of byte, word (16-bit), and long- to move existing users and applications from word (32-bit) instructions. The Alpha ASP archi- OpellVMs VAX to OpenVlMS UPsystems; second, to tecture is 64-bit, with 64 bits of virtual address deliver a high-quality first version of the product space, 64-bit registers (32 integer and 32 floating- as early as possible. These requirements led us to point), and instructions that load, store, and oper- adopt a fairly straightforward porting strategy with ate on 64-bit quantities. There are also longword minimal redesigns or rewrites. We view the first load, store, and operate instructions, and a canoni- version of the OpenVMS i\XP product as a begin- cal sign-extended form for a longword in a 64-bit ning, with other evolutiona1-y steps to follow. register. The Alpha ASP architecture was designed for The OpenVMS AXP system has anticipated evolu- high performance but also with software migration tion from 32-bit address space size to 64-bit address from the \AX to the Alpha t\Xp architecture in mind. space by changing to a page table format that sup- Included in the Alpha AXP architecture are some ports large address space. However, the OpellVMS \'/I>; features that ease the migration without com- software assumes that an address is the same size as promising hardware performance. VAX features a longword integer. The same assumption can exist in the Alpha IU(P architecture that are important in applications. Therefore, the first version of the to OpenVMS system software are: four protec- OpenVMS iUCP system supports %-bit address space tion mocles, per-page protection, and 32 interrupt only.

Digital Techrtical Journal Vo/ 4 No 4 .T/~e~lallssne 1992 111 Alpha AXP Architecture and Systems

Most of the OpenV,\j1s kernel is in Vt\X assembly Other directives allow the user to provide addi- language (VAX MA<:It0-32). Instead of rewriting the tional information about register state and code VAX h,h\CR(>-32code in nothe her language, we devel- flow to improve generated code. Another class of opecl a compiler. In adtlition, we requiretl inspec- directives instructs the compiler to preserve the tion and ~nanual~nodification of the Vt\S MA(:KO-32 VAX behavior with respect to gran11l;lrityof mem- code to deal with certain VAX' architecttlr;il tlepen- ory writes or atomicity of memory updates. The dencies. Parts of the ker~lelthat dependetl heavily Alpha t\XP architecture makes atomic updates and on the VAX ;~rcliitccturewere rewritten, but this gu;~ranteedwrite granularity sufficie~ltlycostly to was a small percentage of the total volume of \lilX performance that they shou Id be enabled only MACRO-32 source cocle. when required. These concepts are discusset1 in the section Major Architecti~r:ilDifferences in the Compiling VAX MACRO-32 Codefor the OpenV31S Kernel. Alpha AXP Architecture Simply statetl, tlie VAX >IACIlA<:llO-32as a source language to be conlpiled ;ant1 creates native OpenVMS AXP object files just ;is As mentioned earlier, the compiler must either ;I FORTRAN compiler might. This task is f;lr more transparently support VAX architecture-tlepe~itlent cornplex than a simple instruction-by-instr~~ctio~~ constructs or infor111the user that ;I source change translation because of fi~ndamentaldifferences in is necess;rrv. A good example of the latter case is the architectures, :~nclbecause source cocle fre- reliance on VAS )SH/RS13 (jump to subroutine and cluently contains assuml,tions about the \/AX archi- return) instruction return atltlress semantics. On tecture ancl the OpenV>lS Calling Stantlartl on VAX VAX systems, :I JSB instruction leaves the return systen1s.5.~The compiler must either transp;u-ently acldress on top of the stack, which is used by the convert these Vr\X depcntlencies to rlieir OpenVMS HS13 instruction to retur~i.~System subroutines AXP counterpans or inform the user that the source often take ;~tlvant;~geof this semantic in order to cvtle has to be cli;~~igetl. change the rettrrn adtlress. This level of stack con- trol is not ;~v;rilablein a compiletl language. In Source Code Annotation porting the OpenVMS system to the Alpha LYP We extentled the Vt\X MACRO-32 source I;~ngu;igeto architecture, we developed altern;~tivecotling prac- inclucle entr!.-point tleclarations and other tlirec- tices for this and many other notltr;rnsportal~le rives for the compiler's use, which provjcle more itlioms. information about the intentled behavior of the pro- The co~npilermust also account for the dif- gram. Entry-point decl;~rationswere introtlucecl to ferences in tlie Openvh4S Calling St:~ndartlon the allow declaration of: (1) tlie register semantics for VAS and Alpha AX]' architectures. Although trans- ;I routine when the tlel;~ultsare not approlxi;~te;lnd parent to high-level language progr;lniniers, these (2) the speci;~lizedsemantics of frameless subrou- differences are very significant in assembly lan- tines and exception routines to be dec1;irecl. guage progra~i~n~ing.~To operate correctly in a The differing register size between the VAX and mixed language environment, the code generated the Alpha ,\XP arcliitect~~resinfluencetl tlie clesign by the \(AX hlA<:RO-32 co~iipilermust conform to of tlie compiler. On tlie VAX, BWCRO-32 operates on the <)penVMS (blling Standard on the Alpha LYP 32-bit registers, :mtl in general, the compiletl code architecture. ~iiaintains32-bit sign-estendetl values in Alpha AXP On WX systems, a routine refers to its argurnents 64-bit registers. However, this code is now part by means of a11 argument pointer (AP) register, of a system th;~tuses true 64-bit values. As a result, which points to an argument list that was built in we designed the compiler to generate 64-bit regis- nien1or)l by tlie soutitie's caller. 011 Alpha AXP sys- ter saves of ;illy registers modified in a routine, tems, up to six routine arguments are passed to the :IS part of the "routine prologue cotle." Autoni;~tic c;~lletl routine in registers: any ;~dtlitionalargii- register preservation 11;sproven to be the safest ments ;ire p;~ssetlin stack locatiorls. Normally, the default but must be overriclden for routines that VAS Mt\(;RO-32 co~iipiler transparentl). converts return valires in registel-s, using appropri;~teentry- M-basecl references to their correct Alph:~AXP loca- point dec1ar:itions. tions ant1 converts the cotle that builtis the list to

112 Vol. 4 hb.4 Speclnl Issne 192 Digital Technical Jotrmnl Porting OpenVMSfrom VXX to Alpha AXP initialize the arguments correctly. In some cases, removal, and many local code generation optimiza- the compiler cannot convert all references to their tions for particular operand types. Peephole opti- new locations, so an emulated VAX argument list mization of local code sequences and low-level must be constructed from the arguments received instruction scheduling are performed by the back in the registers. This so-calletl "homing" of the argu- end of the compiler. ment list is required if the routine contains indexed In some instances, the programmer has knowl- references into the argument list or stores or p?.,sses . edge of register state or other code behavior that the address of an argument list element to another cannot be inferred from the source code alone. routine. Compiler directives are provided for specifying reg- The compiler identifies these coding practices ister alignment state, structure base address align- during its process of flow analysis, wliich is similar ment, and likely flow paths at branch points. to the analysis done by a standard high-level lan- Certain types of optimization typically per- guage optimizing compiler. The compiler builds a formed by a high-level language compiler cannot be flow graph for each routine and tracks stack depth, performed by the VAX WRO-32 compiler, because register use, condition code use, and loop depth assumptions made by the MACRO-32 programmer through all paths in the routine flow. This same cannot be broken. For example, the order of mem- information is requiretl for generating correct and ory reads may not be changed. efficient code. Major Architectural DtHerences Access to Alpha AXP in the OpenVMS Kernel Instructions and Registers This section concentrates on architectural changes In addition to providing migration of existing VA>; that affect synchronization, memory management, cocle. the VAX ,MACRO-32 compiler includes support and I/O. These are not the only architectural differ- for a subset of Alpha AXP instructions and PALcotle ences that cause significant changes in the kernel. calls and access to the 16 integer registers beyond However, they are the major differences and are those that map to the VAX register set. The instruc- representative of the effort involved in porting the tions supported either have no direct counterpart OpenVMS system to the Alpha AXP architecture. in the VAX architecture or are required for efficient For the most part, it was possible to isolate archi- operation on a full 64-bit register value. These tecture-dependent changes to a few major sub- "built-ins" were required because the OpenviMs systems. However, differences in the memory AXP system uses full 64-bit values for some opera- reference architecture had a pervasive impact on tions, such as manipul;~tionof 64-bit page table users of shared data and many common synchro- entries (PTEs). nization techniques. Other differences such as those related to memory management, context Optimization switching, or access to I/O devices were confined The compiler includes certain optimizations that mostly to the relevant subsystems. are particularly important for the Alpha AXP archi- tecture. For example, on a VAX system, a reference Effects of Architectural Dzffeel-ences to an external symbol would not be considered in Memory Subsystems expensive. On an Npha AX'P system, however, such The following differences between the VAX ancl a reference requlres a load from the linkage section Alpha AXP memory reference architectures affected to obtain the address of the symbol prior to loading synchronization:',3 the symbol's value (The linkage section is a data Load/store architecture rather than atomic mod- region used for resolving external references ') Ify instructions Multiple loads of this address from the linkage section may be reduced to a single load, or the Longword and quadword writes with no byte load may be moved out of a loop to reduce memory write instructions references. Read/write ordering not guaranteed Other optimizations include the elimination of memory reads on multiple safe references, regis- Load/store memory reference instructions are ter state tracking for optimal register-based mem- characteristic of most RlSC designs. However, the ory references, redun~lant register save/restore other differences are less typical. The reasons for all

Digital Technical Jorrrnnl Vvl. 4 No. 4 Special Issue 1992 113 Alpha AXP Architecture and Systems three rlifferences were hiirdware simplification and appropriate for several reasons. First, the problem opportilnities for increased hardware perfor- may be independent data items that happen to mance.' These differences result in significant share a longword. Synchronizing this sort of access changes in system software ant1 in many opportitni- in unrelated code paths is prone to error. Explicit ties for subtle errors, which can be detected only syncl1roniz;ition ma)! also have an unaccept;~ble under heavy loacl. Adapting to these architecti~ral performance impact. Finally, deadlocks are a possi- changes without greatly inilxrcting performance bility if one thread interri~pts nothe her. was one of the major c11;tllenges that faced the Ensuring that data items are in aligned long\vords group in porting the OpenVMS system to the Alpha or qu;itlwords both improves performance ant1 AXP architecture. eliminates interactions between unrelated data. A load/store architecture si~chas Alpha I-IS~'can- This technique is used wherever possible but can- not provide the atomic read-motlify-write instruc- not be used in two major cases. One case occurs tions present in the vt\X architecture. Tllus, when the replication fictor is too l;irge. Exp;incling instruction sequences are necessary for many oper- an array of thousantls of bytes to longwortls may ations performed by a single, atomic VAX instruc- simply not be acceptable. The other major problem tion, such as incrementing a memory location. The case is data structures that cannot be changed consequence is 21 need for increased awareness of because of external constr;lints. For example, tliita synchronization. Insteatl of depending on hartl- may represent a protocol message or a structure ware to prevent interference between multiple primarily resitlent on disk. Separate internnl and threads of execution on a single processor, explicit extem;il fonlis of such data strilctilres could exist, softw;u-e synchronization may be reqi~iretl. but the performance cost of continuous conver- Without this synchronization, the interleaving of sions may not be acceptable. indepentlent load-modify-store sequences to a sin- Often the easiest and highest-performance WAY gle tnemory location may result in overwritten to solve synchronization problems is to use stores and other incorrect results. sequences of load locked and store conditional The lack of byte writes i~iiposesaddition;ll syn- instructions. 'The load locl

114 Vol. 4 ,Ib,4 .Spc.cinl [safe 1992 Digilnl Tecb~ricnlJourrrnl Porting 0penVMSfrom VAX to Alpha AXP

processor will not necessarily observe memory space as closely ;IS possible. This decision recjuirecl operations in the ordcr in which they were issuecl creating a 2-gigtbyte (GU) process private address by another processor. Thus, many obvious synchro- region (i.e., VAX PO and PI) ant1 a 2(;B shared nization protocols will not work when writes to acldress region (i.e., VAX SO and S1) for each pro- the synchronization variable and to the data being cess. This is easily accomplished by giving each protected are observed to occur out of order. process a private level 1 page table (Ll m) that con- A memory barrier instruction is provided in the tains two entries for level 2 page tables (L2t'Ts). architecture to ensure ordering. However, the nega- One of these L2Ws is shared and implements tlie tive impact of this instruction on system perfor- sharecl system region, whereas the other is private mance requires that it be used only when and implements tlie process private address necessary. regions. Although the smallest allowed page size of As described in the previous section, we used 8 kilobytes (KB) results in an 8GR region for each various mechanisms to solve the synchronization level 2 page table, we use only 2GB of each region problems. Having multiple solutions allowecl us to to keep within our ~GB32-bit limit. As shown choose the one with the least performance impact in Figure 1, the L2PTs are chosen to place the for each case. In some cases, explicit synchroniza- base address of the shared system region at tion was already in place due to multiprocessor syn- FFFFFFFF80000000 (hexadecimal), the same as the chronization requirements. In other cases, we sign-extencled address of the top half of the \'AS expancled dat;~structures at a cost of modest arcliitecture's 32-bit address space. amounts of memory to avoid expensive synchro- Although changes were extensive in the memory nization when referencing clata. management subsystem, many were not conceptii- ally difficult. Once we dealt with tlie new page Privileged Arcbitectzlre Changes table structure, most changes were merely for alter- Unlike tlie pervasive architectural changes native page sizes, new page table entry formats, and describecl in the previous section, the privileged changes to associated data structures. We tlitl architecture clifferences had a more limited impact. clecide to keep the OpenVMS VAX concept of map- 'The primary remaining areas of change are the ping process page tables as a single array in shared new page table formats and the details of process system space for our initial implementation. contest switching. This section describes mem- Although not viable in the long term due to the ory managenlent as a representative example. potentially huge size of sparse process page tables, However, many limited changes still amount to this decision minimized changes to code that refer- modifying virtually every source niodule in the ences process page tables. Having process page OpenVMs kernel, even if only to add compiler tables visible in shared system space turned out to directives. be a significant complication in paging and in address space creation, but the schedule benefits derived from avoiding changes to other subs)~stcms Melnory Management Not surprisingly the mem- were considered worthwhile. We expect to ch~nge ory man;igement subsystem required the most to :I more general mechanism of self-mapping pro- change when moving from tlie VAX to the Alpha cess page tables in process space for a subsequent ASP architecture. Aside from the obvious 64-bit OpenVMS AXP release. addressing capability, two significant differences Retaining the VAX appearance of process page exist between the page table structures on the VAX tables allowed us to meet our goals of minimum and the Alpha AXP architectures. First, Alpha ASP change outside of tlie memory management subs)ls- does not have an architectural division between tenl. Unprivileged code is unaffected by the niem- shared ancl process private address space. Seconcl, ory management changes unless it is sensitive to the the Alpha AXP three-level page table structure new page size. Even privileged code is generally shown in Figure 1 allows the sharing of arbitrary unaffected unless it has knowledge of the length or subtrees of the page table structure and the effi- format of PTEs. cient creation of large, sparse address spaces. In ziddition, future Alpha AXP processors may have Opl.i?r?ixcdT~*acllzslatio~z Buffer Use Thus fiir, we larger page sizes. may have given the impression that architectural To meet our schedule goals, we decided initially changes always create problems for software. This to emulate the VAX architecture's 32-bit address was not universally true; some changes offered us

Digital Tcchrrical Jo~rrrrnl Vol 4 /\To. 4 Jpecrrrl lssrre 199.2 115 Alpha AXP Architecture and Systems

LEVEL 3 CODE OR PAGE TABLES DATA PAGES

PO SPACE LEVEL2 flyn+VIRTUALADDRESS 0 PAGE TABLES PAGE TABLE PROCESS- I BASE PRIVATE REGISTER m PI SPACE +SOME P1 SPACE VIRTUAL ADDRESS LEVEL 1 UNUSED I PAGETABLE (Ll PT)

SYSTEM SPACE L3PT LlPTE - +VIRTUAL ADDRESS SHARED L3PTE FFFFFFFF80000000 L2PT

4 UNUSED

L2PTE - SYSTEM '0 SPACE L3PT +SOME SYSTEM SPACE VIRTUAL ADDRESS L3PTE --n Figure I OpenVMS AXP Page Table Structure opportunities for significant gains. One such example, there is read-only code, read-only data, change was an Alpha AXP translation buffer feature and read-write data. Normally, a kernel image is called granularity hints. TBs are key to performance loaded virtually contiguously and relocated so that on any virtual memory system. Without them, it it can execute at any address. To take advantage of would be necessary to reference main memory granularity hints, kernel code and data are loaded in page tables to translate every virtual address to pieces and relocated to execute from discontigu- a physical address. However, there never seems to ous regions of memory. Only a very few TB entries be enough TB entries. The Alpha AXP architecture are actually used to map the entire nonpaged ker- allows a single TB entry to optionally map a virtu- nel, and the goal of near-zero TB misses was ally and physically contiguous block of properly reached. aligned pages, all with identical protection The benefits of granularity hints became immedi- attributes. These pages are marked for the hard- ately obvious, and the mechanism has since been ware by a flag in the PTE. expanded. Now, the OpenVMS AXP system also uses Given granularity hints, near-zero TB miss rates the code region for user images and libraries. This for the kernel became attainable. To this end, we feature extends the benefits not only to images sup- e~~hancedthe kernel loading mechanisms to create plied by the OpenVMs system, but to customer and use granularity hint regions. applications and layered products as well. Of The OpenVMS AXP kernel is made up of many course, usage of this feature is only reasonable for separate images, each of which contains several images and libraries used so frequently that the regions of memory with varying protections. For permanent allocation of physical memory is

116 Vol. 4 No. 4 Specinllssue 1992 Digital Technical Journal Porting OpenVMSfrom VXX to Alpha AXP warranted. However, most applications are likely to Red/Write Ordering An I/O device is simply have such images, and the performance advantage another processor, and the sharing of data is a can be impressive. multiprocessing issue. Since the Alpha AXP archi- tecture does not provide strict read/write ordering, I/O a number of rules must be followed to prevent incorrect behavior. One of the easiest changes is to Unlike the architectural changes, the new I/O archi- tecture structures an area that previously was use the memory barrier instructions to force order- rather uncontrolled. The project goal was to allow ing. Driver code was modified to insert memory more flexibility in defining hardware 1/0 systems barriers where appropriate. while presenting software with a consistent inter- The devices and adapters must also follow these face. These seem like contradictory aims, but both rules and enforce proper ordering in their interac- must be met to build a range of competitive, high- tions with the host. An example is the requirement performance hardware that can be supported with that an interrupt also act like a memory barrier in a reasonable software effort. providing ordering. In addition, the device must The Alpha AXP architecture presents a number of ensure proper ordering for access to shared data differences and challenges that impacted the and direct memory access. OpenVMS AXP I/O system. These changes spanned areas from longword granularity to device control Kernel Processes Another important way in and status register (CSR) access to how adapters which the Alpha AXP architecture differs from the may be built. VAX architecture is the lack of an interrupt stack. On VAX systems, the interrupt stack is a separate stack for system context. With the new Alpha AXP CSR Access A fundamental element of I/O is the design, any system code must use the kernel stack access of CSRs. On VAX systems, CSR access is accomplished as simply another memory reference of the current process. Therefore, a process kernel stack must be large enough for the process and for that is subject to a few restrictions. Alpha AXP sys- tems present a variety of CSR access models. any nested system activity. This burden is unreason- Early in the project, the concept of I/O mailboxes able. A second problem is that the VAX V0 sub- system depends on absolute stack control to was developed as an alternative CSR access model. The I/O mailbox is basically an aligned piece of implement threads. As a result, most of the I/O code is in MACRO-32, which is a compiled language on the memory that describes the intended CSR access. Instead of referencing CSRs by means of instruc- OpenvMS AXP system that does not provide abso- lute stack control. tions, an I/O mailbox is constructed and used as a command packet to an 1/0 processor. The mail- These facts resulted in the creation of a kernel box solves three problems: the mailbox allows threading package for system code at elevated inter- access to an I/O address space larger than the rupt priority levels. This package, called kernel pro- address space of the system; byte and word refer- cesses, provides a set of routines that support a ences may be specified; and the system bus is sirn- private stack for any given thread of execution. The plified by not having to accommodate CSR routines include support for starting, terminating, references that may stall the bus. As systems get suspending, and resuming a thread of execution. faster, these bus stal Is are increasingly larger imped- The private stack is managed and preserved iments to performance. across the suspension with no special measures on Mailboxes are the I/O access mechanism on the part of the execution thread. Removing require- some, but not all, systems. To preserve investment ments for absolute stack control will facilitate the introduction of high-level languages into the I/O in driver software, the OpenVMS UPoperating system implemented a number of routines that system. allow all drivers to be coded as if CsRs were accessed by a mailbox. Systems that do not support Performance mailbox I/O have routines that emulate the access. As stated earlier, the main purpose of the project These routines provide insulation from hardware was to deliver the performance advantages of RISC implementation details at the cost of a slight perfor- to OpenVMS applications. We adopted several mance impact. Drivers may be written once and methods, including simulation, trace analysis, and a used on a number of differing systems. variety of measurements, to track and improve

Digital Technical Jorrrrrnl Vol. 4 No. 4 Special ISSLL~1992 117 Alpha AXP Architecture and Systems operating system and application performance. 35 This section presents data on the performance of OpenVMS services and on the SPEC Release 1 bench- 30 mark suite.5 Note that all Alpha LYP results are preliminar)! +m25 (I) 20 Performance of OpenVMS Services 0 To evaluate the performance of the OpenvlLls l5 system, we used a set of tests that measure the CPU 5 processing time of a range of OpenVMS services. 10 These tests are neither exhaustive nor representa- tive of any particular workload. We we relative CPU 5 speed (it., VAX CPU time divided by Alpha UPCPU

~p ~ - time) as a metric to truly conlpare CPU perfor- O 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0>2.0 mance. For I/O-related services, a RAM disk was RELATIVE CPU SPEED used to eliminate I/O latencies. Notes: The tests were run on a Vi\X system and an Alpha 1. The relative CPU speed equals the CPU time on a VAX system divided by the CPU time on an Alpha AXP system. LYP system that are the same except for the CPrJ 2. A relative CPU speed greater than 1.0 implies that the Alpha AXP Table 1 shows the configuration details of the two system is faster. systems. Figure 2 shows the distribution of the rela- 3. The total number of tests is 198. tive CPU speed for the OpenVMS services measured. Most tests ran between 0.9 and 1.7 times faster on Figure 2 Distribution of Relative CPUSpeed the Alpha AXP system than on the VAX system. Table for OpenVMS Services 2 contains the results for a representative subset of the measured OpenvlLls services. their time performing application-specific work and a small fraction of their time using operating Application Performance system services. For these applications, perfor- Applications vary in their use of operating system mance tlepends mainly on the performance of services. Most applications spend the ~najorityof hardware, compilers, and run-time libraries. We

Table 1 Configuration Details for OpenVMS Services Test Environment VAX System Alpha AXP System

Model number VAX 7000 Model 61 0 DEC 7000 Model 61 0 Clock rate 91 MHz 182 MHz Memory size On-chip cache size 1KB virtual I-cache 8KB physical I-cache 8KB physical I- and 8KB physical D-cache D-caches Backup cache size 4MB I- and D-caches 4MB I- and D-caches Translation buffer 96 entries 12 ITB entries 32 DTB entries Page size 51 2 bytes Number of registers 16 32-bit GPRs 32 64-bit integer 32 64-bit floating-point OpenVMS version

Key: I Instruction D Data ITB Instruction translation buffer DTB Data translation buffer GPR General-purpose register

118 Vol. 4 No. 4 Special Iss~ce1992 Digital Tecb$ricalJourrral Pol-ting OpenVMSfro~nVAX to Alphcr AXP

Table 2 Relative CPU Speed for a Subset deferred from the first version of the OpenVMS AX]' of OpenVMS System Services system. Beyond this, we anticipate taking signifi- and Primitives cant steps to better exploit the hardware architec- OpenVMS System Service Relative ture, including evolving to a true 64-bit operating or Primitive CPU Speed system in a staged fashion. Also, detailed ;~n;~lysisof performance results shows the ~ieeclto alter inter- Memory Management Services Create virtual address space nal designs to better match KISC architecture. Delete virtual address space Finally, a graclual replacement of Vr\S NLlCRO-32 Expand address region source with a high-level 1;ingu;lge is essential to sup- Page fault without I10 port a 64-bit virtual address space and is an impor- (soft page fault) tant element for increasing performance. Logical Name Services The OpenVMS AXP system clearly tlemonstrates Translate a logical name the viability of making dramatic changes in tlic Event Flag Services fundamental assumptions of a m;lture opernt- Set an event flag Clear an event flag ing system while preserving the investment in software Iayeretl on the system. Thc future Process Control Services Create a process and challenge is to continue operating system evolu- activate an image tion in order to provide more capabilities to appli- File System Services cations while maintaining that essential level of (File on a RAM Disk) compatibility. File open File close File create Acknowledgments File delete The work described in this paper was done by Record Management System (RMS) members of the Open\JMS AXP operating system Services (File on a RAM Disk) group. This work would have been impossible witli- Get record from a sequential file Put record into a sequential file out the help of many software and hardware engi- neering groups at Digital. Thanks to Bradley Note that the relative CPU speed equals the CPU time on a VAX Waters, who measured OpenVMS performance, ant1 system divided by the CPU time on an Alpha AXP system. A relative CPU speed greater than 1.0 implies that the Alpha AXP to John Shakshober ancl Sandeep Dcshmukh, who system is faster. obtained the SPEC Release 1 benchmark results. We also thank Barbara A. Heath and Kathleen D. Morse for their comments, which lielpetl in preparing this paper. used the SI'E(: Release 1 benchm;~rksas representa- tive of such applications. Table 3 shows the details of the VAX ;llid Alpha ASP systems on which the References SPEC Release I suite was run, ;inti Table 4 contains 1. R. Sites, "Alpha ASP Architecture," I)i~itrrlTech- the results. "l'he Sl-'E(:rnark89 comparison shows nicfll Jotirncll, vol. 4, 11~). 4 (1992, this issue): that the OpenV1US AXI' system outperforms the 19-34. OpenViLIs M\X system by ;I blctor of 3.59. The perform;~nceof OpenVNlS services and the 2. D. Bhandarkar and D. Clark, "Performance from SPECmark results ;Ire consistent with other stiidies Architecture: Comparing a NS<;ancl a CISC with of how operating system primitives and SPECmark Similar Hardware Organization," Proceedings results scale between ClSC and RISC6 Overall, the the Fourth International Corzferelzce orz Arcbi- results are very encour;~gingfor a first-version tectzlre Support Jor Propmmitzg Larz~rtrct~,es product in which redesigns were purposely limited and Operating Systems (ASI-?(,OS-IV) (New York, to meet an aggressive sclietli~le. NY: The Association for Computing Machinery, 1991): 310-319. Conclusions Some OpenVMS VtlX features such as symmetric 3. T. Leonard, ed., VAX Architecture ReJerence multiprocessing and VMScluster support were Manual (Bedfortl. ,MA: Digital Press, 1987). Alpha AXP Architecture and Systems

Table 3 Configuration Details for the SPEC Release 1 Benchmark Test Environment

VAX System Alpha AXP System Hardware Model number VAX 7000 Model 61 0 DEC 7000 Model 61 0 Clock rate 91 MHz 182 MHz Backup cache size 4MB I- and D-caches 4MB I- and D-caches Memory size 128MB 256MB Software Operating system and version OpenVMS V5.5-2 Field Test OpenVMS V1.O Compilers and version VAX C V3.2 Pre-release C compiler VAX FORTRAN V5.7 Pre-release FORTRAN compiler with HPO V1.3 (high-performance option) Other software KAP V1.O for VAX C KAPFIKAPC V1.49 native and FORTRAN KAP for Alpha AXP systems

Key: I Instruction D Data Note that DECram, a memory-resident disk device, was used to create and manage memory-resident disks

Table 4 SPEC Release 1 Benchmark Results VAX 7000 DEC 7000 SPEC Benchmark Model 610 Model 610 Relative Name and Number SPECratio SPECratio Performance

Note that relative performance represents the ratio of DEC 7000 Model 610 performance to VAX 7000 Model 610 performance.

4. 0penKtfS Calling Standard (Maynard, MA: General References Digital Equipment Corporation, October 1992). R. Goldenberg ancl S. Saravanan, VlW forAlyhu Plnt- 5. SpecNewslettel; vol. 4, no. 1 (March 1992). forms Internals ancl Data Strzrctures, Preliminary 6. T. Anderson, H. Levy, B. Bershad, and E. edition of vols. 1 ancl 2 (Maynard, {MA:Digital Press, Lazowska, "The Interaction of Architecture and 1993, forthcoming). Operating System Design," Proceedings of the J. Hennessy ant1 D. Patterson, Computer Architec- Fourth International Conference on Architec- tzrre, A Quantitative Approach (San Mateo, CA: ture Strpport for Programming Langtlages an61 Morgan Kaufmann Publishers, Inc., 1990). Operating Systems (ASPLOS-IV) (New York, NY: The Association for Computing Machinery, R. Sites, ed.,Alpha Architecture Refi?rerzceMnr?unl 1991): 108-120. (Burlington, MA: Digital Press, 1992).

120 Vo1. 4 I\'(). 4 Specinllssue 1992 Digilnl Technicul Jouri~c~l David S. Blickstein Peter W Craig Caroline S. Dauidson R. Neil Faiman,Jr. Kent D. Glossop Richard B. Grove Steven 0. Hobbs William B. Noyce The GEM Optimizing Compiler System

The GEIVI colnpiler systenz is the technology Digital is using to bziild state-of-the-art conzpiler prod~lctsfor a uc11~iet31of langz~clgesand ha?.d~~~are/softzilareplatforms. Portc~ble,modt~lar soJltuare colnponemzts with carefttllly specifiecl intevaces simplify the engineering of diverse compilers. A single optimizer; independent of the kcin-

gtiage and the targetplntfonn, transjomns the intermediate lnmzgzlage gene~uted@J the front end into a semantic all]^ equiu~rlentfornz that execzltes faster on the target machine. The GLibl system supports a range of languages and has been s~lccessfL11ly retargeted and rehosted for the Alpba AXP GIIZ~,VlIPS architectures and for several operating environl?zents.

I11 the past, Digital has made major investments diverse compilers, GEM compiler products share a in optimizing compilers that were specifically basic architecture. The compiler is divided into sev- directed at one hartlware platform, namely \',AX eral major components, in effect, the fi~ntlamental computers. When Digital began broadening its building blocks from which a compiler is con- hardware offerings to include reduced instruction structed. The interfaces among these components set computer (MSC) architectures, it became clear are carefully specified. The major components of a that new optimization technology was needed, as GEM compiler are the front end, the optimizer, the well as a new strategy for leveraging investments in code generator, and the compiler shell. The logical compiler techno log)^ across an increasing rllrmber clivision of GEM colltponents and the range of GEM of I~ardwareplatforms. support is shown in Figure I. Note that the host is This paper presents a technical description of the computer on which the conlpiler runs, and the the GEM compiler technology that Digital uses to target is the computer on which the generated generate compiler products for a wide range of object runs. hartlware and software combinations. We begin The front end performs lexical analysis anel pars- with an explanation of the GEM strategy of leverag- ing of the source program. The primary outputs are ing investmellts by using portable, modular soft- interme~liatebng~iage (IL) graphs and symbol ware components to build compiler products. The tables, which are both standardized. In an IL graph, bulk of the paper describes the GEM optimizer and each node, referred to as a tctple, represents an cotle generator technologies, with a focus on how operation. Front ends for a11 source languages they adclress challenges posed by the Alplla AXP translate to the single standard IL. All Ianguage-spe- architect~1re.lW then move to a discussion of com- cific code is encapsulated in the front end. All piler engineering and conclutle with an overview knowledge of the source language is cornmuni- of some planned enhancements to the software. cated in the IL or through callbacks to the front end. Knowledge of the target liarclware is represented in GEM CompilerArchitecture tables and in a minimal amount of conditional cocle. Because of the many hardware platforms available, The optimizer transforms the IL generateel by the often with multiple operating systems and a v;~riety front end into a semantically equivalent form that of languages offered on those platforms, builtling a will execute faster on the target machine. A signifi- compiler from scratch for each combination is no cant technical achievement is that a single opti- longer feasible. To simplify the engineering of mizer is used for all languages and target platforms.

Digilrrl Technical Jorrrtirrl Vol. 4 IVO. 4 Spccirrl Issl~e1992 121 Alpha AXP Architecture and Systems

FRONT END SHELL CODE GENERATOR

LANGUAGES OPERATING SYSTEM HOST CPU OPERATING SYSTEM TARGET CPU Alpha AXP OpenVMS BLISS ULTRIX ULTRIX Olhers Windows NT Windows NT COBOL Fortran Pascal OPTIMIZER Opal

Figzlre I GEIM Corlzponents and Supported CPUs, Operatiizg Syste~ns,cind Lcrngz~~zglges

The codegc.rzerc~r"ortranslates the 11. into ;I list ot of ;I program, certain information about other rele- code cells, each of which represents one machine vant piu-ts of the source program can be useful. instruction for the target hardware. Virtually all the Figure 2 illustrates the overall process of compil- target machine instruction-specific code is encap- ing a program. Since <;EM compilers include inter- sulated in the cotle generator. procedural optimizations, as much of the program The shell is a collection of common compiler as possible shoultl be presentetl to the optimizer at functions such as listing generators, object file the same time. For this reason. <;EM co~i~pilers emitters, and command line processors. Basically, allow the user to process multiple source files in a the shell is a portable interface to the external envi- single compil;~tion.The front end parses these ronnient in which the compiler is used. 7'liis inter- source files ;ind constructs the symbol t;ible ant1 a face allows the other components to remain compact form of 1L in memory before invoking tlie independent of the operating system. <;EM back end. The portion of tlie user's program There are numerous benefits to this modu1;ir thus compiletl is called a con~pil;~tionunit. approacli: The GEM back-end interprocedural optimization phase is the first to operate on tlie program. This Adding a new feature to a common component phase analy~csthe routines within a compilation enhances many products. unit to tlevelop a call graph that shows which Source language compatibility is ensr~red;itnong routines might call which other routines. all cornl~ilersthat use the same front entl. Interpsocedural optimizations :ire applietl to the routines ;a;I group. Standardizetl interfaces enable us to plug in ;I Next, the global optimizer and the cotle genera- new front end to build a conlpiler for a new [;in- tor process each routine in a bottom-up order, guage, or a new shell to allow the compiler to resulting in :I tr;inslation of the program to cotle run on a new host. cells th;~trepresent oper;~tionsat niacliine level. This bottom-up order is convenient li)r certain opti- Wlien ;I new language is added, it can be offered mizations, as discussed in the Optimization section. quickly on rnany platforms. The first action of the global optimizer is to trans- IT. When a new target Cpu or operating system is late the roiltine's from the compact form pro- supported, many 1;lnguages are i~nmediately videcl by tlie front entl to ;in exp;intlecl form i~seclby ;ivailable to that target. the optimizer and the cotle generator. Since only one routine at ;I time is storetl in expanded form, a much 1;lrger tlat:~structilre c;111be used to store the Order of Processing results of the optimizer ;~nalysis.The expansion When compiling a program, the overall order of pro- from comp;1ct form illso expands certain shorthand cessing must be carefully arranged so that each com- forms, which are convenient for a front end, into ponent of the compiler can see a large portion of the explicit oper;itions in the expanded 11.. much like a program at one time. When processing one portion ni;lcro expansion facility in a source language.

122 H>1.4 No. 4 Special Issue 1992 Digilnl Tecbnicrrl Jorrrtrrrl D.7e GEM Optimizing Compiler System

SOURCE PROGRAM 1 FRONT END COMPILER SHELL SCANNER AND UTILITIES PARSER SEMANTIC PROCESSING FILE 110 SUPPORT MESSAGING SYMBOL TABLE COMPILER DEBUGGING TOOLS COMPACT INTERMEDIATE LANGUAGE LOCATOR PACKAGE INTERPROCEDURAL INLINING COMMAND PROCESSING OPTIMIZATION LISTING GENERATION COMPILATION ORDERING MEMORY MANAGEMENT I I SYMBOL TABLE COMPACT INTERMEDIATE LANGUAGE

GLOBAL OPTIMIZATION INTERMEDIATE LANGUAGE EXPANSION FLOW GRAPH REDUCTION LOOP UNROLLING COMMON SUBEXPRESSION CODE MOTION VALUE AND CONSTANT PROPAGATION STRENGTH REDUCTION TEST REPLACEMENT SPLIT LIFETIME ANALYSIS

SYMBOL TABLE 1 EXPANDED INTERMEDIATE LANGUAGE CODE GENERATION CODE SELECTION INTERMEDIATE LANGUAGE SCHEDULING REGISTER HISTORY REGISTER ALLOCATION CODE EMISSION STORAGE ALLOCATION

SYMBOL TABLE I CODE CELLS INSTRUCTION PROCESSING PEEPHOLING CODE SCHEDULING I BRANCHIJUMP RESOLUTION

SYMBOL TABLE I CODE CELLS OBJECT MODULE CONSTRUCTION

0OBJECT MODULE

Figure 2 GEM Compiler Order of Processing

Once all the routines have been processed by mizations and instruction scheduling, are per- the global optimizer and the code generator, a formed on this program description. Finally, the complete description of the program is available at optimized machine instructions are converted to the machine instruction level. Certain machine- the appropriate object language for the target oper- specific optimizations, such as peephole opti- atingsystem.

Digital Techrrical Joccrrrnl Vol. 4 No. 4 Sf)eciol Isst~e1992 123 Alpha AXP Architecture and Systems

Optimization Out of concern for the time required to compile The GEM compiler system's optimizer is state-of- large programs, GEM also established the design the-art ant1 independent of the language ancl the tar- principle that the order of complexity as a function get platform. The input to the optimizer is the 1L of the number of IL operations should be as close to and symbol table for multiple routines; the output linear as possible. is the semantically eq~~ivalentIL and symbol table, both modified to run faster on the target platform. Data Access Model and GEM optimizations include interprocedural opti- Side Effects Interface mizations, modern optimizations for superscalar Since GEM compilers translate all source languages IUSC architectures such as the Alpha UParchi- to a common IL and symbol table format, the tecture, plus a robust implementation of the classi- semantics of these languages must be specified cal global optimizations. In addition, (;EM'S code precisely. Many optimizations require an exact generator includes a number of optimization fea- understanding of which symbols are being written tures that help it produce extremely high Local code or read by operations in the IL, and which opera- quality. tions might affect the results computed by other operations. Design Principles The GEM team developed a detailed specification Certain general design approaches or principles known as the data access model, which defines were applied throughout the optimizer. For those operations that can write to memory and instance, choices had to be made in the design of those that can read from memoq7. Each of these the IL; the front end could either provide a higher- memory-accessing operations can explicitly desig- level description of program features or rely on the nate the syn~bolbeing accessed when it is known. back end to derive the higher-level description The model also requires the front end to speclfy frorn an analysis of a lower-level description. In when symbols may be aliased with parameters and cases where accurate, well-defined algorithms for when they may be pointer aliased A pointer- deriving those higher-level features exist, GEM aliased symbol may be accessed through pointers chooses to derive the descriptions. or other operations that do not spec@ the symbol Describing source code loops is a key example of that they access. the implementation of this design principle. Most The model can indicate that the pointer-aliased source languages have explicit syntax for writing property is derivable, i.e., a symbol is pointer loops, and the front end could translate these lan- aliasecl only if an operation that stores its address is guages into a higher-level IL that designates those present in the IL. A special IL operator marks such loops. Instead, GEM uses a lower-level IL with primi- operations. When the derivation of this property is tives such as conditional branch and label opera- deferred, the optimizer can avoid marking symbols tors. The advantage of this approach is that GEM pointer aliased. recognizes all loops, even those constructetl with The data access model provides a standard way GOT0 statements. for a front encl to indicate how IL operations affect A general design approach that emerged from or depend upon synlbols. However, some front experience gained during the GEM project is the ends can provide additional language-specific dis- use of cnahling or expanding transformations to crimination of operations that cannot be allowed to support fundamental optimizations. Often, repre- interfere with one another. For example, a strongly senting operations in the IL in a way that hides cer- typed language like Pascal may stipulate that an tain implicit operations is a compact ancl efficient assignment to a floating-point target must refer to approach. However at times, making these implicit different storage than an integer read, even when operations explicit allows a particular optimization the assignment target is accessed indirectly through routine to operate on them. Agood solution to this a pointer. problem is to initially represent the operations in To represent language-specific rules while adher- the 1L in the compact form. Then, before applying ing to the philosophy that the back end should have optimizations that could benefit from seeing the no knowledge of the source language, GEM compil- ituplicit operations, apply expanding transforma- ers employ a unique interface with the front end, tions to convert the IL into a longer form in which calletl the side effects interface. The front end pro- all operations are explicit. vides a set of procedures that GEM can call during

124 1/01. 4 No. 4 Special Isstre 1992 Digital TecbniculJournal The GEM Opti~nizingCo~npiler System

optimization to ask which IL operations have side passed as arguments, and the corresponding for- effects and which IL operations depend upon those mals are described as not aliased with one another, side effects. eliminating the formal parameters through inlining discards valuable alias information."5 Also, the order in which inlining clecisions are Interprocedural Optimization made can be important. In a chain of calls in w.hich GEM's interprocedural optimization phase starts A calls B and B calls C, the call from A to B might be by walking over the 1L for all routines to build the most desirable inlining candidate. However, if the call graph. The call graph is a directed multi- the call from B to C is inlined first, the size of B may graph in which the nodes are routines, and the increase, making it a less attractive candidate for edges are calls from one routine to another. The inlining into A. Consequently, GEM uses its heuris- graph is not a tree because recursion is allowed. tics to preevaluate all calls and then orders the calls A special virtual routine node represents all by desirability. GEM inlines the most desirable can- unknown routines that might call or be called by didate first, and then reevaluates the caller's prop- a routine in this compilation. erties, possibly acijusting its position in the ordered GEM walks the graph to determine which local list. symbols that are potential targets of up-level access In many C programs, the acldress of a variable are actually referenced in a called routine. When (especially a struct) is passecl to a called routine up-level references do occur, GEM can also deter- that refers to the variable through a pointer for- mine the most efficient way to pass that context mal parameter. After inlining, a symbol's address from the routine that declares the variable to the is stored in a pointer and indirect references are routine that references it. made through the pointer. Later, while optimizing On the same walk, GEM analyzes the use of sym- the routine, GEM's value propagation often elimi- bols whose pointer-aliased property is derivable. If nates the pointer variable. Finally, whet1 one or operations that store the address of such a symbol more pointer-storing operations have been elimi- are present, then the symbol is marked as pointer nated, GEM reevaluates the pointer-aliased prop- aliased. The front end's indication is also retained erty of derivable local symbols, ant1 the variable that so that this analysis can be repeated after address was once passed by address is no longer pointer storing operations are eliminated. aliased. The most significant interprocedural optimiza- After interprocedural analysis, the routines of the tion that GEM performs is procedure inlining. user's program pass through the optimizer and Inlining is a well-known method for reducing code generator one at a time. GEM'S interprocedural procedure call overhead and for increasing the phase chooses a bottom-up routine order in the call effectiveness of global optimizations by enlarging graph. Except for recursive cycles, this order causes the scope of the operations seen at one time. GEM to generate the code for a called routine before Inlining has additional benefits on superscalar generating the caller's code. GEM takes advantage of FUSC architectures, like the Alpha AXP system, this property by recording the scratch registers that because the optimization allows the compiler to were actually used in a called routine ant1 adjusting schedule the instructions of the two routines register usage at its call sites.' GEM also determines together. whether or not the called routine requires an argu- GEM's inliner reviews all calls in the call graph ment count. and uses heuristic algorithms to determine which calls should be inlined for maximum speed without unreasonable increases in code size or compilation Intermediate Lnlzguage Peepholes time. The heuristics consider the number and kind GEM uses a peephole optinlizer to improve the 1L. In of IL operations, the number of symbols referenced, addition to performing the many obvious simplifi- and the kinds of optimization that would likely be cations such as n~ultiplyingby one or adding zero, enabled or disabled by inlining. the optimizer performs other transformations. When callers pass constants as actual parame- Integer division by a constant is expanded into a ters, better optimization is likely to result from multiply by a reciprocal operation, which can be inlining because the corresponding formal parame- efficiently inlplemented with a UMULH instruction. ter will have a known constant value. On the other String operations on short fixed-length strings are hand, when two sections of the same array are converted into integer operations, to benefit from

Digital Technical Joumnl Vo1. 4 No. 4 Special Issue 1992 Alpha AXP Architecture and Systems various optimizations performed only on scalars. Base Binding Also, integer multiply operations by a constant are On RISC machines, a variable in memory is refer- converted into an equivalent set of shift and add or enced by loading the address into a base register and subtract operations. then using indirect addressing through the base reg- IL peepholes sometimes expose new optimiza- ister. To reduce the number of address loads, sets of tion opportunities by expanding complex opera- variables that are closely allocated share base regis- tions into more explicit components. Also, other ters. GEM considers two address expressio~lsfor- optimizations such as value propagation may create mally ecli1i\7alentifthey differ by an amount less than new opportunities to apply peepholes. To take the range of the hardware instruction offset field. advantage of these opportunities, GEM compilers The CSE detection algorithm determines which apply these IL peepholes n~ultipletimes during the adtlress expressions are formally ecluivalent and optimization of a routine. thus can share a base register, and the code motion algorithm moves the base register loads out of loops. Data-flow Analysis In previous Digital compilers, the use of data-flow Induction Variables analysis was limited largely to the elimination of Some of GEM'S most valuable optimizations require common subexpressions (CSEs), value propaga- the identification of inductive expressions and tions, and code motions. We generalized the data- induction variables, which is done during data-flow flow analysis technique to perform a wider variety analysis. An expression in a loop is inductive if its of optimizations including field merging, induction value on a particular iteration is a linear function of variable detection, dead store elimination, base the trip count. The simplest forms of inductive binding, and strength recluction. expressions are the control variables of counted The process of detecting <:SEs is divided into the loops. Expressions that are linear functions of tasks of induction variables are also inductive. GEM'S implementation of data-flow analysis uses Knowing when two expressions would com- a technique for determining what variables are pute the same results given identical inputs. modified between basic blocks in the flow graph.6' Within GEM compilers, such expressions are said The variables modified between a basic block ant1 to be fomnlly eqttirmlent. its dominator are represented as a set called the Verifying that the inputs to formally equivalent IDEF set. The mapping from variables to set ele- subexpressions are always identical. Such ments is clone using the side effects interface. expressions are said to be value eyuiunlent.This The algorithm for detecting induction variables verification is accomplishetl by using the side starts by presuming that all variables modified in effects mechanism. the loop are induction variable candidates. It then tlisqualifies variables not redefined as a linear func- Determining when an expression dominates a tion of themselves with a coefficient equal to one. value eqi~ivalentexpression.5 This information The loops that GEM chooses to analyze have a loop guarantees that GEM will have computed the top that dominates all nodes within the loop. The dominating expression whenever the dorniilated IDEF set for a loop top is exactly those variables that expression is needed. are modified within the loop and thus serves as the Code motions introduce the additional task of starting value for the induction variable candidate finding those places in the flow graph to which an set, again using the side effects mapping of vari- expression could be legally moved such that ables to set elements. During the walk of the loop, whenever a disqualifying store is encountered, the The moved expression would be value equiva- contents of the candidate set are updated. Thus, at lent to the original expression, ant1 the end of the walk, the remaining variables it1 the The moved expression would execute less often set are known to be true induction variables. than the original expression. Strength Reduction of Induction Variables The following sections describe how GEM Strength reductiol~is the process of replacing an detects base-binding and strength-reduction candi- expensive operation with a less expensive opera- dates by substituting slightly different equivalence tion. The most basic example of strength reduction functions. on induction is as follows:

126 Vol. 4 No. 4 Special Issue 1992 Digital Techtrical Journal The GEM Optimizing Compiler Sjntem

If the original source prowam was V' and V" can be assigned to different registers, DO 20 I = 1,10 each with shorter lifetimes than the original vari- 2 0 PRINT 1*4 able V 'The allocator can thus pack registers and memory more tightly. strength reduction would reduce the multiply to an acld as follows: V' and V" can be scheduled indepentlently. For example, the computation of Z in line 2 could be I' = 4 DO 20 I = 1,10 scheduled after the redefinition of V in line 3. PRINT I' 20 I' = I' + 4 A lifetime that begins with a fetch is an uninitial- ized variable. GEM issues a diagnostic in such cases. Note that the most common array references are of the forin A(I), which implies a multiplication of Any lifetime with only stores is effectively I by the stride of the array. Thus, strength reduction "dead," and thus, the stores can be eliminated. yields a significant performance improvement in When a lifetime of an incluction variable con- array-intensive conlputations. tains an equal number of stores and fetches, the To detect strength-reduction candidates, we variable is used only to compute itself. Thus, the redefine formal and value equivalence as follows: whole lifetime can be eliminated. This is called Two inductive expressions are formally equiva- induction variable elimination. lent if, given identical inputs, they differ only by GEM uses split lifetime information to optimize a constant. the flushing and reloading of register variables Two formally equivalent inductive expressions around routine calls. are value equivalent if their inpiits are value GEM uses split lifetime information to determine equivalent or are direct references to induction what variables are potentially refcrenced by variables. exception handlers. Thus, strength-reduction candidates appear Lifetimes often need to be extended around loop loop invariant, and two expressions that are value tops and loop bottoms. Split lifetime analysis has equivalent can share a single strength reduction. fill1 information in many cases in which the code Code motion yields the initial value of the strength generator's lifetime computation must make reduction. pessimistic assumptions. Thus, analyzed vari- ables are allocated more efficiently inside loops. Split Lijetime Analysis The technique GEM uses for split lifetime analysis The GEM optimizer analyzes the usage of certain is based on the VAX Fortran SPLIT phase.# The tech- variables to determine if the stores and fetches of a nique includes several extensioils in the areas of variable can be partitioned, i.e.,split, into disjoint induction variables, unselected variables (the origi- variables or lifelinzes. nal algorithm analyzed only a fixed number of vari- For example, consider the following program ables), and exception handling. segment. Code Generation The GEM code generator matches code templates to sections of IL trees.9 The code generator has a set of approximately 600 code patterns and uses dynamic The references to V can be divided into two dis- programming to guide the selection of a least-cost joint lifetimes V' and V" without changing the covering for each statement tree in the IL graph pro- semantics of the program as in: duced by the global optimizer. Each code pattern specifies a set of interpretive code-generation actions to be applied if the tem- plate is selected. The code-generation actions cre- ate temporaries, determine their lifetimes, allocate registers and stack locations, and actually emit V' and V" can be treated as two completely sequences of instructions. These actions are independent variables. This has several useful applied during the following four separate code- applications. generation passes over the IL graph for a procedure:

Digital Techtrical Jourital '01. 4 No. 4 Special lsszie 1992 127 Alpha AXP Architecture and Systems

Context. During the context pass, the code gen- A result mode that encodes the representation erator crcates tlat;~structures t11;lt describe eiich of;^ value computed by the template's generated temporary variable. The information computerl code. includes the lifetime, usage counts, and ;I \wiglit scaled by loop depth. An intcger that represents the cost of the code generated by this template. Register history. 1)liring the register history p;iss, the code generator eloes a rlominator-order The result modes :ire an enurner;ttion of the tlif- walk of the flow graph to identify pote11ti;il ferent ways the compiler can represent a value in redundant loads of values t1i:lt could be ;ivail;tble the m;~chine.~~<;EM conipjlers llse the following in registers. result modes: Temp name. During the temp name pass. the Sc;iL;ir, for ii value, negated value. and comple- code generator performs register allocation mented value using the lifetime ;inel weigllt inform;ition coni- Hoolean, for low-bit, high-bit, and nonzero values piited tluring the context pass. The code genera- tor also ilses register Iii~tor)~to ;llloc:ite Flow, for ;I Boole;ln represented ;IS control flow temporaries that holtl the same vi~luein the same register. If successfi11,this ;lction eliminates lo:~tl Result motles for different sizes of integer literals ancl move instr~ctions. Result nodes for clelayetl generation of adclress- <:otle. I>uring tlie cotle pass, the code generiitor ing calculations emits instructions ;~nclcotle labels. 'The resulting Result modes indicating that only a part of a code cells are an internal representation :~tthe v;ilue has been materialized, i.e., the low byte, or assembly code level. Each cotlc cell conL;~ins;I that the n1ateri;ilizecl \;;ilue has usetl :i lower-cost single target mirchine instruction. The cotle cells solution have specific registers ;rncl boiincl offsets from base registers. References to Iiibels in the cotle As templates are m;~tchetlto portions of the 1L stream are in a symbolic form, pentling further tree, c:ich node is 1;tbeled with ii vector of possible optilnization and final offset ;lssignmcnt after solutionb. The vector is inclexed by result mode, instruction peephole optimiz;ition and instruc- ancl the lowest-cost solution for each result mode is tion schecluling. recortletl on the forw;lrd bottom-up walk. When a root nocle is encountered, the lowest-cost template in its vector of solutjons is chosen. 'This choice then Code tenlpli~teenumeration and selection occurs tietcrmines the required result mode and solution during tlie context pass. The enumeration pllase for e;1c11 leaf of the pattern, recursivel!~. scans IL notles in execution ortler (11otto111-tip);ind labels each node with alterni~tivepatterns ant1 GEM Code Generator Action Language costs. When ;I root notle such ;IS n store or br;inch Tllc (;I;&] cocle generittor uses and extends methods ti~pleis reached, the lowest-cost ternpl;~teh)r that clcvelopccl in the 131,lsS compilers, the Ci~megie- node is selectetl. l'he selection process is then Mellon llniversity 1)rocliiction-Quality Compiler- applietl recursively to the leavcs for the entire Compiler Project, ancl Digital's VAX Pascal tree. lo cornpiler.1i.13One key <;EM innovation is the use of The IL tree pattern oPa cocle-generi~tiontenipl;~te :I formalizctl action Language to give a ~~nified consists of four pieces of inform;rtion: description of all actions performecl in the four cotle-generation passes. The same formal action A pattern tree that clescribes thc rirr;1ngemcnt of descriptions ;ire interpretetl by four different inter- 11. nodes that can be cotled by this template. The preters. For example, the Allocate-TN action is interior nodes of the pattern tree are If. operit- ilscd to iilloc;~telong-lived temporaries that may be tors; the It.dves are either result rnocle sets or 11, in a register or in memory. This action creates a data operxtors with no operantls. structiire describing tlie temporary in the context A predicate on the tree notles of the pattern. The pass, ailocates a register tluring the temp name pretlic:ite milst be true in orcler for the piittern to pass, and provides the actual temporary location be i~pplicable. h)r code emission.

I*/. 4 IUo. 4 Specir11 lssrre I992 Digital TechrticulJourtial The GEM Optimizing Compiler System

Tree-matching code generators were originally Scheduling developed for complex instruction set computer To take advantage of high instruction-issue rates in (CISC) machines, like the PDP-11 and VAX comput- Alpha AXP systems, compilers must carefully sched- ers. The technique is also an effective way to build ule the object code, interleaving instructions from a retargetable compiler system for current NSC several parts of the program being compiled. architectures. The overall code-generation struc- Performing instruction scheduling only once after ture and many of the actions are target indepen- registers have been allocated places artificial con- dent. Some IL trees use simple, general code straints on the ordering, as illustrated in the follow- patterns, whereas special cases use more elaborate ing code example: patterns and result modes. rO, a(sp) ; Copy a to b rO, b(sp) Register Allocation rO, c(sp) ; Copy c to d rO, d(sp) GEM compilers use a simple linear model to charac- terize register lifetimes. The context, temp name, If the load of c and store of d were to use some and code passes process the basic blocks and the IL other register, the code could be rescheduled to nodes of each block in execution order. Each code save three cycles on the DECchip 21064 processor, pattern has a certain number of lifetime ticks to as shown in the following code: represent points at which a temporary value is cre- rO, a(sp) ; Copy a to b ated or used. The lifetime of a temporary is then the rl, c(sp) ; Copy c to d interval defined by its starting lifetime tick and end- rO, b(sp) ing lifetime tick. rl, d(sp) Simple expression temporaries have a linear life- On the other hand, scheduling only before regis- time contained within a basic block. User variables ter allocation does not incorporate decisions made and CSEs may require that lifetimes be extended to by the code generator. Therefore, instruction cover loop tops and loop bottoms. The optimizer scheduling in GEM compilers occurs twice, before inserts special begin and end markers to delimit the and after registers are allocated. This practice is precise lifetimes of variables created by the split fairly common in contemporary RISC compiler sys- lifetime phase. tems. In most other systems, scheduling is per- The code generator uses a number of heuristics formed only on machine code. GEM has two to allocate registers to avoid copying. If a new different schedulers-one that schedules machine lifetime begins at exactly the same tick as another code and one that schedules IL. lifetime ends, this may indicate that they should share a register. Otherwise, the allocator uses a round-robin allocation to avoid packing registers Intermediate Language Scbedziling too tightly, which would inhibit scheduling. The IL scheduling is performed one basic block at a Move-Value action is used to copy one register to time. First, a forward pass over the block gathers another and provides a hint that the source and des- information needed to control the scheduling, and tination should be allocated to the same register. then a backward pass builds the new ordered list of Actual allocation of registers and stack tempo- tuples. During the forward pass, the con~piler raries occurs in the temp name pass. The allocator builds dependence edges to represent the neces- uses a bin-packing technique to allocate each com- sary ordering relationships between pairs of tuples. piler and user variable to a register or to memory.'* Tuples that would require an excessive number of The allocator first attempts to assign variables to edges, such as CALL tuples, are considered markers. registers; lifetimes that conflict cannot be assigned No tuples can be reordered across a marker. to the same register. The allocator uses a density The compiler uses the data access model to fi~nctionto control the process. A new candidate determine whether two memory-access tuples con- can displace a previous variable that has a conflict- flict. Also, if two tuples have address operands with ing lifetime if this action increases the density mea- the same value (using data-flow information) but sure. After the allocation of temporaries to registers different offset attributes, the tuples must access is completed, any unallocated or spilled tempo- different memory. Thus, no dependence edge is raries are allocated to stack locations. needed, and more rescheduling is possible.

DfgftalTechnical Journal Vol. 4 No. 4 Special Issue 1992 129 Alpha AXP Architecture and Systems

The general code for an expression tuple places value of this heuristic is the Sethi-Ullman number, the result into a compiler-generated temporar): i.e.,the number of registers needed to evaluate the and the general code for a store into a register vari- subexpressions in tlie optimal order, keeping their able moves the value from a temporary into the intermediate values, plus the additional registers to variable. Many GEM code patterns for expression evaluate the tuple itself.Ii If the second pass sched- tuples allow targeting, where the expression is ules tuples with a lower count later in the program, computed directly into the variable instead of into the register usage will be kept low. Without such a a temporary. These cocle patterns are valid only if mechanism, scheduling before register allocation there are no fetches of the variable between the tends to cause excessive register pressure. expression tuple and the store operation. Similarly, CSEs can be treated similarly to subexpressions in a fetch tuple need not generate any code (called this computation, but with two complications. The virtual),if no stores exist between the fetch and its first pass cannot predict the last use of the CSE and consumer. For example, therefore treats each use as the last one. The sched- uler ignores any register usage associated with CSEs T = A-I; A = B+1; C = T; that are not both created and used within the block might generate the GEM IL being scheduled. This action allows the register allocator to place the CSEs in memory, if the sched- uled code has better uses for registers. The second pass of the IL scheduler works back- ward over the basic block. The scheduler removes all tlie tuples up to the last marker and makes avail- able those that have no clependence edges to tuples In this example, SUB operates directly on the reg- that must follow. The scheduler then selects an ister allocated for A, and ADD targets its result to the available tuple and places it in the scheduled out- register allocated for A. The obvious dependence put, updates the state of each modeled f~~nctional edge is from FETCH(A) to STORE(A,. . .). However, IL unit, and makes available new tuples whose depen- scheduling must be careful not to invalidate the dences are now satisfied. When the marker is code patterns, which would happen if it moved scheduled, the scheduler continues to remove the FETCH(A) between ADD and STORE(A) or STORE(A) preceding group of tuples from the block until the between FETCH(A) and SUB. To ensure valid code entire block has been scheduled. patterns, the first pass moves the head of clepen- The scheduler keeps track of the number of dence edges backward from targeted stores to the scheduled cycles and the estimated number of live expression tuple that does the targeting Sinlilarly, registers. When choosing among tuples, the schetl- the first pass moves the tail of dependence edges uler prefers one whose subtree can be evaluatetl forward from virtual fetches to their consumers. In within the available registers, or, failing that, one this example, the edge runs from 2$ to 4$ and pre- whose subtree can be evaluated with the fewest vents either of the illegal reorderings. registers. When several tuples qualify, the sched- In addition to building dependelice edges, the uler chooses the one with the greatest AET. first pass computes heuristics for each tuple, to be Limiting register pressure, while not important used by the second, i.e., scheduling, pass. One for all programs, is important in blocks with a lot of heuristic, the anticipated execution time (AET), available parallelism. With this feature, IL schedul- estimates the earliest time at which the tuple could ing is a significant contributor to the high perfor- execute. The AET for tuple T is either the maximum mance of GEM-compiled programs. AET of any tuple that must precede T, or the maximum AET plus the latency of T's operands. If some of the tuples that must precede T require the Instruction Peepholing same hardware resources, the AET may be opti- After cocle has been generated or code cells have mistic Nevertheless, the AET is a usefill guide to the been created directl-): the instruction processing scheduling pass. phases are run as a group. Instruction peepholing The first pass also computes the minimum performs a variety of localized transformations, typ- number of registers (separately for integer and ically by matching patterns of adjacent instructions floating-point registers) needed to evaluate the and replacing them with better patterns. From the subexpression rooted at a particular tuple. The perspective of instruction scheduling, the most

130 \lo/.4 No. 4 SDeci~~lIssue 1392 Digital TechnicalJournal The GEM Optimizing Compiler System interesting function of the instruction peepholer well is often preferable to obtaining a good sched- is to perform a set of branch reductions. The peep- ule. A cycle model allows other instructions to holer also replicates short sequences of code to "float" into the no-issue slots, while allowing the facilitate instruction scheduling and to eliminate critical resource to be scheduled well. the instruction pipeline effects of branches. ~Moclelingthe machine allows easy determination A control flow processing phase follows the of where stalls are occurring, which in turn allows instruction peepholing phase. Currently, this phase instructions from the current block or from suc- determines labels that are backward branch targets cessor blocks to be moved into no-issue slots. for alignment purposes. This action occurs before instruction scheduling, because instruction align- Modeling the machine in a forward direction ment is important for the DECchip 21064 Alpha AXP captures the fact that processors are typically processor, in which instructions must be aligned "greedy" and issue all the instructions that they on qiiadword boundaries to exploit dual instruc- can issue at a given time. tion issue. In the near future, the control flow pro- The cycle model allows a variety of dumps, cessing phase will collect register information for which can be useful both to users of the com- each basic block to allow additional scheduling piler system and to developers who are trying to transformations. improve the performance of generated code. Instruction Scheduling The forward pass does a topological sort of the instructions. The scheduler moves instructions that The instruction scheduler is the next phase. At this have either a direct dependence or an antidepen- point, all register binding ancl code modifications dence (e.g., register reuse) to a data structure other than branch/jump resolution have occurred. called the issuing ring for fiiture issue. The scheduler docs a forward walk over the basic The scheduler represents the instructions avail- blocks in each code section to determine the align- able for issuing ;IS ;I list of data structures known as ment of the first instruction in each block. heaps, which are priority clueues. Each heap on the For each basic block, the instruction scheduler list contains instructions with a similar "signature." does two passes that are effectively the inverse of For example, a heap might contain all store instruc- the passes that the IL scheduler performs, namely a tions. When looking for the next instruction to backward walk to determine instruction-ortlering issue, the scheduler examines the top instruction in requirements and path length to the entl of the each heap. Within each heap, instructions are typi- block, and a forward pass that actually schedules cally ordered by their AET values, with occasional the cotle. small biases for different instruction properties, The backward ordering pass uses an AET compu- such as loads that may have a variable execution tation similar to the one usecl by the 1L scheduler. time longer than the projected time. The instruction scheduler knows the actual instruc- The heaps are, in turn, orcleretl in the list accord- tions to be scheduled and has a more detailed ing to how desirable it is that a particular heap's top machine model. For the DECchip 21064 processor, instruction be issued. All nonpipelined instruction for example, the instruction scheduler has tletailed heaps are first on the list, followed by all semi- asymmetric bypassing information ant1 information pipelineil heaps and, last, all fully pipelinecl ones. about multiple issue. For architectures that have A semipipelined resource may prevent particular branch delay slots, the AET computation is biasetl instructions from issuing in certain fiiture cycles so that instructions Likely to be able to fill branch but can issue every cycle. For example, stores on delay slots will occur immediately before branch somc machines interact with later loads. operations. Instructions that use multiple resources are rep- The forwarcl scheduling pass does a cycle-by- resented in the heap ordering. For example, float- cycle model of the machine, inclutling modeling ing-point multiplies on the MIPS machine multiple issue. The reasons for choosing this use both the multiplier ancl some of the same approach rather than an approach that just selects resources as additions. As a result, the heap that an ordering of the instructions are ;IS follows: holcls multiplies is always kept ahead of the heap For machines with significant issue limitations, that holds adds. This ordering scheme works well e.g., nonpipelined functional units or multiple for both machines with a significant number of issue pairing rules, packing the limiting resource nonpipelined units, such as the MIPS processors,

Digital Technical Journal 1/01. 4 rVo. 4 Speciul lssrre 1992 131 Alpha AXP Architecture and Systems

and machines that have largely pipelined fi~nctional lower-level operators. The IL generated by the front units with only particular combinations of multiple end for two field extractions as shown in (a) of issue allowed, like the DECchip 21064 processors. Figure 3 is expanded into the IL shown in (b) Note that, other than the architecture-specific of Figure 3. With the loads exposed as fetches, data- computation for AET and per-processor imple- flow analysis is now capable of finding the common mentation data tables, the scheduler is completely subexpressions of 1$ and 3$. target independent. For example, currently, proces- Similarly, each field store expands into a fetch of sor implementation tables exist for the MIPS R3000 the background longword, an insertion of the new and R4000 processors, the DECchip 21064 pro- data into the proper position, and a store back. cessor, and Alpha AXP processors that are under Given two field stores, value propagation can elimi- development. nate the second fetch, and then dead-store elimina- tion can eliminate the first store. Field Merging Example In some cases, a program operates on the field Generating efficient code for the extraction and and thus eliminates the extract and insert opera- insertion of fields within records is particularly tions. For example, the following example gener- challenging on RlSC architectures, like Alpha AXP, ates the machine code shown in Figure 4. that provide only 32-bit (longword) or 64-bit (quad- typedef struct node C word) memory operations. char n-kind; char n-f Lags; Often, a program will fetch or store several fields struct node *xl-car; that are contained in the same longword. Without struct node *xl-cdr; optimization, each fetch would load the longnlord 1 NODE; from memory, and each store would both load and #define MARK 1 store the longword. However, it is possible to per- #define LEFT 2 form a collection of field fetches and stores with a void demo(ptr1 single load ant1 store to memory. As another exam- NODE *ptr; ple, two bit tests within the same longword could while (ptr) C be done in parallel as a mask operation. if (ptr->n-kind == 0) C In the IL generated by the front end, each field ptr->n-flags I= MARK; ptr->n-flags &= -LEFT; operation is generated as a separate IL operation. 1 Thus, the real task of optimizing field accesses is to ptr = ptr->xl-cdr; identify IL operations that can be combined. 1 In the initial IL, a field fetch or store is repre- sented as an IL operator. The underlying problem is The unoptimized code would contain a load and that the redundant loads and stores are not visible an extract for each reference to n-kind or n-flags, in this representation. The first part of the solution plus an insert and a store for the latter two involves expanding the field fetch or store into references. The optimizer has eliminated two of the

I$: FETCHX(RECORD, COI, [I]) ; Fetch the #I(low-order) bit ; from memory 2s: FETCHX(RECORD, CII, [I]) ; Fetch the #2 bit from memory

(a) Pre-field merging 1L

1$ : FETCH(REC0RD) ; Fetch the longword 2s: EXTV(l$, LO], CII); ; Extract the #I from the longword

3s: FETCH(RECORD1 ; Fetch the longword 4s: EXTV(I$, [I], [I]); ; Extract the #2 from the Longword

(b) Post-field merging IL

Figure 3 Field Merging Example

132 Vo1. 4 No. 4 Speciallssue I992 Digital Technical Journal The GEM Optimizing Compiler System

demo: : BEQ ptr, LS5 NOP LS7: LDL RO, (R16) ; Load n-kind and n-flags AND RO, 255, R1 ; Extract n-kind BNE R1, LS9 MOV 256, R17 BIS RO, R17, R17 ; Set MARK (in place) M0 V -513, R1 AND R17, R1, R17 ; Clear LEFT (in place) STL R17, (R16) ; Store back LS9: LDL ptr, 8(R16) BNE ptr, L$7 LS5 : RE T

Figure 4 Machine Code with Field Merging three loads, two of the three extracts, both inserts, and one of the two stores. Brancb Optimization Examples Branch instructions can hurt the performance The ANDSKlP and WNDC tuples implement the of high-performance systems in several ways. In conditional-ANDoperator. If tuple 2$ is false, tuples addition to consuming space and causing time to be 4$ and 5$ are skipped, and the result of the LANDC expended while issuing the instruction, branches is false. Otherwise, the LANDC uses the result of can disrupt the hardware pipeline. Also, branches tuple 5$. can inhibit optimizations such as code scheduling. Similarly, the SELTHEN, SELELSE, and SELC tuples Therefore, the GEM compiler system uses several implement the select operator. If tuple 6$ is true, strategies to avoid branches in the IL and generated then tuples 8$ and 9$ compute the result, and code or to eliminate some bad effects of branch tuples 11$ and 12$ are skipped. If tuple 6$ is false, instructions. then tuples 8$ and 9$ are skipped, and tuples 11$ Some branches appear as part of a we1 I-defined and 12$ compute the result. pattern that need not inhibit optimizations. GEM These operators allow programs to represent uses special operators for these cases. A simple branching code within the standard basic-block example is the MAX function. For Alpha AXP sys- framework but require branches in the generated tems, IMAX can be implemented using the CMOVxx code, to avoid undesired side effects of the skipped instructions, avoiding branch instructions entirely. tuples. In some cases, though, GEM can determine For other architectures, the main benefit is that the that the skipped tuples have no side effects and then branch does not appear in the IL. A more compli- converts the operators to an unconditional form, cated example involves the so-called "flow- allowing the generated code to be free of branches. Boolean" operators. Consider the C code example, GEM ~erformsother transformations on the IL to eliminate branches and thus enable further opti- which generates the following GEM JL: mizations. For example, GEM transforms if (expr) var = 1; else var = 0; into

var = ((expr) != 0)

Alpha AXP implementations typically include a branch prediction mechanism. Correctly predicted

Digital TechnicalJournal Vol. 4 No. 4 Speciullssssrre 1992 133 Alpha AXP Architecture and Systems

branches take several cycles less time than mispre- do c dictecl branches. The fastest conditional branch is if (aCi1 != bCi1) goto Label; aCi1 = 0; one that is correctly predicted not to be taken. GEM if (aCi-11 != bCi-11) goto Label; uses several strategies to arrange branches for best aCi-11 = 0; performance. if (aCi-21 != bCi-21) goto Label; aCi-21 = 0; GEM selects an order for the basic blocks of a pro- if (aCi-31 != bCi-31) goto label; gram that may differ from the order in the source aCi-31 = 0; program. For each basic block that ends with an 1 while (i-= 4); unconditional branch, GEM places the target block This code executes four fall-through branches next, unless that block has already been placed. and one taken branch, whereas the original code Similarly, ifa basic block within a loop ends with an executed four fall-through branches and four taken ilnconditional branch, a target block within that branches. loop is placed next, if possible. For example, Certain code patterns generate code that is likely while (--i> 0) C not to be executed. For example, when the com- if (aCi1 != bCi1) return aCi1-bCi3; piler believes th;it ;I 16-bit value in memory is apt to aCi1 = 0; be naturally aligned, but may be unaligned, it gen- 1 erates the instructions shown jn Figure 5 to load To eliminate the unco~~ditionalbranch when the the value, given the address in 10. The code runs loop iterates, GEM transforms the pretested loop quickly for the aligned case, bec;~usethe branch is into a posttested loop. Since the return statement is correctly predictetl to fall through, but gets the cor- outsidc the loop, the generated code looks like rect value for unalignetl data, as well. A similar code pattern hantlles stores. if (--i> 0) do ( if (aCil != bCil) goto Label; Compiler Engineering aCi1 = 0; Engineering compilers for a large combination of ) while (--i> 0); ... languages and platforms required a considerable number of innovations in the area of project engi- label: return aCil-bCil; neering. In this section we describe some of the (;Eht can also ilnroll loops and thus reduce the project methods and tools GEM uses. number of times the branch back must be exe- cutecl. more important. GEM often allows opera- Opal Intermediate Language Compiler tions from different iterations to be scheduled The task of a <;EM compiler is to translate a pro- together. Unrolling by four transforms the above gram presented by the front end in the form of an loop into a cleanup loop and the main loop into IL graph and symbol table into machine code. In code that resembles the early stages of GEM tlevelopment. no front

: 3-instruction inline sequence if aligned

Ldq-u rl, (1-0) extwl rl, r0, rl blbs rO, 108 208:

; out-of-line sequence to Load and merge , 10s: Ldq-u r28, l(r0) extwh r28, rO, r28 or rl, r28, rl br r31, 208 ends existetl to generate IL graphs ant1 symbol but has a front entl that compiles only one program. tables. To fill this recluirement, a synt;~cticspeci- The KFOLD builder scans the GEM operator signa- fication of the 1L and symbol table was designed ture table ant1 constructs a procedure that contains and an 1L assembler called Opal was built to com- a many-way conditional branch to select a case pile this syntax. Opal uses GEM components such based on the I[* operator specified in the argument as tlie slicll and thus supports a robust set of fea- list. Each case fetches operand values from the tures including listing generation, object files, argument list, applies the oper;itor, and returns the inclutle files, clebug support, and language editor result. Since most GEM 11. tuples operate on several diagnostics. data types, additional subcases rnay be based on tlie Even with the availability of front ends, Opal operator type or result type. We have already recov- rem;iins a vital project tool: it allows GEM develop- ered the investment in developing the automatic el-s to exercise new features before front-entl sup- KFOLD builder, and it significantly eases the task of port is available; front-end developers use Opal to retargeting GEM. experiment wit11 different 1L alternatives; ant1 the Opal syntax serves as the output form;~tof the JL Conclusion d~~~iper. This paper describes the current GEM compiler system. However, a portable, optimizing compiler Attrib.~lteand Operator Signature Tables provides many opportunities that we have not yet <;EM tables give a complete description of all <;EM exploited. Some enhancements planned for future data structures, including IL operators and symbol versions are: t;tble nodes. The operator sign;tture table contains the operator type, result type, ntlmber of operantls, Additional 1L operators ant1 data types, to sup- and leg;~l operand types for IL operators. The port more langliages ;~ttributet;tbles describe each component in a notle Support for xtltlitional architect~lreand operat- including loci~tion,abstract GEM data type, legal val- ing system combinations ues, node type for pointers, and special print for- mats. 1Vhen a new attribute is added to the <;EM Dependence analysis, to enable some of the specification, the attribute is described once in the following enhancements tables and automatically the Opal compiler 11ndel.- Loop transformations, to improve the use of the stands the syntax and semantics, the <;EM dump memory hierarchy utility is able to clump the attribute, and tlie <;EM integrity checker is able to verify tlie structure. Software pipelining, to increase parallelism in vectorizable loops Automatic KFOLD Builder Better reordering of memory references during The <;EM compiler needs to evaluate constant instruction scheduling expressions at compile tinie, which is referred to as constant folding. GEM'S intermediate language has The schecluling of instructions into different many IL operators ant1 data types. A constant folder basic blocks is ~IILIS;I complic;ited routine with many cases, and The relaxing of tlie linear restriction on the the compile-time and run-time results must be lifetime model, i.c., ;~llomlingholes in register itlentical. lifetimes After writing our first, incomplete, hantlcraftecl const;lnt folder, we searched for a methocl to ;~uto- The GEM compiler system has met demanding marc the process. No source language supported all technical and time-to-niarl

Acknozuledgments 13. W Wulf et al., The Design of an Opti7nizi7zg The authors wish to acknowletlge the contribu- Compiler (New York, h?': American Elsevier tions of the following indivicluals to the design and Publishing Co., 1975). implementation of the GEM con~pilers: Ron 13. B. Leverett et al., "An Overview of the Pro- Brender, Patsy Griffin, Lucy Hamnett, Brian duction-Quality Compiler-Compiler Project," Koblenz, Dennis Murphy, Bob Peterson, Paul Conzpz~ter;vol. 13, no. 8 (iiugust 1980): 38-49. Winalski, Stan Whitlock (Fortran), Bevin Brett (Ada), and Farokh Morshed (C). 14. R Johnsson, "An Approach to Global Register Allocation." P1i.D thesis, Carnegie-Mellon References University December 1975. 1. R Sites, ed., Alpha Architeclure Reference 15. R. Sethi and J. Ullni;ln, "The Generation of khnz~al(Burlington, bLA: Digital Press, 1992). Optimal Code for Arithmetic Expressions," Jo~~rnalof the ACM, vol. 17, no. 4 (October, 2. K. Cooper, M. Hall, and L. Torczon, "The Per- 1970): 715-728. ils of Interprocedural Knowledge," Rice COkIP Th'90-132 (1990). General Reference 3. K. Cooper, M. Hall, and L. Torczon, "Unes- I? Puiklam et al., Engineering a Co~~.piler(Bedford, t\ t\ pected Side Effects of Inlilie Substitution: ILL\: Digital Press, 1982). Case Study," TOPLAS (March 1992): 22-32. 4. E Chow, "Minimizing Register Usage Penalty at Procedure Calls," SI(;PD~IV'88 Conference on Program~ning Lcrnguage Design and Implementation (June 1988): 85-94. 5. T. Lengauer ancl R. Tarjan, "A Fast Algorithm for Finding Dominators in a Flowgraph," TOPLAS, vol. 1, no. 1 (July 1979): 121-141. 6. J. Reif, "Symbolic Interpretation in Almost Lin- ear Time," CotzJivwux Records of the Fzyth ACtW Symposi~lrnon Principles of Program ming Languages (1978): 76-83. 7. J. Reif and R. Tarjan, "Symbolic Program Anal- ysis in Almost-Linear Time," SIAtW Journal of Computing, vol. 11, no. 1 (February 1981): 81-93.

8. K Harris and S. Hobbs, "VAX Fortran," Optimization in Co~npilers,ed., F. Allen, B. Rosen, and E Zndek (New York, NY: ACM Press, forthcoming). 9. R. Cattell, "Form;~liz;~tionand Automatic Derivation of Code Generators," P1i.D. thesis, CMU-CS-78-115, Carnegie-Mellon Universit): April 1978. 10. A. Aho and S. Johnson, "Optimal Code Gener- ation for Expression Trees," Jo~~rnalof the AChI: vol. 23, no. 3 (July 1976): 488-501. 11. B. Leverett, "Register Allocation in Optimiz- ing Compilers," Ph.D. thesis, CMU-CS-81-103, Carnegie-Mellon University, February 1981.

136 It11 4 1\10 4 Spectnll\\~~e1'1% Digitul Technical Jourrral Richard L. Sites Anton Chernofl Matthew K Kirk Maurice I? Marks Scott G. Robinson

Binary Translation

Binary translatio~zis a technique used to change an exect~tableprogmnz for one computer architect~ireand operating system into an exec~~tablep~~ogm~~ora dq &rent comnptiter architecture and opemtirzg system. Two binmy tra~zslatorsare 6imong the nzigmtion tools available for Alpha AXP computers: EST translates Open VIMS VAX binary images to Open KVIS AXP images; ~nxtranslates ULTRIX I\IIP~P images to DEC OSF/I AXP images. In both cases, translated code zisually runs 011 ?.Alpha AXP computers as fast or faster than the original code runs oiz the origilzal nrchitectz~re.In contrast to other migration eflorts in the industry, the VAX transla- tor reprod~icessubtle CISC behauior on a RISC machine, and both open-ended trans- lators provide good performance on ~Lynai~zicallymodified programs. Alpha A XP Di~zary translators are importalzt migration tools-hundreds of translated Open VbfS MX and ULTRIXIIIIPS images currently run on Alpha AXP systems.

When Digital started to design the Alpha A?(P archi- Software interpreter (e.g., Insignla Soluttons' tecture in the fall of 1988, the Alpha jLYP team was SoftPC) concerned about how to run existing VAX code and Microcodeti (e.g.,PD1'-11 compatibility soon-to-exist MIPS code on the new Alpha N(P coni- mode in early VAX computers) puters.l.2To take fill1 advantage of the performance capability of a new compilter architecture, an appli- Binary translator (e.g., Hunter System's XDOS) cation must be ported by rebuilding, using native Native compiler compilers. For a single program written in a stan- dard programming language, this is a matter of A software interpreter is a program that reads recompile and run. A complex softmiare application, instructions of the old architecture one at a time, however, can be built from hundreds of source performing each operation in turn on a soft- pieces using dozens of tools. A native port of such ware-maintained version of the olcl architecture's an application is possible only when all parts of the state. Interpreters are not very fast, but they run build path are running on the new architecture. on a wide variety of machines and can faithfully Therefore, devising a way to run an existing (old architecture) binary version of a complex applica- tion on a new architecture is an important interim FASTER measure. Such a technique allows a user to get applications up and running immediately, with minimal porting effort. Once a user's everyclay envi- ronment is established, applications can be rebuilt over time, using handwritten native code or par- TRANSLATOR tially native and partially old code.

Background SOFTWARE Several techniques are used in the industry to run INTERPRETER the binary code of an old architecture on a new architecture. Figure 1 shows four conimon tech- Figcli"e1 Co1n1nonTecl~1ziyuesforRu1zni1zg01d niques, from slowest to fastest: Code on Nezu Conz)ctter.s

Digital Techtrical Joc~rrral Vol. 4 iVo. 4 Sl~ecinlissue I992 137 Alpha AXP Architecture and Systems reproduce the behavior of self-modifying pro- also directly or indirectly invoke operating system grams, programs that branch to clata, programs that services. In simple environments with a single dom- branch to a checksum of themselves, etc. Caching inant library, it can be sufficient to rewrite that interpreters gain speed by retaining pretlecocled library in native code and to interpret user pro- forms of previously interpreted instructions. grams, particularly user programs that actually A microcociecl emulator operates similarly to a spend no st of their time in the librar),. This strategy software interpreter but usually with some key is conllnonly used to run Witldows and Macintosh hardware assists to decode the old instructions programs ilncler the UNIX operating system. quickly and to hold hardware state information in In more robust environments, it is not practical registers of the micromachine. An emulator is typi- to rewrite all the sharecl libraries by hand; collec- cally faster than an interpreter but can run only on tions of dozens or even hundreds of images (such as a specific microcoded new macl~jne.This technique typical VAX ALL-IN-I systems) must be run in the old cannot be used to run existing code on a recluced environment, with an occasional excursion into the instruction set computer (RISC) machine, since RlSC native operating system. Over time, it is desirable to :trchitectures do not have a microcoded hardware rebuilt1 some images using a native compiler while layer unclerlyjng the visible machine architecture. retaining other images as translatecl code, and to A translated binary program is a sequence of achieve interoperability between these old and new-architecture instructions that reproduce the new images. The interface between an old environ- behavior of an old-architecture program. Typically, ment and a new one typically consists of "jacket" much of the state information of the old machine is routines that receive a call using old conventions kept in registers in the new machine. Translated and data structures, reformat the parameters, per- cocle hithfillly reproduces the calling stantlard, form a native call using new conventions anel data implicit state, instruction side effects, branching structures, reformat the result, and return. flow, and other artifacts of the old machine. The Alpha ILYP Migration Tools team considerecl Translated programs can be much faster than runnjng olcl VAX binary programs on Alpha AXP interpreters or , but slower than native- compilters using a simple software interpreter, but compiled programs. rejected this method because the performance 'Translators can be classified as either (1) would be too slow to be useful. We also rejected bounded translation systems, in which all the the idea of using some form of microcoded emula- instructions of the old program must exist at trans- tor. This technique would compromise the perfor- late time ant1 must be found and translated to new Jnance of a native Alpha AXP implementation, and instru~tions,3-~.~or (2) open-ended translation sys- VA)r' compatibility woultl be nearly impossible to tems, in which code may also be discovered, cre- achieve without microcode, which is inconsistent ated, or moclifiecl at execution time. Bounded with a high-speecl NS(: design. s)rstems usually require manual intervention to find We therefore tilriied to open-ended binary trans 100 percent of the code; open-ended systems can lation. We were aware of the earlier Hewlett- be fi~llyautomatic. Packartl binary translator, but its single-image HP To run existing VAX and MIPS programs, an open- 3000 input code looked much simpler to translate ended system is absolutely necessary. For example, than large collections of hancl-coded VAX assembly some customer programs write license-check code language programs.' One member of the team (VAX instructions) to memory, ant1 branch to that (R. Sites) wrote a V~x-to-vAXbinary translator in code. A bounclecl system fails on such programs. October 1988 as proof-of-concept. The concept A native-compiled program is a sequence of new- looked feasible, so we set the following ambitious architecture instructions produced by recompiling protluct goals: the program. Native-compiled programs usually 1. Ope1-1-ended(completely automatic) translation use newer, faster calling co~lventionsthan old pro- of almost all user-mode applications from the grams. With a well-tuned optimizing compiler, OpenVMS VkY system to the OpenVMS AXP native-compiled programs can be substantially system faster than any of the other choices. Most large programs are not self-contained; they 2. Open-ended translation of almost all user-mode call library routines, winclowing services, data- applications from the 'lJT.TRIX system to the DEC bases, and toolkits, for example. These programs OSW~system

138 Wl. 4 No. 4 .$peci~~lIsstre 19.92 Digital Tec/~tricalJournal 3. Run-time performance of translated code on To achieve these goals, the Alpha AXP Migration Alpha .UP computers that meets or exceeds the Tools team created two binary translators: VEST, performance of tlie original code on the original which translates OpenViMS VN( binary images to architecture OpenVMS t\)(P images, and mx, which translates ULTRlX MlPS images to DEC OSWl AXP images. 4. Optional reproduction of subtle old-architecture However, binary translation is only half the migra- details, at the cost of run-time performance, e.g., tion process. As shown it1 Figure 2, the other half is complex instruction set computer (CISC) to build a run-time environment in which to exe- instruction atomicity for multithreaded applica- cute the translated code. This second half of tlie tions and exact arithmetic traps for sophisti- process must bridge any differences between old cated error handlers and new operating systems, calling standards, exception handling, etc. For open-ended transla- 5. If translation is not possible, generation of tion, this part of the process must also include a explicit messages that give reasons and speclCy way to run old code that was not discovered (or did what source changes are necessary not exist) at translate time. The translated image While we were creating the VLY translator, we environment (TIE) and mxr run-time environment discovered that the process of building flow graphs support the VEST and mx translators, respectively, of the cocle and tracking data clepenclencies yielded by reproducing the old operating environments. information about source code bugs, performance Each environment supports open-encled transla- bottlenecks, and dependencies on features not avail- tion by including a fallback interpreter of old code, able in all Alpha AXP operating systems. This analy- and extensive run-time feedback to avoid using the sis information could be valuable to a source code interpreter except for dyt~atnicallycreated code. maintainer. Thus, we aclded one more product goal: Our design philosophy is to do everything feasible to stay out of the interpreter, rather than to increase 6. Optional source analysis information the speed of the interpreter. This approach gives

OLD BINARY

TRANSLATOR (VESTIMX)

IMAGE I ::?OLD CODE I I AND ERROR OPTIONAL ZFi'AL I I I I GRAPHS I NEW CODE MESSAGES

RUN-TIME OTHER OTHER SUPPORT TRANSLATED NATIVE (TIEJMX) IMAGES IMAGES

Figure 2 Binary Translation and Execution Process

Digital Tech~iicnlJournal Vol. 4 No. 4 Special Issue 1992 139 Alpha AXP Architecture and Systems

better performance over :I wicler range of progranis Cotle analysis can detect many problems, includ- than using pure interpreters or bounded transla- ing some that indicate latent bugs in the source tion systems. image. VEST can detect, for example, uninitialized The remainder of this paper tliscusses the two variables, improperly formed VAX CASE instruc- binary translator/run-time environment pairs ;~v;iil- tions, stack depth mismatches along two different able for Alph;~ icvP computers: VEST/'T'IE and paths to tlie s;lme code (the program expects data mx/mxr. ?b establish a basis for the discussion, the to be at a certain stack depth), improperly formed reader must ilnderstarid the following terms: returns from subroutines, and modifications to a datum, alignment, instruction atomicity granular- VAX call frame. A latent bug in the source image ity interlocked uptlate, :ind word tearing. should be fixed, since the translated image may Definitions of these terms appear in the References demonstrate incorrect behavior due to that bug. and Note section.- Aniilysis also detects the use of unsupported OlxnVMS features inclutling unsupported system VEST: Translating a VAX Image services. The source image must be modified to Translating a V.kX image involves two main steps: eliminate the use of these features. analyzing VkY cock and generating Alpha AXP code. Some problems reported by VEST result from The translated images produced are OpenVMS i\XP code that is hackish in nature. For example, we images and may he run just like native i~iiages.~ foi~ndcode that expects a call mask at an entry Translateel images run with the ;issistance of the point to be executed as a no-op instruction so that translated image environment, wliich is discussed the code preceding the subroutine can simply exe- later in this paper. The VEST binary translator is cute the call mask, rather than go through the over- written in C++ and runs on Vfi, MIPS, ancl Alpha head of a \%X jump (JMP) instruction. VEST MI' machines. The TIE is written in the Open\lbls reproduces the behavior of the VAX program, even system programming langu;lges, BLISS ;Inti A.lpha if this beh:~vioris a result of luck. assembler. A VEST-generated flow graph is displayed in Figure 3. Dashed lines represent code paths fol- lowed if a conditional branch is taken. Solid lines indicate fall-through paths. A problem is high- To locate VAX code, \fESI' starts disassembling code lighted by a wide, dashed pointer whose bottom at known entry points ant1 recursively traces the entl inclicates tlie basic block in which the problem progr:~m'sflow of control. Entry points come from was uncovered. Full blocks show the path that main and global routines, debug symbol table reveals the error; empty blocks show basic blocks entries, and optional information files (including that are not in the error path. In Figure 3, a path run-time feedback from the TIE). exists by which register 3 (R3) may be used without As VEST traces the program, it builcls ;I flow graph being set if the VtLK BNEQ (branch if tlie register that consists of basic blocks (it.,htr;iight-line code tloes not equal zero) instruction in the second basic sequenccs) annotateel with information clerivetl block is true the first time through the code from parsing instructions. VEST then performs sev- sequence. eral ;in;ilyses on the flow graph to propagate con- text information to each basic block and eliminate unnecessary operations. Context information Code Generation includes contlition code us;ige, register contents, The VEST tr;inslator generates code by converting stack depth, ancl ;I variety of other information that each VA>< instruction into zero or more Alpha .GYP allows VEST to generate optimized code. instructions. The ;irchitecture mapping is straight- Analysis is important for achieving good perfor- forward because there are more Alpha AXP registers mance. For exanlple, no condition codes exist in than VAx registers. The VAX architecture has only 15 the Alpha LYP architecti~re.Without analysis it registers, which ;ire used for both floating-point woultl be necessary to conlpute condition codes and integer operations. The Alpha AXP architecture for each VAX instruction even if the codes were not h;~sseparate integer and floating-point registers. used. Furthermore, several forms of ;lnalysis were VAX RO through R14 are mapped to Alpha AXP RO invented to ;~llowcorrect tr;~nslation.For example, through R14 for all operations except floating VEST automatically determines if a subroutine does point. R12, RJ3, and R14 retain their V.desig- a normal return. nations ;is argument pointer, frame pointer, and

140 Vo1. 4 No. 4 Specictl lssrre 1992 Digital TecbniculJournal Binar), Translation

remaining Alpha AXP registers are used as scratch DHRY'srMC\Prcc2\504 ICI R3 used registers or for OpenvMS MI' standard calls. tVES-I-W5TKALl,U, PICW: Nen-standard call uses Aj. VEST connects simple branches directly to their translated targets. VEST performs backward sym- bolic execution of VAX instructions to resolve as many computed branch targets as feasible. If more than one possible computed target exists, a run- time lookup is done on the Vkd target adclress. If the lookup fails to find a translated target, a fallback VMinterpreter is used, as described in the TIE sec- tion Failure to Find All Code during Translation. Unlike boundetl translation systenls, which must achieve 100 percent resolution of computed tar- gets, the VEST and mx binary translators require no manual intervention. Translated Images A translated image has the same format ;IS an OpenVMS AXP image and contains the original OpenVMS VAX image as well as the Alpha AXP instructions that were generated for the VAX coclc. The run-time VAX interpreter TIE needs the original instructions as a fallback. (Also, some error handlers look up the call stack for pointers to spe- cific VM instructions.) The addresses of statically allocated data in the translated image are identical to their V.&Xaddresses. The image contains a VM-to- Alpha AXP address m:~ppingtable for use during lookups and may contain an instruction atomicity table, described in the VAX Instruction Guarantees section. Translated images use the OpenVMS VAX calling standard. Native images use different conventions, but translated images interoperate with native or translated shareable images. Automatic jacketing services are providetl in the TIE to convert calls using one set of conventions into the other. In many cases, jacketing services permit substitution of a native shareable image for a translated share- Figure .? VESTgenemted Flow Graph Sboioing able image without modification. However, a jacket Uninitialized Variable routine is sometimes required. For example, on OpenVhlS UP systems, the translated FORTRAN run-time library, FORRTL-TV, invokes the native stack pointer, and R15 is used to resolve PC-relative Alpha IU(P library DEC$FORRTLfor I/O-related sub- references. Floating-point operations are mappecl routine calls. DEC$FORRTLhas a different interface to FO through F14. than FORRTL has on an OpenVMS VAX system. For The VAX architecture has condition codes that these calls, FORRTL-TV contains handwritten jacket may be referenced explicitly. In translated images, routines. condition codes are mapped into R22 and R23. Similar to the HP 3000 translator, R23 is used as a Files Used fast condition code register for positive/negative/ Translating an image requires only one file-a VAX zero results." R22 contains all four condition code executable image. Several optional files make trans bits and is calculated only when necessary. All lation more effective.

Digital Tecbnical Journal Vol. 4 No. 4 Special Issue 13-2 Alpha AXP Architecture and Systems

1. 1ni;lge information Files (IIFs). VEST automati- describcd in detail in the section Special c:llly creates IIFs to provide information ;ibout Considerations for Instruction Atomicity. shareable image interfaces. The inform:ition inclucles the adtlresses ofent~~~points, n;lnies of Untranslatable Images routines, and resource utilization. Some characteristics make OpenVMS VAX images 2. Symbol information files (SIN). VEST automati- untranslatable, including: cally generates sIFs to control the global syrubol table in a translated shared library, facilitating Exception handler issues. Images that tlepend interoperation between translated ant1 n;~tive on ex;~mini~igthe Vsprocessor status longword image>. (~'sL) (luring exception Iianclling must be niodi- fiecl, because the VAX PSL is not available within 3. Hand-edited information files (HIFs). The TIE exception handlers. a~~tomaticallygenerates I-IIFs, which m;ly be hand-edited to supply information that VEST can- Direct reference to undoci~nlentedsystem ser- not cleduce. HIFs cont:~indirectives to tell VEST vices. Some software contains references to about i~ndetectedentry points, to force it to unsuj>l'orted and undocumented system ser- ch;lnge specific assumptions ;ibout an image tliir- vices, such as an intern2il-to-\/MSservice, which ing translation, ant1 to provide known interpice parses image symbol tables. VEST highlights properties to be prop;~g;~tedinto an [IF. these references. Exact VAX memory management requirements. VEST Per$ormance Considerations Images that tlepend on exact VtU( memory man- In evr~lu;~tingtranslated cotle performance, we rec- agement behavior clo not function properly and ognized that there was a significant trade-off must be modified. These images inclucle those between performance ant1 the accuracy of emulat- that tlepentl on VN( page size or that cxpect ing the VAX architecture. VEST perlnits i~sersto certain objects to be m;ippetl to particular select several architectural assumptions and opti- addresses. miz;ltions, including: 1m;ige format. Programs that use images as data are not able to read Open\OtS SPimages with- D-float precision. The Alpha UParchitecti~re out modifications, because the image formats jxovitles hardware support for D-float with only are different. 53-bit mantissas, whereas the vhx architecture provitles 56-bit mantissiis. Tile user may select translation with either 53-bit hardware support TIE Design Overuiew (f~ster)or 56-bit software support (slower). The run-time translatetl image environment TIE assists 111 executing translated OpenVMS VrU( images Alignment. Alpha AXP instructions slipport Only under the OpenmSA)(P operating sptem. Figure 4 nati~rallyaligned longword (32-hit) ant1 quatl- and Table 1 show the contents of the TIC wortl (Whit) memos). operations. LJnalignetl memory operations cause alignment fi~ults, Problems Solved at Run Time wliicli :Ire handled tr:unsp;rrently by softw;~reat Complici~tions may occur when translated signitlcant run-time expense. The user may OpenVMS \IU images arc run under the OpenVMS direct VEST to assume th;it data references are ~~n:~lignetlwhenever ;rlignment information is AXP operating system. This section tliscusses the following related topics: the failure to find a11 code ~~n;~v;iil:~ble, during translation, \'AX instruction guarantees, Instruction aton1icit)l. Multitasking and multi- instruction atomicity, memory update, and preserv- processing programs may depend on jnstri~ction ing VtLY exceptions. atomicity and memory operation cllaracteristics simil;~r-to those of the VAS architecture. VEST Failure to Find All Code dt~ring TI-c~nslation uses special code sequences to produce exact When the \/EST binary translator encounters ;i \/AX memory characteristics. VEST and the TIE brancli or subroutine c;ill to an unknown destina- cooperate to ensure VAX instruction atomicity tion, VEST generates code to call one of the TIE wlien instructed to do so. This mecliaoisni is lookup ro~~tines.The lookup soutilies niap a Va

142 kl. 4 Vo. 4 Specic~lr.sslrr I992 Digital 7ecbrriccrl Jour~inl instruction address to a translated Alpha tLYP code If the target of the flo-w change is translated cotle, address. If an address mapping exists, then a trans- the interpreter exits to this code. Otherwise, the fer to the translatetl code is performed. Otherwise, interpreter continues to interpret the target. the VAX interpreter executes the destination code. Lookup operations that transfer control to the When the VAX interpreter encounters a flow of con- interpreter also record the starting Vku code trol change, it checks for returns to translated code, address in an HIF file. The ViU( image can then be retranslatecl with the HIF information, resulting in an image that runs faster. Lookup routines are also usecl to call native MAIN AND NATIVE Alpha AXP (nontranslated) routines. The TIE sup- SHAREABLE IMAGES IMAGES plies the required special autojacketing processing that allows interoperation between translated and native routines with no manual intervention. At OPENVMS AXP load time, each translated image identifies itself to the TIE anti supplies a mapping table ilsetl by the lookup routines. The TIE maintains a cache of trans- CALLBACKS lations to speed LIPthe actual lookup processing. 4 4 4 Every translated image contains both the original VAX code and the corresponding Alpha UPcode. When a translated image identifies itself, tlie TIE marks its original VtU( addresses with the page pro- tection called fault on execute (FOE). An Alpha AXP processor that attempts to execute an instruction on one of these pages generates an access violation

MANAGER fault. This fault is processed by a TIE condition Iian- dler to convert the FOE page protection into an appropriate destination address lookup operation. INTERPRETER For example, the FOE might occur when a trans- lated routine returns to its caller. If the caller was interpreted, then its return address is a Vtm code address insteatl of a translated VAx (Alpha Axp Figztrc?4 VEST Run-time Eizzlirontnent code) address. The Alpha AXP processor attempts

Table 1 TIE Contents VAX-to-Alpha AXP Address Mapping Used to find computed destinations and other cases (VAX State Manager) where VEST did not find the original VAX code. Each translated image has a mapping table included. VAX Instruction Atomicity Controller Achieves VAX instruction atomicity for asynchronous (VAX State Manager) events. This allows data sharing between the single asynchronous execution context (AST) provided by OpenVMS and non-AST level routines. VAX Instruction Interpreter Executes VAX instructions not found by VEST. VAX Complex Instructions Some VAX instructions do not have code generated in-line by VEST. Those instructions are processed in the TIE. Examples are MOVC3 and MOVC5 that move byte strings. OpenVMS VAX Exception Processing Certain aspects of OpenVMS AXP exception processing are necessarily different from OpenVMS VAX. For example, the VAX computers have two scratch registers, but Alpha AXP computers have 15. Translated condition handlers are passed the VAX equivalents. Routines for Differences between OpenVMS Some operating system interfaces were rearchitected. VAX and OpenVMS AXP System Services The TIE intervenes to make the differences transwarent.

Digital Technical Jourt~al 1'01. 4 1Vo.4 SpeciuI Issue 1992 143 Alpha AXP Architecture and Systems

to exccute the VXZ code ant1 generates a FOE condi- unclerlying IUSC instructions manipulate four or tion. The TIE condition h;~ndlerconverts this into a eight bytes at a time. Finally, VEST/TIE can provide ,1J41' lookt~poperation. exact memory read-write ordering and arithmetic exceptions, e.g.,overflom~. All these provisjons are VAX I~zstt~zrctio~zGunr~rritees Instruction guaran- option;~land require extra execution time. tees :ire characteristics of a computer architecture Tables 2 and 3 show the visibility differences th:~tare inhercnt to instructions executecl on that between various guarantees on VAX and Alpha AXP architecl~rre.For example, on ;I VLG computer, if systems as well as for translated VhX programs. instruction 1 writes data to memory ancl then instruction 2 writes data to memory, a second pro- Special Considerations for- Instrzrction Atoi?licity cessor must not see the write from jnstruction 2 The VAX' ;lrchitecture requires that interrupted before thc write from instruction I.. This property instructions complete or appear never to have is called strict read-write ordering. started. Since translation is ;I process of converting The VES'IYTIE pair can provide the illusion that a one VAX instruction to potentially many Alpha AXP single (:IS<: instruction is executed in its entirety, instructions. run-time processing must achieve this even thougli the underlying translation is :I series gu;lr;lntee of instruction atomicity Hence, a \'AX of RlX: instructions. VEST/?'IE can also provide the instruction atomicity controller (MC) was created illusion of two processors updating adjacent mem- to manipulate Alpha AXP state to an equivalent ory l3yte.s without interference, even thoirgh the VAX state. When a translated asynchronous event

Table 2 Single Processor Guarantees Single Processor Guarantees Characterized by What an Observer Sees on the Same Processor That Executes the Data Change Topic VAX Translated VAX Native Alpha AXP Instruction An entire An entire translated A single Alpha AXP Atomicity VAX instruction VAX instruction with instruction /PRESERVE=INSTRUCTION -ATOMICITY and TIE'S instruction atomicity controller, else a single Alpha AXP instruction

Table 3 Multi~leProcessor Guarantees Multiple Processor Guarantees Characterized by What an Observer on a Different Processor Sees versus the One Executing the Data Change Topic VAX Translated VAX Native Alpha AXP Byte Granularity Yes, hardware Yes, with Yes, via LDx-L, ensures this /PRESERVE=MEMORY merge, STx-C -ATOMICITY sequence Interlocked Update Yes, for aligned Yes, for aligned datum Yes, via LDx-L, datum using interlock using VAX interlock modify, STx-C instructions instructions sequence Word Tearing Aligned longword Aligned longword or Aligned longword or writes change all quadword writes quadword writes bytes at once change all bytes change all bytes at once at once Other writes are allowed to change one byte at a time

144 Vo1. 4 iVo. 4 Speciallssue I992 Digital Technical Journal Binary Translation

processing routine is called, the L4C is invoked. The access to shared memory. Longword-size updates LAC examines the Alpha ASP instruction stream and to aligned locations are performed using nor- either backs up the interrupted program counter to mal load/store instructions to ensure longword restart at the equivalent vainstruction boundary granularity. or executes the remaining instructions to the next Many multiprocessing vAX programs depend boundary. Many V!programs do not require this on byte granularity for memory update. VEST guarantee to operate correctly, so VEST emits code generates byte-granular code if the condition that is VAX instruction atomic only if the qualifier /PRESEU\~E=MEMOU\'-ATOMICln is specified when /PRESERVE=INSTRUCTION-ATOMICITY is specified translating an image. In addition, VEST generates when translating an image. strict read-write ordering code if the qualifier VEST-generated code consists of four sections /PRESERVE=RUD-WRITE-ORDERING is specified that are detected by the L4C. These sections have when translating an image. the following functions: Preserving VAX Exceptions Alpha AXP instruc- Get operands to temporary registers tions do not have the same exception characteris- Operate on these temporary registers tics as VA); instructions. For instance, an arithmetic fault is imprecise, i.e., not synchronous with the Atomically update VAX results that could gener- instruction that caused it. The Alpha kXP hardware ate side effects (i.e., an exception or interlocked generates an arithmetic fault that gets mapped access) into an OpenvMS AXP high-performance arith Perform any updates that cannot generate side metic (HPARITH) exception. To retain compati- effects (e.g., register updates) bility with VkY condition handlers, the TIE maps HPrUiITH into a corresponding VA?< exception when The \TAX interpreter achieves VAX instruction calling a translated condition handler. Most VAX atomicity by using the atomic move, register to languages do not require precise exceptions. memory (AMOVRM) instruction. The AIIOVRM For those that do, like BASIC, VEST generates instruction is implemented in privileged archi- the necessary trap barrier (TRAPB) instructions tecture library (PAL) subroutines and updates a if /PRESERVE=FLOATING-LYCEPTIONS is specified contiguous region of memory containing \TAX when translating an image. state without being interrupted. At the begin- ning of each interpreted vAx instruction, a read and OpenVMS AXP and set flag (RS) instruction sets a flag that is cleared OpenVfWS VAX DzIfferences when an interrupt occurs on the processor. Functional Differences Most OpenVMS AXP AMOVRM tests the flag, and if set, performs the system services are identical to their OpenVMS VAx update ant1 returns a success indication. If the flag counterparts. Services that depend on a \TAX-spe- is clear, the h10VRhlinstruction indicates failure, cific mechanism are changed for the Alpha AXP and the interpreter reprocesses the interrupted architecture. The TIE intervenes in such system ser- instruction. vices to ensure the translated code sees the old interface. Isszles with Changing Memory VAX instruction For example, the declare change mode handler atomicity ensures that an arithmetic instruction ($DCLCMH)system service establishes a handler for does not have any partially updated memory loca- VAX change mode to user (CHMU) instructions. The tions, as viewed from the processor on which that handler is invoked as if it were an interrupt service instruction is executed. In a multiprocessing envi- routine required to use the V.return from inter- ronment, inspection from another processor could rupt or exception (REI) instruction to return to the result in a perception of partial results. invoker's context. On OpenVMS AXP systems, the Since an Alpha MP processor accesses mem- handler is called as a normal procedure. To ensure ory only in aligned longwords or quadwords, it compatibility, the TIE inserts its own handler when is therefore not byte granular. To achieve byte calling OpenVMS iiXP $DCLCMH.When a CHMU is granularity, VEST generates a load-locked/store- invoked on Alpha tUP computers, the TIE handler conditional code sequence, which ensures that a calls the handler of the translated image, using the memory location is updated as if it were byte granu- same VAX-specific mechanisms that the handler lar. This sequence is also used to ensure interlocked expects.

Digital Technical Journal Vo1. 4 No. 4 Special Issue 2992 Alpha AXP Architecture and Systems

Exceptio?z Handlirg OpenVMS AXP exception started after VEST was filnctional, and we took processing is alrnost identical to that performed in advantage of the VEST common code base for much the OpenVMS VAX system. The niiijor tlitference is of tlie analysis nnd Alpha AXP code assembly phases that the VA)C. mechanism array neetls to hold the of the translator. In hct, about h;ilf of the cotle in value of only two temporary registers, RO and R1, mx is compilecl from the same source files as those whereas the Alpli~LYP mechanism array neetls to usetl for VEST, with some architectural specifics holcl the value of 15 temporary registers, RO, R1, ant1 suppljecl by differing inclutle files. The cocle-shar- R16 through R28. ing aspects of C++ have proven quite valuable in this regard. Co~nl~lexInstrz~ctions Translating some VAS mxr is the run-time support system for translated instri~ctions woulcl require many Alpha AXI-' programs. It provicles services simil;~rto TIE, emu- instructions. Instead, VEST generates code that calls lating the ULTRIX MIPS environment on a DE<: OSUl a TIE subroutine. Subroutines are implemcntetl in AXP system. mxr is written in C++, C, and Alpha two ways: (1) handwritten native enlulation rou- ;~ssenibler. tines, e.g., MOVC5, ancl (2) VEST-translatecl VAX emu- lation routines, e.g., POLYH. Challenges Together, VEST ant1 TIE can translate ant1 run most Creating a translator for the MIPS /R3000 existing user-mode VAX bin;irlr inxlges. As shown in architecture presented us with a Iiost of new oppor Table 4, performance of translated VAX programs tunities, along with some significant challenges. slightly exceeds the original goal. Performance The basic structure of the mx triinslator is much depends heavily on tlie frequency of use of VAX fea- simpler than that of VEST. Both the source and tures that are not present in Alpli;l AM-' machines. the target architectures are IIISC: niachines; there- fore, the two instruction sets have a consideriible ULTRYX MIPS Translation similaritj~.Many instructions translate one for one. mx is the translator that converts IJ1:SRIX >lIl'S pro- The MIPS architecture has very few instruction sick grams to DEC OSF/l AXP programs. The rnx project effects or subtle architect~~ralciet;lils, although

Table 4 Translated VAX Performance, Normalized to Native-compiled OpenVMS AXP Code

VEST VAX Time Translated Time Native Time on VAX 6610 on DEC 7000 AXP on DEC 7000 AXP Program (83.3 MHz) (167 MHz)* (167 MHz)

gee express0 spice2g6 doduc nasa7 li eqntott matrix300 ~PPPP tomcatv Geometric Mean (without gcc)

Notes: The larger the number, the slower the performance. These performance numbers were measured on derated field test hardware and software at various times during 1992; production results will vary somewhat. The SPEC benchmarks are written in FORTRAN and C; no conclusions should be drawn about other classes of programs wr~ttenin other languages. 'The DEC 7000 system was running at a derated speed compared to product~onDEC 7000 systems. t~iminginformation for this run is not available. those that are present are particularly tricky. architecture has more than 32 registers, including Furthermore, the format of an executable program the HI and LO registers used by the multiply and i~nclerthe IJLTRIX system collects all code in a single divide instructions, and a floating-point condition contiguous segment and makes it easy for rnx to register, whose layout and contents do not corre- reliably find close to 100 percent of the code in the spond to the Alpha &YP floating-point condition MIPS application. The system interfaces to the register. rJLTRIX and DEC OSF/1 systems are similar enough In mx, we assign registers using standard conl- that most ULTRlX system calls have functionally piler techniques. To assign registers to 33 MII-'S identical counterparts under the DEC OSF/l system. resources (the 32 general registers plus one 64-bit The challenges in mx stem from the fact that the register to Iiold both HI and LO), certain registers source architecture is a RlSC machine. For example, are permanently mapped, and other M[PS registers DEC OSF/l LKP is a 64-bit computing environment, are kept in either AXP registers or memory. The i.e., all pointers used to communicate with the MIPS argument-passing registers A0 through A3 are operating system are 64 bits witle. This environ- permanently assigned to Alpha AXP registers R16 ment does not present a problem when the pointer through R19, mihicll are the argument registers in is passed in a register. However, when a pointer (or the DEC OSF/l AXP calling standard. This correspon- a long data item, such as a file size) is passed in dence simplifies the work needed when mxr must memory, it must be converted between the 32-bit take arguments for an UUrlUX system call and pass representation, used by the LILTRIX system, and the them to a DEC OSF/1 system call. Similarly, the argu- 64-bit mP representation, even when the seman- ment return registers VO and V1 are mapped to the tics of the operating system call are the same on Alpha &XI3 argument return registers RO and R1. The both systems. return address registers and stack pointer registers A significant challenge is the fact that our users' of the two machines are also mapped. MIPS RO is expectations for performance of translated pro- mapped to Alpha AXP R31, where both registers grams are much higher than for VEST. Reasoning contain the same hard-wired zero value. We reserve that the source and target machines are similar, Alpha AXP registers R22 through R24 as scratch reg- users also expect mx to achieve a translated pro- isters and also use them when interfacing to mxr. gram perfornlance better than that of the source We reserve Alpha LKP R14 as a pointer to an mxr program, since Alpha UPprocessors are faster. communication area. Finally, we reserve three Thus, as our performance goal, we set out to pro- more registers as scratch registers for use by the duce a translated program that runs at about the code generator. same speed as the original program would run on a The remaining 16 Alpha AXP registers are avail- MIPS R4000 machine with a 100-megahertz (MHz) able to be assigned to the remaining 23 MIPS internal clock rate. resources. After the code is analyzed and we have register usage information, the 16 most freqi~ently Mapping the Architectures used MIPS registers get mapped to the remaining 16 At first glance, it appears that we could simply Alpha AXP registers, and the remaining registers are assign each MITJS register to a corresponding Alpha assigned to memory slots in the mur cornmunica- iu;P register, because each machine has 32 general- tion area. When a MIPS basic block uses one of the purpose registers. The translated code would then slotted registers, rnx assigns it to one of the scratch have two scratch registers, since the MIPS architec- registers. If the first reference reads the old con- ture does not allow user-level programs to use reg- tents of the register, mx generates a load instruc- isters KO and K1, which are reserved for the tion from the communications area. If the value of operating system kernel. the MIPS resource changes in the basic block, the Unfortunately, translation requires more than scratch register is stored in the communication two scratch registers The Alpha I\XP arch~tecture area before the end of the block. As in most compil- does not have byte or halfword (16-bit) loads or ers, if we run out of registers, a spill algorithm stores, and the code sequences for perform- chooses a value to save in the communication area lng these operations require four or five scratch and frees up a register. registers. Furthermore, rnx reclulres a base register Alpha AYI-' integer registers are 64 bits witle, to locate mxr without having to load a 64-bit whereas MlPS registers are only 32 bits wide. We address constant at each call. Finally, the ivlIPs chose to keep all 32-bit values in Alpha UPinteger

Digital Trchtrical Jourtcal Val. 4 No. 4 Special Issue 1992 147 Alpha AXP Architecture and Systems

registers as sign-extended values, with the high 32 AXP registers, ancl which MIPS resources are nor- bits equal to bit 31. This approach occasionally mally kept in memory. This table allows mxr to find requires mx to generate additional code to create the MIPS resources at run time. canonical 32-bit integer results, but the &-bit com- pare operations do not need to change the values Finding Code that they are comparing. As with the VEST translator, mx finds code by The floating-point architecture is more complex. starting at entry points and recursively tracing Each of the 32 MIPS floating-point registers is 32 bits down the flow of control. mx finds entry points wide. Only the even registers are used for single using the executable file header, the symbol table precision, and a double-precision number is kept (if present), and feedback from mxr (if present). in an even-odd register pair. We map each pair of Finally, mx performs a linear scan of the entire MIPS floating-point registers onto a single 64-bit text section for unexaminetl words. mx analyzes Alpha AXP floating-point register. Also, one Alpha any data that looks like plausible code but does not AXP floating-point register represents the condition connect this data into the main flow graph. code bit of the MIPS floating-point control register. Plausible code consists of a series of valid MIPS Thus, the mx code generator can use 14 scratch instructions terminated by an unconditional trans- registers. nLu goes to considerable effort to find fer of control. paired loads and stores in the MIPS code stream, and While finding code and connecting the basic to merge them into one Alpha AXP floating-point blocks into a flow graph, mx looks for the code operation. sequence that indicates ;I switch statement, i.e., a MII'S single-precision operations cause problems multi-way branch, usually through an element of a with floating-point corresponclence. Since on MIPS table. mx finds the branch table and connects each machines, the single-precision number is kept in of the possible targets as successors of the branch. only the even register of the register pair, the even and odd registers in a pair are independent when single-precision (or integer) operations are done in Code Analysis the floating-point unit. On Alpha AXP machines, Our static analysis of hundreds of NlIPS programs computation must be done on a value extended to indicates that only 10 instructions account for double format in the whole 64-bit register. We about 85 percent of all code. These instructions are defined two forms for values in Alpha AXP floating- LW, ADDIU, SW, NOP, ADDIJ, BEQ, JAL, BNE, LtJI, and point registers: computational form, in which com- SLL. The corresponding sequences of Alpha AXP putation is done, and canonical form, which code range from zero operation codes, or opcodes, mimics the MIPS even and odd registers. If a MlPS (for NOP, since the Alpha AXP architecture does not program loads an even register and uses this regis- require NOPs anywhere in the code stream) to two ter as a single-precision value, mx loads the value opcodes (for SLL). from memory to be used computationally. Lf a MlPS Code analysis for source programs is much more program loads only an even register but does not important in mx than in VEST, because the coding use this register in the basic block, mx puts the 32- idioms for many common operations dlffer bit value into half of the Alpha AXP floating-point between the Alpha AXP and MlPS processors. The register. This permits correct behavior in the patho- simple technique of mapping each MIPS instruction logical case where half of a floating-point number is to a sequence of one or more Alpha AXP instruc- lo:~dedin one place, and the other half is loaded in tions loses much of the context information in the some other basic block. If a register is used as a sin- original program. gle-precision number in a basic block without first For example, the idiom used to load a 32-bit being loaded, the code generator inserts code to constant into a register on M[PS machines is to gen- convert it from canonical to computational float- erate a load upper immediate (LUI) opcode, placing ing-point form. If a single-precision value has been a 16-bit constant in the high-order 16 bits of a computed in a block and is live at the end of the register. This operation is followecl by an OR imme- block, it is converted to canonical form. diate (OM) opcode, logically ORing a 16-bit mx inserts a register mapping table into the zero-extended value into the register. The LUI translated program that indicates which MIPS corresponds exactly to the Alpha i\xP load address resources are statically mapped to which Alpha high (LDAH) opcode. However, the Alpha AXP

148 1/01. 4 No. 4 Special Issue 1992 Digital Technical Jourtral Binary Translation

architecture has no way of directly ORing a 16-bit optimization phase, which looks for MlPS coding value into a register and cannot even load a zero- idioms and replaces them with abstract machine extended 16-bit constant into a register. When the instructions that better reflect the idiom. For exam- high-order bit of the 16-bit constant is 1, the short- ple, mx changes loads of immediate values to a non- est translation for the ORI is three instructions. The MlPS hardware load immediate (LI) instruction; shift mx translator scans the code looking for such and add sequences to abstract operations that idioms, and generates the optimal two-instruction reflect the Alpha AXP scaled add and subtract sequence of Alpha AXP code that performs the 32- sequences; and sequences that change the floating- bit load. No opcode exists that corresponds to the point rounding mode (used to truncate a floating- ON, but the results in the registers are correct. point number to an integer) to a single opcode that When we started writing the mx translator, represents the Alpha &V convert operation with we listed a number of code possibilities that we the chopped mode (/C) modifier. thought we would never see. In retrospect, this was MlPS code contains a number of common code a misguided assumption. For example, we have sequences that cross basic block boundaries, seen programs that branch into the delay slot of but which can be compressed into a single basic other instructions, requiring us to indicate that the block in Alpha AXP code. Examples of these are delay slot instruction is a member of two different the min and max functions, which map neatly basic blocks-the block it ends, and the one it onto a single conditional move (CMOVxx) instruc- starts. We have observed programs that put soft- tion in Alpha AXP code. The code generator looks ware breakpoint (BREAK) instructions in the branch for these sequences, merges the basic blocks, delay slot, and thus BREAK ends a basic block with- and creates an extended basic block, which out being the last instruction. Some compilers includes pseudo-opcodes that indicate the MIPS schedule code so that half of a floating-point regis- code idiom. ter is stored and then reused before the other half is After the optimizer completes the list of instruc- stored. The general principle that we intuit from tions, it translates each abstract opcode to zero or these observations is "if a code sequence is not more Alpha AXP opcodes, again building a linked expressly prohibited by the architecture, some pro- list of instructions. This process may permit further gram somewhere will use it." improvements, so the optimizer makes a second pass over the Alpha AXP code. Code Generation When processing a basic block, the code genera- tor assumes that it has an unlimited number of tem- After the program is parsed and analyzed and the porary resources. Since this is not actually true, the flow graph is built, the code generator is called. It code generator then calls a register assigner to allo- builds the register mapping table and then, in turn, cate the real Alpha AXP temporary resources to the processes each basic block, generating Alpha AXP intermediate temporary registers. Tlie register MIPS code that performs the same functions as the assigner will load and spill MIPS resources and gen- code. erated temporary registers as needed. At each subroutine entry, mx scans the code Finally, the list of Alpha AXP instructions is assem- stream with a pattern-matching algorithm to see if bled into a binary stream, and the instruction the code corresponds to any of a number of stan- scheduler rearranges them to remove resource dard MlPS library routines, such as strcpy. (Note that latencies and use the chip's multiple issue capability. the ULTRlX operating system has no shared libraries, so library routines are bound into each binary image.) If a correspondence exists, the Image Formats entire subroutine is recursively deleted from the The file format for input is the standard ULTRIX flow graph and replaced with a canned routine to extended common object file format (COFF). In perform the subroutine's work on Alpha AXP pro- most ULTRIX MlPS programs, the text section starts cessors. This technique contributes significantly to at 00400000 (hexadecimal) and the data at the performance of translated programs. 10000000 (hexadecimal). In virtually all programs, For each remaining basic block, the instructions a large gap exists between the virtual address for are converted to a linked list of intermediate the end of text and the start of the data section. opcodes. At first, each opcode corresponds exactly When mx creates the output image, it places the to a MIPS opcode. Tlie list is then scanned by an generated Alpha AXP code after the MIPS code and

Digital Technical Joztrnal Vol. 4 No. 4 Speciallssue 1992 Alpha AXP Architecture and Systems before the MIPS data. This allows the program to Translation Goals and Classes have one large text section. The Alpha AXP code of Programs Not Supported begins at an Alpha AYP page boundary, so that we 011sgoill was to translate most user-mode MIPS pro- can set the memory protection on the IMII'S code grams compiled for a MIPS R2000 or R3000 machine separately from the Alpha t\XP code. running III*TRIX Release 4.0 (or later) to run iclenti- The translatecl image is not in LIE<: OSF/l tU1' exe- cally on the I>EC OSF/l AYP system with acceptable cutable format. Instead, it looks like a Mil's (:OFF 1xrh)rrn;lnce.As shown in Table 5, performance of file, but with tlie first few bytes changetl to the translatetl MIPS programs meets or exceeds the string "*!/usr/bin/mxr". origin;~lgoal.

n Execw ting Translated Program Table 5 Translated MlPS When a translated image is run on I)E(: OSWl xx~. Relative Performance its ~i~otlifieclheader invokes mxr first. mxr uses tlie nienior!r map (mrnap) system call to loacl the trans- MlPS Time on Translated Time DECstation on DEC 3000 lated program at the same virtual atltlress th;lt it 5000 Model 240 AXP Model 500 would have liad under tlie LlL'I"I1LX operating Program (40 MHz) (1 50 MHz) system. rnxr resets the protection of the MIi-'S code to read/no-write/no-execute,the Alpha ASP code SPECint92 to read/no-write/esecute, and the c1;1t;1 to read/ espresso 2.4 1.1 (1 .O)* li 1.6 1.2 (1 .O) write/no-execute. eqntott 1.6 2.1 (1 .O) mxr allocates a communication area ancl ini- compress 2.7 1.0 (1.0) tializes Alpha &YP R14 to point to this are;). The SC -t - colnmi~nication area contains save areas for gee 2.1 1.2 (1 .O) >,IIPS resources, iliitialized pointers to nlxr ser- Geometric Mean 2.0 1.3 (1.0) vice routines, ancl other scr:~tclispace. mxr then (without sc) constructs new conimantl argument (nrgv) and SPECfp92 environment vectors as 32-bit wicle pointers (as the spice2g6 - MII-'S program expects). arranges to intercept cer- doduc 1.7 tain signals from the I>EC OSF/l UPsystem. and mdljdp2 2.7 transfers control. to the translated start acldress of wave5 1.1 the program. tomcatv 3.0 When a system signal is tleliveretl to the program, o ra 1.5 alvinn 1.6 control goes to the signal intercept code in mxr. ear 1.7 This cocle transforms the signal context structure mdljsp2 1.4 from the DEC OSF/l ILYP system and constructs an swm256 2.3 LIL'TRIX PIIPS style context, which it tllen passes to su2cor 2.7 the translated signal h:lncller. hydro2d 2.9 nasa7 2.6 Certain signals are processed specially. For ~PPPP 2.2 instance, a program that attempts to transfer con- Geometric Mean 2.0 trol to a location containing MIPS code rather than (without translated code gets a segmentation violation, since spice2g6) the MIPS code is not executable. This situation can occur if a routine modifies its return adclress Notes: The larger the number, the slower the performance. These to be a MIPS address constant, mxr will examine performance numbers were measured on derated field test the target address antl, if it correspontls to the start hardware and software at various times during 1992; production of ;I pretranslated MIPS basic block, divert the flow results will vary somewhat. The SPEC benchmarks are written in FORTRAN and C; no conclus~onsshould be drawn about other of control to the translatetl cocle for that block. classes of programs wr~ttenIn other languages. If not, mxr enters the MII-'S interpreter. The 'The values in parentheses are from running once, then interpreter proceeds to emulate the MIPS code retranslating with the run-time feedback from the f~rstrun; until a translated point is reached. mxr then this gave a significant performance difference only for the programs shown. resynchronizes its machine state and reenters the t~iminainformation for this run is not available. translated code.

I50 It)/.4 i\'o. I .S/)c,cir/lI.E<: SF/^ at' systems, programs that expect files tomer applications. Tlie group members, especially to be in particular places or in a particular format Shamin Bhindarwala and Robi d-Jaar, were sources m;ry fail. Similarly, programs that read /dev/kmem of extremely valuable customer feedback. The and expect to see an LJLTRIX MIPS memory layout Engineering System Group under Mike Greenfielcl f~il. also made extensive early ilse of tlie translators and Certain other classes of programs are not cur- provided valuable feedback. rently supported, but are technically feasible. The Alpha iLYP Migration Tools team is relatively These include big entlian MIPS programs from non- small for tlie subst;lnti;rl ;rmoutit of work acconi- L)igit;~l >Ill'!, environments, programs that use plished in the past two and one-half years. Every 1c4000 or 116000 instructions that are not present person has made several key contributions. In atltli- 011 tlie R3000 motlel, programs that need to be tion to the authors of this paper, the team members ~nulti~rocessorsafe, and programs that require cer- are: Kate Burleson, Peigi Cleniinshaw, George tain categories of precise exception heh;~vior. Darc): Catherine Frean, Bruce Gordon, Rick Gorton, Kevin Koch, Mark Herdeg, Giovanni Della Szcmmary Libera, Nikki Mirghafori, Srinivasan Murari, Jim Builtling successful turnkey binary translators Paradis, and Ashutosh Roy. reqi~ireshartl work but not magic. We built two dif ferent translators; VEST and mx. In both c;rses, the References and Note oltl and new environments are, by design, quite similar in fundamental data types, memory atltlress 1. R. Sites, ecl., Alpha Ar.cbitect~rr-e ReJerellce ing, register ;rnd stack usage, and operating system rllanunl (Burlington. MA: Digital Press, 1992). services. Translators between dissiniil;~r;rrchitec- 2. R. Sites, "Alph;~ tLYt' Architecture," Digital tures or operating systems are :I different matter. Tech~zical./ounza~,L,vol. 4, no. 4 (1992, this issue): Translating tlie code might be a reason;lbly straight- 19-34. h~rwardtask. However, emulating a run-time envi- ronment in which to execute the code might 3. C. Hunter ant1 J. Banning, "DOS at RISC," Byte ,?resent insurmountable technical ;rntl business ~Mngnzine(Novern her 1989): 361- 368. 01~st:icles.Without capturing the environment, an 4. Echo Logic, News Release (May 4, 1992). instruction translator would be of no use. The ide;~of binary translation is becoming more 5. L. Wirbel, "1)OS-to-lJNlX Compiler," Electronic common in the conlputer industry, as v;~riousother Eizgineel.i~lgTilnes (,M;lrch 14, 1988): 83. co11ip;lnies start on their transitions to 64-bit 6. A. Bergh, K. Keilman, D Magcnheime~ancl ;~rchitectures. J. Miller, "HP 3000 Emulation on HP Precision Acknowledgments Architecture Coniputcrs," He~ulett-Packor-clJour.- 1zc11(December 1987). Steve Hobbs originally suggestetl the binziry transla- tion p:rth in the arcliitect~lretask force tliscirssions. 7 Datum is the term ilsetl to refer to a piece N:~ncyKronenberg ancl Bob Supnik ;~cldedcritical of information that has ;in adtlress ancl a size. e:rrly support and later coordination. Jucl 1.eonard Alignment is the property of a datum of size set the engineering direction of cloing careful static 2" bytes. This clati~mis aligned if its byte atldress translation once, instead of on-the-fly dynamic has n low-order zeros. A size or address not tr21nslation ;it each execution. Butler Lampson boostetl morale at a critical time. Jim Gettys has meeting this constr;~intimplies that the datum is irnaligned. also been an important and vocal supporter. The success of the translators would not have Instruction atomicity is the property of instruc- been possil~lewithout the enthusiastic support of tion execution on single processor systems the OpenVklS tCYP and DEC OSWl iUCP operating such that an interrupted instruction has been

Digital Tecbrricnl Jo~~rnnl%I. 4 ,Vo 4 Specitrl 1ss11e1992 151 Alpha AXP Architecture and Systems

completed or has never started, i.e.,partial exe- intlepenclent uptlates to the saine alignetl datum cution of an instruction is never observetl. will be consistent. This property causes serial- ization of the independent re;id-niodify-write Granularity is tlie property of memory n~riteson sequences ant1 is not guaranteed for an niultiprocessor systems such that independent unalignetl datum. writes to adjacent alignetl tlata produce consis- tent results. The terms byte, word, longword, Word tearing is tlie property of alignetl memory quadwortl, and octaword granularity refer to writes on multiprocessor systems such tli;it a writing I-, 2-, 4-, 8-, and 16-byte size adjacent reader inclepenclent of the writer c;in see partial data. results of the write. Interlocked update is the property of memory 8. N. Kronenberg et al., "Porting OpenVMS from updates (read-modify-writesequences) on niulti- VAX to Alpha AX[-'," Digitul Tech~zicalJo~~rrzd, processor systems such that simultaneous vol. 4,110. 4 (1992, this issue): 111-120.

152 Vol. 4 ATo 4 Speciol Isslie 192 Digital Techrrical JorrrrrnI Jeflrey A. CofJler Zia Mohamed Peter M. Spiro

Porting Digital's Database Management Products to the Alpha AXP Platform

The cornerstowe softulare conzpolzent of higb-end production sjatevns is a database rnancrgenzent y~stetnDigital has successfklly ported the DEC Rd6 for Open Kt/.$ rela- tiolzal dc~t~~buse~narzage1tze1zt s.yste~?z and the DEC DB,IlS for OpenKrrs nettilor15 d~lt~dxrsenzanagen~erzt systelrz to the Alpha AXPplatforln Rd6 nrzd Dfi1ll5 z1ler.e per- haps the most co~~zplexh~)e~eclproducts to be ported The tzght couplii~goj these two products to the Open KVS VAX system made theport a challetzgi~zgtcuk. To avoid the fz~tz~reproblem of integrati~zgtu~o source code bases, the porting teailz decrded to cise a common code base and to overlap current VAX development with the Alpha AXP port. The goal ZLIC~Sto l~ronidecln easy ~nigmtionpclth for softzvnre products to the Alpha AXPplc~~orm

Digital is one of a small number of ventlors conipet- transaction processing and information manage- ing in the high-end, complex production systems ment products such as <:I)I>. A<;MS, and DEC RALLY, market. Applications for this market support intl~~s- which integsatc with the Rclb and DBMS systems. tries such as banking, stock exchanges, telecomrnu- Database management systems are anlong the nications, ant1 information services. The Alpha AXI' most conlplex of all software protlucts. Applica- platform is ideally suited to meet the response tions expect these systems to have 7 by 24 availabil- time, throughput, and availability requirements of ity, sophisticated concurrency capabilities, fast data these applic;ttions, since it offers increasetl perfor- access, high-speecl backup and restore mecha- mance while maintaining the superb :~vailability nisms, and largc buffer pools. 17) provide such hlnc- characteristics of VMScluster systems. tionality, the Rtlb and 1>13>IS products make Although high-end production systems involve a extensive weof the OpenV~\4S\!AX system, the VAX collection of software packages, the cornerstone run-time libraries, and the HLISS and VAX MACRO-32 software component is a database management programming languages. The current release of the system. Digital offers two database management product set uses more than 100 system services or systems for high-end commercial systems: I)E<: ILIb run-time library c;~lls.The two protlucts utilize for OpenVMs, a relational database n1;lnagement almost every RI.lSS D1IILTIN function, it.,a machine- system, ;lnd DEC DBMS for OpenVMS, a network specific function c:ill tli;it generates in-line code. (<:OI)ASYL) database management system. Digital Combined, Rtlb and 1)HMS comprise more than 30 had to port the DEC Rdb for OpenVMS VAX and DE<: different images. The products run in elevated pro- DHMS for OpenVMS VAX database systems to the cessing modes, both executive and kernel, and Alph;~i\XIj pl;~tfornias early as possible to continue inclucle user-written system services. to compete in this commercial arena. The resulting Further compounding tlie coniplexit)~of porting products are the DEC Rdb for OpellViMS AXP and tlie Rclb ant1 DBMS software to tlie Alpha AXI' plat- DE<: DBMS for OpenVivIS AXP systems. (Since these form is the fact that they are niatilre products; DHMS two products for tlie Alpha AXP system are the was released in 1981, ltclb in 1984. Because varioi~s same as those for the VAX system, hereafter, we system capabilities did not exist in the early 1980s, will refer to the protlucts as Rclb and III3MS.) the two database n1an:igement systems include Atldition;illy, both software protlucts clrive many code that is no longer required. For example, both sales of Digital's OpenVMS operating system ant1 prorlucts have code to move bytes from one data

Digilal Technical Jozrrnnl Vol. 4 IVo. 4 .S/~rcictllsst~e 1992 153 Alpha AXE' Architecture and Systems

type to another. Also, during image runclown, the The DBMS procluct also provides language pre- products rely on undocumented, operating system processors, an interactive query front encl. ;uncl behavioral patterns such as the ;~synclironous other software necessary to define, create, ancl system trap (AST) delivery protocols. In ;tddition, manage data in simple or complex databases. In the lidb software contains a niotlifiecl version of the contrast to Rtlb, I>IlMS provides a CODASYL inter- Ope11\~1\4sSORT routine. face to the d;it;ib;~se. Rdb and DBMS were initially clesignetl to run Fjgure 1 shows tlie relationship of the Rtlb anti only on the OpenVMS VAX operating system. L)BMS software products to the KODA d;~tab:ise Consequently, both products heavily utilize VAX- kernel. specific features for performance gains.' For exam- ple, Rdb generates VAX machine code routines as Porting Policies part of query execution plans; the machine cocle is Initially, we cle\~elopeclpolicies to guide our port to carefully generateel for maximum execution effi- the Alpha ,\S1' platform. These policies, which ciency. This tight coupling of Rdb ancl DBMS to the applied to the KOl>r\, Rtlb, anel I>RklS teams. mere OpenvMs vAX system made the port a challenging clesigned to simplify the port ;ind to ease long-term task. maintenance requirements. Since the OpenVNJS ant1 BLISS groups were busy with their own porting projects, mein the Database Systems Group had to ;iccomplish our port with lit- Common Source Code Base tle outside help. The task was noteworthy because, Our most important clecisiorl was to have a com- by necessity, the team had to port its procluct set to mon source code base. 'That is, mie wantetl to 1i;ive the Alpha AXP platform earlier than most of the one set of source code that could be compiletl a~?d other porting groups. At the same time, Rdb and run on either a VAX or an Alpha AXP system. At the I>IlkfS were perhaps the most complex layered time we began our port, the OpenVMS group was proclucts that \voulcl be ported. Our goal was to the only other software group that had started their port these two proclucts in a timely fiisliion, so that port, anel they had chosen to have two distinct cock lligital woulcl truly succeetl in providing an easy bases. (The OpenVMS AXP porting schedule dic- migration path for software products to the Alpha tated the choice.) So with respect to code base, the AXP platform. path we chose was untested. \Ve also decidecl to In this paper, we first present a brief description maintain common command procedures to con1- of the architecti~reof the two database manage- pile, builcl, and link, and common regression tests ment system products. We nest clescribe tlie guitl- between the VAX and Alpha IGYP systems. ing policies we formulatecl to allow the port to A primary season for our code base decision w;ls proceed as efficiently as possible. Then, we docu- that we did not 11;ive the resources to manage two ment porting issues that we resolved for the two different code b:ises. Also, :~lthouglitwo divergent products. Finally, we summarize our experiences code sources woulcl have ;illowetl for a stable cocle related to this effort.

Product Architecture lligital is unique in tlie tlatahase industry in that we provitle two different types of clatahise manage- RDB DBMS ment systems that layer on top of the same database kernel, which is called KOIIA. The KOIIA kernel provicles journaling and recovery, locking, access nietliods (e.g., B-tree, hashing), recorcl ancl page KODA DATABASE KERNEI nianagenient, and buffer pool m:in:igement. The Rtlb software provides 1:ingu;ige preproces- OPENVMS OPERATING SYSTEM sors, an interactive query front end, a ca1l;tble inter- face, catalogue management, query optimization, and relational operations such as join, select, ancl project. Rtlb supplies a relational interface to tlie Figure I Rel~~tionshipof Rdb n77d DB1Lf.S cl;it;ibase. to the KODA Dat~d>nseKer~~el

154 Vat. 4 No. 4 Sjx~JnlIsstti? 1992 Dkitnl T~L.~/JIZ~CO/JOI~/"II~/ Porting Digitc~I'J'Database Mcirt~ige~rterztProducts to the Alpha AXP PlcrfJi,r~n

base witli wliicll to begin the Alpha AXP port, the On the first pass, we decidecl to port the \'AX cocle group strongly wantetl to avoid having to merge the wjthout designing any new algorithms. We tlitl two code bases at a future date. Consequentl}: clean up some code for style, conventioti, and per- since our prelirnin;lry investigation indicated that a formance, but basically, the Alpha ,+XI-' relei~se single code base was feasible and that we could remains functionally equivalent to the latest \'AX hick most of the platform dependencies through release. the superb macro capability of the BLISS language, we proceecletl with tlie common source cotle Correct and Fast Code Execution implementation. The single code base allowed us to We clitl not prioritize our effort to first, be correct, built1 and release Alpha AXP ant1 VAX versions of our anti second, be kist. We clecided that we must be products at the same time. correct a~zdfast on certain key issues. For ex;rmplc, 011 VAX systems, our argument-passing mechanism Concurrent Releases i~tilizetlthe argumetit pointer (AP). To minimize Our release scheclule complicated the process of code chatlges, we could have used the AR<;I'?'II con- adhering to the single code base policy. To meet the struct it1 the BLISS cross compiler. However, ARGITR schedule, we hat1 to overlap some of the Alpha AXP is inefficient aml, therefore, not appropriate for our port with our current VAX releases. That is, the sce- needs. Consequently! we ensured that our new nario we followed was Nm': work on a VAX release; argument-passing design was efficient, even complete all necessary code changes; stabilize the thougli cloi~igso was time-consuming. release; ant1 then create ;I newer set of sources for the Alpha )jXI-' port. Rather, for the beginning por- Minimizing Platform -specific Modz~les tion of the Alpha AXP port, we also had to change Cotle contlitionalizatio11, i.e., producing sep;lr;lte source code destinetl for a \'AX release. Tlius, if a code for tlie VAX and the Alpha AXP platforms, module had to be changed for the earlier VAX requires various levels of code duplication. For release and the same module bacl already been example, the process may require the duplication ported h)r tlie Alph;~AXP release, the engineer had of an entire module, routines within a tnodule, or to propagate the code change to the Alpha AXP certain lines of code within a routine. To rnini~iiize source cotle. the amount of cotle duplicatecl, we condition;~lizetl To minimize the effect of double cocle changes, on the srnallest code segment possible, using a sen- we first worked on those modules for the Alpha sible approach. For example, when forced into AXP release that were reasonably stable in the cur- using conditional code, we avoidetl duplicating rent VAX code stream. For example, the BLISS motlules by choosing to keep within a single mod- REQUIRE files that we use for data definitions were ule. Ideally, we conditionalized just a few lines. reasonably stable for the VAX release by the time the Wherever possible. BLISS macros were motlifiecl to Alpli;~AXI-' port began. The modules that did not hide tlie code conditionalization. change for the vt\X release were also good cancli- dates for helping us to avoid making double code Rd6 Is Rdb changes. When we finally began to port the bulk of \Ve wanted our database management protlucts to the motlules, they were mostly stable and, as a "look and feel" the same on an Alplia I\XP system as result, only bug fixes for the VAX release required they did on a VAX system. So, to paraphrase from the that we manually modiQ the same module for the OlxnVMS operating system maxim, we wantetl Rdb Alpha AXl' release. to be Rdb! That is, the ported Rdb should have tlie F~rrtherrnore,once we began work on the Alplia same utilities, the same data structures, tlie slme AXI' release, we neetled the capability of being able data tlefinitjon c;~pabilities,tlie same tlata manipu- to compile, link, ;mtl test on both the Alpha AXP lation constructs, etc., as the DEC Rdb for 0penV~MS ant1 VAX platforms. So m7e had to modify our devel- Vhx' j>rotluct. Incorporated in this tlesire h)r s;lme- opment environment to allow us to identify the ness was the fi~ndamentalpoint tl~atwe were not code change session as either an Alpha ,UP or a VAX going to change the on-disk structures. IIRMS was session. ported witli the same goal in mind. Ab New Functionality No Changes to On-disk St?~zrctures The Alpha AXI' release of the database management The KODA kernel stores recortls on database pages system PI-oductset contains no new fi~nctionality. Unfortunatelj: the database page is not nat11r;llly Mpha A;YP Architecture and Systems aligned; page heatler fields and fields within the Varianted code reconls are not aligned. Although aligni~igthese Data alignment and field resizing fields \vo~ilclboost performance, to realign all the structures 011 the d;ltabase page woulti req~~iretlie Argument- as sing mechanism tlat:ib;we to be unloatled and then reloaded. Current BUILTIN functions customers c:innot afford the downtime needecl to perh)rm the conversion, so we decided to maint;~in VAX testing the sztrne p;tge/recortl structure. Furthermore, by The CALI.('; mechanism and AP references m;~int;iiningthe same on-disk structure for the VAX zinc1 Alpha .\>(I' tlatabases, we tlo not preclude VAX MACRO-32 modules f~titseconcurrent access to the database in a Message file support mixed-architecture \'MScluster. Thus, our present clesign tloes not require an unload/reloatl opera- Vurbnted Code To simplify conditional code, we tion, since performing that action woultl be too added a set of literals, for example KOD$K-VAX or 1i1~1cliol an impeclime~ltto migrating to the Alplia KOI)$K-ALPHA, that can be usetl in all our BLISS AXI' p1;ttforrn. However, we do plan to investig;~te modules. We could then use these literals to concli- the potential performance boost from aligned tionalize code. The code example shown in Figure p;tges/reco~-cls;inti, if the gain is substantial, to offer 2 illustrates the conditionalizing of the PROBE some alignment solution. Note that tliis section instr~~ction.The I'liOBE instruction checks the refers only to data structures tied to on-disk struc- re;~d/writeaccess of a memory location. On Alpha tures. We did align all in-memory structures. ant1 ASP systems, the instruction is quite different from we cl;~bos;itcon this topic in the next section. the corresponding instruction on VAX systems. However, BLISS easily handles tliis difference in :I Porting Details macro, which allows us to change the name and the In this section we describe a general set of issues order of the arguments, pass arguments by value ;1nd solutions that ;~pplietlto all the groups involved instead of reference, and use an offset instead of a in porting the tlat;tb;tse management system soft- length. By developing such a macro, the actual ware to the Alpha AXP platform. We then explain source code tlitl not have to change. sorlie of the more interesting issues and solutions pcrt;iining to each group. Ll~itctAligtztnent and Field Resizing On the first pass, we it~imecliatelymodified all in-memory data strilctilres so that they were naturally aligned. 'This (;i)??znzoizIssues step avoided incurring a significant performance t\ t\ collection ot'genes:~lposting issues applietl to tlie penalty on the Alpha AXP platform. In atldition, Rtlb. I)lnls, :lncl KOIIA groups. For ex;trnple, all since no single Alpha AXP instructions exist that groups needed the cap;tbility to contlitionalize could be i~setlto easily manipulate bytc5 or wortls, code jn ;I motlule. so that the compiler on an Alplia 111:11iy of our in-memory byte (8-bit) and word /\XI' s!.atc.~n \vould procluce one set of object cocle. (16-bit) fieltls were changed to longwords (32 bits) ;~ndthe compiler on a \%X sj~sternwoultl produce to reduce the object code size and i~nprove ;tnothcr set. <:on~rnonissues were: performance.

$PROBER (BASE, LEN = 4, MODE = 0) = %IF KOD$K-ALPHA %THEN (BUILTIN PAL-PROBER; PAL-PROBER (BASE, LEN - 1, MAX (MODE, $PREV-MODE))) %ELSE (BUILTIN PROBER; PROBER (%REF (MODE), %REF (LEN), BASE)) %FI %, I

Figzire 2 Conditiorzalized PROBE Itzsh'uction

156 \+,I. 4 :Vo. 4 Speiinl Issue 1992 Digital Techtzicrrl Jourtrnl Porting Digilnlj: Dat~~lbase

Once we aligned tlie in-memory data strilctilres, support library. Again, we used BLISS macros to two groups of data structures remained unaligned: solve the problem. Essentially. our macros catego- those tied to the database root file, which records rized the RLJlLTINs ant1 then performed the appro- database parameters such as associated files and priate expansion, basecl on the category. For database settings, and the database pages that actu- example, the PROHE IlIllLTlN differed markedly ally contain the data records. Since the database between the \AX ant1 Alpha ASP implementatioiis, root file is relatively small (i.e., less than 100 blocks as indicated by Figure 2. in size), it was aligned also. Thus, the root file is automatically re-created in a conversion that VXX Testing Another general problem that we occurs when upgracling a database product to sup- had to guard against was the possibility that tlie port both the t\lplia AXP and VAX architectures. Alpha &\'IJ cocle changes wo~tlclintroduce bugs into Since this conversion invariably takes place when the VAX versions of the products. Consequently, we converting to a newer version of either the Rdb or adopted a policy whereby all Alplia AXP changes the DBMS product, the additional realignment of had to be testeel on :I VAX system. Tliis policy the root is a minor additional expense. ensured that we maintained a stcady pattern of cor- Thus far, we have not pursued any potential mod- rect VAX behavior. Also, since tlie VAX environment ifications of the page data structures, s~iclias align- was more stable than tlie i\lp11;1 )\XI' environment, ing them once they are fetched into memory. Note testing on a \%X system helped tremendously in that these structures do not generate un:~lignetl identtfying and fixing bugs rel:~teclto the port. faults. Insteacl, they force the compiler to generate a few aclditional instructions to hanclle the odd The CALLG ~lfcchar~is~~ia~zdAP Refcreizces The alignment. Alpha iD(P platform does not tlirectly support CALLG, a VAX procedure calling mechanism, and Argurne~zt-J~clssingMechanism The VAX ;Inel references to the AP. The <:.-\l.I.<;~ilechanisni ancl AP Alpha AXP argument-passing mechanisms are references are slow since they :Ire sin1ul:ited iintl entirely different. Rather than using the st;lntlarcl automatically allocate st;lck space to accommotlate BLISS mechanism, the existing code clepended the largest possible argument list (i.e.,255). In sitil- strongly on the VAX argument-passing mech;~nisms ations where perforin;~ncewas not critical, for by using BLISS macros to reference arguments from example, in an error hancl Ier, we replaced CALLG by the AP. This approach was not possible on Alplia a standard routine call on both the VAX ant1 the AXP systems due to the lack of an AP register. (You Alpha software versions. When performance coulcl force the AP to be generated, but that process was an issue, we usetl conclition;~lcode to retain the would be slow and would waste memory.) CALLG mechanism for the VAX code and to use ;I Therefore, we changed our procedure headings to standard routine call in the Alpha AXP code. 111 declare a generic formal parameter list (e.g., 1'1 instances where the (:t\LL<; mechanism is used to through PN) for both the Alpha tLYP and the VAX pass the argument list to the next routine, we con- systems and then developed another set of BLISS structed an arguiuent vector ;~ntlreplaced CALLG by macros that allowed us to bind to the arguments a special call linkage. The new mechanism passed based on the generated formal parameter list. Since the pointer to the argument vector b!, means of a this process involved changing every routine decla- single parameter or a global register. This solution ration, we developed a text-processing tool tIi;lt guaranteed good performxnce on both VAX :lntl would autoniatically change the routine headings Alpha AXP systems yet avoided any conclitionalizing and thereby avoid the expensive ant1 error-prone of the code. taslz of nianually changing each routine.

VAXMACRO--32Mo~lzilcs For ;I variety of reasons, BlJILTIN Fzinctions Together, the KODA, Rdb, ;u~itl we used VAX LMACRO-s2to code some rot~tinesin DBMS code uses most of the BLISS BlJlLTlN hlnc- the Rdb, DBMS, and KOl>A softw;~re.For example, tions. This fact presented a problem for the team basic operations such ;IS record compression, record porting the software to tlie Alpha AXP platform. expansion, ant1 buffer initii~lizationare performed Some VAX BUlLTlNs were not supported, some through calls to VAX MA<:RO-32 routines that are behaved differently, and some were eliminated as heavily optirnizetl for efficient operation. Some BUILTIN5 but emulated by Starlet, an OpenVMS routilies are cocted in \ii\X MACRO-32 for ease

Digital Techirical JOU~HU~1/01, 4 IVO 4 4pe~icrlIssue 1992 157 Alpha AXP Architecture and Systems

of character manipulation. Also. we used u\X changed the mechanism to define the KOD$- sym- ,\.W<:RO-32 to code machine instructions that were bolic values to be compatible with both the VAx not avai1;tble tl~rougha Ul.1Ss BIJIL1'IN function. and Alpha t1)cl' architectures. We adopted various solutions for these VAS First, we removed all .LITERAL declarations from MA<:110-32 routines. For those routines where per- the KODA message file. As a result, all KODA mes- formance was not an issue and BLISS generated sages were defined strictly as RDMS or DBMS acceptable code, we converted to t3I.ISS cotle. For messages. Then, after passing the message source routines where performance wns absolutely criti- file tlirough the 111ess;lge co~i~~ilerto get tlie mes- cal, we rewrote the routine in Alpha AXP mc:~O-64 sage object file, we invoked tlie ANALYZWORJECT to utilize the additional registers. Fin;illy, in some facility to get 3 listing of the message symbol codes c;lses where we could not rewrite the routine in ant1 values for each message. Finally, we wrote a 13LJSS code and did not have tlie resources to con- small utility to reatl the ANALYZWOBJECT output vert to XIA<:Ii\ literal declarations. This process woultl no tiation of the software. For example, in the DBMS longer work since Alpha AXP object files have a clif- interactive query (DHQ) facility, the user can per- ferent format than \IAX object files. As a result, we form the following operzition:

MODULE DBMKODMSG = BEGIN

GLOBAL LITERAL KOD$-ABORT-WAIT GLOBAL LITERAL KOD$-ACCVIO GLOBAL LITERAL KOD$-AIJACTIVE GLOBAL LITERAL KOD$-AIJALLDONE . . .

END ELUDOM

Figure .3 BLISS Cocle to Gerzelz~teKOD ~%IessageDclfinitiorzs

158 VoI. 4 So. 4 .S/)c2ciollssrrc 1992 Digital TechrricnlJorrrtral iWol?nge~?zrntProd~~cls to the Alpha AXP l'/[email protected]

dbq> ! Attach to first database as userl. handler also cleaned up OpenVMS clata structilres dbq> BIND DBI ON STREAM 1 such as the pending AST queue. These OpenVMS dbq> data structures changed significantly for the Alpha dbq> ! Attach to second database as user2. dbq> BIND DB2 ON STREAM 2 AXP architecture. For example, an Alpli;~ AXI-' dbq> system has five pending AST queues instead of one. dbq> ! Establish userl context. dbq> SET STREAM 1 In addition, this Iiancller routine would acquire the OpenVMS scheduler spinlock and perform ,'poor This example has tlie user ;~ttacheclto two differ- man's locktlown," wliicli effectively pages the entire enr databases, DH1 ancl 1>132. To issue queries against routine into memory (since the code cannot incur a either elatabase, the user enters the SET STRM,tl page fault at elevated interrupt priority level, IPL). command. 111 response, KOIIA establishes the cor- For Alpha AXP, code and data cannot be 1oc;lted in rect data structures :~nd stream context for this the same PSECT, so this trick was not possible. database session. This process involves switching Instead, we used the $LK\vsET macro to lock pages dxta structures and stack context. Consequentlj: in memory and then to clean up the Opcn\'h,lS data KOIIA manages its own stack for its executive rnode structures. code ant1 data structures. This stack-switching After we completetl and tested the code, the mechanism is con~plex,and this code is intin1:ltely database and OpenVMS engineering teams clecidecl tied to tlie VAX procetlure calling mechanism. For that such intricacy was needlessly con~plex,and cxa~iiple,whenever a (1ilel-y must stall (e.g., while that the OpenVMS AXl' software could clean up waiting for a lock request), KOIIA saves the current the data structures based on its image control executive mode context and then switches back block and related structures. This example shows throi~ghthe stream cocle oi~tto user mode. This how the OpenVMS AXP system offers clifferent func- iiction allows the process to receive user-mode tionality than the OpenvMS MX system, i.e., the ASTs. This mech:anism essentially saves a call fsame port offered the opportunit)r to clean up existing so that after the user-niocle stall has con~pletetl, mechanisms. KOIIA can set up the appropri;~testack and return to the calling routine by me;ins of the saved call frame. Bc~gcheckDrr~~z/) Mechanism Complex, sophisti- The calling/retum mechanism is entirely differ- cated software products are by nature clifficult to ent for tlie VAX anel Alpha AXP architectures. On tlebug, Most of these products utilize ;I data struc- Alpha AXI-' systems, for each routine, the compiler ture dumping mechanism whenever an internal generates prologi~ecode and epilogue code to man- software or hardware error is encountered. KODA age the routine calling nlechanism. Accortlingly, has a mechanism called a bugcheck (lump that per- the KODA stack mech:inism had to rely on this new forms tliis service. Wlien an unexpected exception mechanism. In adclition, for this level of support, is generated, the bugcheck dump code prints all rel- tlie routine that was coclcd in BLISS for the \iAX plat- evant data structures into a file. In addition, the form had to be cotled in NM<:RO-~~on the Alpha dump includes a stack dump. On VAX systems, the AX1' platform. bugcheck clump traces back clown the st;lck using the saved call frames and prints out all the fielcis in Kc.1.r7e/-1nodeRLIIZL/O~.~JIZ fI~1nd1ers Another esanl- each call frame, the routine name, and tlie argu- ple of KODA's close tie to OpenVMS behavior ments passecl. involved the use of KOfIt\'s kernel-mode rundown In particular, the method for printing the syni- hancllec On \RX systems, in the event of an abnor- bolic name of the routines is especially clever. After m;ll failure, we must cle;ln up certain data struc- linking an image, we utilize a program that scans tures and release resources such as locks or the symbol table (.STR file) produceel by the linker. ch;~nnels.Furthermore, elatabase recovery must Then the program creates its own object file, which start before the image sunclown is completed, so inclucles a relative offset of all the routines ancl their that surviving processes cannot acquire locks on symbolic names. Finally, the image is relinkecl, and resources before the databases are recoverecl. this new object file is included into tlie image in a We accomplish tliis image cleanup through the particular PSE<:I'. When tracing back clown the call use of a user-definecl system service (i.e., a system frames, the bugcheck tlump also checks the special service not defined by the OpenViLlS system), PSECT to locate ;~ntlprint the correct routine name. which acts as a kernel-mocle rundown handler. This dump is an invaluable tool in determining tlie 111 atldition to releasing tlatabase resources, the causes of unexpected errors. Figi~re4 includes two Alpha AXP Architecture and Systems

Saved PC = 000408AF : DIOSFETCH-DBKEY + 0000004F ARG# Argument [data...] ...... 1 00206484: OOOIFCFC 002064F4 0020650C 207COOOO 000277C7 00010000 00020001 2 00000001 Handler = 00000000, PSW = 0000, CALLS = 1, STACKOFFS = 0 Saved AP = 0020644C, Saved FP = 00206430, PC Opcode = EO SR2 = 002646DO: 00000000 00000000 00006918 FFDAA3E8 FFF63770 00000000 00000000 SR3 = 0000BC41: 013A2048 CZFFFFFF FFFFF85E EOO09507 D512A4EO 40000000 18C00040 SR4 = 00264600: 00000008 0020645C 002646AO 00000000 00000000 00000000 00000000 20 bytes of stack data from 0020641C to 00206430: 002646B000000001002064B400000002 0000 ' ....4d .....OF&.' 001C7D08 0010 ' .I..'

Saved PC = 00055241 : PSI$MODIFY-STITM + 00000033 ARG# Argument [data...] ...... I 002064B4: OOOIFCFC 002064F4 0020650C 207C0000 000277C7 00010000 00020001 2 00000096 3 002646DO: 00000000 OOOOOOOO 00006918 FFDAA3E8 FFF63770 00000000 OOOOOOOO Handler = 00000000, PSW = 0000, CALLS = 1, STACKOFFS = 0 Saved AP = 00206490, Saved FP = 00206464, PC Opcode = DD SR2 = 00256042: 00020096 0000005F 00000057 00000000 00000002 OOOIOOOO 002E2A13 SR3 = 00264680: 00000000 00000001 00000008 002646A0 00264670 00000000 00000000 24 bytes of stack data from 0020644C to 00206464: 002646D0000000960020640400000003 0000 ' ....4d ..... PF&.' OOlC7CF8002646CO 0010 '@F&.x(..'

Figure 4 R~~gcbeckDIIIIZ~) routine calls from a stack trace, intlicatetl by the process is simpler than directly creating an object lines of code thi~tbegin with "S;~veclPC." file, as we tlicl previously. Alpha ASI' systems have tio equi\alent to the \:iX E\len though tlie routines provitled this call trace- c~llframes, so it is impossible to use the call frii~iie back capabilit~we were missing the arguments mechanism to trace down through the stack. As passed to the routines, perhaps the most important meiitionetl previously, Alph;~AXP routines i~tilize part of the st~ktrace. The VtlX mechanism cap- prologue ant1 epilogue code for returning from rou- tured this tlata, because very often a bugcheck tine calls. I-'rocetlure clescriptors contain informa- results fro111one ro~itinepassing an improper ;I~~LI- tion such as entry address :mcl register s;nre tnent to another routine. The Alpha AXP system inforn1;ition. does not provicle a way to capture this information, On Alpha AXP systems, another Digital group because the routine calling sequence reuses regis- supplietl ;I set of routines that allows tracing tlie ters R16 through R21 for passing arguments. call sequence. This set provitled the basic calxibil- ity to print the routine calling sequence tIi;it lecl to Porting Rdb an ;tbnormal exception. In ;iclclition. the Alpha AXI' Some issues handled by the Rrlb porting group linker ~)rocluceda symbol t:tble file. However, we were :~ssoci:~tedwith the clisp;rtch code, Alpha I\XP decided to simplify our bugcheck mecIi;~nism. code generation, Rdb precompilers, and Rdb Although we still search thc symbol table file for all system relations. routine addresses, rather tli;~~create an Alpha AXI' object file, we create a W\X JLIACRO-.~~file th;lt Disl~crtchCode The tlispatcli cocle is the topmost inclutles the routine name and address/offset. layer of the Rdb software ant1 is called tlirectly by Then, \vr simply use the Alpli:~hlt\(:RO cross co~ii- the user ;ipplication by means of relational call piler to generate the Alph;~AXI' object, which gets interface (KI) calls.?The main h~nctionof dispatch linked into the image on the second p;lss. In fact, cotle is to direct the user request to tlie correct tar- we changed our VAX bugclieck routine to produce a get Rdb executive (local or remote) for processing. >WCR0-.52 file with routine n;ime and offsets. ?'his On VAX systems, tlie dispatch code passes the user

160 Ibl. 4 No. 4 Special Issue 1992 Digital Technical Journnl Por-tirig Iligitlilk Dritribase 11.Iuiz~igenzelztPr-odircts Lo the Alpha AXP PlntJbrnz

arguments to the Rdb software using the <:hl.LC; DATA ITEM^ I DATA ITEMZI .. . NULL BIT VECTOR linkage.', On Alpha AX'P systems, <:AI.I.<;linki~ge is very inefficient. Therefore, the dispatch code was ch;lngetl to build a user argument vector in the same style ;ISthe VAX argument list, :rnd the pointer to the argument vector was passed as a single par;inieter. The cocle in Rdb was cli;~ngetlto bintl to systems.?Therefore, the algorithm was motlifiecl to tlie user arguments using the offset fro111 the fetch a batcli of null bits into a register. \When all pointer to the argument vector. null bits in the register are processed, the batcli is Using two clifferent calling niecli;~nisrnsin the written and the next l>atcli of null bits is fetched. tlispatch to pass user arguments WAS a c;~reful This approach reducetl the number of load and design. On VAX systems, the existing <:AI.I.<;mecha- store instructions and niacle the code sequence nism was ret;~inetlto ensure backward compatibil- nli~chmore efficient. ity between different versions of the Rclb clispatch, On Aph:~ASP systems. the machine code rou- Rdb layered protlucts, ant1 gateways. A new calling tines generatetl by Rtlb use four different aclclress- mechanism was used on Alpha ASl' sysl.enis to ing modes to access clata items: absolute atldress, ensure gootl performance, since every user request base register pills offset, integer register content, to tlie Rdb executive goes through the tlispatch. and floating-point register content. Each of the Alpha ASt' registers 1112 through R15 is used as a Cock lcCelz~.mto~*Rrlb uses compilecl BLISS cocle base register. Thus, any data stored within 256K ant1 generated machine code to execute user (4 X 64K) of memory space can be accessed effi- recluests. Il~~ringrequest compilation, Rclb gener- ciently. To maximize tlata access efficiency and ates highly efficient routines using the target caching, changes were rnacle in the code generator m;tchine instructions. These routines perform to allocate data densely. To improve performance basic data operations inclucling d;~t;~con\iersion, hlrtlier, data items were allocatecl at quatlword or data movement between buffers, aggregirtion. ;rncl longwortl aligned addresses. expression evaluation. An Alpha ASP code sequence executes more 'The design of the Rdb code generator to protluce efficiently when instructions cxn be multi-issuetl Alpha ASP machine code was i~ncloubtedlythe and executed in p;~rallel.'This can be achieved most complex porting task. Iisc of a niechnnism by reordering the sequence of instructions other than code generation woulcl h;~veretli~cecl while maintaining any chronological depenclency the porting effort. However, at the time we beg;~n between instructions. 1-0 take advantage of this porting Rdb, it was not clear if an alternate mecIi;~- Alpha AXP feature, I%I..ISSn1;rcros were developetl nism woulcl gu;~ranteean accept;lble level of perfor- to reorder and interleave tlie instrirctio~isin a gen- mance. Good performance was consitleretl critical erated cocle sequence. to tlie success of Iitlb on Alpliil ?\XIJ systems. On Alpha i\SP systems, backward branches in tlie TI~ereh)re,we decided to add f~rnction;llityto the code slow clown the exec~~tionbecause of instruc- Rclb code generator to produce Alpli;~ASP cocle. To tion stream invali~lation.~Changes were made in generate efficient Alpha AXI-' cotle sequences, we the Rdb code generator to minimize bachw;rrtl obsrrvetl specific gi1ide1ines.l branches. This cli;~ngeat times increased the size of On Alpha ASP systems, cocle that references data the generatecl code but improvetl the code execu- itenis with incre;rsing memory atltlresses executes tion efficiency Further, Boolean code generation more efficiently. Therefore, the ;ilgorithrn was algorithnis were modified to incorporate branch changed to first order tlie data items by increasing prediction logic; code sequences with a smaller memory ;rtldresses and then generate code to pro- probability of execution were branched out of tlie cess tlie dnt:~. main cotle stream. 'T'his technique maximized tlie In Rdb, e;~chdata item has ;I null bit that indicates effect of instruction stream caching. whether or not the value of the dat;~item is known. As shown in Figure 5, to conserve space, the nirll KLID P~~ecoi~zl,ilc.r=sAn Rclb precompiler prepro- bits of clifferent data items arc stored together like cesses a user app1ic;rtion program that includes 21 bit vector within a record. Loading/storing ;I Rdb statements ant1 repl:lces these statements by null bit is ;In expensive operation 011 Alph;~ASP standard R<:I c;rlls to rlie Rdb software.' The Rtlb

Digital Technical Jourtral Vol. 4 r\To. 4 .SI~ec.inlI.SSLI~ 1992 161 Alpha AXP Architecture and Systems statements ernbedcled in the applications can be Porting DBMS one of three types: structured query 1;lnguage This section tliscusses some experiences of the (SQL), Rdb preprocessors language (Kdbl'KE). or DBMS porting group, namely those related to the re1ation;ll data manipulation language (ftvI;i71'RIEVE. The RclbPRE precon~pileris a proprietary lan- After receiving control, DBMS32 performs some guage interfice to Rclb. The long-term goal is no processing and then, using the <:~1.1.(;mechanism, new functionality and minimal maintenance. So passes the entire argument list to lower-level rou- the main objective was to recluce the effort tines for furtl~erprocessing. These lower-level rou- required to port this compiler. This was achieved by tines, in turn, often pass on the argument list, retaining the existing tlesign and using the Alpha sometimes :IS tleep as five or six levels. MACRO cross compiler to produce Alpha AXP Because we founcl CALLG to be inefficient, we objects froni VAX NlA<:R0-:$2files. tlecidccl to ch:~ngethe primary entry point into the The RD>IL precompiler is also a proprietilry Ian- DBCS. Rather than passing up to 26 separate argu- gilage interface to Rclb. I;nlike the Rclbr~1:precom- ments, DDMS creates a vector of longwords; each piler, this compiler does not produce MX AWCRO-52 longword contains an argument that would have files. So porting it was an e~ts)~ancl straightforw;~rd been passed using a p;lrametel: Once this vector is task. created (often during the compilation phase for the precompilers), l>UM$J2-\IEC (the \fE<:'I30IIversion Rd6 Sj~sten?Relutions Rdb uses system relations of DBM$~~)is called with a single parameter: the to record infornlation bout the user relations and atlclress of the argument list. An ex;tniple is show1i the database. Tlie system relations are storetl on in Figure 6. disk and loadetl into memory on deni;~nd.Since Layereel products using DBMS were ;idvised of the they are frequently referenced during user request new interface and were requested to use it as soon processing, efficient access to data in system rela- as possible. I-Iowever, since the cli;~ngedinterface tions is critical for performance. On Alpha AXP was incomjx~tiblewith some existing proclucts, the systems, accessing data froni memory is efficient if old interface was retained. DB~$32-\/1l<:uses the it is located on either a longword or a quadword new interface, ;~ndDBM$~~ homes the ;Irgument list address boundary Therefore, changes were niatle (thus creating the above vector) ;md then passes to the in-memory system data structures to align that, by reference, to DBM$~Z-VE<:. each dat;~fielcl to at least ;I longword address bound- ary. Further, data fields that were a byte or ;I word Sllyyorl o$ H-FLOAT Data opes Tlie H-FLOAT were espancled to ;I longwortl. data type is fully supportecl on tlie vi\X processor, The data in system relations was accessed by but the Alpha AXl' processor has no high-precision using RdbPRE st;iternents embedded in Rdb source floating-point formats. Although Eicilities esist on modules. Porting such Rdb modules posed a Alpha AXP processors to read an H-ITLOAT data dilemma. To compile these mocli~les, first the type, no such Pdcility exists to write :in H-FLOAT RtlbPRE compiler had to be ported to the Alpha t\XP data type. platform. Vice versa, to port and test the Rdbl'RE As a result, I)HMS customers arc atlvised to elimi- precompiler, Rtlb hatl to be portetl ant1 running on nate any H-FI.OAT data in tlatabases beh)re moving the Alpha AX[' platform. Moreover, Rdbl'RE was no them to an Alpha AXP system. Tlie 1)1%,\4S Ilatabase longer a strategic language interface. Therefore, Restructure IJtility (DR'II) can be llsetl to change all new BLISS macros were clesigned that replacecl the H-FLOAT data to ;unother common floating-point embeddecl RdbPRE st:ltements. form:~t. Po~-tit~'yDi~,ilal'.s JlatuDase Manag,oone?ztPI-ocllrcts to the Alpha AXP Platforr~

DBM$32 INTERFACE DBM$32_VEC INTERFACE

ARGl = FIRST PARAMETER ARGl -t ARGP = SECOND PARAMETER

FIRST PARAMETER

ARGN = NTH PARAMETER

HNTH PARAMETER

Figure 6 DBCS Routine-calli~zgInterface

In preparation for mixed VAX and Alpl~aAXP 5. Most of the changes that we made in DRMS were VMScluster systems, DBMS was modified such that not conditional, that is. tlie changes would affect tlatabases with H-FLOAT data can still be accessed. both VAX and Alpha AX]' systems. As a result, we However, a rim-time co~iversionerror occurs if were able to test our code on vh\; aystems with a H-FLOA'I' data is accessed from an Alpha AXP fairly I~igl~degree of certi~intythat our code was system. correct, barring any operating system or com- piler bugs. Use of AfJl) The Alpha User-mode Ilebugging We did eventually get an AIID version of DBMS Environment is a set of facilities that aids testing working. However, since we spent a considerable atit1 tlebugging of native Alpha AXI' code on any amount of time ;~ccomplishingthis, and we did not OpenVMS Vt\X system. AUD allowetl as much Alpha actually find any bugs in our code by using AlJD, we AXP user-niode code as possible to be ported imme- decided not to use MJI) in further areas of DBMS. diately to the Alpha AXP system and to be substan- Shortly after using AUI>, we received our Alpha tially debugged before Alpha AXP hardware was Demonstration Unit (t\1)11) and coultl test our cock avail;tble. Early in the DBMS porting effort, we used on actual Alpha AXP hardware. The only problems A[~I>to verify our port and to ensure that our code we found, which were missed tluring our initial was working correctly. port, were VtlX-style argument list assumptions. However, several issues hampered the success of Some of our code assunietl that routine arguments using AlJl) in porting the DBMS software: were contiguo~isin virtual memory; on Alpha AXP I. I)RMS makes frequent use of signaled excep- systems, this is not the case. tions. AlJl) hat1 difficulty in handling exceptions that cross the boundary between the Alpha AX]) Conclusion and VAX systems. To conclude tlie paper, we discuss our plans for per- 2 1113MS uses special stack m;~nipulation code formance testing ant1 our reflectiolis on the porting (stream code) to perform n~ultithreadingh~nc- process. tions. MID woultl become confusetl if the stack were to change unexpectedly Performance We have only begun our performance tests. Cur- 3. At the time we were using AUD, the l>BCS hat1 rently, we are running the 'TPC-B performance been ported, but KODA (i.e., the low-level ser- benchmark. We also plan to test against all TPC vices used by the DRCS) had not. As 21 result, benchmarks (A, B, and C) and other benchmarks many variables needed to be defined as crossing such as the Wisconsin benchmark. We are trying to the boundary between the Alpha AXP and VAX minimize the amount of time spent in PALcotle, systems. The setup time to define this informa- decreasing the code path length, reducing the cycles tion was significant. per instruction, and optimizing internal algorithms. 4. Since the code was still running on a VAX proces- Planned testing will also evaluate the effect of sor, many VAX dependencies were not c;~ughtby additional data alignment. As mentioned earlier, the In p;~rticulal;system services that ch;inged ease-of-migration issue is par;lmount for our current in subtle ways would work as before because the customers. Consequently, we have not realigned operating system was still the OpenVMs system. the database pages because that action would

Digital Technical Journal Vol. 4 1Vo. 4 Sl~ecirilIssire 1992 163 Alpha AXP Architecture and Systems require too much clowntime. Nevertheless, we do Surprisingly the code dicl not grow appreciably not want to preclude new customers, or current in size or complexity. One strength of the Kdb customers who need the performance boost, from and IlUMS softnrare has been the ability to easily utilizing a properly aligned database page. To test modify the code and to add new functionality the potential performance improvement, we plat1 Even after the port, m7e fincl that the products to create ;I test database that is completely aligned, ;ire as malleable ant1 as easy to modify as before. in memory 21nd on disk, and compare tlie 'TPC per- We h,und i~nreportedbugs in our \hX products. formance against the standard database. Virtually all the groups involved did a masterfill Reflections job. 'The program team and various Alpha )\XI' com- mittees anticipated potential issues and ensured At the beginning of the paper, we stated that oilr thi~tthe program proceeded smoothly and pre- goi11 Mias for Digital to provide an easy migration clict:ibly. The cross compilers from the language path to tlie Alpha AXIJ platform for software prod- groups worked superblj~.The OpenVMS AX[' and ucts. Although we encountered some difficulties, hardware groups delivered their products on time, DBMS we believe our Rtlb and porting efforts attest ;~nclwhen a user logs in to an Alpha to(P system, the to Iligital's success in this endeavor. OpenVA4S AXP system is not only familiar but faster. As one example of how the experience influ- enced our ;~ppro:~chto porting, we had to 1e;lt-n new methodologies, practices, and system behavior Acknowledgments on the Alpha ASP machines. For instance, when The successf~ilport of the Rdb and IIHMS software stepping through a particular code sequence n~itli to the OpenViMS AXP operating system was a result the debugger, we would entl up in an infinite loop; of the contributions made by many of the engineers if we just rrrn the cotle, the sequence would work. in the Database Syste111s G~oLI~.The autliors sin- Althoi~gh this behavior was documented, we cerely acknowledge the effort of each engineer in encountered the problem several times before we achieving tlie project goal, that is, to be able to fully untlerstood the ramifications ant1 appropri- quickly offer correct versions of Rdb ant1 DBMS on ately changed our development methods. the Alpha AXP platform. Finally, an unsung hero in Overall, the porting effort had the following pos- the company-wide effort was Digital's VAX Notes itive results: coninlunications facility. VAX Notes h~nctionetlas ;in excellent medium for identifying and sIi;lring The port allowed us to clean up our cocle. even problems ant1 solutions. though we tried to avoid algorithm changes. Ijcc;iuse we had to port ancl review ever). line of cotle. we managed to moire the code to ;I more References consistent cotling convention. 1. T. Leonartl, VA X Architectzir-e Refesence iMa17ual The port :~cterias a learning experience for nlost (Betlford. MA: Digital Press, Order No. EY-3459E- of the engineers. Most niature products contain Ill', 1987), home code that has not been modified in years. 2. /lMI Hnizc[Dook (Maynard, hw: Digital Equipment The port forced us to review ancl understand Corporation, Order No. AA-GV71A-TE, 1986). such code becluences As a result, we ended up with more knowletlgeable engineers. 3. O/~erzKb%\' Cnllilzg Stnndald (Maynard, M/\: Digital Equipment Corporation, Order No. AA- Thc port allowetl us to transform the code into 1'QY2A-TK, 1992). a more portable state. As we moved away from tight ties to \/AX behavior, we simplified future 4. R. Sites, ed., Alpha Architect~ire Rcfere~zcc tasks such ;IS moving to the OSF/l alitl Wi~itlows Mar?ual (BurJington, MA: Digital Press, Order NT oper:lting systems. NO. EY-IJ52OE-DP, 1992).

Although overlapping current VAX development with the Alpha AXP port slowecl dow~ltlie port- ing process, the tlecisiorl to use a common code base elimin;~tedthe future need to integrate two divergent source codes. James K Colombo PamelnJ Rickard Paul Benoit

DECnet for OpenVMS AXP: A Case History

The DECnet for OpenVlMS AXP networking software facilitates the itztegration of Open R?fSAXP systems into existing DECnet computing environments. This new soft- ware product supports application migration by prouidiizg the follozui~zg net- working capabilities: support of compatible libraries, consistent application programming intetfaces, and the assurance of a common semantic operation with the OpenE\fS VAX system. The tecim implemented a phasedporti~zgprocess and ee- cuted the project coopemtiue& The eflort resulted in a solid knowledge Ocise with ~uhichto approachfiltureporting u~zdertakingsUsing comnzon code wwherepossi- 61e and avoiding architectz~re-specificcode were lessons learned during the project.

The DECnet for OpenVMS IU(P networking software and identifies specific problem areas. It concludes product plays an important role in the integration with a summary of the lessons learned during the of OpenVMS AXP systems into existing DECnet com- course of the project. puting environments. The availability of DECnet software on the Alpha A?(P hardware platform facil- Project Overview itates application migration. The networking capa- In addition to presenting the DECnet for OpenvMs bilities needed to support this migration activity AXP features, this section details how we derived a include support of compatible libraries, consistent project schedule and gives an overview of the soft- application programming interfaces (APIs), ant1 the ware components. assurance of a common semantic operation with the OpenViMS VAX system. The network features such as network file transfer, remote file access, Software Code Base remote login, downline load, and local and remote Prior to the formation of a team to port a DECnet network management allow the OpenVMS MP product from VAX: to the Alpha AXP architecture, system to participate fully in a DECnet network. the DECnet-VAX development group completed The purpose of this paper is to describe the pro- a feasibility study of porting DECnet-VAX Phase IV cess of porting the DECnet-\AX product to the to the Alpha AXP architecture. This effort was nec- OpenVMS AXP operating system. The DECnet-VAX essary because the DECnet-VAX software was not product consists of networking software written in designed with porting in mind. The study con- the MACRO-32 and BLISS-32 programming languages. clutled that it woultl take four engineers twelve The software contains privileged system code, months (i.e., 48 person-months) to port DE<:net- device drivers, and user-mode utilities. VAX to the OpenVMS AXP operating system. After This paper is divided into two major sections. examining the proposal and investigating the alter- The first section presents an overview of the proj- natives, we decided that the best approach would ect, including discussions about the DECnet fea- be to start by porting DECnet-VAX V5-4.3,a Digital tures supported in the OpenVMS AXP operating Network Architecture (DNA) Phase IV implementa- system, the project schedule, and the major DECnet tion.' One of the n~ostimportant factors in making for OpenvMS AXP components. The second major this decision was that this software version was section details the process of porting DECnet-VAX in external field test and was nearly ready for software to the OpenvMS AXP operating system, shipment to customers. Another consideration was including testing and clebugging. This section pro- that some very important fixes had been made in vides information on nonportable coding practices that release, and we wanted to offer our customers

Digital Technical Journal &)I. 4 r\'o. 4 Special lssr.te l992 165 Alpha AXP Architecture and Systems

the highest quality possible in the first version of of code per software module. The Motlule Count DECnet for OpenVMS AXIJ software. Since that time, Method uses the number of moclules per software we have continued to improve our DECnet software component to determine the workload. Both meth- for the OpenWS AXP operating system ancl have ods talze into consideration the programming lan- recently incorporated some fixes from DECnet for guage used in each module. Table 2 presents details OpenVMS VAX V5.5-2. of the component module count and sizes. Me fur- ther categorized the software being ported into DECnet for OpenVMSAXP Features three groups: privileged code, device driver, and user-mode utility The software type was used to The first release of the DECnet for OpenVNIS ASP estimate tlie amount of time needed for linking. In networking product is packaged with the OpenVMs general, we allocated more time for privileged code ASP operating system. The initial offering includes ant1 device drivers. the support of DECnet Phase IV protocols running The estimates were used to derive a first-pass over Ethernet or fiber clistributecl data interface scliedule ancl to determine resource allocation. A (FDDI) local area networks. This release supports number of other factors affected the final schedule. distributed task-to-task communications using the A major factor that we could not quickly estimate same set of documentecl programming interfaces was the portability of the software. The software supported in the DECnet-VAX environment. At this techniques encountered and described in this time, DECnet for OpenVMS AXP software does not paper such as coroutines, up-level stack references, support wide area communications devices and and condition code usage had a direct impact on host-based routing. Future releases of DECnet for the schedule. Also, during tlie first three months of OpenVMS AXP may include symmetric multi- the project, significant time was spent learning processor (SMP) and cluster alias support. how to port code. During this learning period, we developed the skills, knowledge, and techniclues Project Schedule used throughout the remainder of our porting The DECnet for Open\lMS iD(P project scheclule was work. primarily driven by the overall OpenVMS i\xll oper- Once we established the estimation metrics, the ating system product schedule, with the DECnet com- data was compiled ancl time estimates calculated ponent scheduled for clelivery in November 1991. for each component. Tables 3 and 4 show the aver- The DECnet-\RX porting project officially began in age amount of time required to port each DECnet early January 1991, after the code base was selected. for OpenVMS AXP component. Based on these calculations, we estimated that it Porting Estimates After analyzing the work would take 13 person-months just to port the required to achieve the port, we developed general net-VAX software. We then used project man- porting guidelines and estimates based on a num- agement software to plan the scheclule. The schecl- ber of factors, including the language the software ule shown it1 Table 5 indicated that it would take 48 was written in, the amount of software to port, and person-months to meet the OpenVkiS AXP sched- the number of software component modules. We uled completion date of November 22, 1991. We then combined these estimates to determine an made our first network connection onJuly 25, 1991, overall project schedule. Table 1 presents the 20 person-months into the project. Although much guidelines we used for the porting estimates. work remained, we were well ahead of the We used two methods to estimate the amount of November target date. work required to complete the port. The Module Since we were ahead of schedule, we assisted in Size i~lethodtaltes into account the number of lines the porting of other components, including RTPrU), CTDRIVER, RTTDRNER, ancl REMACP, all cliscussed later in the paper. In addition, we were able to add Table 1 Guidelines for Porting Estimates support for FDDI. Lines of Code Module Count Language (Per week) (Per week) Milestones The OpenVMS AXP project schedule BLISS 10,000 10 consisted of a series of functional internal base levels numbered one to five. In terms of the whole MACRO 3,000 5 OpenVMS AXP project scliedule, DECnet for

Val. 4 No. 4 Special Issue 19.92 Digitul Technical Jozir~tal Table 2 Component Module Count and Sizes

Average Software Module Number Number Component TY pe Language Count of Lines of Lines

DTRIDTS User MACRO EVL Privileged BLlSS HLD Privileged MACRO MIRROR Privileged MACRO MOM Privileged BLlSS MACRO Subtotal NCP User BLlSS MACRO Subtotal NETACP Privileged MACRO NETDRIVER* Driver MACRO NlCONFlG User BLISS NMLt Privileged BLlSS MACRO Subtotal NETSERVER Privileged BLlSS

Notes: * Includes estimates for NDDRIVER Includes estlmates for NMLSHR

Table 3 Module Size Method

Total Time Component BLlSS MACRO Link per Component DTRIDTS EVL HLD MIRROR MOM NCP N ETAC P NETDRIVER* NlCONFlG NMLt NETSERVER TOTAL Weeks Months Years

Notes: ' Includes estlmates for NDDRIVER Includes estlmates for NMLSHR

Note that the data presented IS in weeks, unless otherw~sespecified. A week equals five working days, a month equals 4.33 weeks, and a year equals 12 months or 52 weeks.

Digilnl Techrr ical Jorrr~rd %I. 4 l\'o. $ .c;l,ecirrl /ssl(e 1992 167 Alpha AXP Architecture and Syste~ns

Table 4 Module Count Method

Total Time Component BLISS MACRO Link per Component

DT WDTS 0.00 2.80 2.00 4.80 EVL 1 .OO 0.00 2.00 3.00 HLD 0.00 1.80 2.00 3.80 MIRROR 0.00 0.20 2.00 2.20 MOM 1.50 1.40 4.00 6.90 NCP 3.50 0.40 4.00 7.90 NETACP 0.00 4.80 6.00 10.80 NETDRIVER* 0.00 0.80 6.00 6.80 NlCONFlG 0.70 0.00 2.00 2.70 NMLt 3.1 0 1.40 4.00 8.50 NETSERVER 0.30 0.00 2.00 2.30 TOTALS Weeks 10.1 0 13.60 36.00 59.70 Months 2.33 3.1 4 8.31 13.78 Years 0.1 9 0.26 0.69 1.15

Notes: * lncludes estimates for NDDRIVER lncludes estimates for NMLSHR Note that the data presented is in weeks, unless otherwise specified. A week equals five working days, a month equals 4.33 weeks, and a year equals 12 months or 52 weeks.

Table 5 Planned Project Schedule Code Total Time Component Port Debug Review Test per Component

DTWDTS EVL HLD MIRROR MOM NCP NETACP NETDRIVER" NlCONFlG NMLt NETSERVER TOTALS Weeks Months Years

Notes: * lncludes estimates for NDDRIVER Includes estimates for NMLSHR Note that the data presented is in weeks, unless otherwise specified. A week equals five working days, a month equals 4.33 weeks, and a vear eauals 12 months or 52 weeks. DECnet for OpenVMS AXP: A Case History

OpenVMS AXP was targeted for base level five. made arcl~itecture-specificcode conditional on the However, it was highly desirable to provide file platform on which it would execute. Our long-term transfer and remote login capability over DECnet as goal is to incorporate common code into future early as possible. The project team worked closely DECnet for OpenVMS products. with the OpenVMs AXP group to deliver this sup- port prior to b;~selevel four. DECnet for OpenWS AXP Components This section describes the major DECnet for Common Code OpenVMS AXP components and lists the porting One of the most important decisions that helped us issues relevant to each.* Figure 1 shows the inter- deliver our software ahead of schedt~lewas build- connection of the various components of the ing common code for the VAX and Alpha AXP DECnet for OpenVMS LYP software. Detailed infor- systems. During the course of porting code, we dis- mation for each porting issue is fi~rtherdiscussed in covered two advantages to building common code. this paper under the Porting Issues heading. The first was having the ability to generate Vi\X and Alpha AXI' images from a single set of source code. IVETDRWER NETDRIVER is a pseudo device The second was being able to debug our ported driver, i.e., a device driver that does not directly changes in a stable OpenV&fSVAX environment. We control any hardware devices. It implements the accomplished this by rewriting code that required routing, end communication, and session control change so that it worked on both platforms. We layers of the Phase rV version of DNA.'

7'7APPLICATION I NICE MESSAGES I LOCAL 1 1 REMOTE

NETSERVER

$QIO -$QIO NETDRIVER NETACP -- SESSION DATABASE END COMMUNICATION ROUTING DATABASE

$010

MOM

$QIO

NDDRIVER

I I DATA LINK DRIVER I

Figure I DECnet for OpenVMS AXP Components

Digital Tech~zicnlJourtrtrl Ih1. 4 A1o. 4 Specin1 /ssrre 1992 169 Alpha AXP Architecture and Systems

'I'lle queue 1/0 request ($~l0)s)lstenl service is the DNA command terminal (CTEIbVl) protocol. the interface into the session control layer. The CTDRIVER and RTTDIUVEII perform similar func- NETL)IWER routing layer communicates with other tions with the exception tl~atRIIDRIVER is used for device drivers that implement the clata link layer of interoperability with older implementations of DNA. NETDRlVER communicates with NETACP remote terminal support. REMACP is an ancillary (another component discussetl later in this section) control process (AW) that receives incoming to perform network management functions, to requests for remote terminal support. After REMACP report state and network topology changes, and to establishes ;I connection with the remote nocle, perform operations that require process context. either CTDRIVER or RTTDRIVER communicates NETDRIVER is written in NMCItO-32 code and pre- directly with NETDRIVER to send and receive sented us with many porting issues, includ- remote terminal protocol messages. ing device driver changes, coroutines, memory CTDRIVEI1, KTTDRNER, and RE;LWCP are written in management changes, page size clependencies, MACRO-32 code and presented the following port- atomicity ancl granularity problems, OpenVMS ASP jng issues: device clriver changes, unaligned refer- operating system data structure changes, unaligned ences, OpenVMS AXP operating system data references, and up-level stack references. structure changes, and for KEMI\CP, changes in the interface with CTDRIVER. IMO~W The mainte~~anceoperations module (MOM) image processes service operations defined NETACP NETACP runs as an ACP that assists by the maintenance operation protocol (MOP). One NE'['[)RTVER in performing network operations that such service operation is downline loading remote require process context. Examples include creating systems. MOM uses NDDRWER (described in the processes for incoming logical links and assigning next subsection) to communicate with the remote channels to data link clevices. NEI'DRNER ancl system over a DECnet circuit. MOM comnlunicates NE'TACP also work together to maintain information with NETACP to gather information about nodes about the state of the network. Another major firnc- req~~estingto be downline loadetl. NETACP creates a tion performed by NETKP is the management of process running the MOM image when a request for the network configuration parameters residing in a service operation is receivecl on a circuit enabled virti~almemory. to perform service operations. NETACP is written in MACRO-32 code. Porting MOM is written pri~narilyin BLISS-32 code. Porting issues included coroutines, usage of processor issues inclucled removing dependencies on the for- status longword (PSL) condition cocles, memory mat of a VAX argument list, condition handling management changes, page size dependencies, changes, and Alpha tLYP image Ileacler changes. atomicity and granularity problems, Openms AXP operating system data structure changes, and NlmR/VER The pseudo clevice driver NDDNVER unaligned references. It1 addition, tlie use of "poor implements an interface that allows MOM to use a programmer's lockdown," a method of locking DECnet circuit to perform service operations using pages into a working set, required moclification. DNA MOP. The &IOM image uses the $QIO system service interface to send MOP messages to and receive klOP messages from NDL)KNER, which then NETSERVER The NETSERVER image is run by communicates with the data link device clrivers to server processes created to handle incoming logi- transmit and receive these messages. NDDNVER cal link requests. NBTSERVER invokes the image or communicates with NETACP to perform tasks command procetlure associated with the network that require process context and to receive notifica- object specified by the incoming logical link. To tion of state changes to circuits enabled for service avoid the overhead of process creation, a server operations. process can be reused after the logical link it was NDDlUVER is written in MACRO32 code. Porting servicing is terminated. Iclle server processes regis- issues includecl changes to device clrivers, memory ter the~nselveswit11 NETACP so that they map be management, and OpenViMS UPoperating system reused for another logical link. data structures, as well as page size dependencies. NETSERVER is written in RLISS-32 code. The only porting change necessary was the addition CmRfVER,RTTDRIVER, anlzd REMACP CTDWER of the BLISS VOLATILE attribute to several clata is a pseudo device driver for remote terminals using declarations.

170 Vol 4 No. 4 .Specral/s.c~~c1992 Digital Techrricnl Jorrrtrnl DECnet for ODenVM.7 A XP: A C~rsc.Histor))

NCP The network control program (NCP) is tlie messages into something readable by a network user interface for network m;in;Igement. NCP com- manager. municates with other network management com- EVL is written in BLISS-32 code. Porting ibsues ponents using the network information and included adding the BLISS VOLriTILE attribute to control exchange (NICE) protocol. NCP can be usetl some data structure definitions and aligning data to manage thc 1oc;ll node ;15 well as I-emote nodes. structure fields on natural bountlaries. When managing the local node, NCP exchanges NICE protocol messages with the NMLSHR shareable DT5 and DTR The DECnet test sender (DTS) and image. For remote management, NCI' creates a logi- the DECnet test receiver (DTR) are cooperating pro- cal link to the network rnzlnagement listener (NMI.) grams that can be used to test the network connec- object on the remote node ancl exchanges NICE pro- tion between two nodes. DTS runs on the local node tocol messages over this logical link. and commuiiicates with DTR on the remote nocle. N<:l' consists primarily of HI..ISS-32 moclules. The DTS and DTR can be used to test the througliput and major porting issue ;~ssociateclwith NCP was chang- reliability of a line over which IIECnet is running. ing the code to use I,IB$TAHI.E-PARSErather than DTS and DTR are written primarily in MACRO-32 LIB$TPARSE. cocle. The two major porting issues associated with DTS and DTR were changing the code to use 1VMLSHK NMLSHR is ;z shareable image that pro- LIB$TABLE-PARSErather than LIR$TI'ARSE;~nd add- cesses NICE protocol nctwork management mes- ing some BLISS-32 code to support flo;~ting-point sages on an Open\/MS system. NMLSHR decodes operations. NICE messages received as input alld performs the requested n1an;lgernent operation. NMLSHR builcls RTPAD RTPAD provides the connection between NICE protocol messages as a response to requests a local terminal and the remote terminal services of asking for network n1an:lgement information to be a remote node. RTPAD is invoked as the result of returned. NCP and NML both link with tlie NMISHR executing the SET HOST cornm;lntl of the Digit;~l image to call the routines th;~tprocess tlie NICE pro- Command Language (DCL). RT13b\l) comniunic;~tes tocol messages. with REXIACP and CTDRIVEK or Rl-TI>RIVER on the NiMLSHR is written in BLISS-32 and MACRO-32. remote system to provide remote terminal support. Porting issues inclutletl dependencies on the for- RTPm accepts input from the local terminal (whicli mat of a VAX ;Irgument list and ch;inges required to could be another remote terminal) and sentls this link shareable images. data over the network to the remote node. Output from the remote nocle is receivetl by R'r'r'iil) ;incl tlis- IV,~!IL The network rn;~nagementlistener (NML) played on the local terminal. image is run when a remote node requests a con- RTPm is written in MACRO-32 code. Porting nection to the NML object to perform remote issues inclucled unaligned references ;~ntl;~ligning network management operations. NML sentls NICE data structure fields on natural boiind;~ries. protocol messages to and receives them from the remote node. NML passes NICE protocol messages NICONFIG NICONFIG is tlie Ethernet configurator receivetl from the remote 11otle to NMLSHR for that listens to the M<)P system itlentification mes- tlecoding and receives messages from NMISHR to sages broadcast on Ethernet circuits and maintains send to the remote notle. a database of configi~rationinformation for all sys- NML is written in I3LISS-32 code. The only porting tems heard. NCP is used to manage ;inel tlispl;~ythe change made to NML code was to atlcl the BLlSS information maintained by NICONFIG. NI(:ONFI<; VOLATILE attribute to one data declaration. runs as a process created by NMLSHR and comniuni- cates with NMLSHR over a I>ECnet logical link using EVL 'I'he event logger (EVL) receives event mes- the NICE protocol. sages from the various DNA layers. EVL can also act NICONFIG is written in BLISS-32 code. The only as an event sink for messages generated at a remote porting change was to remove the module switch node. EVL is started by NETACI1 ant1 declares itself LANGUAGE. *as a network object so that remote nodes can con- nect to the EVL object ant1 send event messages. EVL HLD The host loader (HLD) communicates with can log events to a file in binary form or format the the DECnet-RSX satellite loader to tlownline loacl Alpha Architecture and Systems

tasks to ;In RSX-11s node. H1.D is writ~enin ~i\(:tiO- procetlures were copied onto the disk from a UP 32 code. The only porting change was to upd;ite the system. Each time new DECnet for OpenVMS AXP structure definition language used to create one images or test procedures hat1 to be adtletl to the cI;tt;l struct~~re. sharetl disk during a test or debug session, the Alpha A)(P test system hat1 to be stopped, the disk MfRROR The loopback mirror p;u-ticipates ill mounted on the VAX system, images copied, the disk networlc services protocol and routing 1;lyer loop- tlismounted, ancl the Alpha AXP system rebooted. back testing. MIRROR is written in MM;lE<:net-VAXproduct to the OpenVlLlS operating platforni consisted of the following steps: code system. preparation, compilation, linking, code review, debug, and testing. We tlid not start the task of port- DECnet jor OpenWS MY ing DECnet-VAX with a completely clear vision of Dezielopment Environ~nent the tot:~l process. As we progressed and learned I>E(:net for OpenVMS AXP is built with ant1 inte- more about tlie tools ant1 porting process, we grated into the Openv~MsAXl) operating system. improvecl our porting techniques and, as a result, Many cl1;inges were being ni;~cle to system d;tta our protlucti\lit): structures that directly affected the I>EC:net soft- Our stratcg) was to begill by porting the drivers nr;tre. These changes required the l>EC~ietfor and pri\rileged code. These components were the OpenVMS AXP software to be built with ;~ntltested most complex: thejr were written completely in 011 mnny interim operating system base levels MACRO-32 cotle ;i~itlIiad the greatest potential for Ixfore the combinetl OpenV&!s ,\XI-' oper:~ting change. \Ve started with NETDRTVEK and NETACI', system and OECnet for OpenVlLlS AX1' kit was assigning one engineer to work on each compo- shipped for layered product development. nent. As our porting group grew in numbel; we Bec;~usethe development tools ch;ingetl tIiro~~g[~- began to port, in parallel, the BLISS modules that out the project, we used the same tools to port the comprise NU', NMI., NILII.SI-IR, EVL, ant1 i\lOkl. I)l:(;net-\3X software as were usecl to develop the The following is an overview of the process we opcr;tting system base levels. When we copied pol-- used to port the I>E<:net-VAXsoftware to the Alphi1 tions of an OpenvMS ASP base level, we also copied ASP pl:~tform.L:~ter sections contain tletails of cotl- the tool directories associated with the system ing practices that had to change. build. We used cross compilers for hlA(:RO-32 ;tncl I%l.ISS-.32.which allowed us to tlevelop Alpha ASP Code PI.C/)NIYI~~OIIOur first task was to create software on an Open\ra,Is \RS s!,stem.5 In ;tclrlition, procedures that we coulcl use early in the porting we i~setlthe OpenVMS AX]-' linker, 1ibr;trian. ;lncl process to compile single ~llodule~of a DECnet for system dump ;unalyzer (SDA) cross tools on the VAS OpenV,\IS A>\']' component. We also ported the com- system. \Ye also used the Open\l,\IS r\Xl' debug- ponent's bi~ildprocedure to use the new Alpha AXP ging tools Delta and XDelta on the Alpha .\XI' proto- cross tools. type l~ardware.~ Nest, we began to prepare the code for initial 1niti;ll J)E(:net for Open\/i\lS AX1) testing was cornpilntion. &1.\(:1{(>-.32code must have each entry accomplished on a VAX system. Such testing was point itlentit'iecl prior to the initial compile. Entry j7ossible because we designed ;i m:~jority of the points itre identified by a compiler directive such as I>L<;nctfor OpenVMS AXP cock to run on bofh VblS .ISB-ENI'RY iind .(:ALI.-ENTIIY. Each directive :111dAlpha AXP hardware plath)rms. accepts optional p;cl.;tmeters that identify register The Alph;~I.W' prototype system i~seclfor testing usage. Howc\,er, this information is not required utilizecl a sharetl disk that cont:iinetl the OpenVMS at this point in the porting process. The Alpha ,4XP operating system images. The imitges nnd test AXP ,\IA<:ItO-32 compiler will provide register

161. 4 No. 4 S/~cicrlIssue 1992 Digital Tecb~~icnlJorrmnl DECnet for OpenVMS AXP: A Case History usage hints during the compilation, if so directed. some atomicity and granularity problems that are As the team became familiar with tlie porting not resolved/atldressed might cause code failures process, we were able to combine these steps during debug.' ancl include the register usage information when NETLIIWER ancl NETACP cotitained architecture- declaring entry points. Also, as our experience specific code, including memory management, increased, we were able to make changes to non- driver tables, and structure definitions, which had portable coding practices prior to the initial com- to be made conditional for the OpenVMS AXP and pile of a module. OpenVMS \'AX systems A prefix file was added to Our experience proved, as we expected, that each hUCKO-32 module during the Alpha AXP com- BLISS code is far easier to port than MACRO-32 code. pilation step. This file containetl an Alpha AXP dec- For the DBCnet-VAX components containing BLlSS laration that allowed us to include the directives modules, we began the port by running the compo- required for conditional compilation. To comp~le nent's build procedure. BLISS routines do not the portecl code on a VAX system, it was necessary require that entry points be identified. The com- to provide a VAX declaration and macros for the piler can process each module, identtfy errors, and various entry-point directives that when espanded provide warning and informational messages. contained no instructions. These were placed in a common library file ant1 conditionally compiled. Conzpile Process After we completed the initial The library file is inclutled in each module. Figure 2 code preparation ant1 created c~~stomizedbuild is an exanlple of a library file that contains a VAX procedures, the real iterative process of porting declaration and macros. began. At this point we compiled one or more BLISS architecture-specific code was made moclules, made additional modifications based on conditional using the %if %bliss(bliss32v) or %if the compilation results, and recompiled until we %bliss(bliss32e) constructs for OpeaVklS VAX and were reasonably satisfied that all the errors were OpenVMS AXP, respectively. fiiecl. After porting a11 the ~iiod~~leswithin a compo- The Alpha AXI' cross compilers, the I\WCKO-32 nent, the component's build procedure was run to compiler in particular, have the capability of pro- ensure that each module hat1 been ported without viding a vast array of inforniational and warning error. This was typically the first attempt to link the messages. When compiling a module, we always component. We also ran the Openv&bv!X proce- requested all informational messages. The infor- dure to ensure that the code would continue to mation assisted us in identifying the input and out- compile ancl link on tlie OpenVMS VAX operating put registers as well as the registers that the system. compiler automatically preserved. Using this infor- mation, we verified tlie register usage in each rou- Linking The process of linking was difficult at tine and added the information to the entry-point times. The DECnet for OpenVMS AXP software con- directives. Other informatiotsal and warning mes- tains drivers, system images, and shareable images. sages directed us to cotling techniques that Each component required changes to the link pro- required change. By working wit11 one module at a cedures. We made these proced~~resconditional for time, we avoidecl making repetitive porting errors both the OpenVklS VAX and the OpenVMS AXP oper- in multiple modules prior to our complete under- ating systems. standing of bow to soLve the more complex porting The process of linking the portecl modules problems. brought to light many unresolved references. In Some informational messages caution that cer- general, these references were to external routines tain coding techniques such as data alignment that had changed for the OpenVMS AXP operating should be moclified. We observed that attempting system. One of the most difficult aspects of the to make changes to align all data structure ele- porting project was determining which changes ments prior to completing preliminary debug and to the OpenVMS operating system had an impact testing caused many debug problems. Therefore, on our project. Determining these changes was we decicled to establish a porting policy to change difficult because DECnet for OpenVMS AXP is only as much code ,as was absolutely necessary tightly integrated into the OpenVMS AXP operating prior to the initial debug ant1 test of a DECnet for system. The process of pprting applications to OpenVMS AXP software component. Adhering to the OpenVMs ASP environment should not be as this policy required careful consideration, since difficult.

Digital Technical Jour~ral VoL. 4 ~\b.4 Special Issue I992 173 Alpha AXP Architecture and Systems

.SUBTITLE SDECNETDEF

Define all those symbols that should precede all DECnet

I macro modules.

.MACRO SDECNETDEF .IF NOT-DEFINED Alpha-AXP

These make Alpha AXP code compile on VAX builds by doing

I nothing when encountered

VAX=l ; .JSB-ENTRY .macro .jsb-entry, input, output, scratch, preserve . endm .JSB32_ENTRY .macro .jsb32_entry, scratch, preserve . endm ; . CALL-ENTRY .macro .call-entry, preserve, max-args=O,- home-args=faLse, input, output, scratch . endm . ENDC . ENDM /

Cor4e lRIVER ancl NEl'I\(:I' conipo- lems fount1 during tlie compile ant1 link phases had nents. Since many of the required chilnges were been corrected, we began our code review process. common to both OpenVhlS AXP ;~tldOprnVICIS VAX The original Vi\X cotle, the ported code, ;~ntla dif- systems, we were able to debug much of this code ference listing were avai1:tble to the porting team. before we had access to Alpha IIXP liartlware. We One or more members of tlie team reviewed the foillid and fixed a number of problems using this ch:~ngesmatle ant1 pointecl oi~t;In!' problems tliat technique. were identified to the person responsible for tlie W/lien we were re;ison;lbly confitlent tli;~tthe module being reviewetl. We ;ill had previously ported code was working on the Open\l$ls \%X ;!greed that the reviews would be friendly ancl that operating system, we began testing on Alpha ASP egos wo~lldbe left out oftlie process. We found that prototype hardrn;lre, which fort~~natelyti:icl just our successfill code reviews were well worth the become available. We completecl the driver 1o:id effort. and KPinitialization testing. The initial test uncov- Initial reviews turnetl 1113 differing pliilos- ered some problems tliat required special ophies I-egardingthe porting process. We discussed workarountls to allow debug to continue. Fl'liese these difl'c.1.encc.s and rr;iched a consensus. The problems were corrected in 1;lter versions of tlie reviews uncovered errors in the porting process, tools. Since the user interk~cehacl not yet been ant1 correcting these problems red~~cedthe amount ported, test code was written to start 1)EC:net for of delx~ggingreq~~ired. The review process also OpenViMS AXP and begin debugging the $QIO inter- ;rllowed us to agree on and maintain coding stall- face to tlie tlriver. dartls. Eventually NCP, NMl,. ;lntl NMI.StIII were ported, and more comprehensive clebugging begctn. We Urb~~ggiiz~0i1r ;~py~);~chto tlehiigging the used the OpenVMS AXl' XDelta ;~ndDelta tools to DEClieC hrOpenV>IS AXl' softw;~re\vas to build the debug the .DECnet for OpenVMS AX]' cotle on our common portetl cocle for a \hX system and to Alpha AXI' prototype h;~rdw;~re.System failures replitce tlie OpenVMS \)AX images with our portetl were tlebi~ggeclusing the sDii cross tool on a Vt\X version on one of oilr wol-kst;~tions.We began by system. We learned how to trace c;ill chiiins b), studying the OpenVMs calling stantl:-~rtl.- subroutine c;ills. We conclutletl tlir~t tlie code, 1 ide erst an cling the format of linliage pairs, proce- which had existed for a long time, :ilreatly savetl ;~nd dure descriptors, ;~ndregister save areas m;~tle restored the correct registers. We decided tli;~ttr!.- tlebugging niuch easier prior to the avail;lbility of ing to con~mi~nic;~tethe correct list of input, out- these features in SDA. 1)ebugging on ;in Alpha ASP put, pass-through, and preserve registers to the system is more tliffici~ltthan on a VAS system since compiler could be an impossible task, esl>eci;illy most VAX instructions generate multiple Alpha ASr' given our scl~etlule.We investig;~tetlthe possibility instructions whose positions ;Ire optirilized by the of using the .JSH32_ENTRY directive. This directive conipiler to take ;idvantage of Alpha AXP architec- allows the specification of registers that must be t~lrefe;~tures. "T'lil~s, it is not ;~l\vayseasy to follow preserved but tloes not take any input, output, or tlie AIphhiiXI' cock line by line bec;iuse the gener- scratcl? parameters. The OpenVMS hXP ~lh<;UO-32 ated Alpha AXP code from one language statement cross compiler will not automatic;~llypreserve ;in!. is interspersecl with Alpha AXl' code generatetl registers when this directive is usecl. A great cle;ll of from aoother language statement. care must be taken when LIS~II~this entry-point tlirective. Testiug M'ter solving the ol?\/io~~sproblems clur- Our tlecision to use .JSB32-EN'I'RY to declare entr!- ing the tlebi~gprocess, we began to test the DE(:net points lecl to an interesting proble~nwit11 :IS!'II- for OpenVk1S ASP code. Elrly versions of the chronously executing code tli;~tcould jnterri~pt Openvprs ~ixl-'file system, record nianagement ser- other threads of execution. The 1)E(:net-\'AX code vices (RILIS), ;inti the file access listener (FA13 were that we ported i~sedI'IJSHR anel IIOI'R instructions made av;rilable to us. We in turn provided tlie to save and restore registers th;it were ~nodificcl 1)EC:net for OpenVMS ASP cock to the group porting by DECbet-\'AX cock jnterrupting another thread of Open\'&lS R>lS ;tnd FAL for their use in debugging. execution. When adding the . JSI~~)~_INTRYdirec- We were then ;tble to run test scripts that used ;I tives, we specified a register preserve par:lrneLer variety of I)(:[. con~m:inds to perform loops of only on external entry points, assuming that the remote copies, differences, and clirectosy listings of remailider of the original UE(:net-ViIX code was s~v- I-emote files. I>E(:netnetworl< n1;In;igement scripts ing the proper registers. The preserve parameter tested the network n1;lnagement interface. DTS ant1 ensures that all 64 bits of the registers specified ;Ire I)TR were i~sedto perform tl:~t:~transfer testing. saved at routine entry ant1 restoretl at routine exit. Since the I)E(:net for OpenV,LlS t\XI1 softmiare was The PUSl3R and L'OL'R instructions preserve only ;~vailableearly, it w:is used by other Alpha IU(P port- tlie low-ortler 32 bits of the specified registers. ing groups on Alpha AXP prototype llardware in However, if I)G<;net-VAXcode in ;I routine witl~oi~t various 1oc;ltions.As the code stabilized, a timeslur- the .JSB32_ENTRY preserve p;~rameterinterrupts ing system was set LIP, which provitlecl the opportu- another threacl of execution th;~tmakes use of tlie nity for more testing. ul>per 32 bits of a register, these upper 32 bits would not be properly restorecl when contl-ol Portitzg Issues returned to the interruptecl thresd. The solutjon When we beg;~nporting the I)E(:net-VAX softw;ire was to specify the register preserve parameter on to the Alph;~)\XI-' 11ardw:lre platf-i)rm, we fount1 the .JSB32_ENTR\i directives used to declare the many coding conventions coultl not be used. most entry points of routines in I>E(:JietI'or OpenVhdS of these coding p~lcticesare called out by the cross ASP that are c;ipable of inte~-r~~l>tjngother thre:itls compilers, which signific;~ntlyhelped the porting of execution. effort.3 Whenever we changed tlie inp~~tor outl?t~t The following is a discussion of some problems p;lrameters to ;In internal subroutine, we ;llso we enco~~nteretlwhile porting ;~ndhow we solvetl changed tlie name of that subroutine. This effoi-1 them. 11elped identify all the internal calls lllade to sub- routines whose interface had c11;inged. Eiityy Poilils Approximately four months into the project, the porting team cletermined that usi~igthe CO~OLL~~~L'SA feature of the Vt\s ;rrchitectt~reusccl .,\SI3_ENl'llYclirective in NE'T'1)RIVER was going to throughout the NETACP and NFI'I>RniER con>- make porting clilficult. The tlifficulty was clue to ponents is called a coroutine. Coroutines usetl the complexity of the code ancl the fact that some in >h\CRO-D a1 low a subroutine to call cotle friig- code paths contained more than ;I clozen layers of ments in other subroutines. This technique uses the Alpha AXP Architecture and Systems

jump-to-subroutine construct JSB @(SP)+ to jump Alpha AXP architecture. When a coroutine is split between coroutines. The code example shown in into multiple routines, some cocle, such as that test- Figure 3 denlonstrates the use of the JSB construct. ing returned values. 11i;~ychange relative loc;~tion. The general flow of the example is for ~~AINto In our example, the error processing at SAYE is no call (:OKOUTINE with RO equal to 0 and R1 equal longer necessary Insteatl. <:OIK>C)In'INEreturns to to 1. Usually, COROUTINE changes the value of R1 to MMN if it detects an error, and ,WN siti~l>lyreturns 2 and calls back iWNat address SAVE. If COROUTINE to its caller with the status in RO. The VA~code is entered with R1 not eclu:~lto 1, then RO is set to 1 exaniple in Figure 3 was converted to Alpli;~AXI' and the coroutine dialogue terminates. MAIN at code using our technique. The resulting code is address SAVE then tests RO and exits. Under normal shown jn Figure 4. circumstances, MAIN at atlclress St\VE continues, The use of coroutines on Alpha ,\XI' s),stenis storing the returned value of R1 in DATA and calling should be discouragetl because of the o\,erl~ead back the coroutine at adtlress FINAL. COROIJTINE at associated with storing the return adtlress in regis- adilress FINAL then changes the value of R1 to 3, sets ters and the ailditional consumption of stack space. the return status in RO to 1, ant1 returns to 1ktA1N at Rather than a simple return ;iddress on the st;tck, address TERMINATE. TEliMlNATE then exits MAIN via there will be ;I register save arc.;( on the stack for the RSB instruction. each subro~rtinethat makes up the corol~tinc. All entry points in MACRO-32 code on an Recursive coroutines can consume large qilantities OperlvMs AXP operating system must have an entry of stack space. We have since converted coroutines directive. Thus, it is not possible to use tlie]SB con- usetl in main code paths to straight in-line subro~~- struct to jump to any random line of code, as the tine calls. previous example clenionstrates. To tlo so, the code shown in Figure 3 would have to be split into sub- St~~ck.Uscf,~e MA(:IIO-32 coclc ilseb :I number of routines, each with a .JSH-ENTRY or .jSB32_ENTRY common coding techniques t1i;it reqiiire knowl- entry directive. Also, we hail to change the irnple- edge of the state of the stack ;~ndth;~t must be mentation of coroutines. Rather than use the stack changed for the OpenV,\IS ,.\XI' operating system. to pass return addresses, we passed each return One such technique, referred to as ;in up-level st;ick address in a register. reference, occurs nrhenever ;I subroutine ;Ittempts Since some coroutines ported were more com- to access information (;lddrc.ss or rl;~ta)stored on plex than the example shown in Figure 3, we clevel- the stack by its caller. Parameter passing sonietimes oped a technique to port VAX coroutines to the uses this technique. If :I routine pushes ;Irjiurnents

MAIN: MOVL #O, RO ; Assume failure MOVL #I, R1 ; Set initial value JSB COROUTINE ; Open a coroutine dialogue

SAVE: BLBS RO, TERMINATE ; No change in value MOVL R1, DATA ; Save the changed value JSB @(SP)+ ; Continue coroutine dialogue

TERMINATE: RSB ; Exit with status in RO

COROUTINE: CMPL R1, #I ; Should we change the value'! BNEQ EXIT ; If not, exit routine MOVL #2, R1 ; Change the value JSB @(SP)+ ; Call back to coroutin~

FINAL: MOVL #3, R1 ; Final value

EXIT: MOVL #I, RO ; Signal success R S B ; Return

Figure 3 VAX Code Example Shoulitzg the Use of the ConstructJSB @ (SPj+ h./lf rrll., Oetlireer~Chr'olr tirzcs

176 Vol. 4 !\b. 4 Special Issue 1992 Digital Ticl~nicrtlJour?~a~ DECnet for 0penVlW.S AXP: A Case History

MAIN: .JSB-ENTRY OUTPUT=,- SCRATCH= MOVL #O, RO ; Assume failure MOVL #I, R1 ; Set initial value NOVAB SAVE,RZ ; Next coroutine address BSBW COROUTINE ; Open a coroutine dialogue R S B ; Return to caller

COROUTINE: .JSB-ENTRY INPUT=,- OUTPUT= C M PL R1, #I ; Should we change the value? BNEQ EXIT ; If not, exit routine PUSHL R2 ; Save next coroutine address MOVL #2, R1 ; Change the value MOVAB FINAL,RZ ; Coroutine address for SAVE to use J SB @(SP)+ ; Continue at SAVE

EXIT: MOVL #I, RO ; Set status RSB ; Return to MAIN

SAVE: .JSB-ENTRY INPUT=,- OUTPUT= PUSHL R2 ; Save next coroutine address - FINAL MOVL R1, DATA ; Save the changed value J SB @(SP)+ ; Continue coroutine dialogue at FINAL RSB ; To COROUTINE

FINAL: .JSB-ENTRY OUTPUT= MOVL #3, R1 ; Final value R S B ; To SAVE

F~LLI-e4 Alpha AXP Code Exu~)zpleSl~oiuirzg the Use of the Constr~tct JSB Q(SP)+ toJ~tnzp Deliueen Coro~~trnes onto the stack prior to jumping to a subroutine, the Subroutilles removecl the return address from the called subroutine does up-level stack references to stack and returned to the caller's caller. We modi- retrieve the arguments. Other techniques include fied the cotle to remove the up-level stack refer- using the stack as ;I common data are;) or attempt- ence (the caller's return address) ant1 return a flag ing to manipulate the caller's return address in in a register to signal the caller that a change in the order to alter the progr;im flow. program flow was desirecl. MI these techniques require re-cotling. When we encountered code that passed parameters on the Condition Codes The Alpha A?

DLC'rict f?)r OperzV1l4S AX? A Case Nlsto~')~

Conclusion any other code written in &bi(\<:l;~viclGagne, ;inti As we tliscoverecl tluring the process of porting Ben Thomas of the I/O team; Karen Noel ancl Mike the DECnet-VAX software, iLLZ<:R<)-32code is signifi- Harvey of the executive group; and Steve Dipirro of cantly more clifficlllt to port than code written in the XDelta te;lm. Iiiglier-level languages. However, certain ;~rchitec- The DECnet for OpenVMS AX1' project was only ture-specific functions may have to be written in part of the Alpha ASP teani effort. We feel fortun;ite assembly 1;lnguage. Wk recommend that these func- to have experienced the synergy that this team tions be isol;~ted.In addition, we recommend that created.

Digital Techr~icrrlJournal Vol. d ~Vfr,.4 Special lsslre 199.2 170 Alpha AXP Architecture and Systems

References 7 OpenViVLS Calling Stnnclclrd (Maynard: Digital 1. A. Lauck, D. Oran, ant1 11. Perlman, "Digital Equipment Corporation, Order No. AA-PQY2A- Network Architecture Overview," Digitul TK, 1992). Tecbrzicnl Jo~1t.nnl vol. 1, no. 3 (September 1986): 10-24. 8. N. Kronenberg et al., "Porting OpenVMS from VAX to Alpha AXP," Digit611 Z~cbniccrlJourrzcrl, l? 2. Beck and J. Krycka, "The DECnet-VAX Prod- vol. 4, no. 4 (1992, this issue): 111-120. uct-An Integrated Approach to Networking," Digit~rllTecI3nicnl Jounlcrl, vol. 1, no. 3 (Septem- ber 1986): 88-99. General References DECnet/or Open Vll-IS Nel~llorkiVl~rnngetnet?l 1Jllli- 3 ~Migi-uti~zgto an Open L1f.S Alpha Systenl: Port- ties mayna nard: Digital Equipment Corporation, i~zgK4X ilIACRO Code (Maynard Digital Equip- Order No AA-PQYAA-TK, 1992). ment <:orporation, Ortler No AA-PQYEA-'TE, 1992). DECrzet Ji>r OpenKMS G~ridelo ~Vetrcorking(May- 4. Openr',lfS' Linker rLlununl (Maynartl: Digital nard: Digital Equipmcnt Corporation, Orcler No. Equipment Corporation, Order No. At\-P()?CA- AA-I'QY8A-TK. 1992). TK, 1992). DECizet for Open vil4.Y Nel lvolflkiizg 1Wnn LICIL (M;I)'- 5. Open KV1.S Alpha Sj'stem Dump Analyzer IJtility narcl: Digital Equipment Corporation, Order No. ~Mclnr~~rl(Majlnard. Digi t;~lEcluipmen t Corpora- A A-PQ'I'YA-TK, 1992). tion, Order No. AA-PQYCA-TE, 1992)

6. O11eiz ViMS Deltu/XDeltu Cltility ill~zrz~~nl(May- iVligrating to arz O11en V;WS All,hcr Sjlstellz: Pl~r12 rz irzg nard: Digital Equipment Corporation, Order No. forilligrution (Majlnard: Digita1 Equipment Corpo- A A-PQY PA-TK, 1992). ration, Order No. AA-PQY7AYTE,1992).

180 Vol. 4 ~\b.4 S[~ecin/Issrre1992 Digilul Technical Jourrrul George A. Dccrcy III Ronald R Brender StephenJ Morn's Michael K Iles Using Simulation to Develop and Port Software

Anio~zgthe tools deueloped to silpport Digitc~l'sAlpha AXPprogrnl?~zilere jolir soft- zuare sirnulc~tors.The iMa~inequi~zand I.SP instruction set si~)zulatorswere used to port the Open VIN and OSF/l operating systems to the AQha AXP platform. The Alpha User-nzode Debugging Eizt~ironment(AUD) allozi~edAlpha AXP user-/node code to be debugged iilzth s~q~portfi'o~rz the Ope~zviVSVAX rz~rz-tuneenz)iron~~ient on VAX hc~rclwc~re.AUD 11~sb~ilt from a conzbinatzo~zof new and existing Digital softlvare conzponents. The Alpha User-mode Debugging Enz~ironnzent for Translated I~nages(AUDI) allouted translated images to be debugged 012 a sinzulator rz~rzizi~zgon a K4X conzputer With these clebz~gginzg enuiro~z~nents,user-mode cQplications GLIZ~code comn1)oizents coz~ldbe tested before Alpha AXP hardware crnd operating systnn softzuare were availc~ble.

Digital developed several software simulators to in tracking the Alpha I\XP architecture and in root- support its Alpha AXP program.' These tools ing out software bugs that the OSF/1 group was able enabled engineers to develop rind port software for to boot the ULTRlX operating system on the hard- the 64-bit RISC Alpha AXP ;irchitecture concur- ware on the first attempt. The OpenvMS group hat1 rently with harclware development. The simulators similar success and booted the OpenVMS AXP oper- were used for a variety of purposes inclutling port- ating system in a few hours. ing, testing, verification, ancl performance analysis. Note that the Alpha Demonstration Unit (ADU) is This paper discusses four Alplia iurP software simu- an Alpha AXP prototype hardware system and lators: Nlannecluin, ISP, AUD, and AUDI. should not be confused with the Alpha User-mode Debugging Environment (AUD) or the Alpha User- The Mannequin and ISP Simulators mode Debugging Environment for Tr;inslated Images (AULII), two software simulator facilities dis- Two tUpha iD(P instruction set simulators, cussed later in the paper. Mannequin and ISP, were used to port operating systems to the Alpha AXP platform. Tlie OpenVMS group used the Mannequin si~nulatorto port the OpenVMS AXP Porting Openv>ls v,\X system to the AJplia ASP platform. The OpenvMS group usetl NIannequin as their Alpha Likewise, the OSF/l group used the ISP simulator in AXP instruction simulator in porting the OpenVMS their port of tlie ULTRlX ant1 OSF/l operating sys- VAX operating system to the Alpha AXP hardw;~re. tems to the Alpha AXP platform. Both sinlulators Never before had an OpenVMS porting effort been were also usetl for architectural and design verifica- able to debug as much operating system code tion, and for performance modeling. in advance of hartlware. Prior porting efforts l'he Nlannequin simulator grew out of a simula- debuggetl only up to VMB, a priniary boot stage in tor developed for ;111 earlier RlSC project at Digital. the OpenVMS operating system. IJsing Mannequin, The IsP simulator was written anew by engineers operating system developers were able to boot the closely associated with the Alpha tLYP architecture. entire operating system on tlie simulator ant1 actu- The two tleveloprnent gro~~pswere recluested to ally log in and debug utilities. boot their respective operating systems on the sim- Some developers used Mannequin's own win- ul:itors before attempting to boot on the Alpha dows interface and debugging facilities to debug Demonstration Unit (ADIJ), the Alpha ASP proto- their code. Others ran the XDelta utility on top of type hardware.' The simulators were so successfi~l mannequin? XDelta is a low-level system debugger

Digilrrl Tecbnicril Journal 1/01. 4 Are. 4 Speci~tlIssi~e 1992 18 1 Alpha AXP Architecture and Systems used to debug the OpenvYlS ViX kernel :~ndtlrivers. OSF/l AXP Porting However, the Ma~inequininterface was useful in ini- The OSF/l operi~tingsystem group usetl the ISP sim- tially clebugging XlleJta, since the Alpha AXP con- ul;~toras an Alpha AX'P instruction compute engine. sole allows neither breakpoints nor single stepping. The strategy was to connect the ISP simulator to To debug their code before the fill1 OpenVMS ASP dhx, a st:~ntl;ird UNIX source-level debugger, via operating system was avail;lble, other clevelopers cll?x's remote interface. hi interface \v;ls added to used Mannequin in conjunction with tlie Npha tlie ISP to support the following low-level debugger prim;~rjr boot (A13R) cotle and a test h:~rness. commands: Wlannequin was especially usef~ilin finding align- mcnt hults in the boot sequence. since the align- Instruction stream exiirnine and deposit nielit tools are not operational until the OpenVMS . Data stream examine ant1 deposit AXP system is completely booted. Alignment faults Register exmnine ant1 deposit occur when :in ;lttenlpt is rn;icle to access ;I unit of dat;~locatetl ;it ;in atldress that is not a multil>le of . Single step the size of the data.

Microcode L'@eedzip 'She dbx tlebugger was motlified to work with the One mail1 reason the OpenVMS team was :ible to 64-bit Alpha architecture. 'That is, atltlresses in debug 21 large part of the operating system in real the debugger were extentlecl to 64 bits, and an time was tlie use of specially written microcode to Alpha ,%XIJ clisassembler was providetl. The ISp speetl LIPthe siniul;ltor. Mannecluin is c;ip:~bleof simulator ant1 tlbx debugger operated as separate running with speci;il user-written microcotle on processes communicating on the same ni;~cliine thc VAX 8800 fi~milyof machines. This microcode by means of a socket. A socket is a protocol- is :ln addition to the normal VAX micrococle for independent connection point for interprocess [lie 8800 machines; the VAX microcotle remains communic;~tions. unchangecl. With micrococle support, a large subset Hjstoric;rll): the OSF/l group used the ISP-dl~ of Alpha ASP instructions is executed in microcode combination to port the I.I:rlu>; oper:iting system ;ind attains performance comparable to native VAX to the Alpha AXP platform as ;In advanced develop- instructions. The Mannequin microcode occupies ment effort. When the group began to port the 9:) percent of tlie total 1,024 worcls of thc user- OsW1 system, Alpha AM' prototype hardware writ:ible control store. (141)Us) and fieltl-test coml,ilers were av;~ilable. Using microcotle assistance greatly speetls up Thus, the OSI:/l group iisetl tlie ISP in its hl)Ii niocle. M;~n~ieql~inexecution, yielding from 350 thousand where the ISP simulator operated as a console to Alpli;~AXP instructions per ClYJ secontf (KII'S) to a the ADU hardw;tre system. Tlie ADU consists of an peak perforniance of 1 million Alpha AXI' instruc- All~ha AXI' 1)ECchip 21064 processor, memory, tions per CPli secontl (IMII'S) on ;I VAX 8800. disks, Ethernet, ant1 a I>E<:station5000 wwrksti~tion, lVitliout microcode assistance. iVIannequin perfor- which acted ;a the console interface. Instrilctions mance is on the order of 10 KII'S. (For conlparison, that norm;~llyexecute on the simulator were trans- the ISP siniul:~tor operates at approxim;~tely30 ferrccl to the AOU for execution. However, the KIl'S.) Code streams that execute completely in entire symbolic debugging environment remained Mannequin microcode show much better perfor- iinchanged. mance than those that switch back ;inti forth between micrococle ancl the software simulator. With microcotlc ;tssistance on an unloadetl VI~X Si~nulntorSpecifics 8800, it takes from 20 to 30 minutes to boot the Tlie ISP simul;~torwas written entirely in portable OpenVMS ASI' system and reach the Digital <;. The Manneqi~insimulator was a hybrid of the Command Language (DCL) prompt ;iftcr login. <:++ ant1 C languages IsI' consisted of approxi- Becai~seof this microcode speedup, softwzire engi- m;~tely27,000 lines of code, Mannequin 31,800 ncers were able to simulate and debug a much lines. The two simulators shared common cotle: larger part of the operating system and utilities than the ISP simulator provided Mannequin with float- ever before. ing-point routines ant1 a comprehensive instruction Usirzg Silrzulation to Iler~elopcind Port Softz~wre

test program; Mannequin provided ISP with I/O whereas Mannequin supports only the not-last- device routines. Thus, the simulators verified the used (NLII) algorithm. The configurable TLBs Alpha AXP architecti~reas well as each other. allowed the operating system and chip design The Mannequin and ISP simulators tracked ancl groups to analyze and finely tune the tr;~nslation supported changes in the evolving Alpha AXP arclii- lookaside buffers for optimum performance. tectilre ant1 in PALcode. IIALcode is special machine- depetitlent software that provides support for many low-level operating system services such as Tlie Mannequin and ISI' simulators also support faults ant1 exceptions. PALcode ;~lso provides execution of user-mode, st;~nd-aloneprograms, i.e., instructions not in tlie base Alpha ~xl'h;~rdware. those with little or no operating system ri~n-time The two simulators 1i;lve features comnion to support, by providing program loaders for several many simul;~tors,i~lcl~~tli~ig support for loading formats. These formats include two UNlX object for- and running progr;ims, setting breakpoints and mats ((;OFF and a.out), an OpenVMS AXP image for- watchpoints, accessing memory, and saving and mat, and a system (raw tlata) image format. restoring machine state. Also supportetl are many Programs were compiled with early field-test machine-specific features, such as I/() devices, Alpha AXP compilers. Program execution was espe- interval timers, and configurable translation look- cially useful for hardware tlesigners ant1 conipilcr aside buffers. Besides a commantl line interface, writers for performance :lnalysis and benclimark- the Mannequin simulator has a graphical windows ing purposes. Note that applications rccluiritig fill1 interface that allowed users to see most machine operating system support usecl the AUI> facility, resources in a windows-hased format, as shown in described in a later section. Figi~reI. The simulators can ge1ier;lte trace files in a stan- 'The N1;innequin :~ncl[SP simulators support three dard trace file format. This common;~lityenabled basic devices: the two facilities to share the same trace analysis tools. The trace files generated by Mannequin A console device ~~setlfor terminal I/() and ISP were also used ;IS input to the Alpha A disk device used to boot the operating system Performance Model, another simulator that gener- ated detailetl performance data. An interval timer used for interrupts EVILIST ant1 ALPHA$REPORTwere two tools fre- quently used to analyze trace files and generate The disk device on the simulators can be either statistics concerning ni:icliine resources used dur- a file or a physicill disk device. The OpenVMS ing program execution. Tlie types of data generated group usetl a shared disk so that developers could by ALPH~I\$REPORT'~~~~LI~~the following: boot from a comnion disk while running on tlie simul:~toc Instruction distribution by opcode, class, and The simulators provide 16 megabytes (MB) of format physical memory with ;I tlefault page size of 8 kilo- bytes (kH). The physical memory of the simulators Instruction and floating-point register utiliza- may be increased to tlie practical limit of available tion summary virtual memory on ;I VAX system (minus a small Distribution of code block run lengtlls amount for the actu;il simulator code). Both simulators have configur:tble instruction Opcode pair distribution by class stream (I-stream) ant1 data stream (D-stream) trans- Control/branch instruction flow suniliiary lation lookaside buffers (TLBs). A TLR is a s~ilall cache that holds recent virtual-to-physical address At1 acti~altrace analysis report generated by translation and protection information. The sirnula- ALPHA$REI'OHT' is shown in Figure 2. This example tor TLIb can have a variable number of entries ill conies from a scaled version of FPPPP (one of the 14 each of tlie four gr;lnularity hint block sizes. benchm:~rl

Digital Tecbnicrrl Jouiv~crl VbI. 4 ,Yo. 4 .S/)ecicrl Issue 1992 183 Alpha AXP Architecture and Systems

OOOOONOO 00000000 00000000 00000000 00000000 00000000 OOOOoOrO 00000000 OOOMOOOOOOOO 000000000000 000*00000000 OWOrOOOOOOOO kvv000000000 Mlnu.N00000000 000000000000 NNN*00000000 ...... omomomomomom UlnlOhWOlOr OOrrNNMM**UtUl NNNNNNMM 000000000000 EWWWCCEEE 000000000000 000000000000 000000000000 OOOmOOOO 000000000000 000nU00m 000000000000 000rL000 OOOOLOOO 000ln~000 OOOOLOOO OOOOLOOO OOOOLOOO

OOOOLOOO 0000L000 0000~000 0000u000 0000u000 0000u000 0000k000 oooo~omo

-I ZZZZ a 0000 C > 00000000 C U) am- ", 00000m00 m aazm OOOOOOrnO E mm~a OOOOOONO LLW I 0000*0~0 +J+JU' 000000~0 I I Id 00000000 nu^ m 00000000 U.7 mr c4-J a OIO ma, Q 0 *r E 0 0 A L U 0 m CI .. 0 C .7 00 3 XL 00 m 10

mkOOOOOO O~OOOOOO OWNOOOOO OWOOOOOO OlLOOOOOO OlLOOOOOO OILOOOOOO OILOOOOOO

OkOOOOOO 0v000000 OILOOOOOO OILOOOOOO OlLOOOOOO 0v000000 0~000000 OlLOOOOOO

O-NM-~~QV 00000000 WWWWWWEDI

184 Vo/. 4 No. 4 Spccia!lssr4e 1992 Digitnl Techriicnl Joirrri~rl 1Jsing Si~ntrlatiorzlo Develop and Port Softwulr~?

ALPHA Instruction Statistics Report 6-MAY-1992 FPPPP -- Quantum chemistry calculation of a two-electron integral derivative

Instruction Distribution by Opcode (Ranked from highest to lowest)

Instruction Cumulative Class Mnemonic Occurrence Percent Percent 6 LD T 2321 155 25.41 20 8 MULT 1732928 18.97 40 8 ADDT 1433798 15.70 6 0 6 STT 998446 10.93 7 0 1 L D Q 544385 5.96 1 LDL 241 142 2.64 1 STL 178828 1 .96 4 BIS 151120 1.65 3 ADDL 126321 1 .38 8 SUBT 95045 1.04

help in designing critical sections of cotle. For Alph;~t\XP code using simulation on VAX hardwiire. example, the register usage distribution report At the same time, OpenVMS VAX run-time services helped determine how many registers shoultl be calletl by the cocle could be executed ;is native VI\X preservetl by a call ancl Ilow many shoultl be instructions. Thus, modules could be ported and scratcli (usable by a c;~lletlroutine without being tlebugged one at ;I time, until almost the entire preserved). ;ipplication consisted of bug-free Alph;~AXP code. During the design of the AUD environment, two key technical issues were The AUD Facility Whereas the Mannequin and ISP sin1ul;itors were How to efficiently tletect calls nintle by execut- suitable for initial debugging of low-level software ing VAX code to a routine in Alpha ASP code tliat such as operating systems, direct use of these tools could be "executetl" only by simulation, ant1 for user-motle ;~pplications,it., 1;lyeretl products, is conversely, how to tletect calls made by Alpha a different mzltter. Porting ;inti tlebugging Alpha t\Sf'cotle being simulatecl to native \/i\X code. AXP user-mode code is at best tlifficult without the How to effect the tr;~nsformationof parameters, full run-time support of an operating system. User- both location and representation, from that pro- r~iocle:~pplications typically take advantage of a vided by the caller in one dom;~ininto the loca- wide v;~rietyof run-time libraries, including coni- tions and represent;~tionsexpected by the callecl piled cocle support (such ;IS the Fortran run-time routine in the other domain. Altliongl-r there library), mathematical routines, graphics I/() ser- existed well-defined and widely followed calling vices, ;lntl database software (such as Rdb for stantlards for both tlomains. a variety of special- 0pe11\/hIS).Even if all this software were i~nmedi- purpose, 11igh-performancc c;~llingconventions ately available for Alph;~!\XI' systems, running it were i~seclin many situations. under simulation would be proliibiti\~elyslow. Therefore, Digital developed a mixed-execution This mixed-execution environment was expected tlebugging environment. This Alpha User-mode to have a relatively short lifetime, because it would Debugging Environment (All[)) was built from a become obsolete 21s soon as significant nu~nbersof con~binationof new ant1 existing Digitcll software real Alpha iLYP h;~rtlw;lresystems bec;uiie avai1;tble. components. In the AIJD environment, user-mode Conseq~rently,AlJl) itself had to be simple and inex- code being tleveloped for or ported to the Alpha pensive enough to be created quickly and put into AXI-' pli~tforrncoulcl be compiled ;~ntlexecutecl as use. 'l'he clevelopment effort met this requirement.

Digital Techrricnl Joui-i~nI l'ol. Q ,\b.4 .5peci~1II.FSII~ 1992 185 Alpha AXP Architecture and Systems

The elapsed time from initial concept to first use h)rmat image as output. The stantlard VAX linker was about eiglit months; tlie total tlevelopment can therefore reference locations in the Alph;t AX[' effort for AIJD over its lifetime was between three image in the normal way, :tnd the stantl;trd and four man-years. OpcnVMS image activator can be ~~sedto load the Alplia AXP image for execution. However, to mini- AUD Components mize complexity, we did constrain the Alpha A>(I' Despite the desire for simplicity, AUD consists of a image to be linketl as an absolute image (i.e.. a n~~tnberof cooperating components: basetl image, in OpenVMS jargon). This restriction eliminated the problem of how to relocate Alpl1;r (21llable Mannequin Alpha Simulator AXI' instructions using the OpenvMS image activa- ~lrl)tlebugger tor. As mentioned previously, the Alpha AXP image also includes a global storage area to hold the sirnu- ~ljl)linker Ir~tedAlpha /\XI' n1;lchine state. Alpha ASP native services All~haAXP Natirle Ser-vices Alpha AX]-' native ser- vAX jacketing services vices is a special oper;~tingsystem shell, part of AIII) 1.inkage Analyzer (t\LA) which executes as Alpha AXP code (i~ntlersirnul;~- tion) ancl part of which is includecl in the AUD jack- Selected VA~jackets eting services. ?'he native services provide the lowest-level support for hardware esception han- Callable Mcrn~zeyzii~tAlpha Si~nul~~tor.Callable dling and the OpenVMS co~lditio~i-handlingfacility. Nl;innec[uin, the Alpha AX]' instrt~ctionset siniula- Wlljle iUiD ultini:~tely supportecl frame-based con- tor, is essentially a subset of tlie mannequin simu- dition handling within the Alphit t\XI' image, inter- lator described earlier. In particul;rr, Callable operation of applic:~tionexceptions between the Mannequin omits the user interfdce and Alpha AXP Alpha AXP and VAX domains was not supported. machine state. Insteatl, tlie A~JDdebugger supplies the user interface. Also, storage for the Alpha UP VAX jacketi~lgScrrlices VAX jacketing services is machine state is separately linkecl into the AUD VAX code tliat supports the ability to write jackets environment to make this information globally that pass control back and forth between VAX and accessible. Callable Mannequin does retain tlie Alpha AXP code. The mechanics h)r ;~ccotnplishing microcotle-assist feature. this ;are discussed in the Jacketingsection.

AUU Ilr11~rgge-r-The ALID tlebugger is a niotlified Linkage A~za@zer The A1.A is ;I specialized version of DEBUG-32, the user-mode tlebug utility compiler that reads a specializetl jacket description on the OpenVMS VAX operating system. The AUD I;~nguage. This langi~agedescribes how calls in del>ugger provides most of the same feat~~resof one tlomain ;ire to bc transforrnetl into calls in the DEBUG-32. A configuration option allows the other domain on a routine-b!,-routine, parameter- DERII(;-9 utility to use an internal. low-level by-parameter basis. The output from tlie AM is remote debugger interface to interface with a for- an Alpha AXP object module and a linker options eign target. (This capability was origin;~llydevel- control file, both used to link the Alpha UPimage, oped for use in other protlucts such ;IS VAXELN ;inti a VAX object module. The Alpha i\XP object Atla.) We developed new cotle to join DEH1J(;-32 ant1 module provides a transfer veclor into tlie Alplla Mannequin using this interface. As a result, the AIrD ASP procedures. The linker options control file debugger 'cvorks directly with VAX cotle, in the provitles symbol definitions in an encoded form to LISLI;I[ Inannet-, ant1 works with Alplia A~I'cotle by manage calls from tlie Npha 14x1' image to the main passing commands to the Callable Mannequin simu- VAX image, which is linked later. The VI~Xobject lator to set breakpoints, examine instructions, exe- motlule contains a table that encodes the jacketing cute cotle, etc. tlescription.

AUD J.i~zker The Alll) linker is a vari;mt of the Selected VAX J~icket.s Selected \'AX jackets are A[.A Alpha AX]' cross linker tliat reads Alpha AXP object jacketing files (in both source and compiled forms) modules as input ant1 produces an OpenVMS VAX for calling common VAX facilities from Alpba AX['

186 1W. 4 iVo. 4 Speciclllssrrc 1992 Digilnl Technical Jortr.acrl Using Sinz ~ilntiojzto Ilei~elopaizd Port Soft~iwr,c

code. J:~cketsare provided for OpenVMS system ser- part of an application and its linking with the AUD vices, the C run-time library and some parts of the manager to create the \It\>; main image. When the general-purpose, run-time library (LIBRTL). The mixed VtiS and Alpha AXP application is executed, DECwindows group also supplietl jacket definition these images are combined in memory with files for use by other groups. AUD users are able to Callable Mannequin, the A1:D debugger, and other supplen~entthese files as neetletl by creating and shareable images. This relationship is illustrated in cornpiling their own jacketing descriptions for Figure 4. other VAX facilities. Figure 3 shows the main steps in building an AUD Jacketing environment. The uppermost sequence shows the Jacketing is the key feature that allows VAX and cornpililtion ancl linking of the iUpI7i1 AX^' conipo- Alpha AXP interoperability i.e., gives a processor nents, which results in the creation of the Alpha the appearance of being able to execute both \/AS AXP image. The central sequence shows the compi- and Alpha iLYP instructions. Mthough the details of lation of the jacket descriptions, which results in jacketing are intricate, the result is simple and ele- the creation of components that are included in gant. Calls can be made freely back and forth both the Alpha AXP and the VAX images. The lower between VAX compiletl code and Alpha AXP com- rows of Figure 3 show the compil;~tior~of the VAX piled code, without any special compilation motles

ALPHAAXP 1 n ALPHA AXP ALPHA AXP

PROGRAM

ALPHA AXP ALPHA AXP LINKER IMAGE

ALPHA AXP JACKET OBJECTS v

JACKET JACKET VAX VAX D-DESCRIPTIONS COMPILER T LOADERAAA -t MEMORY

VAX

VAX VAX - LINKER IMAGE

DEBUG IMAGE PROGRAM El-

ENVIRONMENT CALLABLE MANAGER COMPILER OBJECTS MANNEQUIN IMAGE

Fig"'-e -3 Crenting an AUD Application

Digilnl Tecb~ricnlJorrnrnl 4 LVO, 4 .Speck11Iss~~c 1992 187 Alpha AXP Architecture and Systems

ALPHA AXP LlNK-

r CALLABLE MANNEQUIN VAX AUD AND JACKETING + AUD DEBUGGER -LlNK TABLES -IVAX COMPONENTS I I MAIN IMAGE SHAREABLE LIBRARIES

on either sicle. 'The A~JDsupport is fi11Iy rec~trsive an Alpha AXP routine. According to the VAX archi- ~IICI reentrim. tecture, the first 16 bits of a routine comprise a Static c;~llsfrom VAX to Alpha AXI' cocle are mask that encodes the registers to be preserved as clirected to cl~rmrnyentry points in thc object motl- part of the call. Bits 12 ;~nd13 of this nxuk are ~rleprotlucecl by the AW compiler. E;ich entry point ~lnusedand recli~iretlto be 0; if one of these bits is is simply :in instruction that loads a pointer to the set nt the time of a call, then a hardware exception jacketing clescription table for the t:lrget t\lpha ,-\>;1' results. According to the OpenVMS AX]' softwnre procedure, h)llowetl by a transfer into common ;~rchitecture,an Alpha AXP procedure address is jacket interpretation cotle. ;lctu;rlly the atldress of a procedure descriptor, Calls from Alpha AXP cocle to VAX code use which is a data structure ancl not the actu;~lAlpha the fact that the <:allable Mannequin component ASP cotle. By design, bits 12 and 13 of this data stops anti returns c01itr01 to the i\lll) environment structure must be set to 1. when it detects :in instruction that trilnsfers control VAX execution of a VAX CALL instruction that out of the Alpliii ASP image. In this case, the appar- ;Ittempts to transfer to ;In Alpha AXP 17rocetlure ent address is an encoded integer (created by the results ill an exception. A special AUD exception )\LA), whose high four bits make it look like an ille- Ii;~ndlt.rintercepts the exception, determines if the gal address (in the VAX reservecl Sl space) and illeg;~lentry mask is caused by ;I reference into an whose remaining bits ;Ire a two-level index (i.e., 12 Alpha AXP image, ant1 if so, calls into the All[> jacket- bits of facility cotle and 16 bits of ofhet) into the ing routines to reform;tt the call frame. This mecha- jacket description table for the target \'A>; proce- nism also works for handling asynchronous system dure. This two-level scheme wits chosen to allow tl.:11x(ASTs) from the Open\JMS MIX operating jacket descriptions for different sharetl library facil- syste11-1 to Alpha AXP cocle. ities to be preparecl ant1 compiled inclepentlently. For computed calls from Alpha AXP code, com- Tlie facility cock is :i niimber norm;illy ;ilready asso- piletl code calls an Alpha AXP run-time library rou- ciated wit11 t1i:tt facility by software convention for tine te) perform the comparable bit 13 test (untler other purposes. simulation). If bit 13 of the target location is set to I, When ;I routine is calleel sing ;I dynamically then simulated execution continues ant1 ;In Alpha- tleterminetl atltlress, such as ;in ;~clclrc.ssgi\.en in ;i to-Alpha call is carried out. Otfierwise, control function \~;iri:~ble,a property of the VAS ;ind Alpha tr;~nsfersto a speciiil V\X code entry point in A1113, AXP architectures is exploitetl to determine dynam- which terminates simulation and performs jacket- ically whether the target routine is ;I VAX routine or ing back to the Vt\X target procedtrre. Usirzg Sirn zrlntioll to Der~elopalzd Port Softz~wra

Basic Operation supplies when it sets up a VAX-to-Alpha call (i.e., an To start executing a mixed application, the AUD address in the reserved S1 part of the VAX address environment first performs several initializ;~tion space), the address is interpreted as a retilrn trans- steps. 111 particular, AlJD scans all tlie images loaded fer. Otherwise, t\LJl> initiates a new Alph;~-to-VAX in process memory to identfy tlie Alpha AXl' image call. (only one wirs allowed and supportecl). For a return operation, the AllD code copies the Some ALJI) options are set through the use of return \~~lueor values from the Alpha AXP registers OpenVMS logical names, which are interrogated and passes them back to the VAX code. A VAX return during image initialization. These options include instruction is then executed to resume execution of the calling \hX cotle. Selecting Alpha t\XP stack size For a call operation. the VAS code fetches the Enabling tlelivery of ASTs to alp hi^ AXP routines Alpha AXl' parameters and builds a VAX argument list, whicli is then i~sedto call the target vi\S rou- Disabling the normal Alpha ~ixPstack consis- tine. When the VAX routine returns, the contents of tency checks the result registers are copied to the corresponding Disabling i~nalignedmemory reference messages Alpha AXl' machine state loc;~tions,atid Mannequin is restarted to resume executing Alpha ASP code. Enabling AIJD initialization tracing Despite some limitations (e.g., only one Alph;~ Disabling integer overflow checking image and no exception handling across the VAX to Alpha AXI1 clomains), ALJD greatly aided the Debugging combined VAX and Alpha AXP code OpenV,MS hXP porting effort. The simulator pro- under the ACID e~ivironmentis similar to debugging vided software gro~11~with 3 pseuclo-Alpha AXI' normal VAX code untler the DEIKIG-32 OpenVMS environment in which to debug their Alpha AX[' debugger. Basically, if the address involved in a code, well before either Alpha t\SP harclw;ire or the debug command is within an Alpha Ml' image, OpenvMs rU(P operating system was available. then the debugger calls tlie Mannequin simulator to Many OpenVMS AXI' groups successfi~llyused MID perform the commantl for the Alpha AXl' cotle. to facilitate their porting. inclucling the Record Otherwise, the DEBU(;-32 debugger itself performs management Services (ltblS), I>ECwindows, Forms the command for the VAX code, as usual. Alph;~ASP Managenlent System (FMS), various OpenVklS com- machine state is kept in static global storage by mand utilities, text processing utility (TPI!), IXBUC;, Mannecluin and thus is visible to the AUD debugger. ;~ndGEM compiler b;~cl<-endgroups. In the DERIK; symbol table (DST) representiltion. variables that ;Ire allocatecl in the Alpha AXI-' regis- ters are clescribed as being allocated in the corre- The AUDI Facility sponding global state locations. This "trick" The \'AX Environment Software Translator (VEST) is allowed t\UD to handle the 64 Alpha registers an important part of the initial OpenVMS AXP offer- using the Vt\X DST representation, which can ing.5 VES'I' translates an OpenVMS VAX executable or encode only the 16 VA~registers. shareable jniage into an Openv~MSASI' image that Once simulation begins, Mannequin continues to can then be executed with s~~pporton ;in OpenVMS simulate Alph;~&YP instructions untjl it either AX11 system. As for other user-mode layer softn7are detects an instruction that would transfer control components, it was desirable to test VEST and outside of the Alpha AXP image, completes a single- images tr~nslatedby VEST as e;irly as possible in a step request, or detects an error condition. lJpon simulation environment such as AlJl). However. returning to the AUD environment, 34annequin sup- AUD coultl not be used directly to test tr;~nslated plies status information that indicates the reason images for two reasons: sinlulation entled. VEST directly creates an Alpha AXP image. In For a transfer of control from Alpha to Vi\X effect. VEST is a combinecl compiler 2nd linker. cotle, AUD must first cletermine whether the trans- Thus. the symbol mapping protocols ilsed by fer is a return from Alpha t\XP cotle ;IS a result of a AUD were extraneous, ant1 the linking protocols prior VAS call or a new call from A.lp11a AXP code to had to be completely replaced. VAX code. All0 is fully reentrant, so AUD cannot make this determination from global state. If tlie Actual execution of a translated image on target address is a distingilished ;~ddressthat AUD an Oper~\~klSAXP system makes llse of the

Digital TechnicalJournal 1/01, 4 ~\b.4 Special Issue I992 Alpha AXP Architecture and Systems

Trans1;cted In1;cge Environment (TIE).; The TIE equivalent VAX program counter (PC) is not tlirectly is a shareable library that contains support rou- available during execution, since translatetl code tines for translated images. In particular, TIE occupies different address space than the original provicles support for VAX complex instruction VAX code. Thus, register R15 is used instead as an procebsing, Vi\S-to-Alpha atldress tnapping, and in-image index register. Open\lMS VAX exception handling. Creating a The user-mode \'AX stack is split into a VAx stack VAX version of the TIE to use with AUl) required ant1 :In A!pha and emularetl VAX stack. The VAX intimate interfaces with the OpenVMs VAX oper- stack services both the A1Il)l environn~ent2nd any ating system as well as comp;~tibilitywith AUD. VAX system services or run-time library routines that the translated image may c;~ll.The Alpha ;cncl Thus, the need to debug translated images led to ernulatetl VAX stack services Alpha ASP and trans- the creation of the Alpha User-mode Debugging lated code. Environme~ltfor Translatetl Images (Ar1l)I). Just as Tr;cnslated imxges contain calls to the TIE as nec- Callable Nl:innequit~provided a key building block essary to simul;rte VAX complex instructions and for AIJI), ACID in turn provitled a key building block procedure calls. Complex instruction routines are for 1\111)1. Alpha 14>il' softmlare teams ant1 porting usetl to simulate VAX instructions that would othel-- centers used ALJIII to port both Digital alicl third- wise expand into excessive Alpha AXP code. party translatecl applications prior to the arrival However, since tILln1 is running on \4\X hardw;lre, of Alpha AXP hardware. The porting process was complex instructions can be executed native on the ;IS follows: a VAX application was translated to VAY, 1'1:rrclware. Alpha AXP code by means of the VEST translator; To initialize the AUDI environment, the translntetl this code was then run on ;I \'AX system using the image calls an initialization routine in the TIE by ALJDI simulator. means of an initialization program section (I'SECT). The Atll>lprocess components shown in Figure 5 This routine determines the address range of the incli~clethe Alpha ASP code and the location of the WX-to- Callable Mannecluin Alpha simulator Alpha atlclress mapping structure, saves tlie current AUD debugger Alpha AXP register state, and calls Manneclui~ito begin executing translated code at the appropriate VAX version of the TIE entry point Tr;~nslatedcode uses tlie ;~dclressmap- Translated ViiX cotle (Alpha AXP code) ping structure to find computed branch destina- tions on the fly. Callable WIannequi~ithen executes AUDI E~zuironment translated code until it encounters some instruc- Emulated VAX state in ALJDI is mai~ltaineclin a global tion that woultl transfer control out of translated context block. Emulated VAX registers RO through cotle. The cause of this transfer woi~ldbe either a R14 are used ex;ictly as their Vt\X counterparts T[E-b;csetlprocedure or complex instruction call! or The corresponclence between a translated and calls to native VAX routines.

CALLABLE MANNEQUIN ORIGINAL VAX CODE 1 I

TRANSLATED VAX CODE (ALPHA AXP CODE) -1 AUD DEBUGGER I TRANSLATED IMAGE I OTHER IMAGES I

AUDl ENVIRONMENT I

Figure -5 AUDI Pt40cessConzpone~zts

I90 Vol. 4 Xo. 4 S/)ec'ilrl Isstre I992 Digital Technical Jorrrnrrl Usin? Simulation to Der~elopcrrzd IJort Sofku~ure

Like AUD, AIII)l allows interoperation between not function properly. For outgoing calls (trans- translated VAX cotle (Alpha AXP cotle) and VAX lated VAX code to VAX code), routines in which code. Translated cotle can use existing VAX system a callee modifies its caller's stack fame argument services and run-time libr;~ries.AUDI does not use list or return address may produce 11npredict;tble the jacketing language described in the section The results. since the a11toj;icketing may be altered or ALJI) Facility. Instead, ALJnl automatically jackets clisconnected. procedure calls between the translated VAX code and the native VAX code. Autojacketing includes AUDI Example builtling proper parameter lists and call frames for Figure 6 shows the execution of a translated image the destination calling standnrtl. under AULII. Note that both the BASIC image The fact th;tt Alil>l does not use a jacketing lan- (HELLO-WORLD) and the BASIC run-time lil~rary guage leads to some proceclure call 1,imitations. (BASRTL) are translatetl. Run-time libraries that are However, note that these limitations do not appear used by the AUDI environment cannot be translated when running translated cocle on actual Alpha under ALJIII. Translating run-time libraries that AIll>l AXI' hardware. For incoming calls (VAX code to itself uses causes a "circularity in activation" ;untl translatetl VAS code), all AST delivery and condition incorrect or no execution. handlers execute as VAX code rather tli;tn as trans- In the HELLO-WORLD esample, there are 28 c;~lls lated VtiS code. Thus, translilted programs may to VAX routines, most likely those to LIBRTL and

$ RUN HELLO-WORLD-TV Hello World from VAX BASIC

AUDI V3.0 Runtime Statistics:

8085 Alpha AXP instructions were executed.

TIE Lookups: CALLx J SB JMP ...... Stayed in ALpha AXP routines: 4 5 0 ...... Went to VAX routines: 28 0 0 Total: 3 2 5 0

28 VAX returns used (28 RET, 0 RSB) to resume Alpha AXP code. There were no Fault-On-Execute conditions converted to Lookups. 21 CALLx Context Blocks were allocated - which were reused 7 times^

There were 19 TIE-based 'complex instructions' executed. Instruction INSQUE (OE) : 2 Instruction MOVC3 (28) : 8 Instruction MOVC5 (2C) : 8 Instruction MOVTUC (2F) : 1

There was 1 VAX routine call to Alpha AXP code.

There were 2 images containing Alpha AXP code: HELLO-WORLD-TV XO.0 from BL3.3 VEST of Mar 30 1992 09:27:02 BASRTL-TV XO.0 from BL3.3 VEST of Mar 30 1992 09:14:10

Execution depended on these images: LIBRTL-TV DECWSXLIBSHR LlBRTL2 MTHRTL-TV DECWSTRANSPORT-COMMON LIBRTL TIESSHARE VAXCRTL DBGSSISHR MQNSSHARE MTHRTL DECWSDWTLIBSHR CONVSHR LBRSHR SORTSHR

Figure 6 AUDI Evamnple- Twnslated Hello World l~?zcrge

Digilul TechnicalJourtcal Vol. 4 No. 4 Specir~lrss~te 2992' 101 Alpha AXP Architecture and Systems

OpenVMS system services. Therc are 21 unique Murphy, Scott Robinson, Larry Woodman, and <:ALLx contexts and 7 reused ones. In acldition, the James Wooldridge. Finally, much of the ACID1 example uses four different complex instructions. information jn this article is taken from work origi- nally clone by Scott Robinson. Othcr AlJDI contrit)l~- tors inclucle George L)arcy, Mark Hcrdeg, Matthew The softw;lre simulators Mannequin, ISP, ill). ;~ncl Kirk, N;~ghmrhMirghi~fori, ant1 Mut-ari Srinivas;ln. AIJDI grei~tlyaided Alpha AX[-' software porting and development efforts. Subst;inti;~lparts of both system ant1 application software were sirnul;ited 1. R. Sites, etl., Alj,l)a Archilect~lre Rejkrc~zce and verified concurrently with hardware develop- ~Mclnu~I(I3urlington. hm: Digital I'ress, 1992). ment. When Alpha ASP liartlware becalne ;l\l;~ilable,

most software coultl bc plugged in simply ant1 Kin L. C. Thacker, D. Conro): and I.. Stewart. "?'he exactly ;is expected. The use of these simulation Alpha Den1onstr;ation IJnit: A High-performance tools saved a year or more from the over;~llAlpha Multiprocessor for Software ;inti Chip l>evel- ASP schetlule. opment," Ilic~itc~lT~d?i~iccll.lori~m~I, vol. 4. no. 4 (1992, this issue): 51-65. Acknowledgments M1:1ny people throughout Digital contributetl to the 3. Open K1f.S DeIt~~/,Yneltu Utility ~tfunlral success of the Alpha ASP simulators. Ho1n:iyoon (,Majrn;irtl: Digit:il Equipment Corpor;~tion, Akhiani, Ray Lanza, Stephan Meier, Steve Morris, Order No. Ah-PQYPA-I'K, 1992). Antlrew Payne, ant1 Jon lieeves worked on Llie ISP 4 S. Mislir;~,"The VtlS 8800 ~Microarcliitect~~re:' model. George Dare): Mark Hertleg. Kevin Koch. Di'itcrl TccbniccllJour~c~,vol. 1. no. 4 (February Eric Rasmussen, and Scott Robinson contributrrl to 1987). 20-35 the ~Manneqi~insininlator. The Alll) cffort includetl several groups from across Digital. Their primary 5. R. Sites, A. Chernoff, M. Kirk, M. Marks, and contributors were LW~lterArbo, Ronald Brcnder, S. Robinson. "I%in;~ryTr;~nslation," Digit~~l Henry Grieb, Nark Herdeg, Michael lles; Ja~iies Tec/~rzicr~l./o~i1'11~11,vol. 4, no. 4 (1992, this issue): Johnson, Robert Landau, Mar~riceMarks. Dennis 137-152.

192 Vol. 4 hb, 4 Speciallssue 1992 Digital Techr~icalJorrrncrl Peter E Conklin I

Enrollment Managemenl; Managing the Alpha AXP Program

Digital's ~nultjjecrrAlpha AXP program has involved more than ~ZLJOthousand engi~zeersacross lizalzjldisciplitzes. I1znot1atiz~emnnagenzent stjdes arlcl techniques were required to delizler this high-quality program on an aggressiue scheclule. The Alpha AXP Program Oflice used a four-point methodology for nzanagemelzt: (I) establislj a17 appropriately large shared vision, (2) delegate co~rzl~letelyand elicit specific co~nrrzitments;(3) inspect rigorozis(~;proz~icling supportive feed- Back; (4) ack~zou~lerigeerJer- rldzmnce, learning ns the progmlrz progresses, We cor-zsciocrsly used each project ezient to propelp/.ogress crtzd gall? ~nonzentt~nz. Digital deliuered the Alpha AXP progmnz on sched~ilewith industq~-Leadership c~pa6ilities.

Introduction tliscussion of the model's evolution tluring the The program to develop the Alpha AXP systems Alpha AXP Program. has been the largest in Digital's history and one of the largest in the compilter intlustr): During Size of the Alpha AXP Program the course of the program, the Alpha iLVP Pro- Digital's Alpha AXP program encompassed the gram Office developetl ;I moclel that provided the design of a world-leadership microprocessor chip, tools necessary to manage the program. At times, a new 64-bit system architecture, multiple h;~rtl- this paper may seem to imply that the program ware systems (from personal computers to main- team cleveloped the tools and then used them in frames), multiple operating systems, and huntlreds a pure form. In practice, the team developed these of softw;lre products layered on these systems. The approaches basetl on many years of experience antl development of the first-generation products on the management tl~eoriesof experts; we also extencled over several years ant1 involved more than learned and applied these lessons as we managed two thousantl hardware, softw;~re,and systems the program. engineers at its peak. Digital managed the overall Although the positive effects of timely delivery development program from ;i Program Office and high quality are particularly noticeable results staffed by eight professionals. of sucli a large program, Digital has also used the Across Digital worldwide, the Alpha AXP pro- tools to goocl effect on smaller projects. Moreover, gram development spanned more than 22 software teams within the Alpha AXP program usetl the tools engineering groups and 10 hardware engineering recursively, project by project. The author's experi- groups. The hardware effort included the semicon- cnce is that this management model is applicable to tluctor design group ant1 groups for each of the projects of ne;~rlyany size. bardw;rre systems platforms. The software efforts The discussion th~th)llows briefly ilcfines the encompassed four opt-rating systems groups, and scope of the program and explains why traditional groups clesigning migration tools, network sys- methods were inappropriate for managing the tems, compilers, databases, integration frame- development of such a complex product set in a works, and applications. Some groups peaked at short time period. The Enrollment Management more than 150 development engineers plus sup- Model and the concept of cusps-a key element of porting staff. Many also co~ltracteilwith suppliers the moilel-are then clefinetl and clarifiecl through both within and outsitle Digital.

Digital Technical Journal Vol. 4 No. 4 Speciol Issue 1992 193 Alpha AXF Program Manage~nent

Inappropriate OrganizationalApproclcbes PERSONAL BUSINESS GOALS PUBLIC PROJECT OBJECTIVES Implementing such a broitd, complex program pre- sented not only technological challenges but a man- agement challenge as well. The Program Office therefore consiclered ant1 rejected a nirmber of tra- ENROLLMENT ditional organizational approaches. In the classic organizational model, a hierarchi- COMMITMENT cal, or line, organiz:~tionis brnlecl, containing all DELEGATION the primary implenienters. The problem with this aplxoacch to large programs is that it takes too long to form the organization. Staffing the tealus and TRUST ACCOUNTABILITY establishing operational procedures take longer (TASK-OWNER-DATE) tti;rn the market wintlow and avail;~bletechnology allow. Tlie result is grant1 visions ancl projects deliv- ered ye:rrs behind schetlule. Further. "temporary" Figure I Bzr-olkrnent Ic.I~~~zage~~ze~?t~LIode/ organizations must be folcled back into the niain- stream at the encl of the progratn. Tlie management The model in this form was developed by goal of the tUpha AX11 program was to keep esper- the ~uthor.Some elements are derived from man- tise concentratetl to achieve synergy across many agement seminars ;tnd fro111 consultants' rccom- projects within a particular tliscipli~ie.~ mendations. 'The particular forms used for vision. An alternative approach is to form small coriimitment, and acknowledgment emergetl dur- entrepreneurial teams ant1 challenge them to work ing tlie AJpha t\Xll program. The insgection- long hours to achieve the goals. This works well in support stage was developed by the author tluring s~iiallprojects suititble for "skunk works." The origi- many years of project management and reviews. nal design work was concluctecl in this fashion. Two concepts are key to implementing this However, when this approach is applied to large model for large programs. First, the Program Office, programs, the result is that team members burn out which has ;rlreacly bcen ~i~e~itionetl,provides the without ;tchieving the aggressive schetlules necessary cohesion, program vision, and inspec- demancled. Management then becomes frustratetl tion structures, while allowing the skills and and tries again with different teams, but the results resources to remain within their ~lati~ralorganiza- are no better. tions. Moreover, the office lends consistency across The Program Office est:tblishetl tlie Alpha AX]-' the progr;Im and encourages each contributing program as an integration of project teams that group to holcl to its commitments. The small Alpha woi~ldremain within the existing line organiza- AXP Program Office, made up of a diverse group of tions. Thus, for exitmple. each hardware ant1 soft- product and operations rn:inngers. Ii;tcl no formal ware project resitled within its functional group authority (not even budget autliority); so it exerted influence only through rigorous enrollment and (semicolitluctors, servers. Openv~MS,IJNIX, compil- delegation outlined by the ni;lnagerilent model. ers, database, CPLl development, networks, etc.). The Program Office integr;rted the work of the intli- The second key concept is the project "cusp," vitlual project te;tms, which provided the atltli- which is a critical event that propels chaligc. <:IIS~S tional atlv;intage of program resilience in tlie face of at-e further definetl in the sections Inspection- functional group reorganiz;rtions. Supl>ortant1 Using Project Cusps l>elow. Vision-Enrollment TheEnrollment Managemerct Model The Program Office uses vision to enroll the related Tlle Enroll~~ientManagement Model (Figure 1) for groups in tlie goals of the program. For example, the Alpha AXP program conlprises four stages. the vision can be tlie top-level business goals and customer needs. For subordinate projects, the vision can be the objectives of the larger project. In Commitment-Delegation all cases, the enrollment happens only when the goals are set in the contest- of the ;~i~dience(the Inspection-Suppost project team). In particular, the Program Office is most effective when it expresses tlie program3

194 Vd.4 Nr,. 4 S/~~falfme 1392 Digirnl Tecb8ricnl Jorrrnnl ~ztil/Jnn61gelne1it, ~MLIIILI~~II~ the Alpha UPProgtzlnz vision in the terms and language of the group being application of the Enrollment Management Model enrolled. The vision has to be Large enough to anel the role they played in the creation of the encompass all the recluirerl commitments and the model itself. ultini;~teresults. Acknowledgment-Learning Commitment-Delegation At each step of the project, the Program Office As the manager of a project develops plans, he or acknowledges progress both personally and pub- she delegates the tasks to sub-groups ancl solicits licly. For each event, the management team repeat- specific commitments to content ant1 schetlule.3 edly asks what was learnetl and how managers ant1 Since these commitments are made within the con- the team can do even better nest time. Teams are text of the larger vision, the subortlinate commit- frequently coached to improve their methods for ments become qiiite strong for sub-project better results. members. A key element of the delegation process is the explicit specification of the results such that Using the Model they ;ire measurable ant1 identified with an intlivid- 111 principle, the Program Office used the Enroll- ual owner. The owner is a single indivitlual empow- ment Management Model in each conlponent proj- ered by the committing group and held ect. Of course in practice, not all groups used this accountable for the deliverable:' An important methodology. Early in the program, only a few point here is that the term "owner" does not neces- groups signet! up. As the Alpha AXP Program Office sarily refer to the person who actually cloes the began organizing the overall program, we started work. The owner is responsible ancl therefore formalizing the methoclology. As noted above, we accountable for getting the work done on time. In learned extensively as events progressed. We found our particular program, the Program Office hat1 to few useful manuals applicable to running such a clarify ant1 reinforce this distinction carefully as large program effectivel}~.Insteael, the Program part of thc enrollment stage. Office developetl many of the tools "on the job," learning as the project ~~nfoltlecl.~~This paper exag- Inspection-Support gerates a pure model rather than presenting its The project manager trusts in the comnlitments incremental development. To balance the picture, made and continually inspects the project to ensure we show some of the pitfalls and side paths. delivery on schedule. This inspection strictly takes Most project managers followed the Enrollment the form of supportive feedb;ick, thereby encourag- Managemellt Motlel either by instinct (experience) ing people to disclose risks before they become or by example. In several instances, they formally problems. Whenever the projected results are ;it reached outside for training in running projects risk of falling short of the commitment, the project of this complexity. Depending on the size of the manager declares a project "cusp." project or sub-project, m;lnagers used the model The term "cusp" is adapted here from Gleick to with varying tlegrecs of rigor. For example, the clescribe the potential turning points, or critical larger projects anel the program overall used formal events, in a project.5 (Other terms in conventional inspection meetings and reviews. Subordinnte parlance include "gotchas:' setbacks, crises, turning projects were free to use formal or informal inspec- points, project breakdowns, and "calls to action." tion processes. The program team inspected The ni;inagers i~sedthese terms during the propam. each group's inspection processes to ensure that For our purposes, we adopt the term cusp as an there would not be any utlfortunate management emotionally neutral term. It is importz~ntthat at any surprises. point it1 the project the term used be one that gives an opening for the possibility of making a difference Using Project Cusps and for moving the project forward.) At the point of As described earliel; cusps are critical project a CLIS~,everyone is ready to embrace change events, or crises. Conve~itional project manage- because it furthers the overall program objectives. ment concentrates on rigorous planning to avoid The management team col1abor;itecl to take such crises. The Program Office took the opposite advantage of cusps to propel project momentum approach: We strove to iintlerstand the critical toward the established goal. Examples of cusps in events and nlilestones and used these cusps to the Alpha t\XP program are presented throughout increase project momentum, as Figure 2 illustrates. this paper to demonstrate their integral value in the As the project approached each cusp, the Program

Digital Tecbnicnl Journal Vo1. 4 rV0.4 Spclul issue 1992 195 Alpha AXP Program Management

2. Nineteen ninety-two is the first year in which microprocessors have achieved over 100 MIPS (million instructions per second) of computing. 3. It is now cost-effective to place more than 4 giga- bytes of main menlory on a system; hence 32-bit addressing is insufficient. 4. Networking technology now allows the con- CUSP struction of networks with over 100-megabit throughput. Figure 2 Cusps as n Way to Change Directions 5. Cost-effective storage systems now exceed the many-gigabyte range and are approaching Office dealt with the event promptly to ensure that terabytes. the project continued to move toward the overarch- These computing systems will inclucle large ing goal. In other words, the managers did not amounts of parallelism as compared with classical develop a plan just to follow the plan. Instead, they designs. The levels of performance and connec- cleveloped a plan to understand the overall project tivity finally allow computing to realize greater flow and used the milestones and other events as human productivity: mobile, highly inteructiue opportunities to adjust the project velocity to keep computing that supports group work with algo- moving toward the goal.' In many cases, we gener- rithms that intelligently analyze, simulate, and ated a cusp to propel the necessary change (for synthesize in szlpport of u zuide variety of human example, by creating a schedule crisis). In other endeavors. The application of this technology clual- cases, we took advantage of a cusp to make a neces- ifies as the fifth generation of computing."9 sary change. The program vision for Alpha AXP systems, as As the management team became comfortable shown in Figure 3, is to be the first family of systems with using project cusps constructively, the to implement the technology and applications for Program Office actively solicited more of them. the fifth generation of computing. This family is These increased the velocity and resulting momen- fiilly compatible across all members now and will tum of the program, thereby achieving a "slingshot" be into future generations, ensuring that applica- effect. The Program Office used each cusp to tion binary programs will run unchanged. With no acknowletlge progress. As the team acknowledged compromise to hlture performance, the initial fam- more and more progress, the program's niomentutn ily members also maintain a high degree of com- moved from very low to break-even, and finally into patibility with current systems to allow easy high gear. migration for customers as they begin to require this technology. Delivering a family of high-quality Vision-Enrollment Stage: systems in a timely fashion reestablishes Digital's Magnitude of the Program's Vision reputation for technology and systems leadership. The vision for a program or project becomes the ultimate goal or deliverable. Thus, the Alpha AXP Program Manager's first task was to establish a vision shared by all groups that would contribute to 64-BIT MEMORY the program. This vision had to be large enough to 5L! TERABYTES STORAGE encompass all the work.

Alpha AXP Systems as Fzyth-generation Computing SAME ARCHITECTURE. The Alpha AXP family is at the confluence of five COMPATIBLE SYSTEMS I major trends in computing. 1992 TIME 1. Nineteen ninety-two is the first year in which it is feasible to achieve 64-bit computing on a single microprocessor. Figure 3 Alpha AXP Vision

196 Vol. 4 No. 4 Special Issue I992 Digital Technical Jozrrnal Enroll~rzentManagenzerzt, Managing the Alpha AXP Progra~rz

Getting Started Office pointed out to the project managers that tliis The Alpha AXP program grew out of research revenue could translate to approximately $0.01 on on computing, specifically the teclinology and the stock price for eacli hour of schedule improve- benefits of different RlSC (reclucetl instruction set ment. With this concrete business metric in mind, computing) architectures, ancl from advanced the key project managers were willing to consider developmetit in compiler designs, VLSI (very large- new ways to tackle the program's challenge. scale integration) design, and senliconductor fab- Once tlie Alpha AXP program was approved, the rication. In 1988, Digital's Executive Committee I'rogram Office began holding Alpha AXP quarterly challenged Engineering to tlevelop a system that review meetings. At these forums, groups reported would meet the customers' needs for competitive plans and progress to a wide, cross-disciplinary performance in all of Digital's computing envi- audience. Initially, the audience was composed of ronments. Engineering formed ;I cross-disciplinary engineering, manufacturing, and service groups. As task force that developed the requisite systems the program gained momentum, other disciplines architecture :lnd designs a~itlprocluced the above such as marketing and sales joined and began to vision and hence the Alpha AXP program. Digital's report on their own progress. These forums helped Executive Committee approved the Alpha AXP pro- generate belief and solidify enrollment. They also gram in October 1989.1° helped tlie Program Office identify problem areas before they became crises. First Cusp: Executive Challenge to Accelerate Schedule First Cusp Result By the end of 1989, Digital had completed the We established a program-wide understanding of advanced developments and signed off on the archi- tlie importance of volume deliveries in 1992. tecture and primary design documents. In a major review during January 1990, upper management Commitment-Delegation Stage: challenged the program to improve the schedules Delegating and Eliciting Commitment to capture the market window for this new tech- With the key project managers sharing a common nology. The project managers understood the vision, the next step was to establish a work plan rationale for this demand but could see no way to and to ensure that eacli group committed to deliver meet the aggressive schedule. The result was a loss on its parts. of rapport between management ant1 the technical It had been 15 years since Digital attempted to staff, with coninieots such as "Don't talk to me change siniultaneously its architecture, hardware, about crazy schedules" and "This is just going to be operating systems, compilers, and other layered a lot of hard work.'' products. Since the introduction of the VA;Y systems The Progr;lm Office viewed this cusp as an in tlie fall of 1077, each component had progressed opportunity rather than the crisis tlut it appeared relatively intleyendent of the development sched- to be. The office enrollecl key project managers in ules of tlie others. Fewer than half a dozen project the overall vision, i.e., in the business value of a team members had participated in the VA)( design. timely execution. For some projects, it was suffi- For most participants, the system had always been cient to focus 011 the classic "time-to-market." in existence, and hence the developer of each sub- However, for many, the ship date proved an insuffi- system could invoke and depend on tlie existence cient motivator. Therefore the Program Office of all tlie othcr s~~bsystems. frariied the vision clifferently, as follows. A program The need for tlie sirnult;~neousdevelopment of becomes profitable when it reacl~esbreak-even multiple hardware and software systems cornpli- (it., cumul;~tivcrevenues meet ant1 then exceed catecl the coordination task. The Program Office cumulative expenses). addressed this complex coortlination in two climen- The time taken to achieve this point is known as sions: technical and project management. In the the 'ctime-to-profit."'lThe Program Office estitliatecl teclmical dimension, the office formed a team of tliat the program's schetlule would affect Digital's technical leaders from the core engineering groups, revenue at the rate of $1 million per hour. That is, known as the EJST, shown in Figure 4. (EJST is an for each hour tliat the project could improve acronym for EVA)( Joint Systems Team. EVXi was an (lower) the time-to-profit, Digital would achieve an early name for the Alpha AXP program. An earlier additional $1 million of revenue. 'The Program forum, the EVAX Technical Team, merged into tlie

Digital TecbaicrrlJortrt~nl Vol. 4 No. 4 Special Isstre 1992 197 Alpha AXP Program [Management

I ENGINEERING I I I I -

I 1L I PROJECT I I MANAGERS I I ------A 11 - -

ALPHA PROGRAM OFFICE

E,ISI' process over time.) This group met weelily to establisl~a program-wide m~orluting engineering tlevelopment groups to because I don't know what everyone else is doing 11i;llte commitments and to be account;tble for and m~lienthey will I>e clone with their piece." At cleliverables. This team mlas known as the ASl'M this time, we hat1 already established the cross- (Alph;~t\XP System Project M;~n;tgers).Given tlie functional hSl'k1 team of experiencetl project man- magnitude of the overall task ant1 the complexity of agers representing most of tlie development full!. untlel-standing the interdepentlencies, tlie groups. This team was unable to develop the com- project m;rn;lgers tended to view tlie l>roject :IS ponent plans bec;~i~sethey 1;lcked a master plan. i~n~x~ssibl!~complex. At the progr;lni level, the clial- lenge then became to establish the Alpha AXP mas- Choosi~zgthe Strategy ter 1>1;11i. A master plan, however. evolvetl mi~ch The engineering tlevelop~nent group lnanagers more slowly than expectetl. met in a full-clay meeting to establish the over- all paranietersof the Alpha AXP program's plan. Second Cusp: Cannot Choose First, they est;tblished the business goals and exam- the Order of the Work ined the various technical constraints. The group I\l;~n;~genient'sinability to provide an overall tested the inclusion of each component with plan induced a crisis of disbelief. Tlie project the question "Is this coml>onentcritical to the over- ni;lnagers threatened to revolt (or move to other all business success of the Alpha A)(P program?" projects). The technical leatlers were generating This process established solid reasons for the ever-larger design documents. The engineering contents of tlie mastcr pl;~nand kept the respon- development group managers tleclaretl that tlie sibility for the inclusion or exclusion of a compo- I'rogram Office had a crisis on its Iiantls: \Ye h:lcl to nent with the responsible clevelopment group. The

1'01. 4 \.o. .\/)cciol lss~re1992 Digital Techrricnl Jorrrrrnl irilt ~Mc~nnge~neizt,~\,l~iizaging the Alpha RYP Progr~~iiz group then tleterminetl the org;~nizationalimpli- the overall system's value. A key prerequisite to this cations of such a work plan. Members of the group conversation was to establish a fi~ll-timeproject balanced the climensions of bus~ness,technolog): manager for each component group. who became and organization to establish the priorities and the coordination point and who was held account- work order. We institutionalized this group into the able for each tleliver~ble. Alpha AXP System Board of Directors (ASBOD). Decide What to Do before How to Do It Representing the Plan The Program Office found that each group went \With the major program priorities and constraints through a disbelief process similar to the one seen established, the Alpha UPprogram manager then earlier for tlie program. The program manager set off to establish the master plan. For all groups to urged each group to first focus on the "what" of see their contributions, he held the master plan to a their de.liverable, before trying to decide the "how." single page. He established the content during an The program manager ensured that tlie group intense periotl in which lie asked contributors to groundecl its overall estimates in reaJity For exam- describe their assumptions :ind tasks and to show ple, a software group might count the number of where on the overall plan their pieces would fall. modules to port and estimate tlie person-days per The single-page format forcetl the management module. This kind of high-level, clil;intifiable esti- team to keep the plan to a high-level view ant1 mate allowetl the project manager to make an over- allowecl contributors to see their pieces without all estimate without needing to understand the aclding the complexity of their own group's tletajls. order of the specific tasks. Furthcr, in review meetings it w:rs easy for everyone in the room to view the same picture so that the Third Cusp:Need for Project results coultl be seen, debated, ;ind agreed upon. Mnnngenzent Expertise Once the n1;lnagement team h;~tloutlinetl the Members of several of the larger projects deter- plan, it was recommentled by tlie project managers mined that they tlid not have sufficient project man- (ASPM) and ;~pprovedby the engineering clevelop- agement experience. Previously, this realization ment group n1;inagers (ASROI)). Tlii~steam mern- woultl have resulted in replanning to move out bers knew their goals woultl not change without the target schedule, perhaps repeatedly. Instead, clearly statetl reasons. Further, others could start given the group's commitment to the larger result, building their pl;ins based on ;I consistent set of we founcl ;I much more aggressive behavior. For ;issuniptions. The resulting single page also becanie example, the OpenVMS AXI' group publicly com- a reference, which we called the "straw Iiorse," to mitted to their target schetlule and stated, "We establish ancl reinforce constancy of purpose. tlon't know how to achieve this, but we commit to Figure 5 is an example of the Straw Horse Plan. (We finding a way." The next day they went to a project later i~pgradedthe name to be the "tin horse" to management consultant for training on how to connote the increasing degree of solidity of the build an aggressive, attainable schedule. This con- underlying plans ancl commitments.) sultant conducted the seminar nuny times through- out the project for various groups.I2 Second Cusp Result We agreed on the overall single-page plan upon Third Cusp Result which teams coi~ldbuild their own plans. Groups introtluced education ant1 rigor into project management. Enroll~nentand Delegation: Value of Each Contribution Inspection-Support Stage: With the master plan outlined (the straw horse Inspection with Supportive Feedbncb reviewed ancl approved), the next step was to One of our vice presidents in the e;lrly 1980s hat1 ;In obtain the commitment of each contributing aphorism: M)u get what you inspect, not what you group. To :iclclress continuing skepticism about tlie expect. I11 other words, a conillion failing is that necessity of e;~cIicomponent and its schedule, the managers obtain someone's promise and expect program manager walked each group through the that the results will be what they expectecl. overall program and the economic value of its Unfortunately, tlespite everyone's best intentions, urgency. The group was then askecl to contribute to circumstances and unexpected requests can easily

Digital Tecb~ricnlJournnl Vi)i J At). 4 .S/?CC~LI/ISSIIL' 100 100 Alpha AXP Program Management

Stmw Horse Plan AUg 1990 stasxe 1: Zgchnical DeveZopWst system PO- & c perEamsmce systmt Siwle hm%am platfon Wlish oaly

Forbwn. C, BUes, Assembler, Debng, License %ant Facilb, CasE tools {TPUt COde aagnt WS't%rB. %St=, PesPa- Cbde Aoal?@er, laagmge Sensitive -tar, Digital Test Mgr) , kqpmnd bmmmL Archit%cture, met MeIv task-to-task. DEWincWws client iTiq tAT Some

Stage 2 : -ial De~e1qarentSystem setcod hardwan? platfonnt fnI:~f~ticKla;iwrsians follow 3 m3nths later

-, P-, C++r m, CRDIRetjository, m, threads-rtl, RFC, m,ImC's. Perma System, DBCforms, File cad#, vhxset, Diatr semm (am=,tbzr file, queuing), mate System kr@ager, -u-IN-1 base, CWt mite, fV end rmde and ~ix?ms, m/IP. PATmmw, t81Ptnaster, ABSS Bxrmsims

Sbge 3: Tscbniebl %ka%t Sptatcem open &HC~; -ic Multi-&-ins

WSP, &/I, user-qrf ttsn &rivers, K1SW, disk shadming. neais, ABk4S, DAS or e(iuivalent, full NAS, aaQlet V end node, x.25 access, ALLIM-I fully supported

Stage c: Camercid *stem A&h Clrtstefsr Btematiwal mmsiorus reled 5imuJtanesusLy

New &ttch?Print, $11 Bystem Integrated Prcducts, DECnef V rwtfng node, SWb access

stage 5: 'lmnwetim Wtem

Trmsuction Mbnitctr, thr@d%

Fig~ire5 Tbe Sirigle#uge Pl~lrz:All Extr~~ctfr-onz the Straz~fHorse Ylun divert the promiser away from fulfilling the program ant1 shared our sense of scheclule urgency. promise. Thus, managers learn to inspect regul;~rly Suddenly, we were shocl

200 Wl 4 No.4 Specirrl lsszte 1992 Digilcrl Tecktricnl Jorrrrrtrl Evzrolbnent Management, Managing the Alpha AXP Program reviews had been imposetl by line management ancl milesto~iesis at the top, so one can readily see any tended to encourage the reviewees to cover up repetitive slips. The emphasis is on critical path issues until it was too late to recover. In contrast, events completed last month and those coming up the program manager, operating under the Program next montll. At the bottom are listed those issues Office model, had no line authority ant1 set up the that have been resolved and issues being opened, monthly operational inspections in an open and with clearly inclicated ownership and due dates. supportive environment. The presenters were the designated project managers from each develop- Operational Excellence ment group. The Program Office encouraged all To ensure that every project implemented the presenters to bring in their risks and problems strategies, the Program Office established the prin- before it was too late to address them effectively. ciple of operational excellence across tlle Alpha We used the single-page format again, as shown in AXP program. The office consistently recognized Figure 6. Note that the simple, visual history of all teams that accomplished their results on time and

em: ALPnh/W DATE: April 8, 1992

SCHEDULE: 1 Qd 1991 1 Q1 1992 1 42 1992 1 Q3 1992 1 Q4 1992 lOct Nov =!Jan Feb MarlApr My JunlJul Aug SeplOct Nw Dec [---[---[---[---[---[---[---[---[---[---(---[---[---[---[--- 845 1 E I Sep 91 B 45 6 I E 1Nav 91 B 45 6 I E l Jan 92 B 45 61 U E IWr 92 B 45 61 U E l Apr 92

Milestones B -- Base Level 38 (Editor, debugger, TIE, base DECnet) -- mNE 4 -- BaseLevel 4 (More DECnet, utilities, andiMclients) -- DIXSE: 5 -- Base Level 5 (EV4 support, TFF, performance) -- EONE 6 -- 5ase We1 6 (Performance & Tapes) -- DQNE I -- Internal field test & Pilot Porting Activity - FiT U -- Internal field test update - FT2 E - External field test & Early Support Program - FT3 S -- V1.0 suhnit to SSB

CRITICAL PATH EVENTS PAST MCNI'H: Shipped BL6 on March 12 - stable on ADU, Ruby, Cobra, Flamingo Shipped BL6 AEa7 porting tmlkit Achieved FT1 (PPA) rode freeze Received 2 Flamingo systems in Varese, Italy, for WSIX development With SPE (CSSE), delivered worldwide field test support training FT1 stabilization continuing

ACTIVITIES ALONG TKE CRITICAL PATH (NEXT W3QE-I) : Ship FT1; revised target is Ppr 10 Ship ET1 AM2 porting toolkit Complete PPA R-diness Rwiew Begin FI2 stabilization

ISSUES / DEPENDENCIES RESOLED: Flamingo SFB graphics support formally accelerated into Vl. 0

ISSUES / DEPENDENCIES WT RESOLVED: GDI BL24 compilers needed for ESP integration: D.L., May 15 Rollout support staffing is not XI anyone's plan: J.S., May 29

Figure 6 The Single-Page Revim

Digital Techrricrcl Jouvnal Vol. 4 No. 4 Specla/ Issire 1992 201 Alpha AXP Program Management

predictably. We ;~lsoused the monthly program- acknowletlgment of their work. This contrasts with wide inspections to maintain a published record of the conventional management wisdom that it is progress. Thus. each project was encouraged to necessiiry to give Ereqi~entmonetary rewarcls to excel operationally ant1 to learn froni the esj~eri- motiv;~telxo],le. Although everyone appreciates ences ancl presentations of the others. tlie fin;lncial rewi~rds,tlie biggest motivator is tlie professional recognition that the contributor tlid a Fo ~wtbCusp Result good and necessary job! The second benefit of the acknowledgment was Tlie Progr:t~ii Office established monthly inspec- its effect in creating a sense of momentum throitgll- tions using a consistent single-page document to out all the project teams. Repeateclly, peer man- record pertinent information. agers would comment that the Alpha ASP team was ni:~kingsignificant progress. This in turn gave us a Acknowledgment-Learning Stage: sense of ;~ccornplishnient;IS well. So the program Building Momentum re;~lized;I double benefit from the original :rckno\vl- Developing tlie vision ancl pli111resul~ecl in :I Ken- etlg~nentand ;I h~rtlierslingshot effect with recog- era1 sense of eupliori;~.SliortljJ ;~t'ter\vards, the real- nition coming back to the Program Office. it!. of the work ahead clescentled like a cloutl of despair. At this point, the primary challenge Fzytk! Cusp Result was to start bi~iltlingmomentum in the program. Program-wicle. managers establishetl the norm of In the Enrollment Management Nloclel, building fretli~entacknowletlgment of progress. momentum-the ackno\vleclg~i~ent-Imi~igstage- As tlie Alpha ASP progr;1m made further prog- is tigl~tl!~intertniinecl with the inspection st;~gc.; ress, the 1'rogr;ini Office ;~ctivelysolicited third- that is, events reportecl during inspections wcrc p;1rtjr;~nd c~~sto~iier invol\iement. 'T'his invol'clenient i~seclto built1 momentum, The Program Office rein- pro\.icletl good feedback on progress and liatl the forcctl the vision 21ncl usecl niomentiim b~~iltlingto effect of reinforcing the fact that the program was mini~llizetlie time period during which tlie team 011 track to meet customer neetls. 111 some cases, the felt despair about the work ahc~d. project teams :rltered the Alpha ASP pl:fns to better help our customers atlclress their business needs. Fzytb CLLS~:Despair This further contributed to the credibility ;inel Since tlie overall progr;1m h;itl such 21 formicl;~ble monientuni of tlie progr;irn as well as the sense of goal, man!, of' the contributing teams 11ec:lnie acconiplishment. stalled with tlie m;~gnitucleof the task aheacl of tlieni. 'This manifested itself in tlie form of com- Sixth CZLS~!:Broken Debe~zde~zcies ments about the large amount of work. tlie result- and Replanrzi~~g ing potenti;il h)r scheclule delays, :~nda fe:~rof Like :~nyproject, not every ;wsumption ant1 tlepen- overtime clemands. This syndrome is common in tlency proves to be correct or totally accurate. At any large project, especi;illy when commitments one point, one of the major Alpha ASP hardwcire ;Ire matte that involve taking large risks. The systems slipped its scheclule for delivery of proto- appro;lch the program team took \V;IS to start recog- types to softw:~re.After considering a number of nizing ei~clielement of progress. As we elistributeel altern;itives, the oper;~tingsystem group proposed ;Innouncements of progress widely (using I>igit;~l's ;in ;iltern;~tepl:in using ii different hardware system worltlwicle electronic 11i;lil network), we beg;~nto :~ncla cIi;~iigedorder of events. They s:iitl in their I)i~iltImomentum around the Alpha AXP progralii. management presentation at the time. "The ques- Other groups picket1 up on this momentum and tion is not one of blame. Insteatl our goal is to pre- contributed to it themselves. 'Tl~iseffect cascaded serve the ultimate schedule goal of the program, throughout the entire program-more groups per- specific;~llyits volume availability date." ceived their tasks alie:~das achievable; r;~pitllye;~cli group wanted its own progress acknowledgetl; :~nd momentum increasetl. Sixth Cusp Result The I'rogra~ii Office founcl tli;~tthe members of Program-wide, teani tnembers established the prin- a project. :~ppreciateclancl were niotivated by tlie ciple of focusing on tlie desiretl state of time-to- simple "tli;~~ik~OLI" rel~resentetl by tlie pi~blic profit rather than on bl:~niingothers for hili~res.

202 Ihl. 4 ,Vo. 4 .S/~eric~llssi~c 199 Digilal Techrricnl Jorrrrrnl EfzrollrnentManagement, n'lunci,qing the Albba AXP Yrogrc~nz

At another point, one group was at risk because it turing's readiness to build systems. The Alph;~ASP neetled a critical skill Ibr ;I week. A (historically) Program Office broadeneel this process and asked competing hardware group responded by asking for a program-wide readiness review to identify what sort of resource, and then freely supplied the the "showstopper" risks. As a result, the Program resource despite its own very tight schetlule. In the Office celitralized tlie allocation process so that we past, these groups woi~ltlcompete for the same could maintain tlie prototype alloc;itions in real resource without coll;iborating for the common time. The result was to reestablish sufficient soft- good. ware test time and maintain momentum with mini- mal program impact. Seventh Cusp: Incomplete Asst~~nptions and the Need for the Performance Team Eighth Cusp Resz~lt Less than half way through the Alpha AXP prowam, The program teams decided that prototypes would the program team realizetl that some projects' be delivered based on program priorities. not solely assumptions were incomplete. RISC systems are on existing plans. notorious for recluiring careful design and tuning to meet aggressive performance goals. Evidence from Ninth Cusp: Need for Quality Metric.$ a related program at Digital suggested that some Each goup in the Alpha program adopted very of our system performance homework was weak. high standards for the cluality of its work. The ni;ln- Tlie Program Office quietly asked the appropriate agelnent team repeatedly found reinforcement teams to assign some resources to measure key of Phil Crosby's tlictum: "Quality is free."'{Ilesults components ant1 subsystetns of the design. This in group after group showed that early ;~nclcon- confirmetl the program team's concerns that the tinuous attention to quality resulted in held or curretit plans were incomplete. Quickly, we pulled improved schetlules. together ;I cross-disciplinary task force to analyze However, the program team noticecl tIi;~t we the infortnation rigorously and to liiake recomnien- were not inspecting and rnei~suring progress in dations. These analyses resultecl in changes in the quality at the total systems level; customers care architecturc, the chip design, tlie systems designs, about only the quality of the tot;~lresult. As tlie ancl the softw;lre. 'The changes liave proved to projects started integrating into ;I tot;~ls),stelii, the increase performance substanti;~ll)l. Program Office establishetl ;~nindepentlent group to measure overall quality levels. 'fhe classic reac- Seventh Cusp Result tion to such independently derived quality metrics The program e\t;~blialiedn performance team to is that they are meaningless. Instead, since the change tllc de51gnand pl'ins ~snecded program established tlie metrics ;it tlie niollielit when everyone saw the need, the re;iction h;~s been to focus on the total system's cluality without Eighth Cz~sp:Prototype Allocation Process dropping attention on tlie individu;~lcomponent As manufilcturing st;irted to deliver prototypes, tlie metrics. Program Office fount1 that the early manufiicturing built1 rate was lower tIi;in planned. This was the Ninth Cusp Result result of normal st;lrt-up problems. At the same time, initial demand had increased substantially. The program formalized system-wide qu;rlity Nevertheless, the project ;~clministratorscontinued metrics. to ship the systems to engineering and ;ipplications groups in the original orcler. If this had continued, Results and Lessons Learned dependent software woultl have been delivered Digital met exactly tlie program's overall schedule progressively later because of inadequate testing to the month (i.e., date for high-volume shipments), cyclcs. Our impact analysis indicated that the Alpha despite numerous setbacks ;ilong the W;I~Tlie ASI' volume ;~v;~ilabilitywould slip by three Alpha AXP systeni is meeting the original per- months. formance goals, and quality is excellent. Digital's The review team highlighted this problem in an Board of Directors has approved the fill1 Alpha AXP early program readiness review. Traditionally, program business plan and the investments neces- Digital uses readiness reviews to establish manufac- sary to capitalize on the Alpha AXI' family's e:lrly

Digital Technical Joirnral Vol. 4 /Vo. 4 .'il,ecirrl lssrrr 1992 203 Alpha AXP Program management successes. Initial reactions from customers have erables despite having the most complex tasks. For been favorable. Third parties have committetl example, the OpenVMS AXP system group not only Alpha AXP support for their products in record met its original schedule but also accelerated num- numbers. erous tieliverables into earlier base levels or releases. Figure 7 shows the OpenVMS schetlule and actual What Worked Well tlates of availability; footnotes indicate functional The Program Office in conjunction with the ;~ccelerations.The networks group tlelivered DECnet Enrollment Management Model has worked well. If on the AXP system an entire base level early. The the management team had followed traditional clatabase systems group accelerated its schedule by approaches, we would still be getting organized. several months and tlemonstrated products four Using the model, each group has been able to bring months early at Digital's DECWRLD '92 trade show. its fill( capabilities to bear as problems have arisen. Clearly one of the major lessons was to establish ?'he project teams have accepted the introcluction a constancy of purpose and holcl to it while contin- of multiple levels of inspection, ancl other programs iially learning and updating the detailed plans. The within Digital are copying this aspect of the model. single-page representation of the goals and master Further, the notion of ~isingproject cusps creatively plan is a key element in maintaining this constancy. has been an effective tool to huiltl momentum. Finally, a common schedule and inspection disci- What We Would Do Differently pline allowecl the schedule to become an opportu- Loolting back, we would have approached the nitjr to reinforce ;I shared vision. This positive view program differently in two areas. First, project contrasts with the conventional interpretation of teams would have benefited from broader early a schetlule as a burden. education on project methodology. Several projects As a result, most groups met very aggressive goals had significant slips, causing undue hardship on on schedule. Several groups accelerated their cleliv- other groups. The Program Office should have

ALPHA/VMS SCHEDULE RESULTS

MILESTONE ORIGINAL ACTUAL

Phase 0 closure Rug 30, 1990 Aug 30 Alpha VMS minimal lagin Jun 17, 1991 Mar 20 BL1 ship - minimal login Jul 15, 1991 May 31 BL2 ship - RTLS, DW (1 ) & LAT Aug 26, 1991 Jul 12 BL3A ship - ISAM, linker n/a Aug 23 BL3B ship - prog devel L TIE (2), DECnet (3) Oct 7, 1991 Oct 10 EL4 ship Nov 18, 1991 Nov 15 BL5 ship - functionally complete(4) Dec 30, 1991 Jan 10 BL6 ship - Ruby complete(5) Feb 21, 1992 Mar 6

FTIJPPA Apr 3, 1992 Apr 10 Phase 1 May 1992 May 20 FT2/PPA n/a 1992 May 22 FT3/ESP (6) Jul 2, 1992 Jul 8 FT4/ESP n/a 1992 Aug 14 V1.0 SSB submission (LRS) Oct 2, 1992 Oct 26 Notes:

(1) DECwindaws (2) Translated Image Environment (RTL far translated images) (3) DECnet accelerated from BL4 to BL3B (4) Full graphics support accelerated from next version to V1.0 (5) Support for multiple hardware platforms accelerated from next version to V1.O (6) FDDI support accelerated from next version to V1.O

Figure 7 Original 0penVM.S wiles stone and Deliuery Dates

204 Vol. 4 iVo. .I 5plic-iollsst~e 1992 Digital Technical Jorrrlznl Enrolbnent ~Vunugelnent,Managing the Alpha AXP Progrum introduced Ron LaFleur's project methodology References and Note sooner and pervasively. Instead, we waited until 1. R. Waterman, T. Peters, and J. Phillips, "Struc- each group saw the need and then tried to intro- ture is Not Organization," B~lsirlessHorizor~s. duce it. For groups such as the OpenVMS AXP no. 80302 (June 1980). system group, the methodology was introduced early. However, other groups needed (and still 2. C. Savage, Fiftlg Generation Manngernent need) this discipline. (Burlington, MA: Digital Press, 1990). Second, the office would have conducted more 3. W. Oncken and D. Wass. "Management Time: pervasive project inspections. Several groups were Who's got the nionke):" Harrwrd Busirless very late in protlucing schedules and plans that the Reuiew, vol. 18, no. 6 (November 1974): 75-79. Program Office could understand. The office was unable to obtain their cooperation to hold detailed 4. M. McMaster ant1 J. Grintlel-, PRECISIOIV: and frequent inspections. Eventually, the Program A Neci! A~~I.OGIC~to Co~?zlnu?zicntion (Bonny Office was invited to set up and participate in Doon, CA: Precision Models, 1980). appropriate inspections of schedule, process, etc. 5. J. Gleick, CHAOS: Making a New Science However, we should have insistetl on these much (New York: Penguin Books, 1987). sooner. 6. F? Senge, The F1ytb7 Discipline: The Art ana Practice oftbe Learning Orflrlization (New York: Doubleday, 1990). The Alpha AXP program is the most con~plexpro- gram in Digital's history and h;w been delivered on 7 A. Scherr, "Managing for Breakthroughs in schetlulc with high quality. The Alpha AXI' Program Protluctivity" Hzirnan Resolirce ~VIar7~1ge- Office used a rigorous management me tho do log)^ ~nerzt,vol. 28, no. 3 (Fal I 1989): 403-424. to build the program-level teamwork necessary to 8. L. Tesler, "Networked Computing in the accomplish this breakthrough. The office proved 1990s." Scientific A rnericarr (Sep tem ber the effectiveness of the Enrollment Management 1991): 86-93, Model: vision-enrollment, commitment-delega- 9. The five generations of computing are as fol- tion, inspection-support, and acknowledgnlent- lows: I950s, standalone; 1960s, batch; 1970s, learning. Integral to this motlel and empowering to timesharing; 1980s, personal: 1990s, mobile the team is to take each cusp head-on and to use distributed. them to increase momentum. The management team has been learning as the program progressed 10. R. Comerford, "How UEc: Developed Alph;~," and has identified areas needing strengthening for 1E.M Sl,ectr~~~rz(July 1992): 26-31 future programs. 11. C. House ant1 R. Price, "The Return Map: Tracking Product Teams," Harl~ardBcisir~ess Acknowledgments Rciliezu, vol. 69, no. 1 (January 1991): 92-100. The author thanks the following senior managers 12. R. LaFleur, "A Seminar in Project manage- for tlemonstrating the importance of good manage- ment" (Scituate, MA: Project Management ment: for architect~~reand a clear strat- Assist;rnce Co., 1990). egy; Ken Olsen for demanding simple, single-page plans; Jeff Kalb for operational excellence; David 13. F? Crosby, Qualit])Is Free: The Art of rMaking Stone for the model of focusing on the tlesiretl Q~iulitj~Certoitz (New York: McGrnw-Hill, state; Bob Supnik for the paradign~of the Program 1979). Office. The author also thanks key members of the Alpha General References hXP I'rogram Off ce for their contributions in man- E Brooks, The 1I4ytl?icaI ~Ma1z-irzorzth:ESSUJJS OJZ aging the program and developing the Enrollment Softulal-e E~zgineerilzg (Reading. MA: Addison- Management Methodology: Al Avery for systems Wesley, 1975). integration and significant help preparing this paper; Scott Gordon for competitive benchmark- R. Neustadt and E. Ma): Thinking In Time: The uses ing; Ellen Salisbury for planning; and Ken Schultz of history for decision makers (New York: 'Tl1e for operations and inspection. Free Press, 1986).

Digital Technical Journal Vol 4 No. 4 Special Issue 1392 205 I Fwther Readings

Con~poundDocument Architecture The Digital Technicill Journal Vol. 5 No. I, Wi~ztel-1990, EY-~196~-DP pu6lishespnpers that explore the tccl?izologicalfoz11zdatiot7s Distributed Systems of Digit~il'srnajor.prodrrcts. Ec~cl? Vol. I, No. 9,June 1989, EY-C17c)E-DP Joumalfoc~rses012 at le~zstone prod~lcfor*ecl andpr.e.serits ci Storage Technology co~?zpi/~itio~zofp~if?e)=~ toritten Vol. I, No. 8, F~DI.uLI~~1989, EY-(: 166~-DP 031tl~e crr,qilzeers rishode~ ~elobecl th7eptv)dr~ct.The collteut for CVAX-based Systems thcJourn;~lis selected Ojl the VO~.I, NO. 7, A~~gust1988, EY-6742~-~~ Jourr?c!l Advisory Board. Digitcil engineers upkorvozrld Software Productivity Tools like to coiztrib~rteci paper Vol. I, No. 6, Febrzrarj~1988, EY-&!59E-DP to the Journal should co~ztact the editor- atKDVAX::RM KE. VAXcluster Systems Vol. I,i\b. 5, Sef!tenzDer 1987, EY-8258E-1)P Topics coverecl in previous issues of the Iligitnl TechnicalJoun7ul are as follows: VAX 8800 Family Vol. 1, Are. 4, Februury 1987, EY-6711E-DI-' NVAX-microprocessor VAX Systems I/i,l. 4, No. -3, S~lr?l~?zer192, EY~1884E-l)I' Networking Products Vol. I,/\To. 3, Se/!te~nDer1986; ~~-6715~-~1' Semiconductor Technologies I/i)l.4, No. 2,Sprilzg 1992, ~~-~521~-1)1' MicroVAX 11 System Vol. I, No. 2, ~Wclrch1986, EY-.347413-DP PATHWORKS: PC Integration Software kl. 4, No. I, Wintel- 1992, EYd825E-111' VAX 8600 Processor fil. I,No. I, ALI~IIS~1985, EY-~~~~E-IIP Image Processing, Video Terminals, and Printer Technologies Subscriptions to the Digitul Tech~zicalJo~lrrzlrlare Cbl. .?, IVO. 4, FLIII 1991, EY-H889E-DP avail;tble on a prep:~idbasis. The subscription I-ate is $40.00 for four issues ant1 $75.00 for eight issues. Availability in VAXcluster Systems/ Orders should be sent to Cathy Phillips, Digital Network Performance and Adapters Equipment Corpor;~tion.~~01-.3//868, 146 1M:lin Vol. 3, No. -3, Srr~?~nze~.1991, EY-H890E-111' Street, M;rynartl, MA 01754-2571, U.S.A.. 'Teleplio~~e: (508) 493-2894, FAX: (508) 493-0637 Inquiries Fiber Distributed Data Interface can be sent electronically to [email protected].<:OM. I/r)l. 3, iVo. 2. S/lrirzg 1991, EY-H~~~E-I)~' Subscriptions must be paid in 1l.S. dollars, ilntl checks shoultl be made payable to Digital Transaction Processing, Databases, Equipment Corporation. and Fault-tolerant Systems Single copies and past issues of the Digital Val. -3, No. I, Winter IYgI, EY-F588E-Dl' TechtlicalJotrrnal are available for $16.00 eac1-1 from Digital Press, Department EER, 1 Burlington \'AX 9000 Series Woods Drive, Uurlington, MA 01830-4597 Single Val. 2, NO. 4, Ihl1 1990, EY-E~~~E-DP issues can also be orclered by c;~ltingDE<;direct at 1-80O-l11GITAI. (1-800-344-4825). DECwindows Program fi/ 2, ivo .j, Siir?11?ze1.1990, EY-E~~~c-I)~'

VAX 6000 Model 400 System L'ol. 2,No. 2. .S/)l'ing 1390, T:Y-ClC)7E-I)1'

206 Vol. 4 /\'o. 4 Specir~lls.strr 1992 Digilal TechrricrtlJourrrd I Recent Digital US Patents

Thefollou~i~zgpatentswere recently issued to Digital Equi)rneizt Corpomtion Titles and nallzcs s~~f~pliecl to us @ the US Potent and Tr~~~ie~rzc~r.kOflice c11.e re/!~~ocl~~cccleLJraclIy as tl7ej' af~f~ear or^ tlge OI*I~IIZL~/ plrblishecl /~ate?it

D327,261 K. L. Korellis and R. T. Faranda Front Face Panel Portion for Enclosure Doors for a Computer D327,477 K. L. Korellis Front Panel for an Integrated Storage Assembly for Computer Storage Units 5.092,631 M. G. M.Masnik ant1 Safety Enclosure for Gas Line Fittings R. C. Martel 5,093,628 1. T. Chan Current-Pulse Integrating Circuit and Phase-Locked Loop 5,093,775 W. R. Grunclmann, R. E ~MicrococleControl System for Iligital Dat~Processing System Boucher, and T. FOSSLI~II Method for I-'rovicling ;I Metill-Scmicondi~ctorContrict 5,095,441 D. E Hopper, E. G. Fortmiller, Rule Inference and Localization during Synthesis of Logic S. Kundu, and D. E Wall Circuit Designs Rotating Priority Encoder Operating by Selectively Masking Inpi~tSignals to a Fixecl Priority Encoder 5,095,471 M. D. Siclman Velocity Estimator in a Disk Drive Positioning Sysretn 5,095,613 K. R. Hussinger and Thin Film Head Slider Fabrication Process M. L. Mall;~ry 5,097,370 Y. Hsia Subambient Pressure Air Bearing Slicler for Disk Drive 5,037,387 J. L. Griffith Circuit Chip Package Employing Low Melting Point Solder for Heat Transfer 5,007,411 I? L. Doyle, J. I? Ellenberger, Graphics Workstation for Creating Graphics Data Structure E. 0.Jones, D. C. Carver, Which Are Stored Retrieved and Displayed by a Graphics S. D. Dipirro, B. J. Gerovac, Subsystem for Competing Programs W l? Armstrong, E. S. Gibson, R. E. Shapiro, K. C. Ri~shforth; and Vi! C. Ro;lch 5,097,436 1. H. Zurawski High Performance Atlder IJsing Carry Prediction 5.097,468 E. Earlie Testing Asynclironous Processes 5,099,367 M.D. Sidman Methocl of Autorn;itic Gain Control Basis Selection and Methocl of Half-Track Servoing multiple Bit Error Detection and Correction System Employing a Modifietl Reed-Solomon Code 1ncorpor:lting Adclress I-';lrit!z and Cat;~strophicFailure Detection 5,099,485 W F. 13ruckert, T. D. Bissett, Fault Tolerant Computer Systems with Rii~ltIsolation D. Mazur, J. Munzer, F. Bernaby, and Repair ant1 J. I-I. Bhatia 5,099,517 A. Gupta, W R. Hawe, Frame Status Encoding for Comrnunic;~tionNetworks IM. E Kempf, ant1 C.S. Lee OOE. E. Cox, Jr. and M. I? Rolla Resonant Technique and Apparatus for TIierrn;~lCap;icitor Screening 5,101,362 E. Sirnouclis Modul;~rBlackboartl-Based Expert System 5.101.402 D. Chiu and R. Sudama Apparati~sand Method for Realtime Monitoring of Network Sessions in a Loci11 Area Network 5,101,485 F. I... Pe~azzoli,Jr. Virtual k1emo1-y Page Table Paging Appar;~tusand Method 5,101,493 R. L. Travis and W R. Laurune Digital Computer Using Data Structure Including External Reference Arrangement 5,103,352 W Y. Moon and R. Y. Noguclii Phased Series Tunecl Equalizer

Digital Techrrical Journal Vol. 4 IVO -1 .Tf~ectrri13 \[re I992 207 J.I! Harris. I>. 1-eibholz, Methocl of I>ynamically Allocating Processors in a Massi\rely ;ind B. Miller Processing System M. M;lllary Rletliod of Making a M;~gtieticRecording IJead W C.Moone): J. R. Santantlreu, Tunnelled Millticonductor Systeni ancl ~Metliocl ancl K. Kshonze K. 0.Reck~iian System for Displ;~yingVitleo from a Plurality of Sources or, a l>isplay E. I,. Steltzer Transverse Positioner for ILadfiVrite Head N. K. S. Lee, J. W Howard. Optical Heat1 with Flying Lens I? K. Tan, ant1 W Hsjrtsay D. A. Bailey Cooling System for Computers W R. Grundmann, V K. Ha): Self Tiniecl llegister File Having Bit Storage Cells with 1.. 0.Herman, ;inel Emitter-Coul>led Output Selectors for Common Bits Sharing 1). &I. Litwi~ietz :I Common Pull-Up Resistor ;~nda Common Current Sink C.M. Riggle, L. Weng, High Blndwidth Reed-Solomon Encoding, Decoding :lnd Error and I? N. Hui <:orrecting Circuit 1..,I. Wng ant1 H. A. Leshaty Error Tr;ipping Decoding Method ant1 App;lratus M.1~. &lallary Laminatetl l'oles for Recording Heatls M. Sitlman Continuous-PIils-EmbecldedServo l);it;l l'osition Control System for Miignetic Disk Ikvice D. 13. Fite, T. Fossum, W R. Methotl and Apparati~sIJsing ;I Source Operand List and <;rundmann, D. l? M:lnle): ;I Source Ol>cr;~nclPointcr Queue between the Execution lJnit E X. McKeen, J. E. Murra\; ;rnd the Instruction Decoding and Operand I'rocessing (.nits R. M. Salett, E. Samberg, of a Pipelinetl Ilata Processor anel I>. I? Stirling S. (1. 1);ts ;lnd M. L. NIallar)7 Three-Pole Magnetic He;ccl with Reducetl Flus Leakage I). I). Donaldson and Lookahead Bus Arbitration System with Override of R. H. Gillett, Jr. <;onditional Access Grants by Bus Cycle Extensions for Multicycle Data Transfer U. <:. Eclern, R. I? Helliwell, Data Integrity Features for a Sort Accelerator J. T. Johnston, and R. E Lary E Tjtcomb ant1 J. Cordova Hytlroclyn;lniic Bearing Q,Y A'g Methoel for Provicling ;I Lubricant Coating on the Surbce of :I Magneto-Optical Disk i1nt1 Resulting Optical Dislc J. I.. Finnerty Integrating the Logical and I-'hysical Design of Electronically Linked Objects I). H. Fite. K. C. Hetherington, Virtual Instruction Cache S!.stem [.sing Length Responsi\re M. M. McKeon, I). I? ,Manle): 1)ecoded Instruction Shifting ant1 Merging with Prefetcli and J. E. ~Murray Buffer Outputs to Fill Instruction Buffer E X. McKeen, T. Fossun~, Method ;~ndApparatus for Handling Fai~ltsof Vector I). l? Hli:~nd;trk;~cancl Instructions <>lusingMemory ivlanagement Exceptions <:.A. Wiecek M. 1). Sidman F;ILI~~Tolerilnt Frame. <;i~arclbandand Index Detection Methotls E~libecldedBurst Democli~l;~tionant1 Tracking Error (;ener;~tion W A. Samaras. I). 'T. Vnughan, Method and Apparatus for Stabilized Data Tmnsmission ;inel A. D. Ingraham 1. S. Fitch ancl W I<. Hamburgen Micro-Channel Wafer (:oolitlg Chuck S. Miller Object Identifier Generator for Distributed Conipi~terSystem

208 WJI. 4 iVo 4 Spea'rrl lss~le1992 Digital Technical Journal