Editorial Jane C. Blake, Editor Helen L. Patterson, Associate Editor Kathleen M. Stetson, Associate Editor Circulation Catherine M. Phillips, Administrator Sherry L. Gonzalez Production Terri Autieri, Production Editor Anne S. Katzeff, Typographer Peter R. Woodbury, Illustrator Advisory Board Samuel H. Fuller, Chairman Richard W Beane Donald 2.Harbert Richard J. Hollingsworth Alan G. Nemeth Jeffrey H. Rudy Stan Smits Michael C. Thurk Gayn B. Winters The Digital TecbnicalJoutnnl is published quarterly by Digital Equipment Corporation, 146 Main Street ML01-3/B68, Maynard, Massachusetts 01754-2571. Subscriptions to the Journal are $40.00 for four issues and must be prepaid in U.S. funds. University and col- lege professors and Ph.D. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Technical./ournalat the published-by address. lnquirks can also be sent electro~icallyto DTJ~CRL.DEC.COM.S~~~I~copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 1 Burlington Woods Drive. Burlington, ,MA 01830-4597 Digital employeesmay send subscription orders on the ENET to RDVAX:,JOURNAL or by interoffice mall to mallstop ML01-3/B68. Orders should include badge number, site location code, and address. All employees must adv~seof changes of address. Comments on the content of any paper are welcomed and may be sent to the editor at the published-by or network address. Copyright O 1993 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. The information in theJournal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in theJortrnnl.
Documentation Number EY-J886E-DP The following are trademarks of Digital Equipment Corporation: ACMS, ALL-IN-1,Alpha AXE the AXP logo, AXT: DEC, DEC 3000 AXE DEC 4000 AXI: DEC 6000 AXI: DEC 7000 AXp DEC 10000 AXl? DEC DBMS for OpenVMS, DEC Fortran, DEC OSWl AXE DEC Pascal, DEC RAL1.Y- DEC Rdb for OpenVMS, DECchip 21064, DECnet, DECnet for OpenVMS &XI: DECnet for OpenVMSVAX, DECnet/OSI, DECnet-VAX, DECstation, DECstation 5000, DECwindows, DECWORLD, Digital, the Digital logo, DNA, OpenVMS. OpenVMS AXI: OpenVMS RMS OpenVMS VAX, PDP-11, Q-bus, Thinwire, TURBOchannel, IJLTRM, VAX, VAX-11/780, VAX 4000, VAX 6000, VAX 7000, VW 8700, VrLv 8800, VAX 10000, VAX Fortran, VAX Pascal, VMS, and VMScluster. CMY-I is a registered trademark of Cray Research, Inc HP is a registered trademark of Hewlett-Packard Company. Cover Design IBM is a registered trademark of International Business Machines, Inc. The DECchip 21064, the fzrst implementation LSI Logic is a trademark of LSI Logic Corporation of Digitalk Alpha AXP computer architecture, Macintosh is a registered trademark of Apple Computer, Inc. is the world'sfastest single-chip micropocessolr MIPS is a trademark of MIPS Computer Systems, Inc. Represented on our couer by the RrlP logo, the Motorola is a registered trademark of Motorola, Inc. DECchip takes its phce among sywzbols of other OSWl is a registered trademark of Open Software Foundation, Inc. dewicesfrom computing history, including PAL is a registered trademark of Advanced Micro Devices, Inc. the vacuum tube, apuncb card, sketches of SPEC, SPECfp, SPECint, and SPECmark are registered trademarks of the Standard Bubbagek Analytical Engine, a wheel from the Performance Evaluation Cooperative. ~aciculke,and an abacus. SPICE is a trademark of the University of California at Berkeley The cover was aksi~nedby Deborah Fulck oJ UNIX is a registered trademark of UNM System Laboratories, Inc. Digital's Corporate Human Factors Group ,with Windows and Windows NT are trademarks of Microsoft Corporation. the help of Knzn Design. Book production was done by Quantic Communications, Inc Contents
17 Forauord Robert M. Supnik
Alpha AXP Architecture and Systems
19 Alpha AXP Architecture Richartl L. Sites 35 A 200-MHz 64-bitDual-issue CMOS Microprocessor Daniel W Dobberpuhl. Richard T. Witek. Randy Allrnon. Robert Anglin, Davitl Bertilcci. Sharon Britton, Lintla Chao, liobert A. Conrad, Daniel E. [)ever, Bruce Gieaeke, Soha M.N. H;lssoun, Gregory W. Hoeppnel; KathrynKuchler, Maureen Ladtl, Burton M. Lear): Liam Madden, EtlwardJ. McLellan, Derrick R. Meyer, James Montanaro, Donald A. Priore, Vidya Rajagop:~l;~n,Sridhar Sarnucl~~l;~, and Sribal:~nS;lnrIianam 51 The Alpha Demonstration Unit:A High-performance Multiprocessorfor Sopware and Chip Development Charles I? Thacker, David G. Conro!: ;lnd Lawrence C. Stewart 66 The Design of the DEC3000 AXP Systems, Two High-performance Workstations Todd A. I)i~tton,Daniel Eiref, Nlrgh R. K~rrth,JnmesJ.Reisert, ant1 Robin I,. Ste~vart 82 Design and Performance of lhe DEC 4000 AXP Departmental Server Computing Systems Barry A. M;~skas,Stephen E Shirron, and Nicholas A Wlrcliol 100 Technical Description of the DEC 7000 and DEC 10000 AXP Family Brian R. Allison and Catharine van Ingen 1 11 Porting OpenVMS from VAX to Alpha AXP Nancy P Kronenbrg, Thomas R. Benson, Wayne M. (hrtloza, Ravindran J;~jian~~atlian, ant1 BenjaminJ. Thomas Ill 12 1 The GEM Optimizing Compiler System Davitl S. Hlickstein, Peter W Craig, <:aroJine S. Davidson, R. Neil Riman,Jc, Kent D. <;lossop, Richartl H.Grove, Steven 0.I-lobbs, and William H. Noyce 137 Binary Translation Richard L. Sites, Anton Chernoff. Mattliew B. Kirk, M;~uriceP Marks, and Scott (;. Robinson 153 Porting Digital's Database Management Prodzrcts to the Alpha AXP Platforrn Jeffrey A. Coffler. Zia Mohatnecl, and I'eter M. Spiro 165 DECnet for OpercVMS AXP: A Case History James V. (:olombo, Pamela). Rickard, ant1 Paul Benoit 181 Using Simzrlation to Develop and Port SoJfu)are George A. 1):lrcy 111, llonaltl F, Rrcntlel; SrcphenJ. Morris, ;~ntl~Michacl V. IIcs
AIpha AXP Program Management 193 Enrollment Management, Managing the Alpha AXP Program Peter E Conklin I Editor's Introduction
req~~irements:~nd to allow :~pplicationflexibility The result of their design efforts is a microproces- sor that operates at speeds up to 200 MHz-the fastest commercially available chip in the industry Early implementations of this chip became part of a prototype system, the Alpha Demonstration Unit. As Chuck Thacker, Dave Conro): and Larry Stewart explain in their paper, the prototype servecl the overaII Alpha ,tYl' program by giving software devel- opers early access (ten months) to AXP-compliant hardware. Because of tlie architectural emphasis on Jane C. Blake multiple processors, prototype designers focused Editor on delivering a robust multiprocessing system. The ;iuthors tliscuss the significance of tlie choice of a backplane interconnect for a multiprocessor, corn- 'T'his special issue of tlie Digit~llTeclnrlical Jourf1crl pare different ;ipproaches to cache coherence, ancl presents the computer architecture that Digital describe the system modules and packaging. believes will become the universal platform for With constraints different from those of the pro- computing over the next 25 years. A signific;uit totype, the li;~rdwareprocloct projects are repre- ~liilestonein the comp;cny's history the i\lplia ASP sentetl here b), three different implementations: architecture arises out of Digital's extensive etigi- desktop, departmental, and data center systems. In neering experience antl puts into place a cohesive, the desktop area, the I>EC 3000 UPfamily of work- .I'lexible framework for high-perform:~nce 64-bit stations are balanced u~iiprocessorsyste~ns. Totlcl IlISC computing. This issue contains papers repre- Dutton, Dan Eiref, Hugh Kurth. Jim Reisert, and sentative of the scope of tlie program across Iiobin Stewart review tlie rlecision to replace the Digital's Engineering organization, including hartl- traditional common system bus with a crossbar \\?;Ire systems, an operating system, compilers, system intercontiect constructetl of ASI(:S. This new binary tr;~nslators,network ant1 database software, interconnect allorved the designers to meet the and simulators. goals of Low memory latency, high memory band- The results of the engineering efforts cliscusscd width, ;~nd~>ii~ii~nal (:I'll-I/O memory contention in in these papers reflect three primary goals fix a cost-competitive manner. tlie Alpha AXI' architecture: high performance, The I>EC 4000 AXI' system is a tlepartmental longevit!; and e;~symigration from tlie 32-bit VAX server that iniplements the IEEE Futurebus+ stan- VMS computer line. Dick Sites, one of the chief tlard. Barry Maskas, Stephen Shirron, and Nick Alpha AXl' architects, 1i;rs written a definitive p;cper Wrcliol present tlie reasoning behind the system th;lt explains how key architectural decisions were architecture antl technology decisions that resulted made relative to the gonls. He reviews the similari- in the ;~chievementof optimized uniprocessor per- ties and differences between the UP;rrcIiitecti~re form;ince, tlu;rl-l>rocessorsymmetric multiprocess- and other RIS<: ;crchitectures, ant1 then presents ing, and baJ;uiced 1/0 tlirouglipi~t.lletnils of the details of-'the design, including data and instruction subsystems that make up this expanditble modular formats. In his conclusion, he projects evolutionary system are also provitled. c1i:lnges in the ;~rchitect~lreantl the resultilig per- 'The I)EC 7000 ant1 I>E<: 10000 systems are po\ver- formance increases of a thous;~~idfoldover tlie next fill niitl-rangc and ~ii;~inframeplatforms intended 25 years. for large commercial applications ancl clesigned to 'The first implementation of the Alpha t1XP arclii- utilize multiple hiti~regenerations of the DECchip. tecture i.4 tlie I>EC:chip21064 microprocessor, which Described by Brian Allison ancl (:;~tl~arincvan can execute up to 400 million operations per Ingen. tlie heart of these systems is a high-perfor- secontl. I)an L)obberpuIil ancl members of the mancc interconnect that allows communic;~tions Alpha chip team offer an overview of the CMOS pro- between multiple processors, memory arrays, and cess tech~~olog)!the chip micro;~rchitect~~re,ant1 I/O subsystems. The ;~uthorsrevie131 e;ich of the the external interface. They tlieti detail the circuit modules and the I/() subsystem design, which implementation ancl explain the design choices includes interfaces for SMI and Futurebus. Notably, directed toward meeting architectural performance ;I 32-bit \'AX <:I'~I ~iiodulehas been designetl to tlie reqi~irements of the higli-performance system this design approach and provicle tletails of tlie interconnect. Users who wish to migrate from the individi~alporting efforts. VAX system to Alpha UPneetl only swap module The process of porting l)C<:nct-VKX to the boards. OpenV1\1S operating system is described by Jim Migration to Alpha tU(P from other architectures, Colombo, Pam Rickard, and Paul Benoit. They dis- in particular from VAX VMS, is one of the major goals cuss the DECnet features supported in the operat- set by the Alpha architects. Existing software- ing system, tlie softm~aretechniqi~es used, and the operating systems, languages, programs-must be importance of the clecisio~ito build common code adapted to run effectively on 64-bit RISC systems. A for the VAX and Alpha IU;P sjSsterns.The autliors paper by Nancy Kronenberg, Tom Benson, Wavne share details of the port ancl lessons learned that Cardoza, Ravintlran Jagannatlian, ant1 Ben Thomas can be applied to tiiture porting efforts. addresses the ch;~llengesof porting the OpenVMS Complementary to the previously mentionecl operating system-originally eleveloped specifi- prototype hartlware system are h)ur software simu- cally for 32-bit VAX systems-to Alpha AXP systems. lators that enabled engineers to tlevelop softw;ire To deal with the huge amount of code, the project for Alpha UPconcurrently with harclware develop- team developecl a compiler that treats VrLY ,assembly ment. Described by George Darcy Ron Brendel; langiiage (VLY MA<:RO-32) as a source language to be Steve Morris, and Mike Iles, the M;ln~iequinsi~nu- compiled. The authors also cliscuss the major arclii- lator was i~setlby the OpenvivlS group to boot tectural differences in the kernel, performance, and the entire operating system and clebug utilities; some future directions for tlie system. the ISP simulator was used by the l)EC oSF/l group The GEM compiler system is the technology with similar success. A major section of the paper Iligjtal is using to build state-of-the-art compiler focuses on the Alpha User-niotle Debugging Envi- products. <;E>I is describetl here by David ronment in which user-mode code being devel- Blickstein, Peter Craig, Caroline Davidson, Neil oped for Alpha .UP platforms c;un be compiled and Faiman, Kent <;lossop, Rich Grove, Steve Hobbs, executed as N11h;l ILYP code. and Bill Noyce. A significant achievement in the Tlie closing paper is an uni~si~alone for tlie clevelopment of this compiler is that a single opti- Jo~lrnnlbecause it addresses engineering manage- mizer is used for all languages and platforms. ment, not strictly technical issues. Peter Conklin Developers of compilers will find in-tlepth informa- offers insights into the reasons for the success of tion in the ;~i~tIiors'disc~~ssions of optin~izatio~i one of the largest engineering programs untler- techniques, code generation, compiler engineer- taken in the intlustry. He defines tlie enrollment ing, and future enhancements. management rnoclel used for tlie Alpha UPpro- Binary translation is another rne;ins of moving gram and explains key concepts, i~icludingthe cornplex software applications from one architec- program office ;~n(lproject "cusps." ture and operating system to another architecture The editors are very gratefill for the help of Hob and operating system. Two binary translators are Supnik, Vice I'resident and (:o~-porateConsultant. the subject of a paper by Dick Sites, Anton Chernoff, in planning this special issue ;inti for writing its M;~ttliewKirk, Maurice Marks, ~111clScott IULTRIX/MII'S images to DEC OSWI AXP images. Barbara Watterson from Digital's semiconductor An easy migration path to Alpha A.XP for two organization; I>i;rne Crawford, Executive Eclitor of database rnali;lgement systems usecl in large com- the CACM; the I>TJ editors; anel the ;ii~tIiorscooper- mercial applications is the subject of a paper by Jeff ated so that these inforni;~tivepapers coultl be Coffler, Zia Moh;~mecl,and Peter Spiro. Tlie authors niacle available to ;I broad technical audience. define the issues involved in por~ingtlie complex W DBMS and Rdb/VivlS proclucts to the tutP plat- form. Adding to the challenge but balanced by its advantages was the decision to have a common source, or single code, base. The authors review Biographies
Brian R. Allison Hrian Allison is a senior consultant engineer for Digital's mid-range VAX/Alpha AXP systems groi~pant1 is the stem architect responsible for the coordination of the VAX a~lclDE<: 7000 ant1 10000 system definition and design. Prior to this work, he served as system architect for the Vr\X 6000 product. Brian holcls a B.S.E.E. and a B.S.C.S.from Worcester Polj~echnicInstitute (1977).
Randy Allmon After receiving a B.S. tlegrce in electrical engineering from the University of Cincinnati, Randy Allmon joined Digital in 1981. As a circuit designer in the Semiconductor Engineering Group, he has contribt~tedto the development of numerous high-performance <:Mas processors. Currently, Randy is responsible for the technical tlesign and manilgernent of :I next-genera- tion processor based on the Alpha AXP architecture. He is the coauthor of four high-performance processor papers given at ISSC<: and has one patent pentling.
Robert Anglin Robert Anglin received S.B.and S.M. degrees in electrical engi- neering in 1989 from the M;lssachusetts Institute of Technology. In the same year, he joined Digital's Semiconductor Engineering Group. where he has worked on the design of high-perform:~ncemicroprocessors. Robert is a niem- ber of Signla Xi. He is currently pursuing an M.B.A. degree at Harvard University.
Paul Benoit Paul Senoit is a ]?rincipnl software engineer in the Networks and Communications Group. He is the project/technical leader for tlie DECnet for OpenVMS AXP project; the team receivecl an Alpha Achievenient Award for early completion of project commitments. Previous to this, Paul led the DECnet-VAX Phase 1V effort. He holds an M.S.S.E. (1991) from Boston University and a H.S.C.S. (1986) from the llniversity of Lowell. Paul is a member of ACM anti IEEE Computer Society.
Thomas R. Benson A consulting engineer in the OpenVMS AXI' Group, ?i)m Benson was the project leader ;lnd princip;~ldesigner of tlie V4X ;Llr\CRO-32 com- piler. Prior to his Alpha AXP contributions, he led the VhlS UECwindows Fileview and Session Manager projects and bro~~ghtthe Xlib graphics library to the V,MS operating system. E;trljel; lie supported ;in optimizing compiler shell used by several V!\X compilers. Torn joined Digi1;il's VhX U;~sicproject in 1979, after receiving 13,s. ant1 ,M.S. degrees in computer science from Syracuse li~liversit)~,He has applied for four patents rel;ited to his Alpha AXP work. David Bertucci David Bertircci received ;I B.S.E.E.degree in 1982 from Wayne State University ant1 an M.S.E.E. degree in 1988 from Michigan State University. He joined Digital's Semiconductor Engineering Group in 1989 and worlzed on advanced (:MOS microprocessor design. Currently, he is employed at Sun Microsystems, Inc.
David S. Blickstein Principal software engineer David Blicksteill has worked on optimizations for the GEM compiler system since the project began in 1985. During that time, he designed various optimization techniques, including induc- tion variables, loop unrolling, code motions, common subexpressions, base binding, and binary shadowing. Prior to this, David worked on Digital's PDP-11 and VAX API. implement:~tionsand led the VAX-11 PL/I project. He received a B.A. (1980) in mathematics from Rutgers College, Rutgers University, and holds one patent on side effects analysis and anotller on incluction v~riableanalysis.
Ronald F. Brender Ron Brender is a senior consultant software engineer, contribilting to the GEM compiler back-end project in the Software Development Technologies Group. He has worked on con1,pilers and program- ming langui~gedefinition for Alpha UP,VAX, PDP-11, and PIIP-10 systems, inclutl- ing Acl;~,FOII'TIWN and HI.ISS. A member of various standards committees since the mid-1970s, Ron is now responsible for Vi\X and Alpha AXP calling standards. He joined Digital in 1970, after receiving a P1i.D. in computer and communica- tion sciences at the University of Michigan.
Sharon Britton Sharon Britton received a B.S.E.E. degree from Boston Ilniversity in 1983 and an M S.E E degree from the Massachusetts Institute of Technology in 1990. She joined Digital in 1983 to work on tlie design and clevel- opment of 80186-based controllers for read-only and write-once optical disk drives. Sharon's graduate research involved tlie development of an integrated content adtlressable memory system with error detection c;~pabilityCurrently a member of the Semicontluctor Engineering Group, she is involvetl in the design ant1 implementation of high-performance CMOS microprocessors.
Wayne M. Cardoza Wayne Cardoza is a senior consultant engineer in the OpenVMS AXP Group. Sincc joining Digital in 1979, he has worlied in various areas of the OpenVMs kernel. \Vdyne was also one of tlie architects of PRISM, an earlier Digital RJSC architecture; lie holcls several patents for this work. More recently, Wlyne participated in the design of the Alpha AXP architecture and was a member of the initial design team for the OpenVMS port. Before coming to Digital, WAyne was employetl by Bell Laboratories. Wayne received a B.S.E.E. from Southeastern Massachusetts University ancl an M.S.E.E. from MIT. Linda Chao Lintl;~Chao received ;I 1% S E E degree from the i\il;tssacht~setts Institute of Technology in 1987. Since joining Digital in the Semiconductor Engineering Group/Atlvancecl llevelopment in 1987, Linda has been engaged in the design of microprocessors basetl on the Vi\X and Alpha AX1' architecti~res. She is currently pursuing master's degrees in electrical engineering and nianage- nient through the MIT Leaclers for illanufirct~~ring1'rogr;rm.
Anton Chernoff A~itonChernof'f is ;I member of tlie technic:rl st;rffat Digital Equi~mentCorporation. working in the Alpha t\XP Migration Tools Group. He joinecl Iligital in 1991, but also worked at 1)igital between 1973 ant1 1981 as proj- ect le;~der;rnd developer of the In'-11 ant1 IISTS/E oper;rting systems. Anton spent 1982 through 1991 ;rt L.iant Soft\v;lre Corporation as a senior consulting engineer in compiler ant1 deb~~ggertlevelolxnent.
Jeffrey A. Coffler A prjncipal software engineer in tlie 1)at;lbase Syst-enis Engineering Gro~lp.Jeff Cofl'ler led the effort to port I>IIMS to the Alpha AX1' plat- form. Prior to this, Jeff worked 011 the DBVS and Rdb b;~ckup/restorefacility ancl on new DBMS fe;itures and ni;rintenance. He is currently working on the project to port Rclb for OpenVMS to operating systems such as Winclo\vs N'f ant1 OSB1. He has ;~lsocontributed to the RSTS/E operating system, WPS-I'LIJS porting, ;rntl workflow nianagernent projects. Jeff joined Digital in 1984 ancl holds a R S<: S (1981) I'roni Californi;~State Ijniversity ;~tNorthridge.
James V. Colombo Project/technical leatler James Colonibo is currently responsil,le for tlie next rele:rse of DE(:net/OSI for OpenVMS for the Vr\X :~nd Alpha AXP computing environments. Prior to this, he lecl the port of DECnet-VAS Phase Iv to the OpenV,\lS AXI' oper;~tingsystem; the team received an Alpha Achievement Aw;~rtlfor earl! completion of the project. Jim also let1 the DE<:net for OS/2 V I .O and v;rrious PIITH\VORKS procluct efforts. Before conling to Digital in 1983, Jirn worked at Prinle <;ornputer, Inc. and Computer Devices, Inc. He holtls ;I 1%SC S fro111Host011 I!niversity ant1 is ;I rnember of lEEE.
Peter F. Conklin Peter Conklin is tlirector of Alpha t\XP Systems Develop- ment. Since joining Digital in 1969, lie h;ls held engineering nian;rgement posi- tions in large ant1 small systems and terminals groups, direct hardware :rntl softw:rre engineering, product management, base product m;~rllS layerecl prod- uct set. Peter received an A\.[\. in n1athem;rtics from Marvard University in 1963. Robert A. Conrad Robert Conrad receivecl a H.S.degree in electrical ant1 com- puter engineering from tlie 111iiversity of Cincinnati in 1984 ancl nn M.S. degree in electrical and computer engineering from tlie University of ~Mii~~achu~ett~in 1992. In 1981 he joined Digital's Semicontluctor Engineering Groilp, where he worked ;IS a co-op student in the Architectilrally Focused Logic Group. Since 1984 Rob has been engaged in the research and tlevelop~lientof VLSl micro- processors, including thc MicroVAX CPU, a 50-MHz RISC CPU, ;lncl most recently the DECchip 21064 microprocessor.
David G. Conroy Dave Conroy receivetl a 13.A.Sc. degree in electrical engi- neering from the University of Waterloo, Canada, in 1977 After working briefly in industrial automation, Dave movetl to tlie United States in 1980. He cofoundecl the mark Williams Company and built a successh~lcopy of the Iliul>; operating systerr~.In 1983 he joined Digital to work on the DECtalk speech synthesis system and related products. In 1987 he became a member of Digital's Semiconductor Engineering Group, where and has been involved with system- level aspects of RISC microprocessors.
Peter W. Craig Peter Craig is a principal software engineer in tlie Software Development Technologies Group. He is currently responsible for the design ant1 implementation of a tlependence analyzer for use in future compiler procl- ucts. Peter was a project leader for the VAX Code Generator used in the VrLY C and VAX PIJI compilers, and prior to this. he cleveloped CPlr perform;~ncesimulatioli
I software in the VAX Architecture Group. He received a R.S.E.E. (magna cum I;lucle, 1982) from the University of Con~~ectici~t;~ndjoined Digital in 1983.
George A. Darcy I11 As a senior software engineer in the Alpha Migration Tools Group, George Darcy has worked on the Mannequin Alph:~)\XI' sin~ulator, the VEST' binary translator, and the Translateel 11nageEnvironment ('l'lE) run-time library. 111 his ten years at Digital, he has also developed a virtual disk driver for the OpenVMS V5.0 SMP operating system, soltware behavioral models of a high- end VAX processor, and various simulatio~iand <:AD software tools. George receivetl a R.S.C.E. (cum laude, 1984) from Boston Universit): where he was an Engineering Merit Scholar ancl a member of Tau Beta Pi.
Caroline S. Davidson Since joining Digital in 1981, Caroline I);~vidsonhas contributed to several software projects, primarily relatetl to code generation. Currently a principal software engineer, she is working on the (;EM compiler generator project and is responsible for tlie areas of lifetimes, storage ;II location, and entry-exit calls. Caroline is also a project leatler for the Intel cocle generation effort. ller prior miork involved the VAX FOIITRAN for IJL'I'KIX, VAX Code Generator, ancl FORTRAN 1V software products. Caroline has a H.S.<:.S. from the State University of New York at Stony Brook. Daniel E. Dever 1);ln Dever receiveti a I3 S E.E. degree in 1988 from the University of Cincinn:iti. He joined Digit;il's Semiconductor Engineering Group in 1988, where he worked on the tlesign ant1 logic verific;~tionof CMOS V!\X tnicroprocessors. Since 1990 he has been involved in the design of R1S<; arcliitec- ture microprocessors, including the floating-point unit of the DECchip 21064 ~~iicroproceorIhln is currently involvetl in the design of integer arithmetic logic for the next-generation processor b:lsetl 011 the Alpli;~t\Xll architecture.
Daniel W. DobberpuN Dan Dobberpulil received a I3.S.E.E. degree from tlie [Jniversity of Illinois in 1067 Subsequent to positions with the Department of Defense and Ge11er;tl Electric Company, he joined Digital's Sernicontluctor Engineering Group in 1976. Since that time, he has been active in the design of four generations of microprocessors, including the first single-chip PDP-11 and the first single-chip VAX. Most recently, 1)an was the project leader for the first \rl.SI implementation of IXgital's new 64-bit Alpha tD(P computing architecture. He is co;~ut hor of the text, The Desi'yz arld ArznI)!sisof VLSJ Cir-ctlits.
Todd A. Dutton A priticipal hartlw;ire engineer, Totld Dutton was responsible for the overall design integration and timing verification of the DEC 3000 AXl' Motlel 500. Prior to this, he led a team in developing vector processor hardware in the Advanced VAS Development Group. Todd joinetl Digital in 1987 Pre- viously, he was employed at Xumerix Corporation and at Signal Processing Systems. Inc. Totltl 1i;ls a 1%.S.degree in computer science from the ~Massachusetts Institute of Technology and was electetl to Tau Beta Pi. He lioltls a patent on vec- tor processor technology and has published two papers on vector processors.
Daniel Eiref Dan Eiref joined Digital in 1987 after receiving B.S. and MS. degrees in electric;il engineering from <:olumbia University. At Columbia he was electetl to Tau Ret;~Pi ant1 WAS awartled tlie Steven Abbey O~~tstandingStudent- r~thletei4m~artl. He is currently attentling Harvard Business School. A principal hardware engineer, Dan was responsil>le for the design of tlie memory and clock systems of the l>E(: 3000 AXP Model 500. He also designetl the workstation's SI.I<:E and rU)l)K ASI<:s. Prior to this project, he worketl as an ECL hardware tlesigner in the Advancerl VAX Development Group.
R. Neil Faiman, Jr. Neil Fai~n;lnis ;I consultant softw:ire engineer in the Software Development Technologies <;roi~p.He was tlie primary architect of the (;EM intermediate I;lnguage and a project leader for the <;EM compiler optimizer. Prior to this work. he led the BLISS conipilrr project. Neil c;lme to Digital in 1983 from MDSI (now SchIu~~~berger/Applicon).He has B.S. (1974) and k1.S. (1975) tlegrees in computer science, both from ~Micl~iganState Ilniversity Neil is a mem- ber of Tau Beta I'i and A<:M, and ;In affiliate member of the IEEE Computer Society. Bruce Gieseke Bruce Gieseke received a B.S. degree in electrical engineering from the University of Cincinnati in 1984. and an M.S. degree in electrical e~lgi- neering from North Carolina State IJniversity in 1985. In 1986 he joined Digital's Sen~iconcluctorEngineering Group, where he has been e~lgagetlin the imple- mentation and circuit clesign of RIS<: microprocessors.
Kent D. Glossop Kent Glossop is ;I princip;il engineer in the Softw;ue Development ?'ethnologies G~OLIP.Since 1987 he h;~sworked on the GEM com- piler syste~n,focusing on code gencr;ition and instruction-level tr;insformations. Prior to this, Kent was the project leader for a release of the VAX PWI. compiler and contributed to version 1 of the VAX Performance and Coverage Arxalyzer. Kent joinecl Digital in 1983 after receiving a B.S. in conipilter science from the University of Michigan. He is a meniber of IEEE.
Richard B. Grove Senior consultant software engineer Rich Grove joined Digital in 1971 ant1 is currently in the Software Develol>ment Technologies Group. He has led the GEM compiler project since the effort bcg;~nin 1985, con- tributing to the code generation phases. Prior to this work, Rich was the project leader for the PDP-11 ancl \(AX FORTRlN compilers, \worked on VAX Ada Vl, and was a member of the ANSI X3.13 FOK'I'RAN Committee. He is presently a member of the clesign team for Alpha AXI' c;illing standarcls and architecture. Rich has H.S. ancl M.S. degrees in mathematics from C~rnegie-MellonUniversity.
Soha M.N. Hassoun Soha Hassoun received a H.S.E.E. degree from South Dakota State University in 1986, ancl all S.M.E.E. degree from the Massachusetts Institute of Technology in 1988. From August 1988 to August 1991 she was employecl at Digital as a custom tlesign engineer in the Semiconductor Engineering Group. She contributetl to the design of tlie flo;tting-point unit of the I)E<:chip 21064 processor. Soha was the recipient of a Digital Minorit)r ;in(I Women's Scholarship in I991 and is pursuing a 1'h.l). degree ;~tthe IJniversity of \Vashington, Seattle, (:oniputer Systems Engineering Uepartme~?t.
Steven 0.Hobbs A ~nemberof the Software Development Technoltigies Group, Stcven Hobbs is working on tlie GEM compiler project. In prior contribu- tions at Digital, he was the project leader for \'AX Pascal. the lead designer for the global optinlizer in VAX FORTIWN, ant1 a member of the Alpha AXP architecture design team. Steve received his A.R. (1069) in mathematics at Dartmouth College and while there, helped develop the original BASIC: time-sharing system. He has an M.A. (1972) in m;ithcmatics from the Univel-sit). of Michigan and has done additional graduate work in compiiter science at Carnegie-~MellonUniversity. Gregory W. Hoeppner Gregory Hoeppner graduatetl with tlistinction from I'urtlue IJniversity in 1979. His research topic was ion-implanted optical wave- guicles. In 1980 he worked at C;ener;~l Telephone ant1 Electronics Researc1.r L;~bor;itor!: \vilere he performecl basic properties rese;lrch on (;ails for fabrica- tion of submicrometer FETs. From 1981 to 1992 he helcl ;I number of positions at Digitill Eqi~ipmentCorporation's Nuclson, iLL\ site, including co-implementation 1e;lcler of 1)igit;il's DECchip 21061. He is currentl). emplo!recl ;is a senior engineel- at III.LI, Adv;~ncedWorkstation Division.
Michael V. Iles Michael Iles is a senior technology consultant at (he UK Alpha MI' Migration Centre. Since joining Digital in 1975, Mike has worketl in various fieltl positions, in Advanced \'AX development :IS iI microcotler, and for VMS engi- neering as a software engineer. He worked on the migration of OpenVMS VAX to the Alpha AXl) platform. designing and implemcntillg a user-mode simulation environment that became AuD. Mike has a B.s~.in electrical engineering (hon- ors. 1973) from City University. Lontlon. and holds a patent for tligital speech synthesis techniques. Hc has several patents pentling for AriI).
Ravindran Jaga~athan Ravintlran Jaga~inathanis :I principal software engi- neer in the OpenVMS l'erformance Group currently investig;~tingOpcnVMS ASI' ~nultiprocessingperformance. Since 1986, he has worked on perforrn;~nceanal- ysis ;III~ch;~r;lcteri~;~tio~i, ;~nd;~lgoritlim design jn the ;ire;is of OpenVMS ser- vices, S>IP, \ihXcluster sjwems, ;tncl host-b;~setlvolume sh;itlowing. Ravindrian receivecl a H.E. (honors, 1983) from the University of M;ltlras, Incli;~,and M.S. degrees (1986) in oper;~tionsrese:~rch and st;~tislicsant1 in computer ;rncl sys- tems engineering from Rcnsselaer Polytechnic Institute.
Matthew B. Kirk Matthew Kirk is ;I senior software engineer in the SE(;/I\I> kYI) ,Migration Tools Group, where he works on binary tl.;inslator de\~elopnirnr. testing, ant1 support. He joined Digital in 1986 ;lnd has also tlqsignccl ;ind clevel- oped automated architectural test software for pipelined VAX harclw:ire and the (3 computer interconnect. Matthew holds it 13,s. in computer scicnce (1986) from the University of Massacl~usetts.
Nancy P. Kronenberg Nancy Kro~ienbergjoined Digital in 1978 and has cleveloped VMS support for several vrV( systems. She designetl and wrote the VklS CI port clriver ;111dpart of the VkIscluster System Communications Services. In 1988, Nancy joined the team th;~tinvestigated ;~lternativesto the VAX ;irchitec- ture ant1 drafted the propos;il for the Alpha /\XI1 ;irchitecture and for porting the (.)pcn\/.LlSoperating sjrstem to it. N;lncy is a senior consulting software engineer ;inti technic;~lclirector for the OpenVhIS t\Xl1 <;SOLID. She holcls ;11i !\.1$, degree in physics froni <':orncllUniversity Kathryn Kuchler Kathryn Kuchler received a R.S. degree in electrical engi- neering from Cornell University in 1990. Upon graduation, she joined Digital's Semiconductor Engineering Group, where she worked on tlie first implementa- Ition of a RISC microprocessor basecl on the Alpha AXP architecture.
Hugh R. Kurth Hugh Kurth joined Digital in 1986 after receiving a R.S. degree in electrical engineering, computer engineering, and mathematics from Carnegie-Mellon University. At Carnegie-Mellon, he was elected to Eta Kappa Nu and was awarded the David Tuma Undergraduate Laboratory Project Award. A senior hardware engineer, Hugh designed the TCDS ASIC and SCSl subsystem for the DEC 5000 AXP Model 500. Prior to this work, lie designed floating-point hardware for two projects in the Advanced VAX Development Group.
Maureen Ladd Maureen Ladd received a B.S degree in computer engineering frorn tlie University of Illinois in 1986. She then joinecl tlie Semiconcliictor Engineering Group within Digital and worked on a 32-bit RISC microprocessor. Maureen received an M.S.E. degree in electrical engineering from the University of Michigan in 1990 through Digital's Graduate Engineering Education Program. Upon her return to Digital, she worked on the implementation of the first micro- processor based on tlie Alpha AXP architecture.
Burton M. Leary Mike Leary is currently a consulting engineer in the Semiconductor Engineering Group/Advanced Developn~entMemory Groi~p.He designed the instruction and data caches for the DECchip 21064 CPlJ and is cur- rently working on the design of advanced memory products. Milze joined Digital in 1980 after receiving a B.S.E.E.degree from the University of Massachusetts.
Liam Madden Liam Nladden joinecl Digital in 1984 and has since designed both ClSC and RISC microprocessors and contributed in the area of CMOS process development. He is currently a consultant engineer in Digital's CPU Advancecl Development Group and his interests include circuit design and CMOS tech- nology development. Prior to joining Digital, Liam designed industrial micro- controllers for ~Mahon and McPhillips, Ireland, and worked for Harris Semicontluctor. He received a B.S, clegree from University College Dublin in 1979 C and an M.E. degree from Cornell University in 1990. Maurice P. Marks Maurice Marks is a senior engineering manager in the Semiconcluctor Engineering Advanced Development Group. He currently man- ages the AXP Migration Tools Group and contributed to the design and imple- mentation of the translators. In Maurice's twenty years with Digital, he has led compiler, operating system, hardware and software tools, CAD, system, and chip projects. He holds B.Sc. and B.E. degrees from the University of New South Wales ant1 l~aspublishecl papcrs on transaction processing, software portabilic): and CAD technology. Maurice is a member of the Australian Computer Society.
Barry A. Maskas Barry Maskas is the project leader responsible for architec- ture, semiconcluctor technology, and development of the t>EC 4000 AXP system buses, processors, and memories. He is a consulting engineer with the Entry Systems Busi~~essC;roup. In previous work, he was responsible for the architec- ture and development of custom vLS1 peripheral chips for VAX 4000 and MicroVAX systems. Prior to that work, be was a codesigner of the MicroVAX 11 CPU and mem- ory modules. He joined Iligital in 1979, after receiving a R.S.E.E.from Pennsylvania State University. He holds three patents and has eleven patent applications.
Edward J. McLellan Etl McLellan is a principal engjneer in the Semi- contluctor Engineering Group. He has co~ltributedto the design of several pro- cessor chips. Ed joined Digital in 1980 after receiving a B.S. degree in computer and systems engineering from Rensselaer Polytechnic Institute, where he was elected to Eta Kappa Nu. He holds three patents in computer design and l~asone application pentling.
Derrick R. Meyer Dirk Meyer joined Digital's Senliconductor Engineering Group in 1986. He was initially involvetl in the design of the cache and memory systems for a chilled CMOS VAX processor. He 11% since been involved in the development of n~icroprocessorsbased on the Alpha AXP architecture. Prior to joining Digital, he was employed at Intel Corporation, where he was involved in the design of various CMOS microcontrollers, inclucling the 80C51 and 80~196. Dirk received a R.S. degree in compilter engineering from the University of Illinois in 1983.
Zia Mohamed Zia Mohamed has been a member of the Database Systems Group since joining Digital in 1989. He works in the area of query optimization for the DEC Rdb for OpenVMs products; his contributions involve cost-based optimization of database queries and algorithms for execution of optimized cluery plans He has tleveloped clynamic OI~optimization techniques, refine~netlt of cost-model, ant1 algorithms for better access plans for views. Zia holds a B.S. tlegree in electrical engineering from Bangalore Universit): India, and an M.S degree in cornpiitcl-science from Texas Tech Universit)! James Montanaro James Montanaro received B.S.E.E. and M.S.E.E. degrees from the Massachusetts Institute of Technology in 1980. He joined Digital Equipment Corporation in 1982. He was a circuit designer on the floating-point chip for the LSI 11/74 and a MicroVAX peripheral chip. He led the physical imple- mentation of the uPRrSM CI'IJ, a 70-MHz prototype RlSC CPU completed in 1988. James also led the pllysical implementation of the first CPU chip based on the Alpha AXP architecture and then contributed as a circuit designer for the DECchip 21064 CPU. He is currently with Apple Computer, Inc.
Stephen J. Morris Stephen Morris is a consultant software engineer in the Semiconductor Engineering Advanced Development Group. In addition to writ- ing the Alpha ISP simulator, he wrote the OpenVMS antl OSF PALcode for the Alpha AXP program. In previous work, Stephen designed the control sections of the instruction prefetch and translation look-aside buffer for an experimental Digital MS<: chip. He also worked on the MicroVAX chip team, doing console ant1 debug work, and in the RSTS/E operating system group. Stephen joined Digital after receiving a U i\ in biology from the University of Rochester in 1977.
William B. Noyce Senior consultant software engineer William Noyce is a member of the Software Development Technologies Group. He has developetl several GEM comp~leroptimizations, inch~dingthose that eliminate branches. In prior positions at Digital, Bill implementetl support for new disks and proces- sors on the RSl'S/E project, led the development of VAX DBMS V1 and VAX RdbNMS V1, and designed and implemented automatic parallel processing for VAX FORTIIAN/HPO. Bill received a B.A (1976) in mathematics from Dartmouth College, where he implemented enhancements to the time-sharing system.
Donald A. Priore After receiving an S.M. degree in electrical engineering and computer science from the Massacli~lsettsInstitute of Technology, Donald Priore joined Digital in 1984. Initially, he worked on device characterization, yield enhancement, and yield modeling of NMOS and CMOS processes in manu- facturing. Subsequently, he joined a CMOS design group, working first with low-temperature CMOS technology and later with conventional CMOS in high- performance microprocessor design. His interests include signal, clock, and - power integrity in the on-chip environment.
Vidya Rajagopalan Vidya Rajagopalan received a B.E degree in electronics engineering from Visvesvaraya Regional College of Engineering, Nagpur, India, in 1986, and an M.S. degree in electrical engineering from the University of Maryland in 1989. She was with Norsk Data India Ltd, from 1986 to 1987 as a systems design engineer. In 1989 she joined Digital's Semiconductor Engineer- ing Group and was a member of the design team of the DECchip 21064 RISc microprocessor. Vidya is currently involvecl in the design of high-performance microprocessors. James J. Reisert A senior hardware engineer.Jim Reisert designecl the TC ASIC for tlie DE(: 3000 AX[' Moclel 500. I'rior to this project work, he designed instruc- tion parsers/decotlers for two \RX imp1ement:itions. Jim holtls a patent for his tlesign of a method for replaying instructjons after a microtrap. Before joining I>igital in 1986. he received an S.B. in electrical engineering from tlie Massa- cliusetts institute of Technolog)! He is currently in charge of timing verific;ition h)r another AXP workstation.
Pamela J. Rickard Principal software engineer Pan1 Rickard is a member of the team porting DECnet/OSI for OpenVMS to tlie Alpha AS1' pl;rtform. As the ini- tial member of the DECnet for OpenVMS AXP porting team, Pam took responsi- bility for creating an effective team, ported NETDRIVER ant1 other MACRO-32 cocle, atitl debugged major portions of the portecl product. Si~lcejoining Digital in 1978, she has contributed to I.'hTHWORKS for OS/2 ant1 led the console, microcode, and system test activities of the VAX-11/785 project. Pam receivecl a H.S. (1970) in mathematics and computer science from the IJniversity of 1)enver.
Scott G. Robinson Scott Robinson is a software engineering mirnager in the AXP Migration Tools Group. He contributed to the design ant1 implementation of the binary translators, particularly the \'AX tratislatecl iniagc environment. Scott has also developed iniplementations of DE<:net ilntl CAI)/<:ANl systems to design \%X processors. Prior to joining Digital in 1978, Scott worketl on ;I vi4riety of Digital hardware ant1 software implementations. He holds a B.S. in electrical engi- neering from the University of Arizona anel is a member of IEEE.
Sridhar Samudrala Sridliar Sa~~iutlralais ;I consulting li;rrtl.iv;~reengineer in the Semicontlirctor Engineering Group, where lie is currently working on a new (:I'[I chip. He joined Digital in 1977. Since t1i:rt time, lie h;rs \vorketl on tlie design and verification of PDPil1/23 chips, VAX 8200 micrococle tlevelopnient, ancl on the ;rrchitecture ;~ntldesign of floating-point chips. He holtls t\vo p:ttcnts :rnd has three patent app1ic;rtions pending, all on floating-point design. Sritlhar received ;III M.Sc. ('Tech) tlegree from Antlhra Universit); India, and ;III M.S E.E. tlegree fron'l the llniversity of Wisconsin.
Sribalan Santhanam Sri Santhanam receivccl a n.E. degree in e1ectric;il cngi- neering from Anna University, Maclr;~s,India, in 1987, anel an M.s.E. dcgrcc in co~il- lxtter xiecne crntl engineering from the University of Michigan in 1089. I Jpon gratluation, he joined Digital as a design engineer h)r the Semicontluctor Engineering Group, responsible for tlie full-custo11-1tlesign rind clevelopnient of high-perform;~nceCMOS VLSI processors. Sri worked on the design of the flo;ct- ing-point unit of the UECchip 21064 CPU. He is currently involveel in the tlesig~lof another high-performance microprocessor. Stephen F. Shirron Stephen Shirron is a consulting software engineer in the Entry Systems Business Group and is responsible for Open\/MS support of new systems. He contributed to man)' ;lre;is of the I>EC 4000, including PALcode, con- sole, and OpenVMs support. Stephen joined Digital in 1981 after completing B.S. ant1 MS. degrees (summa cum lautle) at <>~tholicUniversity In previous work, he tleveloped an interpreter for VAX/Smalltalk-80 and wrote the firmware for the ItQDX3 disk controller. Stephen has two patent applications and has written a chapter in Smc~lltalk-80:Hits of lfistor:~~,Words oJAdi)ice.
Richard L. Sites Dick Sites is a senior consultant engineer in the Semicon- tluctor Engineering Group. where he is working on binary translators ;~ntltlie Npha AXP architecture. He joined Digital in 1980 and I~ascontributecl to v;lrious VAX implementations. Previously, he was employetl by IBhl, Hewlett-Packard, and Burroughs, and taught at the I.iniversit)' of C;~lifornia.Dick received a 13,s. in mathematics from MIT and a Ph.1). in computer science from Stanford University He also studied computer architecture at tlie Ilniversity of North Carolina. He Iiolds a number of patents on computer hnrclware and software.
Peter M. Spiro Peter Spiro. ;I consulting software engineer, is presently the technical director for the Rdl:, ant1 IIt3;MS software proclucts. Peter's current focus is database performance for Alpha AXl' systems and very large database issues. Peter joined Digital in 1985, after receiving 1M.s. degrees in forest science ant1 computer science from the Ilniversity of Wisconsin-Madison. He has four patents related to database journaling and recover): and he has authoretl two papers for earlier issues of the L)iLawrence C. Stewart Larry Stewart received an S.U. in electrical engineering from M1'1 in 1976, followed by M.S. (1977) and P1i.D. (1981) degrees from Stanfortl IJniversity, both in electrical engineering. His PI1.D. thesis work was on data com- pression of speech waveforms using trellis cotling. Upon graduation, he joinetl the Computer Science Lab :~tthe Scrox Palo Alto Resr:~rchCenter. In 1984 he joined Digital's Systems Research Center to work on the Firefly multiprocessor workstation. In 1985) he moved to Digitnl's Gunibridge Research Lab, where he is currently involved with projects relating to multimetlia ant1 AXP products.
Robin L. Stewart Robin Stewart joined I)igit;~lin 1986 after receiving a 1l.s.in electrical engineering from tlie University of \/ermont. She is in the process of obtaining an M.B.A.degree from Boston College. A senior technology (liartlware) engineer, Robin had responsibility for the integrated circuit teclinolog)~in the I>EC 3000 aPMoclel 500 workst:ttion. Prior to this project work, she was a com- l~onentengineer in Digital's Semiconductor Hi~sinessOrganization. Charles P. Thacker Chuck Thacker h;~sbeen with 1)igital.s Systems Research Center since 1983. Ikfore joiiiing Digitill, he was a senior rese;trch fellow at the Xerox I'alo Alto Rcsearch Center. His research interests inclutle computer archi- tecture, comprlter networking, ;inti computer-;~idedclesign. He holds several patents in the ;ires of computer organization :u~tlis coinventor of the Ethernet local network. [ti 1984. Chuck was the recipient (with B. Lampson and R. T;~ylor)of the i\(:kl Software System Award. He received :In A.H. degree in physics from the ljniversity of California in 1967. He is a member ofACM and IEEE.
Benjamin J. Thomas 111 Benjamin Thom;~sjoined the OpenVMS AXIJ project in 1989 ;IS project leader for I/O subsystem design ;mcl porting. In this role, he has also contributetl to the I/O architecture of current ancl future AXP sgtenls. Ben joined Digital in 1982 ant1 has worked in the VklS groi~~since 1984. In prior work, he WASthe director of software engineel-ing ;it ;I microcomputer firm. Ben is ;I consulting engineer and has a H.S. (197%) in physics from tlie University of New Hampshire :ulid an M.S.C.S.(1990) from Worcester Polytechnic Institute.
Catharine van Ingen A consulting softw;lre engineer, <:atliarine van Ingcn was co-system ;~rchitectfor the \'A)(. and [>I':(:7000 proclucts. is cur- rentl!, on Ie;~vefrom Digital and is n7orking 011engineering document manage- ment in large heterogeneous systems. Hefore joining Digital in 1987. she worked on d;lta ncquisition systems for two I;irge physics tlctectot.~:it tlie Ferrni N;ltio~l;iI Acceler;~tor1.abor;ttory and Stanfortl Linc;ir Acceler;~torCenter. She holcls serf- era1 clegrees in civil engineeri~ig,inclutling a 13 s. ;untl ;In k1.S. from the University of Nicholas A. Warchol Nick Warchol, a consulting engineer in the Entry Systems Ii~isiness<;soup, is the project le;itler responsible for I/O architecture and I/() rnoclule tlevelopment for the DE(: 4000 AXI' systems. In previous work, he contributeti to he development of VAX 4000 s)rstelils.He was also a tlesigner of the MicroV,\\S 3300 and 3400 processor moclules and the RQDXJ disk con- troller. Nick joined 1)igitnl in 1977 ;lfter receiving a I3.S.E.E. (cum 1;rude) from the New Jersey Institute of Technology. In 1984 he received an k1.S.E.E. from Worcester Polylechnic Institute. He h;~shmr patent :ipplications.
Richard T. Witek Rich Witek joined l>igit;~lin 1977 to work on DE(:net network nrchitecture during Phase 11. In 1982 lhe joined lligital's Semiconductor Engineering <;SOLI~~where he worked on <:,\I> rlevclopment, MicroVAX VLSr chil>s,ant1 ;I variety of internal MS<:projects. Rich \v;~s;I coclesigner of the ALpha .\XI' ;uchitectilre :rntl the principal micso;~rchitectof the DECchip 21064 CP11 chip. He recei\rctl a 13.,1. degree in computer sciencc from 11uror;a College. Rich is 1 'w!.1 currently- employed by Apple Computer. Inc. I Foreword
the proposed solution l~aclto provide strong conl- patibility with current products. After several months of study, tlie team concluded that only a new RISC architecture could meet the stated objec- tive of long-term competitiveness, and that only the existing VMS and UNlX environments coulcl meet the stated constraint of strong compatibility. Thus, the challenge posed by the task force was to design the most competitive MS<; systems that would run the current software environments. Key groups in Engineering responded to this challenge. A cross-functional team from hardware Robert M. Supnik and software defined the basic architecture. Corporate Consultant, Advanced development teams began work on the Vice Preside~zt knotty eiigineering problems: it1 the serniconduc- Tkch~icnlLlirectol; tor group, the specification ant1 design of a fast Engineering microprocessor, and the automatic translation of executable binary images; in the operating systems groups, on the porting of ULTRlX and of VMS (which was not portable!); and in the compiler group, on It all started with eight people in a conference superscalar code generation. In the fall of 1989, room.':' Alpha became an officially sanctioned advanced The time was the summer of 1988. Digital development pr~gram.~In the summer of 1990, it Equipment Corporation had just closed the best transitioned to product development. fiscal year in its history, with record revenues ancl From the original core in semiconductors, oper- profits. Digital's vA>; systems were the most widely ating systems, and compilers, work expantlecl i~secltimesharing systems in the intlustry and were throughout Engineering. The server and work- the "blue-ribbon standard" for mid-range comput- station hardware groups specified and started ing. Digital was tlie second-largest workstation ven- designing a family of systems, from desktop to clata dor. The company hat1 just introducetl the VA>( 6000 center. The networks group began porting DE<:net, system, its first exlxmdable multiprocessor, was TCP/IP, X.25, LAT, ant1 the many other network- deveJ.oping a true VAX mainframe, ancl had decided ing products. 'l'he layeretl software group inve~l- on a rapid thrust into lilSC workstations to capital- toried the existing portfolio of products and ize on that growing market. What could possibly go prioritized the ortier and i~nportanceof clelivei-J: wrong? The research group pitched in by designing an Nonetheless, senior managers and engineers saw experimental multiprocessor as a software devel- trouble ahead. Workstations hat1 displaced VAX WS opment testbed. from its original technical market. Networks of per- In parallel with the engineering work, market- sonal computers were replacing timesharing. ing, sales, and service teams worked closely with Application investment was moving to standarcl, business partners and customers to shape tlie tleliv- high-volume computers. Microprocessors had sur- erables and messages to meet external require- passed the performance of traditional mid-range ments. These teams briefed key customers ancl computers ant1 were closing in on mainframes. And partners early in the development process ant1 aclvances in RISC technology threatened to aggra- vate all of these trends. Accordingl): the Executive The Corona Borealis conferencr room in the LTNI f:~ciljryin Committee asketl Engineering to develop ;I long- Littletoi~.Mass. !,I4N1 was clloscn bec:~ilsc~t nins thc geogr;~phic term strategy for keeping Digital's systems cornpet- epicenter of the arc of Digital engineering k~cilitieson ,\lassa- itive. Engineering convened a task force to study chusetts Koc~te495, the Corona Borealis bec:~useit was the only conference room mith ~.c'iotlo.ivs. the problem. l'he task force looked at a wide range ofpotential +Mergoing through Illore than one nanlc change. Thc original solutions, from the application of atlvancetl pipe- study team was c;~lletlthe 'RIS(:y \'AX Tksk Forcc:"The ;~tlvancedtlcvclopment nlorli W;ISlabeled "EVhX:' Wl~e~irhc lining techniques in Vtm systems to the tleployment program w;kh al,provetl, the Executive Committee den~;~ntletl;I of a new architecture. A basic constraint was that nr~~tralcode tl;lrne, hence "Alph;~." incorporated their aclvice into the develol~ment The work of Engineering, M;~n~~hcturing.M;II-- prog~-:lm.Ongoing partner ant1 customer aclvisory keting, Sales, and Service c;rme together in Noveni- bo;~rtlsprovitlecl 1-egr1l;ir!kc.tlb;~ck on all aspects ber 1992 with the annoilncement of the Alpli;~&XI' of the 1xogr;Im ;uicl helped shape two critic;il systems family: seven s),stenls, three operating sys- extensions of the original concept: the open licens- tems, six languages, multiple networks, migration ing of Alpha technology. and the porting of tools, open licensing of technology, hartlwilre ancl Wi ndo\vs N1'. software partnersl>ips, and more than 2000 com- Taken together. the scope of the Engineering mitted applic;~tions.Totl;~; Alph:~ ASI' e~iibodies21 effort, the ~ieetlfor il1:lrketing. Field, and Service fi~ndamentalrepositioning of 1)igit:il Equipment involvement, ant1 the liigli degree of customer antl Corporation to be the technology alitl solutions bilsiness partner p;~rticipation,posed irniclue man- leader in twentjyfirst centrlry conipi~ting:;I com- agement challenges. Rather than organize a large- pan). dedicated to meeting customers' neetls n~ith scale hierarchical project, the company chose to the best computing, business, ancl service technol- manage Alpha as a clistributed program. A small ogy available. The tlelivery of Alpl~aAXI' recluired progr;lni te;m irsed enrollment m;~nagernentprac- the largest engineering progr;lm in 1)igit;il's histor): tices ancl strict operation;~ldiscipline to coortlillate spanning lllore tli;tn twenly Engineering groups and inspect activities ;Icross the cornp;lny. This net- worldwide. This issue of tlie lli~qit~ilExlnt?icnl worked appro;~chto n1;ln;lgernent g;ivc the program Jozrrr?al documents just ;I few of tlie lli~ntlredsof both flexibility ant1 resiliency in the face of rapitlly projects involved in bringing Alph:c to fruition; changing business and organizational conditions. filture issues will continue the story. Richard L. Sites I
Alpha AXP Architecture
The Alpha AXP 64-bit conzptiter. arcl~itect~ireis designed for. high pefonizance GIH~ lor~~evitl!Bewi~ise oftlnefoc~is or1 i~~~~ltipl~ilistr~~dior~iss~/e,the ar-chitectui~ does not contain jhcilities s~ichcis brc~~zch~lelc~~ slots, lyte zur3ites,arzclprecise arithnzetic e,vceptions. Brccluse of thefocus 012 multiple processors, the architecture does con- tail? a careful shared-memory rrzodel, atomic-tipdate pri~nitirei~zstluctio~zs, atzd r.elcixed read/ii~riteorcleri~zg. Tl~e first itrzplenzentation ofthe Alpha AXP arclgitec- ture is the zi~orld'sfastestsirzgle-chill ~~zicr~oprocessorThe DECtbip 21064 r.utzs 11zulti- ple operwti~~gsjsterns and r~i~zsnati~le-conzpiled progra~rzs that were tra~ulated fro111 the 1!4X arzd 11.flPSarchitect~ires.
Thus in all these cases the Romans did what all Architecture Distinct wise princes ought to [lo; namely, not only to look from Implementations to all present troilbles. I,ut ;~lsoto those in tlie From tlie beginning of the Alpha AX[' design, we fut~lre,;lgailist which they provided with the Lltniost prudence. distinguished the architecture from tlie implemen- -Niccolo Machial-i-lli. The Y~Yrrce tations, following the distinction made by the IL\M System/360 architects: Computer ;~rcliitectureis tlefinetl as cl~c;~tt[.ibutes Historical Context ant1 behaviol. of ;I computer as seen by :I mz~chine- The Alpha IUP architecture grew out of a srn;~lltask language I,rogr:unmer. This definition includes the force chartered in 1988 to explore ways to preserve instruction set, instruction formats. oper:~tion codes, addressing motles, and all registers ;~ntl the VAX VMS custon~erbase through the 1990s. This memory loc:rtions th;~tmay be tlirec[ly m;~nipu- group eventually came to the conclusion that ;I new laced by a machine-1:lnguageprogr;lninier. retlucecl instruction set computer (RISC) architec- 1mplement:itioll is tlefinccl as the actu:~lh;trdw;~re ture would be neecled before the turn of the cen- structure, logic design. and data-path org;~nization tury, primarily because 32-bit architectures will run of a particular embotliment oF the architecture.' out of address bits. Once we 11i;ltle the decision to Thus, the architecture is a doc~~mentthat pursue a new architecture, we shaped it to do describes the behavior of all possible irnplementa- much more than just preserve the VLI 15 .' customer tions; an implementation is typically a single com- base. puter chip2The architecture and software written This paper cliscusses the architecture from a to the architecture are intended to last several number of points of view. It begins by making the decades, while indivitlual implementations will distinction between ;lrcIiitecture and implementa- have much shorter lifetimes. The architecture must tion. The paper then states the overriding arclii- therefore carefi~llydescribe the behnvior that a tectural goals and cliscusses a number of key machine-langu:lgr programmer sees, but must not arcliitect~~raldecisions that were derived directly describe tlie nie;lns by which a p;lrticul;~rimple- from these goals. The key clecisions distinguish the mentation achieves that behavior. Alpha AXP architecture from other architectures. A similar appro;~chhas been used with much The remaining sections of the paper discuss the success in specifying the PDP-11 ant! VtlX klmilies of ;irchitecture in more (letail, From data and instruc- computers. An alternate approach is to design and tion formats through the detailed instruction set. build a fast RISC chip, then wait to sce if it is suc- The paper conclirtles with a discussion of the cessful in the marketplace. If so, successive imple- designed-in futi~regrowth of the architecture. An mentations are often forced to reproduce accidents Appenclis explains some of the key technical terms of the initial design, or to introduce slight software l~seclin this paper. These terms are highlightecl incompatibilities. This approach works, but with with ;In asterisk in tlie text. varying success. Alpha AXP Architecture and Systems
Architectural Goals Digital RIS(: clesign cnlletl I'RISM.~ We p1:icetl the When we st;irted the tletailetl clesign of the Alpha underpinnings for interrupt tlelivery ;uid reti~m, AXP architecture, we had ;I short list of goals: exceptions. context switchitig, memory manage- ment, ant1 error hantlling in a set of privileged 1. High perfi)rrnance sol'twarc subroutines c:~lledI'r\Lcotle. These sub- routines have control let1 entry points, run with interrupts ti~rnedoff, ;~nd11;lve access to real hard- 3. Capability to run both Vh1S ;lnd IJNIX oper;iting ware (implementation) registers. By inclueling tlif- s!stems ferent sets of PA1.code h)r different operating 4 Easy migr;ition from Vr\X ancl MIPS architectures systems, neither the harclware nor the operating system is burclened wit11 ;I b;itl interface m;~tch,anti Tliese goals directly influencecl our key decisions the arcl~itectureitself is not bi:isetl tow;ircl a partic- in clesigning tlie architecture. ular compi~tingstyle. In consiclering performance atlcl longevity, we To ru~iexisting VA~and MIPS binary images, we set a 15- to 25-year design horizon and tried to avoitl atloptetl the idea of bin;rry tr;msl~~tion,':'as tlescribed any clesign elements that we thought could hrco~iie in n cornpanion paper.^.^.'^ The co~iil,in:ttion of limitations during this ti~iie.In current arcliitec- Pr\I.cocle anel binary transl;ition gave us tlie luxury tures, ;I primary limitation is the 32-bit memory of designing ;I new architecture. Other tli;ln the h~n- address. Thus we adopted a fill1 64-bit architecture, clamental integer and floating-point tlata types. with :I mjnimal number of 52-bit operations for there ;Ire no specific VAX or M~PSfeatures carried backw:irtl compatibility. directly into tlie Alpha i\>;ll instruction-set architec- W'e also cotlsidered how implenietitation perfor- tilre for compatibility re;isons. mance shoi~ldscale over 25 years. During the p;ist 25 ye:irs, computers have become about 1,000 Key Design Decisions times faster. Therefore we focusetl our design deci- This section presents the design tlecisions that clis- sions on :~llowingAlpli;~ ASP system imple~nentn- tinguisli the Alp1-1;~AXI' arcl~itccturefrom others. tions to become 1,000 times kister over tlie coming 25 years. 111011s projections of future perforni;ince, we re;lsoned that raw clock rates woulcl improve by a factor of 10 over that time, ant1 that other tlesign Tlie Alpha AXP architecture is ;I tradition;il HIS<: dimensions would have to provide two niore kic- loacl/store ;irchitecture, All ti;it;i is JII~V~~between tors of 10. registers ;inti Iiletnory without cornpiitalion, ;lnd ;ill If the clock cannot be made fastel; then more computation is done bctwcen values it1 re&''ISteI-S. work milst be done per clock tick. We therefore Little-entlian byte adtlressing ;lnd both VAX and IEEE designed the Alpha AXP architectirre to encourage floating-point operations':' arc carried over from the ~nultipleinstruction issue" implementations that \biS ;ind %]IPS;~rchitecturcs.- We assumed th;it most will eventi~;illysust:iin ;~l)oi~tten new instructions irnpleoicntations woultl pipeline instrilctions, i.~., starting every clock cycle. This aggressive tech- they woulcl start execution of a second, thircl, etc. nique of starting multiple instructions distin- instruction before the execution of ;I first instruc- guishes tlie Alpha AXI' architecture froni many tion con~pletes.We ;~ssunirtlthat the implementa- other Ibytes" has an adclress with N ning. 'This early accommotl;ition for multiple pro- low-orcler zeros. Other memory oj>erantls ;ire cessors also distingi~ishes the Alpha AXI' tenlied u11:iligned. architecture from many other RlSC architect~rres, which try to atltl the proper primitives later. Fz~ll64-bitDesign To run the OpenVMS AM-' and the DE<: OSF/l Tlie Alphii hXl' architecture uses a linear::'&-bit vir- AXP-;inel now the Microsoft Wjntlows N'Y-operat- tual ;~ddressspace. Registers, acldresses, integers, ing systcms, we adopted ;111 itlea from :I previous f1o:iting-[?oilit numbers, :inti character strings are ;ill operated on ;IS full &-bit qi~antities.There ;Ire instances of the same instruction. For example, a no segmented atldresses." dedicated string wgister makes it hard to 11;lve three unrelated string operations in the pipeline at once. Register File To illustrate the performance loss associated In choosing the register file design, we consideretl witli special or liitltlen processor resources, con- both a single combined I-egisterfile ant1 split integer sider a dual-issuc implementation witli ;I four-cycle- and floating-point register files. We chose a split deep pipeline. At the beginning of each cycle. up to register file to support aggressive multiple issue. A six prior instructions are partially executed and combined file is somewhat more flexible, espe- two more are about to be issuetl. Six prior instruc- cially for programs that are heavily skewed toward tions can have six pending writes to result regis- integer-only or floating-point-only computation. A ters, plus six sets of side effects on special or combined file also makes it easier to pass a misture hidclen processor resources. The next two instruc- of integer and floating-point subroutine parameters tions can specify ;I total of four operand registers, in registers. However, split files ;illow graceh~ltwo- two more result registers, ant1 two more sets of sitle chip implementations ant1 smaller integer-only effects on special or hidtlen resources. The decision implementations. They also need fewer read/write to issue 0, 1, or 2 of the next instructions involves ports per file to sustain a given amount of mi~ltiple 36 simple comparisons of pairs of register numbers instruction issue. and 12 complex coniparisons of sets ofsitle effects. We ;ilso considered whether e;icli file sho~~ldcon- The number of sucll comparisons incre;ises as a t;iin 32 or 64 registers. We chose 32, largely hecause function of the issue width, the pipeline tlepth, and the number of special or hicltlen processor 1. Thirty-two registers in each file are enough to resources. The complexity of these comp-I' I-' lsons support at least eight-way multiple issue. can limit the clock rate. The register-number con]- 2. Two valuable instruction bits are better i~sedto prisons are unavoicl;tble, therefore we triecl to make a 16-bit tlisp1:icenient fjelcl in memory- lj~nitspecial or Ilidtlen processor resources. h)rmat instructions. Bmrzcb Delay Slots The Alpha AX]' ;irchitecture More registers might seem better, but excess reg- has no brancl-1 delay slots. The branch tlelay slots isters consume chip area and access time, found in some KISC architectures require exactly save/restore speetl across subroutines and context one following instruction to be executetl after a switches, and instruction bits that might be put to conditional br;incIi. In 1988 this was, perIi;ips, a better use. Compilers can deliver substantial per- goocl itlea for overlapping branch Intenclr jn a sin- h)rm;~ncegains when given 32 registers instead of gle-issue chip with a one-cycle instruction cache. In 16, but there is no clear evidence of similar gains 1995, however, it will not scale well to a h)ur-w~y with 64 registers. Deniantl for registers is likely to issue chip with a two-cycle instruction cache. increase slowly in the future, but a number of Illstead of one instruction, up to eight instructions implementatio~ltechniques, such as short latency would be needed in tlie delay slot. Br;uncli del;~y pipelines ant1 register renxrning, shoultl satisfy this slots also introtluce ;I restart problem jf the instruc- demand. tion in the tlela)~slot fiiults: one restart pr.ogr;rm counter is needed for the delay slot and ;inother one Mtiltiple I~zsl'ructio~zIssue for the actual br;inch target. Our design sought to eliminate any mechanism that would hinder aggressive multiple instruction issue Slippressed Iizstructioizs The Alpha AXP archi tec- implementations. Therefore we tried to ;ivoitl ;II.I ture has no suppressecl instructions, wllerehy tlie special or Iiidclen processor resources.~hus,the execution of one instruction conclition;llly sup- Alpha AXP ;lrchitecture 1i;is no condition codes, no presses a 1-01Lowing one. Suppressetl (or skipped) glolx~lexception enables. no multiplier-quotient or instructions are fountl in other RISC architectures. string registers, no bsanch delay slots, no sup- The suppression bit(s) represent nonreplicated pressed instructions or skips, no precise arithmetic hidden state, so multiple instruction issue is diffi- exceptions, and no single-byte writes to memory. cult for more than one potential suppressor. If an All of these features, found in some 1US(: :~rchitec- interrupt is taken between a suppressor ;uncl sup- tures, have the effect of hintlering multiple instruc- pressee, or if the silppressee takes a rest;~rtable tion issue, or hinclering pipelining of multiple exception (e.g., page fault), tlie correct version of
Digital lkcbnical Journal Vo1.4 llkr 4 Special Issue 1392 2 1 Alpha AXP Architecture and Systems
tlie suppression state 111irst be swed ant1 restored. ;I little more complicatetl anel a little slower. With There are ;~lsoclefinitiotlal problems with this nonreplicated hitlden state, it is difficult to issue ;ipproach: Are exceptions ever reported for sup- another byte store until the first one finishes. pressed instructions? What happens if the sup- Fin;~lly,the existence of a byte store instruction has pressed instruction suppresses a third instruction? led to programs and library routines for other NSC implementations with single-byte move ancl com- Byte Loud or Store Iristt-uctions The Alpha AXl' pare loops. String manipillatioa on Alpha AXP architectlrre has no byte load or store instructions implementations is up to eight times faster by pro- anel no iniplicit irnalignetl accesses. There also are cessing eight bytes ;it a time." no partial-register writes. The byte load/store Insteacl of inclueling byte loacl/store, we followed instructiolis and unaligned accesses found in some tlie RISC philosophy of exposing hidden computa- RIS<: architectures can be a perforniance bottle- tion as a sequence of many simple, fast instructions. neck. They require an extra byte sl~ifterin tlie In the Alpha UParchitecture, a byte load is clone as speetl-critical loatl and store paths, ant1 they force a an explicit load/shift sequence; a byte store as an hard choice in fast cache tlesign. The partial-regis- explicit load/modify/store sequence. We tuned the ter writes found in other RlSC architectures can also instruction set to keep these sequences short. The be a performance bottleneck because they require instructions in tliese sequences can be intermixed, masking ancl shifting in the filndamental operation scheduled, and issued as multiples with other com- of accessing a register. putation, as can tlie rest of the instructions jn the On a previous project involving a MII'S implen~en- architecture. Table I gives :I sunlmary of the Alpha t;~tion,we found the shifter for the loatl-left/lo;rd- fit' instruction set. right instri~ctions to he a direct cycle-time bottleneck. Also, the VAX 8700 i~~iplementation~tio~~A~'il%?/rietic E.vceplio7l.s The Alpha IU(P architec- (circa 1986) removed the byte shifter in the ture has no precise arithmetic exceptions. lo;~cl/storehardware in favor of a faster microcycle, Reporting an arithmetic exception (e.g,,overflow, with 2 cycles for ;I byte 1o;rcl and 6 c!.cles for :III unclerflow) precisely means that instri~ctions ~ln;rlignetl32-bit access. This decision irchieved ;I subsecluent to tlie one c;rusing the exception net performance gain. Our experience encour;~getl must not be executetl. This is straightforwarcl 11sto avoid byte load/store. in 21 slow implementation that runs a single instruc- An addition;il problem with byte stores is that an tion to completion before starting the next one, iliiple~ilenterniay easily choose only two of the but becomes substantially more difficult to do three design features: fast write-back cache, single- qi~icklyin a pipelined four-way issue implemen- bit error correction code (E<:<:), or byte stores. tation. There are standard techniques available Byte stores are straightforward in simple byte- for clelivering precise exceptions while run- parity write-through cache implementations. ning quickly (checking exponents, supl~ressing Except for tlie expensive design of four or five E<:<: register writes, exception silos and backout), but bits for every eight bits of clata, a byte store to ;I fast thesc techniqiles consume substantial design E<:<; write-back cache ilivolves time and can cost some performance. They appear not to scale well with wider multiple issue or 1. Reading a11 entire caclic word::' faster clocks. 2. Checking tlie ECC bits and correcting any single- Exceptional cases are just that-exceptional, or bit error mre, events. Based p;~rtlyon customer requests, we cleciclecl to eliipliasize the performance of normal 3. Moclifiing the byte operations at the expense of exceptional cases. 4. Calculating the new E<:<: bits Rather than an implicit exception ortlering between every pair of instructions, we atlopted the 5. Writing the entire cachc word Cr;iy-1 model of :~rithmeticexceptions-in which This reatl-rnotlifi-write sequence requires l~iclclen exceptions are reported eventirally-plus an seqire~icerli;~rtlware and hidden state to hold the explicit trap barrier (T1WR) instruction that c;111be c:rche word temporarily. The secluelicer tencls to used to make exception reporting as precise as slow clown ortlinary full-cirche-wortl stores. The desired.lo We also tlocumented ;I code-generation neetl for byte stores tends to ripple tl~roughout clesign that needs one trap b;rrrier per branch (at the memory subsystem tlesign, m;rking each piece most) to give precise reporting. Using TRAPB Alpha AXP At.c./7itectrrt.e
Table 1 Alpha AXP Architecture Instruction Set Summary I Load/Store, Byte Manipulation CMPLT Compare signed quadword < CMPLE Compare signed quadword 5 LDA Load address CMPULT Compare unsigned quadword < LDAH Load address high CMPULE Compare unsigned quadword 5 LDL Load sign-extended longword MULL Multiply longword LDQ Load quadword MULQ Multiply quadword LDQ-U Load unaligned quadword UMULH Multiply quadword high, unsigned LDL-L Load sign-extended SUBL Subtract longword longword, locked S4SUBL Subtract longword, scale by 4 LDQ-L Load quadword locked S8SUBL Subtract longword, scale by 8 STL-C Store longword, conditional SUBQ Subtract quadword STQ-C Store quadword, conditional S4SUBQ Subtract quadword, scale by 4 STL Store longword S8SUBQ Subtract quadword, scale by 8 STQ Store quadword AND AND logical STQ-U Store unaligned quadword BIS OR logical EXTBL Extract byte low XOR XOR logical EXTWL Extract word low BIC AND-NOT logical EXTLL Extract longword low ORNOT OR-NOT logical EXTQL Extract quadword low EQV XOR-NOT logical EXTWH Extract word high SLL Shift left, logical EXTLH Extract longword high SRL Shift right, logical EXTQH Extract quadword high SRA Shift right, arithmetic INSBL lnsert byte low CMOVEQ Conditional move if reg = 0 INSWL lnsert word low CMOVNE Conditional move if reg # 0 INSLL lnsert longword low CMOVLT Conditional move if reg < 0 INSQL lnsert quadword low CMOVLE Conditional move if reg 5 0 INSWH lnsert word high CMOVGT Conditional move if reg > 0 INSLH lnsert longword high CMOVGE Conditional move if reg 2 0 INSQH lnsert quadword high CMOVLBC Conditional move if reg low MSKBL Mask byte low bit clear MSKWL Mask word low CMOVLBS Conditional move if reg low MSKLL Mask longword low bit set MSKQL Mask quadword low CMPBGE Compare bytes, unsigned MSKWH Mask word high ZAP Clear selected bytes MSKLH Mask longword high ZAPNOT Clear unselected bytes MSKQH Mask quadword high lnteger Branch Floating Point Load/Store BEQ Branch if reg = 0 LDF Load F format (VAX single) BNE Branch if reg # 0 LDG Load G format (VAX double) BLT Branch if reg < 0 LDS Load S format (IEEE single) BLE Branch if reg I0 LDT Load T format (IEEE double) BGT Branch if reg > 0 STF Store F format (VAX single) BGE Branch if reg t 0 STG Store G format (VAX double) BLBC Branch if low bit clear STS Store S format (IEEE single) BLBS Branch if low bit set STT Store T format (IEEE double) BR Branch BSR Branch to subroutine AddressIConstant JMP JU~D LDA Load address JSR ~umpto subroutine LDAH Load address high RET Return from subroutine -JSR-COROUTINE Jump to subroutine, return lnteger Computation and Conditional Move Floating Point Branch ADDL Add longword S4ADDL Add longword, scale by 4 FBEQ FP branch if = 0 SBADDL Add longword, scale by 8 FBNE FP branch if # 0 ADDQ Add quadword FBLT FP branch if < 0 S4ADDQ Add quadword, scale by 4 FBLE FP branch if I0 S8ADDQ Add quadword, scale by 8 FBGT FP branch if > 0 CMPEQ Compare signed quadword = FBGE FP branch if 2 0
I I I conrinucd on ncx~page Alpha AXP Architecture and Systems
Table 1 Alpha AXP Architecture Instruction Set Summary (continued)
Floating Point Computation CVTGF Convert G to F format and Conditional Move (VAX double/single) CWQ Convert T format to quadword CPYS Copy sign (IEEE double) CPYSN Copy sign, negate CVTQS Convert quadword to S format CPYSE Copy sign and exponent (IEEE single) CVTQL Convert quadword to longword CWQT Convert quadword to T format CVTLQ Convert longword to quadword (IEEE double) FCMOVEQ FP conditional move if reg = 0 C~S Convert T to S format FCMOVNE FP conditional move if reg ;t 0 (IEEE doublelsingle) FCMOVLT FP conditional move if reg < 0 CVTST Convert S to T format FCMOVLE FP conditional move if reg 5 0 (IEEE singleldouble) FCMOVGT FP conditional move if reg > 0 Dlw Divide F format (VAX single) FCMOVGE FP conditional move if reg 2 0 DNG Divide G format (VAX double) MF-FPCR Move from FP control register D~VS D~videS format (IEEE single) MT-FPCR Move to FP control register DlVT Divide T format (IEEE double) ADDF Add F format (VAX single) MULF Multiply F format (VAX single) ADDG Add G format (VAX double) MULG Multiply G format WAX double) ADDS Add S format (IEEE single) MULS Multiply S format (IEEE single) ADDT Add T format (IEEE double) MULT Multiply T format (IEEE double) CMPGEQ Compare G format = SUBF Subtract F format (VAX single) (VAX double) SUBG Subtract G format (VAX double) CMPGLT Compare G format < SUBS Subtract S format (IEEE single) (VAX double) SUBT Subtract T format (IEEE double) CMPGLE Compare G format 5 (VAX double) Srstem CMPTEQ Compare T format = (IEEE double) CALCPAL Call privileged architecture CMPTLT Compare T format < library (IEEE double) TRAP8 Trap barrier (precise exception) CMPTLE Compare T format I FETCH Prefetch (cache) date hint (IEEE double) FFICH-M Prefetch (cache) data, CMPTUN Compare T format modify hint unordered (IEEE double) ME Memory barrier (serialize) CVTGQ Convert G format to quadword WMB Memory barrier (serialize) write (VAX double) RPCC Read process cycle counter CVTQF Convert quadword to F format RC Read and clear (VAX single) RS Read and set CVTQG Convert quadword to G format PALRESO PALcode reserved opcode 0 (VAX double) PALRES1 PALcode reserved opcode 1 CVTDG Convert D to G format PALRES2 PALcode reserved opcode 2 (VAX double/double) PALRES3 PALcode reserved opcode 3 CVTGD Convert G to D format PALRESQ PALcode reserved opcode 4 (VAX double/double)
instructions in the first Alpli;~AXP imp1emenl';ttion lowers ~,erformance 3 percent to 25 percent in real The Alpha ASP architecturc's sl~arecl-nlcmory floating-point progralms ;~iirlless than 1 percent ill ~~lultipl-oc~~si~~gmodcl is ;In integral p;irr of the integer progr;irns, but improves cycle time :lpprosi- clesign. It is 1101 the ;~tltl-onfouncl in other RlSc: mately 10 percent. arcl~itcctures. In co11tr;tst to aritll~l~eticexceptions, nlemor). 'I'hc untlerl!ing pri~iiitivcfor safe updating of nlanagenicnt exceptions, such :IS page faults. ;Ire a mi~ltilxocesor-s11:trctl memory location is a reportetl precisel!: This is not ;IS much of ;I I,urtlen sequcncc of RI:.r' instructions: loacl-locked. in-regis- on implemcnters as precise arithtnetic mceprions ter modify, store-conditional. test. If this seqilence a~o~~ltlbe. ;tntl I;lck of precise nlenior!r manapmcl1t completes with no it~terrupts,no exceptions, ;~iitl faults would be a severe burden on softw:tre no interfering write from anothc-r processor, then writers. the store-conditional stores the motlified result. Al)lna AXP A ~.clnitecl.~~i1.c~ ant1 the test indic;~tessuccess: an atomic update arcliitecture differs from other Rlsc architectures was in fact performed. by careli~llyspecifying a canonical form for 32-bit If anything goes wrong, the store-conditional tlat;~in 64-bit registers. A c;inonical form is ;I stan- does not store a result, and the test i~iclicatesfail- cl;lrclizetl choice of clata representation for retlun- ure. The program must then retry the sequence d;~ntly encoded values. Since 32-bit operations until it succeetls. We chose this primitive sequence assume canonical operands and give canonical (quite similar to tlie MIPS R4000 chip designs) results, very few explicit conversions between 32- because it can be implemented in a way that scales ;incl64-bit representations are needed. up with processor performance. 111the xbsence of The fi~ndamentalunit of data in the Al~lpli;~AXl' an interfering write, the entire secluence can be architecture is a 64-bit quadword." As shown in tlone in an on-chip write-back cache, ant1 hi~ndreds Figure I, quadwortls may reside in memory or regis- of chips can tlo noninterfering sequences simulta- ters. For backwartls compatibility, 32-bit long- neously. The sequence can also be ilsetl to achieve wortls:' may also be stored in memory. byte granularity" of writes in sharecl memory.(' There are three fi~ntlamentalclata types: integer, Tlie Alpha AX[' arcliitecture has no strict multi- IEEE floating point, and VAX floating point; each processor reatl/write ordering, whereby the is available in 32-bit and 64-bit form~.~-I~VAX floating- sequence of re;icls ant1 writes issuetl 1,y one proces- point values in memory have 16-bitwords swapped, sor in a mi~ltiprocessorconfiguration is tlelivered for compatibility with VAS (and PDP-11) formats. to :ill other processors in exactly tlie order issued. The VAX floating-point load and store instructions Strict order is simple, but has a problem similar to tlo word swapping" to give a common register that of byte stores. An implenienter may easily order. The 9-bit loat1 instructions expand vali~esto choose only two of the three tlesign llentures: 64-hit canonical form, anti tlie 32-bit store instruc- pipelined writes. bus retry, or strict reacl/write tions contract 64-bit values back to 32."AII register ordering. to-register operations are thus done on fill 1 64-bit If one processor starts ;i write to location A and a values in a common integer or floating-point for- write to location 13, then tliscovers tli;~tthe write to 111at. No partial-register I.~:ICISor writes ;ire tlone. i\ i\ 11;ls hiled (bus parity error. etc.) :11itl retries it suc- The cano~licalform of ;I 32-bit value in ;I 64-bit cessfully, then a second processor will observe the integer register has the most significant 33 bits all writes out of order: H, then A. equal to bit<31>. In essence, bit<31> is kept ;IS ;I Before Alpha AXI' implementations, many VAX "fat bit." This allows signet1 integer values to be implementations ;ivoicletl pipelinetl wrilcs to main used directly in 64-bit ;~ritIimeticand branches. memory, multibank c;~ches,write-buffer bypassing, This canonical for111 is maintained as ;I closed routing networks, crossbar memory interconnect, system (even for 32-bit data considered to be etc.. to preserve strict reatl/write ordering. Tlie "i~nsigned")by using a combination of 64-bit oper- Alpha AXl' architecture's shared-n~emory111otlel ates, 32-bit add/subtract/rnulti~>ly,and two-instruc- instead specifies 110 implicit orclering between tlie tion sequences for shifts. reads and writes issuecl on one processor, ;IS viewed 'T'he canonical form of a 32-bit value in a by a different processor. This programming model 64-bit floating-point register has the %bit exponent is an enabling technology for a wide variety of high- field expanded to 11 bits ant1 the 23-bit mantissa perfor~iiance i~iil)lement;~tiontechniq~~es. Strict field expanded to 52 bits. Except for IEEE tlenor- orclering can be hpecificd when neecletl by insertion mals.':' this allows single-precision floating-point of explicit memory barrier (Me) instructions, quite values to be used directly in double-precision arith- similar to the lB*l System/370 serinlizatio~itlesign.11 metic and branches. This canonical form is main- tained as a closecl slatem by using single-precision Data Representation instructions. and Processor St-ate Bytes nntl words (16-bit quantities) are not fund;^- This section tlescribes the fundamental Alpha AXP mental data types. Tl~e!. may be transferred tlata types ant1 their representation in memory and between memory ant1 registers with short I-egisters.It also tlescribes the con~pleteIl;~rdware sequences of jnstructioiis ;~nclmanipulatetl jn regis- register state for e;ich processor ant1 outlitles ters using normal arithmetic ;incl the byte-m;inipil- the adtlitio~ial state maintained by operating- lation instructions described in the Operate system-specific 1'AI.cotle routines. The Alpha AXP Instructions section.
DigiLnl TecbaiUrtJoumd Vd. 4 hb, 4 Specic~l/.s.srrc 1992 2 5 Alpha AXP Architecture and Systems
QUAbH6RD lNfffiER (MEMORY) QUADWORD INTEGER (REGISTER)
TEE€ T#.mlWG POINT (MEMORY) IEEE T-FLOATING POlNT (REGISTER) fiR
VAX G-FLOATING POlNT (MEMORY) ,-. mn VAX G-FLOATING POINT (REGISTER) - U OJ L '.=. '.. . . I .. W 1 M2 EXP MI M2 ~3 M4 FX Im war.I I r I 16 16 16 1 11 4 52
LONGWORD INTEGER (MEMORY) LONGWORD INTEGER (REGISTER) 0 -. SSSSSSSsssSsSSSSsSs ... S - - RX 1 3 1 32 1 31
IEEE S-FLOATING POlNT (MEMORY) IEEE S-FLOATING POlNT (REGISTER) 31 0 63 0
EXP MANTISSA SXX EXP MANTISSA 00000000000000000... 0 FX 18 23 1' 11 ' 52
VAX F-FLOATING POlNT (MEMORY) VAX F-FLOATING POlNT (REGISTER) 31 0 63 0
M2 S EXP WI SXX EXP M1 M2 00000000000000000... 0 FX
16 18 7 1' 11 ' 52
'rlie liartlware processor state, shown in Fig~rre2, Hardware implementations may optionally includes 9 integer registers RO..K?I of 64 bits each; inclutle ;I pair of state registers for menlory R31 is alw;~!lszero. There are ;ilso 32 floating-point prefetching (FETCN/FFI'(:I-I-hl instructions), ;inti an registers FO..P31 of 64 bits cacli; FS1 is alw:ys zero. optional interrupt flag for use only by tr;rnslatetl Writes to R31 ;rnd F31 ;Ire ignorecl. VAX OlxnVMS ASP progr;lms th;rt reprotluce com- A 64-bit program counter (IIC) contains a long- plex instruction set computer (C1SC8) instruction word-aligned virtual byte atldress (i.e., tlie low 2 atomicity using a sequence of RISC instructions.(> bits of the I>(: ;Ire ;~lwayszero). The VAX arcliitectr~rr In acltli tion to the above hartlw;rre state, the privi- keeps tlie I><: in general I-egister 15, wlicre it is legctl :ircIiitecture library routines for the \;;irio~rs directly used for PC-rel;~ti\:ememory ;iddressing. In operatitlg systems implement ;itltlitional st;rte. This the Pclgh;~AX]' architecti~rc,lio\vever, code ;rntl tl:~t;~ state m;iy be m:~intainedby h;rrtlware or (I'AI.code) pages are usirally separated by 64 kilobytes (K11) or softw;ire, ;it the option of the implementer, and it more to allow separate memory protection, but tlie varies from one operating system to anotlier. 16-bit displ;lcement in lo;~d/storeinstructions c:ln- Typical I'i.\l.cotle state includes a processor st:ltrrs not span more than 64~11. (13s) worcl, kernel and user stack pointers, ;l process The hardware processor state includcs ;I lock flag control block base for context switching, a process- ant1 ;I locked physic;il atltlress for the lo:rtl- unique v;tlire for thre;ids. ;rntl ;I processor number locketl/storc-condiriornrl sccjirence. It also h;ts ;I for mi11tipl-ocessor disp;itchil~g.Atldition;il I'til.code floating-point control register containing the IEEE state may include a floating-point enable bit, inter- dynamic rountling motle."' rupt jxiority level, ant1 translation loolc-aside Alphci AXP Architecture
HARDWARE STATE
/ 4 d /
R30 (STACK POINTER) F30 R31 (ALWAYS ZERO) F31 (ALWAYS ZERO)
a IEEE FLOATING-POINT DYNAMIC ROUNDING MODE
I LOCKED PHYSICAL ADDRESS I 0 LOCK-FLAG
OPTIONAL HARDWARE STATE I PREFETCH STATE A I PREFETCH STATE B I 0 INTR-FLAG TYPICAL PALCODE STATE
I PS I I KERNEL STACK POINTER I USER STACK POINTER I PROCESS CONTROL BLOCK BASE I PROCESS-UNIQUE VALUE I WHO AM I (PROCESSOR NUMBER) I I] FLOATING-POINT ENABLE (FEN) 1INTERRUPT PRIORITY LEVEL
-
I-STREAM TRANSLATION BUFFER 1 D-STREAM TRANSLATION BUFFER 1 T
buffers for mapping instruction-stream nnd clata- two aligned accesses or report a fat;~lerror, depentl- stream virtual adclresses. All of this state is soft in ing on the operating system design). the sense t1i:lt it is defined only in relationship to illpha I\XP implementations allow clata to be the I'ALcotle routines for a specific operating accessed using either a little-e11di:tn"'view (byte 0 is systeni. In ;I tni~ltiprocessorimplementation, all of tlie low byte of an integer), or ;I big-entlian'. view the above state is replicated for each processor. (byte 0 is the high byte of an integer). As tlescribetl in the Load/Store Instructions section, there is a Mernory Access one-instruction bhs in tlie secluenceb for little- ant1 Alpha AX11 memory is byte atltlressetl, using the low- big-endian byte manipi~lation. est-numbered byte of a datum. Only aligned long- Virtual addresses are a full 64 bits; implenienta- wortls or clu:cclwords may be accessetl: an :~ligned tions may restrict ;~tlclressesto have some number longworcl is a fonr-byte clati~mwhose atltlress is a of identical high-ostler bits, but must always distin- niultiple of four; an aligned quadwortl is an eight- guish at Least 43 bits. Virtual ;~ddresses;Ire nx~pped byte datum m~lioseaddress is ;I multij,le of eight. in an operating-specific way to plij~sical;~dtlresses, Normal loatl or store instructions that specify an using fixed-size pages. Memory protection is clone unaligned :~cltlresstake a precise data alignment on a per-page basis. Atldrcss mapping errors (e.g.. trap to I?4Lcocle (which may clo the access irsing protection, page faults) take precise traps to
Digilnl Tecbnicnl Jortrnnl Vo1. 4 !\'a 4 Speci~~lIssue 1392 27 Alpl~aAXP Architecture and Systems
I'i\l.code. Each page ma!. :~lsobe m;irkccl to provicle The rem;~ining bits cont;~i~ifiltlction (opcode a fault 011each read, write, or instruction-fetch. extension). literal, or dispI;~cetnentfields. To mini- Virt11;il ;~ddressesmay be further qu;~lifietl bj- mize register file ports in fast implementations, R13 atltlress space numbers (ASAS).to allo\v multiple is never written, ;~nclR(: is never rcatl. disjoint addresses spaces. l'hc choice ofclisjoint or All the operate instructions are three-operand common mapping across all processes is clone on a register-to-register, c:llcul;rting RC = RA oper~lteRR. per-page basis. In integer operzttes, tlic opcode and a 7-bit firnction 'The virtual- to physical-atltlrt.ss mapping is tlonc field specify tlic exact operation. Integer operates 011:I per-page basis. Each i~iiplc~~ient;irio~im;iy 1i;ive may Iinve ;In 8-bit zero-extentlecl literal instead of ;I p;tge size of 8m, I~KH,-32Kll. or 64Kl1. ?'he 0 ih;~ RH. In floating-point operates, the opcode and ;in up1?er bound allows a linker to ;~lloc;iteblocks of 11-bit function fieltl specify the exact operation. memory with cliffering protection or ,\h\ proper- Tliere :ire no flo;iting-point literals. ties htr enough apart to work on all implementa- Memory format instructions are used for lo;tds. tions. The virtual- to p.hysic;~l-;ttltlrcshmapping can stores, ;und a few miscellaneous operations. Loads be m;in) to one, i.e.. sgnon!.ms :~rc;~llo\i~ecl. In ;I ant1 stores ;ire two-opcrancl instructions, specifying multiprocessor implement;ition, sIi;~reclmain rnern- a register KA ;uid a base-disp1;lcernent virtual byte ory 1oc;ltions haw tlie same pl1ysic;ll ;~tltlrc.sson all adtlress. l'hc effective atltlress calculation sign processors. Per-processor i~~i.sI~:irccl1oc;ttions ;ire extertcls the 16-bit clispl;~cementto 64 bits ancl :iclcls also ;~llowetl. the 64-bit RI$ b;ac rcgister (ignoring overflow). The Memory has longworcl gl-;~ni~l:irity:two proces- resulting virtu:~lbyte ntldress is mappecl to a pliysi- sors III;IJ'si1ni1Itaneo~is1y ~CCC.SS ;tclj:tccnL longwortls cal adclrrss. The ~nisceIl;~aeousinstructions makc without rnut~~alinterScrcnce. T1ie loatl-locketl/ other uses of the 11A, RH. ;ulcl displacement fields. store-co~~tlitiollalsequence disci~sx~l~>re\,io~~sl!. c;in Branch h)rm;lt instructions spec@ a single regis- be i~sedto achieve rnultiproccs~orbj,te granul;lrit!: ter It1 ant1 ;I signet1 PC:-relative lotigword displ:ice- Inpi~t/oi~tpi~tis memory m:~ppeti: some phys- rnent, The br;uicIi target c;tlculation shifts tlie 21-bit ical memory adclresses may refer to I/() device tlisplacemcnt left b)~2 bits to make it a longwortl registers whose access trigpcrs sirlc cl'l'ccts (such (not byte) tlisplacemetit. then sign extc~ldsit anel as tlie transfer of d;~t;l).Sicle effects on re;~clsare atltls it to tlie uptlatecl I>(:. Conditional br;inch cliscour;tgecl. instructions test register hi, ;ind unconclitional branches write tlie uptl;~tedPC to FU for subroutine Irtstraiction Formats linkage. The 1:lrge longwortl displacement allows a Four funclamental instruction format>-operate, range of+4Rill3, subst;~ntiallyreducing the need h)r menlor!;, branch, and <:h1.IL124L-are shown in branches arountl or to other branches. Figi~re3. All instructions :Ire 32 bits witlc ant1 reside The (:~r.I.-l%l.instruction has only a 6-bit opcotle ill memory at aligned longwortl ;icltlresses. E;tch ;ind :I 26-bit function fielcl. The hnction fieltl is ;I i~~structioncontains :1 &bit opcocle fieltl :mtl zero small integer specifying one of a few dozen privi- to three ?-bit register-number t'ieltls. Kh, Rl). ;tntl RC:. Iegecl ;ircl-litecture Iil)r;~rj~s~~broutines.
OPERATE FORMAT BRANCH FORMAT 31 31 26 21 0 LITERAL 1 FUNC. INTEGER. LITERAL OP RA DISPLACEMENT
INTEGER. REGISTER 6 5 21
RB FUNC. FLOATING POINT CALL-PAL FORMAT 31 26 0 6 5 5 11 5 I I 1 MEMORY FORMAT 1 Op 1 FUNCTION I 31 26 21 16 0
OP RA RB DISPLACEMENT 6 5 5 16
28 I hl. .# .Vo.-I .T/~c,c,ir/l/s.s/tc~ 1992 Digitnl Tec/~niralJour~rnl Alpha AXP Arclnitect~lr.c
OJemte Instructions placing it in the low end of a register with high 'I'here arc five gro~~psof register-to-register operate order zeros. A pair ol E>(Txl./EXt'xFI instructions can instructions: integer arithmetic, logical, byte- perform unalignecl loacls, pulling the two parts of manipulation, floating-point, and miscell;~neo~~s. an unalignecl cl:~tum out of two qlladwords anel All instructions operate on 64-bit qu;~dworcls placing the parts in result registers. A si~npleOR unless otherwise specified. operation can then co~nbinethe two parts into the full tlatum. Integer Arithinetic I~~structiorzsThe integer arith- The insert (INSxx) ;ind mask (MSKxx) instruc- metic instructions are add, subtract, multiply, and tions position new t1:1r;1:incl zero o~~toIt1 data in reg- compare. Acld. subtract, and multiply have variants isters for storing bytes. words, and i~naligtleddata. t11;lt enable ;~rithmeticoverflou~ tr;~ps. ?'liejr ;~lso If the Alpha AXI1 ;ircliitecture were a four-operantl 11;lve longword variants that check h)r 32-bit over- one, inserting and masking coulcl have been com- flow (instc:id of 64) ancl force the high 33 bits ofthe bined into ;I single instruction. result to all equal bit<31>. Add ancl subtract also The compare-byte instruction ;~llowscharacter- Ihave sc;~leclv;lri;rnts that shift the first operanel left string search and conipnre to be done eight bytes at by 2 or 3 bits (with no overflow cl~ecl~ing)to speed a time. The ZAP instructions ;~llowzeroing of arbi- trary patterns of bytes in a register. These instruc- 1113 simple subscripted address arithmetic. The tions allow very fast iml>leme~lt;~tionsof the C IIMII1.H instruction (from I'RISM) gives the higll 64 bits of an unsigned 128-bit product and may be language string routines, among other uses. used for dividing by a constant. rl'lie~.eis no integer clividc instruction; a software subroutine is usecl to Floatingpoilzt AritI~~~zeIicIi~struc'tior?~ Tlie float- tlivicle by a nonconstant. The compare instructions ing-point arithmetic instructions :ire aclcl, subtract, are signed or i~nsignedancl write a Hoole;~nresult (0 multiply, clivide. conipare, ;inti convert. The first or I) to the target register. four have v;iriants for IEEE ;~nd\b\X floating-point, a~idsingle- and double-precisio~iclat;~ t!.pes. ?'hey Logiccrl Itistr~lctiorls The logical instructions :Ire ~lsohave variants that en;ible combin;ttions of arith- AND, OR, and XOR, with the secontl opcs;lnd metic tral~sanclt1i;lt spccih the roi~nclingmode. optionally complemented (ANDNOT, ORNOT, Tlie single-precision instructions write canonical SOIINOT). The sliifts are shift left logic;il, shift right 64-bit results, but clo exponent checking and logical, ;lncl shift right arithmetic. The (,-bit shift rouncling to single-precision ranges. The compare count is given by RB or a literal. The conclition;~l instructions write ;I Boolean result (0 or nonzero) move instructions test RA (same tests as the branch- to the target register. Tlie convert instr11ctio1.1~ ing instructions) and conditionally move RIi to It(:. transfer between single nncl clouble, floating-point These can be llsetl to eliminate branches in short and integer, ancl two forms of Vt\X double (D-float sequences such as &llN(a,b). ;Inel G-bat). A conlbination of liarelware and soft- ware provides full IEi(li ;~rithmetic.Operations on Bj)te-~i~unip~~btionctonIt~structiorts The byte-manip- \%S reserved operancls;' clirty zeros,'.' IEEE denor- i11;ition instructions are usetl with the lo;~tl:~ncl mals, infinities, and not-;1-1ii1rnht.rs are done in store i~nalignedinstructions to manip11l;lte short softnlare. un;~lignedstrings of bytes. Long strings shoultl be There are also :I fc\v f1o:lting-point instr~~ctio~is manipulated in groups of eight (aligned quad- tlxit move tlata withoi~t;~pl,lying any interpretation worcls) whe~ieverpossible. The b\~e-m;u~iipi~l:ltion to it. These include a complete set of conditional instructions :Ire fi~nclamentallymaskecl shifts. They move instructions similar to the integer conditional differ born normal shifts by h;~vinga byte count moves. (0..7) instead of ;i bit count (0..63), ;lncl by zcroing some bytes of the result, based on tlic cl;~t;~size h4iscellar~eon.s I~zslruclio~zsThe miscellaneous given in the function field. instructions incl~~de:Inernory prefetching instruc- The extract (ESTxx) instructions extract part tions to help decre:~sememory latency a re:ld cycle of a I-, 2-, 4-, or 8-byte field from ;I qu;iclworcl counter instruction for perh)rrn;111cemeasurement, anel place the resulting bytes in a fielcl of zeros. A ;I trap barrier instruction for forcing precise arith- single ES'l'xL instruction can perform b!.te or worrl metic traps, ancl nlernor!. barrier instructions for loads, p~~llingthe datum out of a qi~aclwortl;11ic1 forcing multiprocessor rc:lcl/write ordering.
Dig ~I~IIT~CIJ~I~CL~I JOU~ILLCI 14~1.4 !YO. 4 .S/)L,CIO/ /.C.SIIO 1092 29 Alpha AXP Architecture a~ldSystems
conditional moves, the ;irchitecture contains hints The load ant1 store instructions only move data. to improve branching performance. They never apply an interpretation to tlie data and The integer contlitional branches test register RA therefore never take any data-tlepentlent traps. This for an opcotle-specified condition (>O >=O =O !=O design allows moving completely z~rbitrarybit pat- <=O routine call, return, CASE (or storing ;I byte (the first two and last two instruc- SWIT(:H) statements, and coroutine linkage. tions might issue sin~ultiineouslyon the first Alpha The architecture specifies three kinds of brancli- AXP implementation). Example 4 shows a sequence ing hints in instructions. The hints need not be for ;In explicit unalig~ietlIoatl quatlwortl (no data correct, but to tlie extent that they are, implementa- ;il ignment trap). tions may perform bster. 'l'lie integer load-lockecl ;lnd store-conditional The first for111 of llint is an architectetl static (1,I)Q-I., LDL-L, ST(>-C, STL-<:) instri~ctionsare branch precliction rule: forward conditional inclutled in the architecture to facilitate atomic branches :ire predicted not-taken, and backw;~rcl i11~tl;ites of multiprocessor-sh;~red tiAs ones taken. To the extent that compilers and hard- tlescribed above, they can be i~sed in short ware implenlenters follow this rule, programs c;~n sequences of RISC instructions to do atomic read- run more quicldy with little hardware cost. This modify-writes, Example 5 shows ;I sequence for hint does not eliminate the use of dynamic br:unch doing a multiprocessor test-;inti-set. Note that prediction in ;in implementation, but it may recluce ch;lnging the r.DQ-~J/Sl'Q-tl in Exr~niple 5 to tlie need to use it. ANI)/~.I>Q-L/ST(~-C/BEQgives a byte-store sequence The secontl for111 describes co~npi~tedjump tar- chat is s;ll'c to use with n~~~Itiproccssor-~l~:~recldata. gets. Unusetl instruction hits are defined to give the There :ire two related loxl :~dtlrcssinstructions. low bits of the most likely target, using the same tar- 1.1)~ calculates the effective atltlress and writes get calcul;~tionas unco~iditionalbranches. Tlie 14 it into R(;. LDAH first shifts the tlisp1;lcernent bits providetl are enough to specifi the instruction left 16 bits, then calculates the effective aclclress offset within ;I page, which is often enough to start ;inti writes it into RC. LDAM is inclilcletl to give a sirn- a fastest-level instruction-cache read many cycles pie way of creating most $2-bit constants jn a before the :~ctualtarget value is known. pair of instructions. (Because I.I>A sign-extends Tlie third for111 tlescrjbes sr~broutine;lot1 corou- the tlisplacement, some v;ilucs in the range tine returns. By marking each branch and jump as 000000007FFF800O .. 000000007FFFFI:FF require c:ill, return, or neither, the architecture provides three instructions.) Constants of 64 [?its :ire loadetl enough information to maintain a small stack of with I,DQ instructions. likely subroutine retilrn atldresses within an iniple- mentation. This implement;~tionstack can be usecl Blwrzclning Instructions to prefetch subroi~tinereturns quickly. 'The br;incli instructions include conditional The conditional move instructions (discussed branches, unco~lditio~l;~~br;lnches, and c;llculated previously in the L.ogic:~lInstructions section and jumps. In ;~cltlition to the previously described the Floating-point Arithmetic Instructions section) Alpha AXP Architecture
EXAMPLE 1: LOAD BYTE (UNSIGNED. LITTLE-ENDIAN) 7 60432 10 LDQ-U R2,O(R1) 1 ( BYTE 1 R2 76543210 EXTBL R2,Rl ,R2 r 0 IBYTE R2
EXAMPLE 2: LOAD BYTE (SIGNED. BIG-ENDIAN) 01034567 LDQ-U R2,O(R1) I ( BYTE ( R2
SUBQ R31 ,R1 ,R3 I -2 R3 01234567 EXTQH R2,R3,R2 BYTE 1 R2
EXAMPLE 3: STORE BYTE (LITTLE-ENDIAN) 76043210 LDQ-U R2,O(Rl) I I OLD I R2 76543210 INSBL RO,Rl,R3 1 NEW 1 R3
OR R2,R3,R2 I ( NEW ( R2 76543210 STQ-U R2,O(R1) I NEW I (o(~1)
EXAMPLE 4: EXPLICIT LOAD QUADWORD (UNALIGNED, LITTLE-ENDIAN) 7 6043210 LDQ-U R2,O(RI) LOW PART I R2 15 14 13 @ 11 10 9 8 LDQ-U R3,7(R1) HIGH PART R3 76543210 EXTQL R2,Rl ,R2 I LOW PART R2 76543210 EXTQH R3,Rl ,R3 I HIGH PART R3 76543210 OR R2,R3.R2 I HIGH PART ( LOW PART R2
EXAMPLE 5: MULTIPROCESSOR TEST-AND-SET
LDQ-L R2,O(R1) I FLAG R2
BNE R2,FLAG-SET FLAG R2
BEQ R2,CONTENTION I STORED? IR2
Figure 4 Load/Store Instrzlctio17s
Digital Tech~cicalJo~i~-lrnl Vol 4 iVo 4 J/I~LLCI/I.FSLIL' 1992 3 1 Alpha AXP Architecture and Systems
anrl the bratiching hints eliniinate some branches 1,000. liacli ;irchitected IYl'P; form:it also hi~sonc bit ant1 speetl 1111 tlie remaining ones without conipro- reservetl for hture expansion. mising multiple instruction issue. Sevcr:il other soft 1)ALcocle registers, such ;IS tlic 1'5 or ,iSN. that need only a few bits toclay ;Ire allo- Supmu ision cated ;I fi111 64 bits for future exp:insion. The ;ictions unclerpinni~igan operating system are Exception processing can limit perforrn;~ncr. perforrnetl in [?.\J.cotlesubroutines ant1 are a flexi- I'Al.code routines tleliver exceptions to an oper;lt- ble 1x11-tof the architecture. All ;isyncI~rotioiis ing s).stcln, so the dcsign can be gr;iclt~;~lly events, such as intet-rupts, exceptions, and m;~chine im]>rovecl. In hct. I?,\Lcotle routines for t1-1~.clat;~ errors, are rnediatecl by l%I,code routines. I'i\Lcotlc ;~lignmentIxtve bcen iniprovetl in the Open\ ITISAXP establishes the initial state of the machine before ;untl Ill:(: OSWI t\XP operating systcnis. Some cur- execution of the first software instruction. l'A1~c0tle rently specified softwarc exceptions (such ;is Ilrlil; routines 11iecli;iteall Accesses to physical liartlw:~rc clenornl;il ;irithmetic) could be mo\~etlinto 1'/2l,coclc resources, includitig physical main memory anel or hartl~v:ire. memory-rn;i1>1>etlI/() device registers. ?'here ;)I-e;I llumber of areas of instl-i~ction-set flesibilit!. designetl into the architecture. tour of This design allows irnple~iientersto craft ;i set of l't\Lcotle routines tliat closely match an oper;tting the (,-bit opcodes are no~iiinally reser\.ecl for s),stem tlesign, not onljr for trnclitioniil oper:iting ;itlcling intcgcr and floating-point alignctl oct:~- s),stems, but also for specialized environ~iientssuch \s;ort14(128-bit) load/store instructions I' Nine more ;is rc;il-time or bigblj. secure cotiiputing. As new 0-bit ol~codesremain for other cxpiinsion. Witliin co~~ipi~tingp:ir;~dignisare atlopred ;lntl new opcr;it- e;icli opcocle, the function firltl contains room for ing systems ;ire created, the Alpha r\XP architecture fi~~-tIicresp;tnsio~l. For csample. tlie sc;ilcd ;icltl/sirl~- may we1 l prove flexible enougli to accommotl;~le tract f~rnctionswere atldcd between l~rototypc then1 cfficie~~tly. chip iind protluct chip. The fnct th;it tlie function fielcls ;ire not f~11Iypolicecl is a mistitke. Future Changes Within the IiXE floating-point f~tnction ficltl, coclc points are nomin~llyreserveel for clo~~blc- The Alpli;~AXP ;~rchitecturewill surely ch:ulge estenclrtl" precision (128-bit) aritlimetic. Within during its lifetime. In addition to the I'~iI.cocle {:lie memory b:trricr instruction gr0~113,tbre~ COCIC' flexibility discussed above, explicit performance poi~itswere reserved for subset bal-ricrs. One of flexihilit)~and instruction-set flexibility exist in these h;is alre;idy been redcfi~inetlas ;I write-write the arcliitcct~~re. b;trricl: Architectural fields that are too sm;11l can limit Not all changes involve growth. l'herc ;*resubsct- perfo-ni;ince. The Alpha hXP architect~~rethere- ting rulcs clcfined for removing either one or both fore h;~smany fields tleliberatel!. sized for 1;iter (1I;I;l; ;uicl US) floating-point clat:i types. If both ;Ire exp;insion. removecl, the floating-point registers c:ln ;~lsoI>e i\lthougli initial implernentations ~1stonl!, 43 rerno\,etl. Tlie h.\l(>Yss P~\Lcocleroiltinc> :~ntl1<5/11(; bits of virtual ;~cldress.they check the reriiaining instructions are defined as option:il ;~ntlc;in be 21 bits, so tli;~tsoftware can run ~inrnoclifictl011 tleletecl nknthe trailsirion of tl-;insl;itcrl \ii\S coclc 1;lter implementations that use (up to) :ill 04 bits. is cornplcted. Other unneecled l)r\l.cocle rolltines Furthermore, ;~ltl~oughiniti;il implementations use c:m ;~lsobe removed e\;enti~;ill\: only 34 bits of pliysical atldress, the nrchilcctetl I>;lge t;tble entry (PTE) formats ancl p:~ge-size choices allow growth to 48 bits. By exp;lncling into SZ~VI~?~~~?~~ a 16-bit I'?'E fieltl tli;lt is not currently used by m;~p- FShcgoals rii;lt shaped tlie Alpha AXl' ;~rchitecture ping Ii;irtlw;~re,another 16 bits of physical adtlress clesigli have largely been realized. For high perfor- grcnvth can be acl~ieved,if ever neetletl. mance, the first irnp1ement;ttion (the I)lcc:cliil~ Initial implementations also use only SKI3 p;lges, 21004 11iicroprocessor) is listcd in the October 1992 but tIie clesign accommodates Ii~iiit~tlgrowth to (,'r~irilicss:soot2 of Recor.ds ;IS tlie world's cistest sill- 6,iKIi pages. Beyontl that, page t;tljle gr:in~~l:lrit)~ glc-chip microprocessor. It is too early to mc;isure liin~s:illow groups of 8, 64. or 513 p;iges to be longe\..ity but tlie fact that \\.c had clesigncd-in flcsi- rrc;ttc.cl ;is ;I single large page, th~iseSf'ectivcljr bility in pl:~ceatl1;it cliangecl tlurir~gclc\.clopment is cstcncling tlic. p:~gc-sjzerange b). :I ktctor of ovc,r ;it 1c;ist encour:iging. OpenVYlS ,\XI', [>kc:OSI/l ,\XI'. and Windows NT operating systems a11 run on IEEE not-a-~zuinber(NLLAV-A symbolic entity Alpha AXI-' implementations today. Programs from encoded in a floating-point format. The lEEE stan- the VAX ant1 ,\LIPS architectures transport easily to dard specifies some exceptional results (e.g., 0/0) Alpha AXIJ implenientatiolls ant1 run quickly Many to be NaNs. of the ideas in the Alpha AX11 design are now being Linear crddressilzg-A memory addressing tech- adopted by other architectures in the industry. nique in which all addresses form a single range, from 0 to the largest possible address. Subscript cal- Appendix culations can create any address in the entire range. Biiznry traizsl~~tioiz-A software technique to change an executable program written for one Little-endian rnenzory addressirzg-A view of architecture/operating-s~rstempair into an equiva- memory in which byte 0 of an operand contains the lent program for a different architecture/ol>erating- least significant bit of an integer. The terms little- system pair. endian and big-endian are borrowed from Gulliver's Tr~iz~elsin which religious wars were Rig-endiaiz nleinory addressing-A view of niem- waged over which end of an egg to break. ory jn which byte 0 of an operancl contains the Longz~~ord-A32-bit datum. most significant (sign) bit of an integer. Compare lit- tle-endian memory addressing. Mz~ltipleinstr~iction issz~e-A highperformance computer implementation technique of starting Byte-An 8-bit datum more than one instruction at once. An implements Byte granularity-The appearance that two pro- tion that starts (up to) two instructions at once is cessors can update adjacent bytes in memory with- called dual-issue; four instructions, quad-issue or out interfering with each other. four-way issue; etc. CLSC-Complex instruction set computer, charac- terized by variable-length instructions, a wide vari- Quadu~orcl-A 64-bit datum. ety of memory aclclressing motles, and instructions that combine one or more memory accesses with RISC-Reduced instruction set computer, charac- arithmetic. <:Is(: designs express computation as a terized by fixed-length instructions, simple Inem- few complex steps. ory acltlressing modes, and a strict decoupling of load/store memory access instructions from regis- IEEE clei~ormalizeclnunzber (denorma1)-A float- ter-to-register arithmetic instructions. NSC designs ing-point number with magnitude between zero express computation as many simple steps. and the smallest represelltable normalized number. Numbers in this range are typically not repre- Segmented addressi~zg-A memory addressing sentable in other floating-point arithmetic systems; technique in which acltlresses are broken into two such systems lnight s~gnalan untlerflow exception or more parts (segments). Subscript calc~~lations or force a result to zero instead. can only be done within a single segment, and elab- orate software techniques are needed to extend IEEE clouble-extended fornzat-A loosely specifed addressing beyond a single segment floating-point format with at least 64 significant bits of precision and at least 15 bits of exponent VAX dirty zero-A zero value represented with a witlth; typically implemented using a total of 80 or non-zero faction; must be converted to a true zero 128 bits. result.
IEEE clynntnic rounding mode-One of four differ- VAX flroating-point-A form of computer arith- ent rounding rules. metic specified by the VAX architecture manual.4 VAX arithmetic includes rules for reserved IL5k f%o~!tiizgll,oi~zt-Aform of computer arith- operands and dirty zeros. metic specified by lEEE standartl 754.l' IEEE arith- metic inclr~desrules for tlenormalized numbers, VAX reserved operand-A non-number that signals infinities, and not-a-numbers. It also specifies four an exception when used as an operand in VAX float- different modes for rounding results. ing-point arithmetic. IEEE i~zfinitjl-An operand with an arbitrarily large VAX ulord su~apping-The rearrangement needed magnitude. for the 16-bit pieces of a vi\X floating-point number
Digital Technical Journal %)I. 4 IVO. 4 Special Issue 1992 Alpha AXP Architecture and Systems to put the fields in a more ilsr~alorder; this is an arti- 8. There are two special-resource ;inornalies in krct of the l'l>IJ-l L 16-bit ;~rchitecture. the architecture that we were unable to avoid: the tleclicated state for the load-locked W~I*L/-)~16-bit tI:~ti~ni, instruction and the dynamic rountling-mode register required for f11ll IEEE conh)rm:~nce. Acknowledgments 9. This is borne out in a large customer's recent Hirntlretls of people have worked on the Alpha AX[' C string manipillation benchmark result, run- ;~rcliitectt~re,h;irdware, ancl software. M;my Alpli;~ ning 3 to 6 ti~iiesfaster than the custoniet-~s AXI' arcllitectural icleas came from the PRISM expectation (wl~icliwas basecl solely on clock design, most notably the PALcode idea. The archi- rate ratios). tecture work was clone in the rich envirolirnent of dozens ;inti later Iiundrecls of bright, thoi~glitfi~l, 10 C~-LIJ~-~Colrzputer Systenz KeJerence ~Vl~~~zuol, and outspoken professional peers. Ellen bat bout;^. For111 2240004 (Minneapolis Crny Rese:ircli, Dileep 13handarkar, Richard Brunnel; Wayne Inc., 1977). <:;irdoz;~.Dave Cutler, Daniel Dobberpuhl, Robert I I. IB,l-1 Systenz/.?70 Prilzciples of 0~)erwtion. (iiggi. Henry Grieb, Richard Grove, Robert Form C~22-7000-4(Armonk. M': Ilih4 Corpo- M:~lste;~d.Jr.. MicIi;~elHarvey, Nancy Kronenberg, ration. 1974): 28. R~ymonclLanza. Stepheti Morris. Willialn Noyce, Charles Nylanclec Ilave Orbits, Mary Payne. Audrey 12. Institute of Electrical and Electronics Engi- Reith, Robert Supnik. Benjamin Thomas, Catharine neers. "Binary Floating-point Arithlnetic for van Ingen, ant1 Rich Witek all contributed directl!, ,Microprocessor Systems,'' Standard Number to the written specification. Rich Witek is co-;rrclii- 11313-754 (New York, 1985). tect ;~ndis the other half of the term "we" usetl in 3. The careh~lreader will notice that Alpl~;~,\XI) this paper. implementations require a longword shifter in the loacl/store patli for 32-bit operands. References and Notes Although we briefly considered ;I design with no 32-bit operands, we decided to Iccep 52-bit 1. C;. Amdahl, G. Blaauw, and E Brooks, Jr., load/store support for good business reasons. "Architecture of the IBM System/S60," Ih'M Similarly, iUpha AXP implementatiolls require Jou1.7~rrloJKesearcl7 arzd Derlelopment, vol. 8, ;I worcl swapper in the load/store patli for VAX no. 2 (April 1967): 87-101. flo;~ting-pointoperands. We clecidetl to keep 2. R. Sites, ed., Alkbn Architectur-e Refererlce VAX floating-point support for good I)i~siness lLIr~rz~~~l(Burlington? ha: Digital Press, 1992). reasons. Depending on market needs, \'AX floating-point support can be removetl in the 3. 11. Conr;ltl et al., "A 50 MIPS (Peak) 32/(i4b h~ture. Nlicrolxocessor," ISSCC Digest cg ~~'c%?I~~L.cI/ Pz~I~ers(February 1989): 76-77 14. Many commercially successful arcliitcc- tilres have grown to double-witlt11 memory 4. It. Brunner, etl., I2.X Arclnitectnre Kt?jbr.er?cc. implementations in mid-life: the lohl 709 ,Wcln~~oISecond Etlition (Bedhrtl, MA: 1Iigit;ll series from 36 to 72 bits; the II%MSysteni/3GO Press, 1991). series from 32 to 64 bits: the Digital I'1)1'-11 series from 16 to 32 bits; ;rnd the 1)igit;ll 5. <;. Kane and J. Heinrich, ,llIP,F R1.K Ar'cbitec- VAX series from 32 to 64 bits. This trentl is ture (Englewoocl Cliffs, NJ: Prentice-Hall, likely to continue. 1992).
6. R. Sites, A. Chernoff, M. Kirk, M. Marks, atid S. Robinson, "Binary Translation," L)igitz~I Tcclnnic-a/Jourrznl, vol. 4. no. 4 (1992, this issue): 137-152.
7 The little-endian bias is very slight; both big- and little-entli;~nAlpha AXP systems and soft- wxre are in fact being built. Daniel W Dobberpuhl Robert A. Conrad Liam Madden Richard 1: Witek Daniel E. Deuer EdwardJ McLellun Randy Allmon Bruce Gieseke Derrick R. Meyer Robert Anglin Soha M.N Hassoun James Montanaro David Bertucci Gregory W Hoeppner Donald A. Priore Sharon Britton Kathryn Kuchler Vidya Rajagopalan Linda Chao Maureen Ladd Sridhar Samudralu Burton M. Leary Sribalan Santhanam
A 200-MHz 64-bit Dual-issue CMOS Microprocessor
A 400-1n~s/200-1~1FLOPS(peak) custom 64-bit VLSICPU chip is described. The chip is fabricated in a 0.75-pmCMOS technology utilizing three levels of metnlizntio~zand optimized for 3.3-Voperation.The die size is 168 mm X 13.9 mm and contains 1.68 lnilliolz transistors. The chi) includes sepamte 8KB instrtiction and dcit~rcacl?es and a fully pipelined floating-point unit tlgat can handle both IEEE and VRY standard floating-point data types. It is designed to execute two instructionsper cycle among scorebonrded integer;floating-point, address, and branch execution units. Pozoer dissipation is 30 W at 200-MHz operation.
A reduced instruction set computer (NSC)-style CMOS Process Technology microprocessor has been designed and tested that The chip is fabricated in a 0.75-pm, 3.3-V, n-well operates up to 200 megahertz (MHz). The chip CMOS process optimized for high-performance implements a new 64-bit architecture, designed to microprocessor design. Process characteristics are provide a huge linear address space and to be devoid shown in Table 1. The thin gate oxide and short of bottlenecks that would impede highly concur- transistor lengths result in the fast transistors rent implementations. Fully pipelined and capable required to operate at 200 MHz. There are no of issuing two instructions per clock cycle, this explicit bipolar devices in the process as the incre- implementation can execute up to 400 million oper- mental process complexity and cost were deemed ations per second. The chip includes an 8-kilobyte (JSB) I-cache, 8KB D-cache and two associated trans lation buffers, a four-entry, 32-byte-per-entry write buffer, a pipelined 64-bit integer execution unit Table 1 Process Description - - with a 32-entry register file, and a pipelined floating- Feature size 0.75 Krn point unit (FPU) with an additional 32 registers. The Channel length 0.5 pm pin interface includes integral support for an exter- Gate oxide 10.5 nm nal secondary cache. The package is a 431-pin pin 0.5 Vl-0.5 V grid array (PGA) with 140 pins dedicated to V,,/y, Ynl 4, (power supply voltage/grouncl). The chip is fabri- Power supply 3.3 v cated in a 0.75-micrometer (pm) n-well comple- Substrate P-epitaxial with n-well mentary metal-oxide semiconductor (CMOS) Salicide Cobalt-disilicide in diffusions process with three layers of metalization. The die and gates measures 16.8 millimeters (mm) x 13.9 mm and con- Buried contact Titanium nitride tains 1.68 million transistors. Power dissipation is Metal 1 0.75-pm AICu, 2.25-pm pitch (contacted) 30 watts (W) from a 3.3-volt (V) supply at 200 MHZ. Metal 2 0.75-~mAICu, 2.625-~m pitch (contacted) 0 IEEE. Reprinted, with permiasion, from thc /EEEJortr.i?alof Metal 3 2.0-pm AICu, 7.5-pm pitch SolidStute Circuils, volumc 27, number 11, pages 1555 to 1567, (contacted) November 1992.
Digital Tecbnicnl Jourrrnl Vol. 4 No. 4 Sl,ecinl lssi~e1992 3 5 Alpha AXP Architecture and Systems
too large in comparison to the benefits provided- I-CACHE principally more area-efficient large drivers such as clock and I/O. The metal structure is designed to support -F the high operating frequency of the chip. Metal 3 E-BOX - - F-BOX is very thick ancl has a relatively large pitch. It is important at these speeds to have a low-resis- - I-BOX tance metal layer available for power ant1 clock ttl t t l~ BIU distribution. It is also used for a small set of special - FRF signal wires such as the clata buses to the pins and the control wires for the two shifters. Metal 1 t A ancl metal 2 are maintained at close to their maxi- mum thickness by planarization and by filling metal 7 7
1 and metal 2 contacts with tungsten plugs. This A- A-BOX removes a potential weak spot in the electromi- gration characteristics of the process and allows more freedom in the design without compromising reliability.
Alpha AXP Architecture The computer architecture implemented is a 64-bit Figt~~-c?I CPU Cbip Block. Dingizlm load/store RISC architecture with 168 instructions, a11 32 bits wick.' Supportecl data types include E-box), the floating-point unit (FRF plus F-box), the 8-,16-, 32-, and 64-bit integers ancl both Digital and loacl/store unit (A-box), and the branch unit (dis- IEEE 32- and 64-bit floating-point formats. Each of tributecl). T'he bus interface unit (UIU), described in the two register files, integer and floating point, the next section, handles all communication contains 32 entries of (54 bits with one entry in each between the chip and external components. The being a hardwired zero. The program counter and microphotograph (Figure 2) shows the boundaries virtual address are 64 bits. Implementations can of the major functional units. The dual-issue rules subset the virtual atltlress size, but are required to are a direct consequence of the register file ports, check the fi1I1 64-bit adclress for sign extension. tl~efunctional units, ancl the I-cache interface. The This ensures that when later implementations integer register file (IM) has two read ports and one choose to support a larger virtual address, pro- write port dedicated to the integer unit, and two grams will still run and not find addresses that have read anrl one write port shared between the branch dirty bits in the previously "unused" bits. unit and the load/store unit. The floating-point reg- The architecture is designed to support high- ister file (FRF) has two read ports and one write speed multi-issue implementations. ?b this end the port dedicated to the floating unit, and one read architecture does not include condition codes, and one write port sharecl between the branch unit instructions with fixed source or destination regis- and the load/store unit. This leads to dual-issue ters, or byte writes of any kincl (byte operations are rules that are quite general: supported by extract ant1 merge instructions Any loacl/store in parallel with any operate within the CPU itself). Also there are no first-gener- ation artifacts that are optimized around tod;iy's An integer operate in parallel with a floating technology, which would represent a long-term lia- operate bility to the architecture. A floating operate and a floating branch
Chip Microarchitecture An integer operate and an integer branch The block diagram (Figure 1) shows the major func- tional blocks and their interconnecting buses, most except that integer store and floating operate and of which are 64 bits wicle. The chip iniplements floating store and integer operate are disallowecl as four functional units: the integer unit (IRF plus pairs.
36 W)1. 4 No. 4 S~ecidlssue1992 Digilal Tecbrrical Jozrrrzal A 200-MHz 64-bitDual-issue CMOS iMicro~~rocessor
CLOCK
Figure 2 ~klicrophotogrl,hof Chip
As shown in Figure 3a, the integer pipeline is subroutine return stack. The architecture provides 7 stages deep, where each stage is a 5-nanosecond a convention for compilers to predict branch deci- (ns) clock cycle. The first four stages are associated sions ancl destination adtlresses, inclutling those for with instruction fetching, dccotling, and score- register indirect jumps. The penalty for branch mis- board checking of operands. Pipeline stages 0 predict is four cj~cles. through 3 can be stalled. Beyond 3, however, all The floating-point unit is a fi~llypipelined 64-bit pipeline stages advance every cycle. Most arith- floating-point processor that supports both vioc metic and logic unit (ALU) operations complete in standard and IEEE stantlarcl data types and rouncling cycle 4, allowing single-cycle latency, with the modes. It can generate a 64-bit result every cycle shifter being the exception. Primary cache accesses for all operations except divide. As shown in Figi~re complete in cycle 6, so cache 1;ltency is three cycles. 3b, the floating-point pipeline is identical and The chip will do hits under misses to the primary mostly shared with the integer pipeline in stages 0 D-cache. through 3; however, the execution phase is three The I-stream is based on autonomous prefetch- cycles longer. All operations, 32- and 64-bit (except ing in cycles 0 and 1 with the final resolution of divide) have the same timing. Divide is handled by a I-cache hit not occurring until cycle 5. The nonpipelined, single bit per cycle, dedicated divide prefetcher includes a branch history table and a unit.
Digital Techtrical Jourt~al W1. 4 /\lo. 4 .S/)eciullssue 1992 37 Alpha AXPArchitecture and Systems
0 2 4 5 6 IF 10 A1 A2 W R
CACHE DECODE ALUl WRITE ACCESS 4 4 4 SWAP ISSUE ALU2 PREDICT RF READ
PCGEN ITB I-CACHE HITIMISS
VAGEN DTB D-CACHE HITIMISS
(0) Integer Unit Pipeline Tirning
0 1 2 9 4 5 6 7 8 9 IF SW 10 I1 F1 F2 F3 F4 F5 FWR '4 BY PASS I I CACHE DECODE ADD LID SHIFT ADDIRND FRF WRITE ACCESS 3X MULl MUL2 ADDJRND FRF WRITE SWAP ISSUE PREDICT RF READ
KEY: PC GEN GENERATE NEW PROGRAM COUNTER VALUE VA GEN GENERATE NEW VIRTUAL ADDRESS ITB INSTRUCTION TRANSLATION BUFFER DTB DATA TRANSLATION BUFFER
Figure 3 Pipeline Tinzi~zg
In cycle 4, the register file data is formatted to pages and 4 that map 4-megabyte (MB) pages. l'he fraction, exponent, antl sign. In the first-stage data tr;~nslationbuffer contains 32 entries that can :rclder; exponent difference is calculated and ;I 3 x map SKU, 64~~,512KB, or ~MBpages. ~n~lltiplicandis generated for multiplies. In aclcli- T11e CPLI supports performance measurement tion, a predictive leading 1 or 0 detector using with two counters that accumulate system events the input operands is initiated for use in resillt nor- on the chip such as dual-issue cycles and cache malization. In cycles 5 and 6, for add/subtract, misses or external events through two dedicatetl alignlnent or normalization shift and sticky-bit cal- pins that are s;lmpled at the selected system clock cul:ition are perfornietl. For both single- ant1 tlou- speetl. ble-precision mi~ltiplication,the multiply is done in ;I radix-8 pipelined array multiplier. In cycles 7 and External Interface 8, the final acldition and rounding are performed in The external interface (Figure 4) is designed to parallel and the final result is selected and driven directly support an off-chip backup cache that can back to the register file in cycle 9. With an ;tllowed range in size from 128KB to 8h4B and can be byp;~ssof the register write data, floating-point constructed from ordinary SRAMs. For most opera- latency is six cycles. tions, the CPU chip accesses the cache directly 'The <:I'IJ cont;rins all the hardware necessary to in ;I combinatorial loop by presenting an address support a tlemand paged virtual memory system. It antl waiting N CPlJ cycles for control, tag, ant1 tlat:~ inclutles two translation buffers to cache virtu;il-to- to appear, where N is a mode-programmablt:nibl num- physic;~laddress translation. The instruction trans- ber between 3 and 16 set at power-up time. For lation buffer contains 12 entries, 8 that map 8KR writes, both the total number of cycles and the adr-h<33:5> 4 RAM-cll sys-RAM-ctl -1 -;-1-I I I I SYSTEM DEPENDENT LOGIC 1 . I I
I RAM CPU MEMORY CHIP SYSTEM INTERFACE
osc<2> (400 MHz)
Fi'q~1r.e4 CPU Exterrzal I~zter;firce t1ur;ltion and position of the write signal are the 16-byte pin bus. In &-bit mode, it is four cycles. progranimable in units of CIJU cycles. This allows This yields a maxiniuni backup cache read bantl- the module designer to select the size and access width of 1.2 gigabytes per second ((;B/s) antl a write time of the SkL\ls to match the tlesirecl price/ bandwidth of 711MH/s. Internal cache lines can performance point. be invalidated at the rate of one line per cycle The interface is designed to allow all cache pol- using tlie dedicated invalidate address pins, icy clecisions to be controllecl by logic external to iAdr-11<12:5>. the c:l'ri chip. There are tliree control bits associ- In the event extern;~lintervention is requirecl, a i~teclwith e;~chbackup cache (B-c;~che)line: valid, request code is presented by the CDU chip to the shared, and dirty. The chip completes a B-cache external state machine in the time domain of the re;ld ;IS long as valid is true. A write is processetl by SYS-CLK as describecl pre\~iousl)! Figure 5 shows the <:l'lJ only if valid is true and sliarecl is false. the read miss timing where each cycle is a SYS-CLK When ;I write is performed, the dirty bit is set to cycle. The external transaction starts with the true. In ;ill other cases, the chip defers to an exter- address, the quadword within block and instruc- nal state machine to complete the transaction. This tion/data indication si~ppliedon the cW>Iask-11 stare m;~chineoperates synchronously with the pins, ant1 REAII-I3I.OCK function supplied on the S\'s_(;l.l