<<

Editorial Jane C. Blake, Editor Kathleen M. Stetson, Associate Editor Helen L. Patterson, Associate Editor Circulation Catherine M. Phillips, Administrator Sherry L. Gonzalez Production Terri Autieri, Production Editor Anne S. Katzeff, ppographer Peter R. Woodbury, Illustrator Advisory Board Samuel H. Fuller, Chairman Richard W Beane Richard J. Hollingsworth Alan G. Nemeth Victor A. Vyssotsky Gayn B. Winters The Digital TechnicalJournal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO 1-3/B68,Maynard, Massachusetts 01754-2571. Subscriptions to the Journal are $40.00 for four issues and must be prepaid in U.S. funds. University and college professors and Ph.D. students in the electrical engineering and science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital TechnicalJournal at the published-by address. Inquiries can also be sent electronically to [email protected]. Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 1 Burlington Woods Drive, Burlington, MA 01830-4597 Digital employees may send subscription orders on the ENET to RDVAX::JOURNAL or by interoffice mail to mailstop ML01-3/B68. Orders should include badge number, site location code, and address. All employees must advise of changes of address. Comments on the content of any paper are welcomed and may be sent to the editor at the published-by or network address. Copyright O 1992 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. The information in the Journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in the Journal. ISSN 0898-901X Documentation Number EY-J884E-DP The following are tradeinarks of Digital Equipment Corporation: Alpha AXP, DEC, DECchip 21064, DECstation, DECwindows, Digital, the Digital logo, KA50, KA52, KA675, KA680, KA690, MicroV!, MS44, MS670, MS690, Q-, Q22-bus, Thinwire, TURBOchannel, lXI'RM, VAX, VAX-11/780, VAX 4000, VAX 6000, VAX 7000, VAX 10000, VAXcluster, VAX MACRO, VAXstation, VMS, and VXT 2000. is a trademark and PAL is a registered trademark of Cover Design Advanced Micro Devices, Inc. The WRXmicroprocessor is Digital'sfastest VRX implementation SPEC, SPECfp, SPECint, and SPECmark are registered trademarks of and the common theme ofpapers in this issue. Our cover graphic the Standard Performance Evaluation Cooperative. joins the WHXproject code name with an image of speed - the SPICE is a trademark of the University of California at Berkeley. chip's most salient characteristic and theperjomzance advantage TPC and tpsA-local are trademarks of the Transaction it brings to a range of new VAxsystm. Processing Performance Council. The cover design is by Deb Anderson of Quantic Communications,Znc. Book production was done by Quantic Communications,Inc. Contents

9 Foreiuord Robert id. Supnik

NVAX-microprocessorVAX Systems

I I The NVAX and NVAX+ High-perfomname VAX C;. Michael Uhler, Debra Bernstein, Larry L. Biro, John E Brown 111, Jolui H. Eclrnondson,Jeffrey D. Pickholtz, and Rebecca L. Stamnl

24 The NVAX CPU Chip: Design Challenges, Methods, and CAD Tools Dale R. Donchin, Timothy C. Fischer, Thomas E Fox, Victor Peng. Ronald l? Preston, and William R. Wheeler

38 Logical Verificationof the NVAX CPU Chip Design Walker Anderson

47 The VRX 6000 Model 600 Lawrence Chisvin, Gregg A. Boucliard, and Thomas M. Wenners

GO Design ofthe VAX 4000 Mode1400, 500, and 600 Systm~s Jonathan C. Crowell, Kwong-Talc A. Chui, Thomas E. Kopec, Samyojita A. Naclkarni, ancl Dean A. Sovie

73 The Design of the VAX 4000 Model 100 and MicroVAX 3100 Model 90 Desktop Systems Jonathan C. Crowell and David W Maruska

82 The VAXstation4000 Model 90 Michael A. Callander, SK,Lauren M. Carlson, Anclrew R. Ladd, and Mitchell 0. Norcross

92 VAX 6000 Error Handling: A Pragmntic Approach Bri;in I'orter I Editor's Introduction

paper on the new mid-r;~ngcVtjX 6000 niultipro- cessing s)lstem, Larry Chisvin, Gregg Bouchartl. and Tom Wenners explain the module design clecisions that s~~pportedthe goals of 6000-series compatibil- ity anrl time to market. Of particular interest are the sclieclule and performance benefits derived from developing a routing and control interface chip. The engineers for new low-end desksicle systems also chose to develop custom chips-a menlory controller chip, memory module, and an I/() con- troller Jon Crowell, Kwong Chui, Tom Kopec, Sam Nadkarni, ant1 Dean Sovie discuss the chip filnc- Jane C. Blake tions tliat were key to exceeding the perh)rrn;~nce Editor goal of three times that of the previous !4000. For tlie low-encl VXX 4000 Moclel 100 system anel The WM is a high-performance, the MicroVla 3100 desktop servers, designers saveel single-chip implementation of the VAX architecture. significant time by "borrowing" existing compo- It is today's fastest Vru; microprocessor and the Cl'L1 nents from proven systems. Jon Crowell ant1 I>ave at the hart of the mid-range, low-end, and work- Maruska relate decisions that allowed them to dou- station systems described in this issue of the Digital ble performance and complete the work within the Technic~zlJourncll. extraorclinarily short time of nine months. The WtLY chip is not only fast, with cycle times as The newest Vastation , b;~sedon low as 11 ns, but also holds a unique position in the WAX, is the Model 90. lMike Callander, L;u~ren Digital family of microprocessors: NVAX is both an Carlson, Andy Ladd, and Mitch Norcross present upgrade path for existing VAX systems and a migra- their design methodology. Most significant for tion path to Alpha &YP systems. In their paper on clevelopment was the tlecision to implement new the WAS anel WAX+ chips, Mike Uliler, Debra logic in programmable technology, which allowed Bernstein, Larry Biro. John Brown, John Edmond- bug fixes in minutes rather than weeks. son, Jeff Picklioltz, ancl Rebecca Stamm present an Not ;tbout system design but rather error han- overview of the complex microprocessor tlesigns dling in 6000 systems, Brian Porter's paper and relate how RlSC techniques are used in this CIS<: describes an approach tliat reduces the aniount of machine to achieve dramatic increases in perfor- unique coding traclitionally required for error han- mance over previous implementations. dling. He details the development of sophisticatecl Increases in performrlnce are also attributable to error 11;1ntlling routines tliat ;~ccornmodate the the CMOS-4 0.75-micrometer technology in complexity of the symmetric VAS which the WAX is implemented. In their paper 6000 models. about the verification of the physical design, Dale The editors thank Mike llhler of the Semi- Donchin, Tim Fischer, Frank Fox, Victor Peng, Ron contluctor Engineering C;roi~p,who ensuretl tli;rt I-'reston, and Hill Wheeler describe tlie methods and the standarcls of excellence ;rpplietl to NVN; TI issues will continue to be refereed so bugs existed in the design. Highlighting major that wc may offer engineering ant1 academic read- strategies, he reviews the behavioral moclels and ers informative and relevant teclinical cliscussions. pse11dor;rndoni exercisers at the core of the verifi- cation effort. Each system design team chose a different approacli to take advantage of WAX performance and to meet system-specific requirements. In a Biographies I

Walker Anderson Principal engineer Walker Anderson is a member of the Models, Tools, ancl Verification Group in the Semiconductor Engineering Group. Currently a co-leader of the logical verification team for a fim~rechip design, he led the NVAX logical verification effort. Before joining Digital in 1988, Walker was a diagnostic and testability engineer in a CPU development group at Corporation for eight years. He holds a B.S.E.E. (1980) from Cornell University and an M.B.A. (1985) from Boston Universitj!

Debra Bernstein Debra Bernstein is a consultant engineer in the Semi- conductor Engineering Group. She worked 011 the CI'U design and architect~~re for the NVkY ~nicroprocessorand the VAx 8700/8800 systems and is currently co-architecture leader for a future Alpha AXP processor. Deb received a B.S. in (1982, cum laude) from the University of Massachusetts in hnherst. She holds one patent, has two patent applications pending, and has coauthored several technical papers.

Larry L. Biro Larry Biro joined the Electronic Storage Development (ESD) Group in 1983, after receiving an M.S.E.E. from Rensselaer Polytechnic Institute. While in ESD, he contributed to the advanced development of solid-state disk

I products. Larry. ,ioined the Semiconductor Engineering- Grou[> as a custom cir- cuit designer on the NVAX E-box. Later, as a member of the NVAX+ chip imple- mentation team, Larry designed the clock and reset logic, coortlinated back-end verification efforts, and co-led the chip debugging. Currently, he is the project -- leader for a future, single-chip ViLv implementation.

Gregg A. Bouchard Gregg Bouchard is a senior hardware engineer with the Scniiconductor Engineering Group. His current responsibilities include the mod- ule design of a DECchip 21064 daughter card that contains a CPIJ-to-businterface. Previously, Greg worked on chip design of the WAX-to-KVI bus interface for the VAX 6000 Model 600, and field programmable chips for the vmstation 4000 Model 90. Gregg joined Digital in 1986 after receiving his B.S.E.E. from the Rochester Institute of Technology. He also holds an M.S.E.E. from Northeastern University and has a patent pending related to hardware queue structure. John F. Brown After receiving an 3.I.S.E.E. from Cor~lellUniversity in 1980, John Brown joined the engineering staff ;it Digital. At present, he is a consultant engineer working on Alph;~AS]' niicroprocessor advanced development. John's previous responsibilities inclutle ni;inaging the tlesign of tlie instruction decotle section of tlie WAX micropl.ocessoc He ;ilso made technical contributions to the Viu( 6000 Model 200 ancl 400 chip sets, ;ulcl w:rs Iiardware engineer h)r the extended flo:~ting-pointcn1~;incernent to the VU-11/780 system. John Iiolcls two patents antl has seven iipplic;~tionspending.

Michael A. Callander, Sr. Michael Callantler is a principal engineer in Iligital's Semiconductor Engineering <;roup. Mike was tlie technical leader for the VMstation 4000 Motlel 90 system. His previous experience with I>igit:il includes tlesign :rncl nrcI1itcctur;rl specification for various CI'U motlules ;mcl sys- tems, incl~idingtlie VfiS 8200 ;ind tlic Vr\S 6000 kloclel 400 and Motlel 500, Mike received his B.S.E.E. from the Ilniversity of iMassacbi~settsin 1982 ;incl joinetl Digital upon graduation. He has nuthorecl several technical papers ;incl hiis ;I number of patent ;~pplic;ttionspentling.

Lauren M. Carlson A senior hnrtlware engineer in tlie Semicontluctor Engi- neering Group, L;~iiren(:;~rlson is cut-rently working on the design of a periph- eral chip set for 21 new microprocessor. I'reviously, she tlesigned the Vt\\Sstr~tion 4000 Model 90 <:DAI.-to-EI)Al.:idiipter chip (CEAC;) gate arra): mihich is part of thc I/O subsystem. 1.aurt.11 :ilso contributetl to the design of the VAXstation 4000 Moclel 90 system module and ;~notlier\'AX system CPIJ module. I'rior to this. Lauren worketl in the Aclvancetl VAX I>evelopment Group. She receivetl her B.S.E.E. from Worcester Pol!rtechnic Institute in 1986 and joinetl Digital in 1087

Lawrence Chisvin A l>rincip;~lh;~rclw;~re engineer in tlie Semiconcluctor Engineeri~igGroup, L;~rr!,<;hisvin is in\~olvedin the tlesign of motlulcs ;mtl systems based on the I,E(:cIiip 21064 microprocessor. Larry also provicles tech- nical support for customers, including 1Upli;i hS1' architecture presentations and example designs ant1 :ipplic;~tionnotes. Previously he workecl on ~,rocessor ant1 memory motlules for the Vr\X 6000 Motlel 600. He holtls a H.S.E.E. (summa cum laude) fro111 North~:~ster~iIiniversity i11icI an M.S.E.E. from Worcester Polytechnic Institute. He is ;I member of the lEEE Computer Socjety and the ACM.

Kwong-Tak A. Chui Kwong-Tak <:l~uiis a principal liarclware engineer in tlic Semiconcluctor Engineering Group. He is working on the design of the I/O con- troller section of a new <:1'11. Kwong was the project leader for the 1/0 controller chip of the NVhY chip set ilsecl in the \X 4000 Moclcl400, 500, and 600 systcnis. Since joining Digital in 1985. he h;is worl

Dale Donchin Dale Donchin manages schematic enti-). ancl layout verifica- tion CAD tool development in the Semiconductor Engineering Group. He facili- tated the use of CAD tools for NVKX design, primarily for layout and pli)aical chip verification. Dale is presently pcrformlng in a similar capacity for a new micro- based on the Alpha AXP architecture. He joined Digital in 1978, and was previously a clevelopinent manager in the RSX group. Dale holtls a B.S.E.E. (1976, honors) and an M.S.E.E. (1978) from Rutgers University College of Engineering and is a member of IEEE and ACM.

John H. Edmondson John Edmondson is a principal engineer in the Semi- conductor Engineering Group. At present, he is co-architect of a firture RIS<; microprocessor. Previous to this, lie was a member of the V!6000 Model 600 CPU chip design team. Before joining Digital in 1987, John designed mini- compilters for five years at Canaan Computer Corporation. He also worked at Massachusetts General Hospital for two years, researching applications of tech- nology to anesthesia and intensive care medicine. John received a B.S.E.E froni '(A. the Massachusetts Institute of Technology in 1979.

Timothy C. Fischer Tim Fischer is a senior hardware engineer with the Seliii- conductor Engineering Group. He is working on tlie design of a floating-point unit hr 21 future high-performance microprocessor basecl on the Alpha AXP architecture. Prior to this work. Tim was a member of the E-box design team and contributed to the design of global clock generation and distribution for the NVAX microprocessor. He also worked on the tlesign of the bus interface unit on the NVAX+ chip. Tim joined Digital in 1989 after receiving his M.S. in computer engineering from the University of Cincinnati.

Thomas F. Fox Frank Fox, a consulting engineer in the Semiconcluctor Engi- neering Group, co-led the implementation of the NVAX microprocessor ancl con- sulted with the Advanced Semiconductor Development Group on the tlesign of the ChllloS-4 technology. He joinecl Digital in 1984 and worked on the implemen- tation of the CVAX microprocessor. Frank received a B.E. degree from University College Cork, National University of Ireland (1974), and a Ph.D. degree from Trinity College, Dublin University (1978), both in electrical engineering. He has published papers on ultrasonic instrumentation, MRI scanners, and VLSI design. Thomas E. Kopec Principal engineer Tom Kopec was ;I n1enihe1-of tlie Entry Systems Business Group that designed the \JhS 4000 Nlodels 200 throirgli 600, ;lnd the Micro\$= 3500 and 3800 systems. He also led an Alpha ~?

Andrew R. Ladd Andy Ladd is a principal engineer 111 tlie Semiconclirctor E~aneeringGroup and is the project leacler for tlie VASst;ltion 4000 Nlotlel 90 (:PIT ant1 low-cost graphics module designs. Previously, lie providetl timing verifi- cation support for the DECchip 21064 and co-designecl two 1x1sinterkice chips for the ViU( 6000 Motlel400 CPU moclule. iltitly joinecl1)igit;ll in 1986. Me receivetl his 1j.S. in compirter engineering from the University of Illinois (1984) ant1 his M S in computer science and engineering from tlie University of Michigan (1991). ilntly is ;I member of the IEEE Computer Society, XILIBeta Pi, and Eta K;IPP;I Nu.

David W. Maruska Principal engineer Davitl Marusk;i is a member of tlie Entry Systems Business Group and is presently involved in tlie design of the next generation of \irUc 4000 CPUs. He was the lead designer h)r the KT\%);~t~d &I52 (:l'lls ant1 project leacler for the \'AX 4000 Model 200 system, tlie K%QSr\ Q-bus-to- S<:Sl ;~d;~pter,;lnd the Futi~rebus+exerciser. Dwe jojnecl Digital in 1982, after receiving a B.S. in computer engineering from Boston Universitj.. Me workeel on graphics \vorkstations for Mos;~icTechnologies and li;~sterT'echnologies from 198.3 to 1985 and then retilrnetl to Digital in 1986.

Samyojita A. Nadkarni Sam hradkarni, :I princijx~lIi;~~-clw;~re engineer in tlie Seniiconductor Engineering C;roup, is currently the project 1e;lder for a set of support chips for the DE<:chip 21064 processor She w;~sthe project le;ltler for the I\rMC chip ilsetl in the V!X 4000 Model 400, 500, ancl 600 syhtems. She ;~lso worked on memory con~roller/busatl;lpter cliips for the U\S 1000 ,Moclel $00 anel MicroVAS 3500 systems. Sam joined Digital in 1985 ;~ndlioltls ;I Il;~cliclorof Technology (1983) from tlie Indian Institute of 'l'echnology ;uitl an ,All S (1985) from Rensselaer Polytechnic Institute.

Mitchell 0. Norcross Senior engineer Mitch Norcross h;is been clesignjng ;~nclan;~lyzing digital subsystems since he joined I>igit;rl in 1986. A:, ;I member of the Semiconductor Engineering Group, Mitch designecl a g;lte ;lrr;ly for the Vi\Xst;~tion4000 Motlel 90 and contributed to the design ;rncl ;~n;ilysisof sever;~l system and <;PIJ modules. Prior to joining SE<;, Mitch designed ;I g;lte ;Irr;ly for Digital's first fault-tolerant VXY system, the VAXft 3000. Me received ;I l1.E. in elec- trical engineering (1985) and an h1.S. in computer engineering (l987), both I'rom I\lanh;~tt;~nCollege. Mitcli holds one patent 011 fault-toler;~ntsystem design. Victor Peng Victor Peng is a consultant engineer in the Semiconductor Engineering Group. He received his B.S. in electrical engineering from Rensselaer Polytechnic Institute in 1981, and his M.S. in electrical engineering from Cornell University in 1982. Victor joined Digital in 1982. He was a member of the design teain for the VAX 8200/8300 memory interface chip and patchable chip. He led the implementation of the VAX 6000 Model 400 float- ing-point chip and was co-manager of the NV/LY chip design team.

Jeffrey D. Pickholtz Jeffrey Pickholtz received an A.S. (1977) in specialized technology from Penn Technical Institute and a B.S.E.E.T. (1989) from Central New England College. He joined Digital in 1977 as a technician and worked on a variety of midrange computers. More recentl): his responsibilities have included technical contributions to the VAX 6000 Models 400 and 600, and VAX 7000 Model 600 chip sets. Currently, Jeff is a senior engineer in the Semiconductor Engineering Group, leading the implementation of a future CMOS microprocessor chip.

Brian Porter As a consulting software engineer in the Systems Group of VMS Development, Brian Porter was responsible for CPU error handling in the VAX 6000 family. Prior to this work, he was responsible for support of VAX systems and was an author and maintainer of the VMS error log utility SYE. Brian is the author of the original VMS striping driver, which was later developed by others into the VMS striping driver product. He currently works in the Executive Group of VMS Development and is responsible for symmetric multiprocessing. He has two patents pending on memory error handling. Brian joined Digital in 1973.

Ronald P. Preston Ronald Preston is a principal engineer in the Semi- conductor Engineering Group. Since joining Digital in 1988, he has worked on the design of several microprocessors. Ron was the circuit design and imple- mentation leader for the E-box on the NVAX microprocessor. He is currently designing the instruction issue logic for a superscalar RlSC microprocessor. Prior to joining Digital, he worked on the design of CMOS for Signetics Corporation. Ron received his B.S.E.E. in 1984 and his M.S.E.E. in 1988 from Rensselaer Polytechnic Institute. He is a member of Eta Kappa Nu and IEEE.

Dean A. Sovie Design engineer Dean Sovie joined Digital in 1981 and is a member of the Electronic Storage Development Group. He is currently involved in the battery backup laser memory design. Prior to this work, he contributed to the design of the high-performance memory module used in the Vhu 4000 Motlel 400, 500, and 600 systems. Dean also helped design memories for the VhY 6000, VAX 4000 Model 200, and POP-11 systems. In addition to his present responsibili- ties, he is pursuing a degree in electrical engineering at Northeastern University. Rebecca L. Stamm Rebecca Stamm is a principal hardware engineer in the Semiconductor Engineering Group. She led the design of the backup , the bus interface, and the pin bus for the NVAX CPU chip. Rebecca then led the chip debug team from tapeout through final release of the design to manufacturing. Since joining Digital in 1983, Rebecca has been engaged in microprocessor archi- tecture and design. She holds two patents and has seven applications pending. Rebecca received a B.A. in history from Swarthmore College ancl a B.S.E.E. from the Massachusetts Institute of Teclinolog)~

G. Michael Uhler Michael Ubler is a senior consultant engineer in the Semi- concluctor Engineering Group, where he leads the advanced development effort for a new high-performance microprocessor. As chief architect for the NVAX and REX520 microprocessors, Mike was responsible for the CPU architecture, perfor- mance evaluation, behavioral modeling, CPU , and CPU and system debug. He received a B.S.E.E. (1975) and an M.S.C.S. (1977) from the University of Arizona and joined Digital in 1978. Mike is a member of IEEE, ACM, Tau Beta Pi, and Phi Kappa Phi and holcls eight patents.

Thomas M. Wenners Thomas Wenners is a senior hardware engineer in the Serniconductor Engineering Group. He is responsible for the design of a next- generation Vm workstation CPU moclule and for CPU module designs of DECchip 21064 microprocessor-based products. Tom's previous work inclutles the mod- ule design of the \ii\X 6000 Model 600, module design and signal integrity sup- port on ESB products, and analysis and evaluation of advanced chip and module packaging. Tom joined Digital in 1985. He received a B.S.E.E. (1985, cum Iaude) and an M.S.E.E. (1990) from Northeastern University.

William R. Wheeler A principal engineer in the Semicontluctor Engineering Group, Bill Wheeler performed architectural definition work for the NVAX microprocessor and was project leader of the NVAX M-box. Prior to this work, he designed the I-box unit for the third-generation VLSI VAX implementation. He is currently involved in advanced development work for improving custom design tools and methods. Before joining Digital in 1983, Bill received both his B.S.E.E. (198'2) and his M.S.E.E. (1983) from Cornell University. He has coauthorecl three papers on vLSI designs and holds three patents. of unparalleled performance and quality. The results speak for themselves. From its initial shipment in October 1991 through today (a year later), Wh,Y was (and is) tlie fastest shipping CISC microprocessor in the world, whether measured by , SPECmarks, or transactions per second. NVAX had fewer bugs after design completion, and went from tape-out to production more quickly than any microprocessor in Digital's history. Robert M. Supnik NVAX systems, spanning the range from work- Corporate Consultant, Vice President station through mainframe, all shipped on or Technical Directol; ahead of schedule, meeting or exceeding pre- dicted performance. Alpha AXP and VAX Systems An outstanding engineering achievement indeecl! The roots of NVAX can be traced back a decade to two distinct engineering programs: the High-end Systems Group's studies and implementations of If, as the popular saying goes, "Once is happen- highly pipelinetl VAX systems; and the Semicon- stance, twice is coincidence, three times is con- ductor Operations Group's projects in process certed action," then four consecutive instances of development and microprocessor tlesign. outstanding engineering achievement must be even The High-end Systems Group started work on more significant. highly parallel VAX systems in 1979, designing and Since 1985, Digital has designed, developed, and building the vAX 8600-the first VAX to include shipped four generations of leadership VAX micro- overlapped operand tlecotling (see tlie Digital processors and CMOS-based systems: TechniccilJournal, August 1985) At the same time, In 1985, the MicroV,%x chip and resulting systems a research team tlescribetl HyperVAX, a hypotheti- (such as the MicroVAX I1 and tlie Vmstation cal fully pipelined design Although HyperVU was 2000) never built, its had a strolig influ- ence on the design of the VAX 9000, Digital's ECL In 1987, the CVAX chip and resulting systems mainframe (see the Digital Technical Journal, Fall (such as the MicroVAX 3800, the VAX 6000-200, 1990). And the microarchitecture of the VAX 9000, and the VAXstation 3100) in turn, was the basis for WAX. In 1989, the chip and resulting systems The Semiconductor Operations Group also (such as the VAX 4000-300, the VAX 6000-400, started work in 1979, formulating a multiyear pro- and the VAXstation 4000-60) gram for the development of both semiconductor process technology and leading-edge nlicroproces- In 1991, the NVkV chip and resulting systems sors This program spanned the years 1983 to 1987 (such as the VhY 4000-500, the VAX 6000-600, and encompassed the development of the V-11, ant1 the VkYstation 4000-90) MicroVA>( 11, and CVAX microprocessors In 1986, The first three were described in the Digital the plan was extended through 1991, encompassing Technical Jozrrnal issues of March 1986, August the development of Rigel, Mariah (a Rigel variant), 1988, ancl Spring 1990, respectively; the last is the and a fourth-generation WIV&Y code-named NV.. subject of this issue. The goals for NVAX were ambitious. First, its WAX and its systems are the culmination of targeted performance was more than 25 times everything Digital and its engineers have learned faster than the VAX-11/780 (more than 10 times about chip and system design over the last decade. faster than the just-introduced CVm chip), requir- The teams involved drew on many disciplines of ing significant improvements in both microar- hardware engineering, from microarchitecture to chitectural efficiency and in cycle time Seconcl, whole-system verification, to produce products the chip development scliedule coincided with the semiconductor process development scheclule, tion, CAD development focused on improvements requiring breakthroughs in concurrent develop- to productivity ant1 accuracy through design syn- ment of product and process. And tliird, the time thesis, electromigration analysis, and capacitance allottccl from chip design conlpletion to system ancl resistance extraction. shipment was tlie shortest in Digital's history, The size and complexity of the design, as well as requiring unprecedented accuracy in chip and the stringent schedule constraints, also dictated an system design and verification. early start on verification issues. The verification As in past projects, work in various disciplines- strategy formed an integral part of the design effort seniiconcluctor process developnient, chip micro- from the outset. The verification team developed architecture and circuit design, ~nicroprocessor tools and strategies for verifying tlie microarchitec- design tools, chip anel system verification tools, ture, the n~icrocode,the logic, the circuits, the chip and system design-cascaded from process as a component, and the chip in a system. through systems. First to start was a team from Lastly, the various system groups-data center Advancecl Semiconcluctor Development (ASD), systems, office systems, -began which designed, simulated, and introduced into designing systems to utilize the JW~LVchip's capabil- manufacturing CMOS-4, Digital's fourth generation ities. Each group was able to build on the work of ClLlOS technology (see the Digit611 R~chnicnl done in past VUsystems ancl designed an N\KY- Jourfzal,Spring 1992). Builcling on prior technology based system that functioned both as an upgrade of generations, <;,MOS-~contained many features- past systems and as a formidably competitive new three layers of metal intercot~nect,salicide, preci- system in its own right. sion resistors, local interconnect, deep diffusion The work of these project teams dovetailed per- ring-which directly supported the performance fectly. Wacompleted design and taped-out in late requirements of WAX. In addition, ASD and Semi- November 1990, just as the CivlOS-4 process was conductor Manufacti~ring pioneered new tech- ready for chip prototyping. Due to the outstancling niques for process transfer and qualification which work of the chip design, system design, Cm,and dramatically shortmeel the time requirecl to debug verification teams, first-pass parts booted the \'Ms ancl qualify the CMOS-4process. operating system at speed in early March 1991. The In parallel, a clesign team from the Semicon- process team qualified CMOS-4 in October 1991, and ductor Engineering Microprocessor Group initi- systems using second-pass parts shipped for rev- ated microarchitectural and circuit studies. The enue that same month-three months ahead of team starteel with the VtM' 9000, but they quickly schedule-with performance significantly greater discovered that the difference in implementation than the original goal. media (n~ultichipECL gate arrays for the VAX 9000, Clearly, the outstanding results from all the NVa single-chip custoln C~L~OSfor NVA>o required sig- engineering projects are neither happenstance nificant changes anel new concepts. The micro- nor coincidence; rather, they represent concerted architectilre sub-team ~~sedabstract anel detailed action-team excellence and individual brilliance- performance models, studies fro111 existing V! at its finest. Hundreds of people contributed to the systenis, and expcricnce with past designs to drive outcome. This issue of the Digital Ethnical quantitative decisioiis about features ;und functions Jozit.~znlis their story. in NVU. At the snnlc time, the circuit sub-teanlfor- mulatecl the overall design, circuit, ant1 clocking methodologies for the chip anel established the feasibility of the tqet cycle time, chip size, ancl lay- out floor plan. As the microarchitectural concepts solidified, the clesign team realized that NVA?< woulcl be the largest ant1 ~iiostcomplex chip ever desig~ieclat Digital, and that it would place unprecedented stress on the capabilities of both designers ant1 design tools. Accordingly, they initiated a partner- ship with the Setniconductor Engineering CN) Group to iillprove current tools and to clevelop new tools. In addition to tratlitional areas like simula- G. Michael Uhler Debra Bemzstein Larry L Biro John E BwnIII John H. Edmorulson Jeffrey D. Pickholtz Rebecca L Stamm

VAX Microprocessors

The IVVM and I\VAX+ CPU chips are higI3-perforinance VM fnzcroprocessors that use tecbnzqzies l~zlditionallynssoci~~ted 2111th RISC nzicroflrotessor designs to dranzati- cally linprove VXXperforrnance.The tu~ocbips)rovide an ztlpgradepathfor existing VXX systems and a migration path from VXX systems to the new Alpha A,V systems. The design evolved throzigbout the project as time-to-market,performance, and conzplexity trade-offs were made Special design features address the issues of deb~~g,~nzciintenance, and analysis.

The NVAX and WAX+ CPUs are high-performance, on the high-leverage design points and on an single-chip microprocessors that implement unprecedented verification effort.5 Digital's VN( architecture.' The WAY chip provides Support for multiple system environments, an upgracle path for existing systems that use the cornpatibility with previous VAX proclucts and previous generation of VAX microprocessors. The systems, and a means to migrate from traditional WAX+ chip is usecl in new systems that support VAX systems to the new Alpha AXP platforms were Digital's DE<:chip 21064 microprocessor, which also important design goals. These goals had a pro- implements the Alpha N(P architecture.'.:' These found impact on the design of the cache protocols two WtU( chips share a basic design. ancl the external bus interfaces. NVAX and Wiuc+ 'The high-performance, complenientary metal- engineers workecl closely with engineers in oxide semiconcluctor (CMOS) process used to Digital's systems groups during the definition of implement both chips allows the application of these operations. pipelining techniques traditionally associated with The paper begins by comparing the basic fea- reduced instruction set computer (RISC) CPUs.< tures of the NVAX and NV'+ chips and then Using these techniques dramatically improves the tlescribes in cletail the chip interfaces and design performance of the NVAX and NVAX+ chips as coni- elements. This description serves as the foundation pared to previous viU( tnicroprocessors and results for the ensuing discussion of design evolution and in performance that approaches and may even trade-offs. The paper concludes with information exceed the performance of popular industry RlSC about the special design features that address the microprocessors. issues of debug, maintenance. and analysis. The chip design evolved throughout the project as the goals influencetl the scheclule, performance, Comparison of the NVAX and complexity trade-offs that were made. The two and WAX+Chips primary design goals were time-to-market, without The WAX and WAX+ chips are identical in many sacrificing qualit): ant1 improvecl VAX CPU respects, differing primarily in external cache and performance. Our internal goal was for the WAX bus support. is intended for systems that use CPU performance to be more than 25 times the previously designed VAX microprocessors. 'The performance of a VAX-11/780 system in a datacenter following systems currently use the IWAX chip: system. Achieving these goals required meeting the VAXstation 4000 Model 70; the Microl~~ aggressive scheclules ancl thus concentrating 3100 Model 70; the VAX 4000 Models 100, 400,

Digital Technical Joctrrzal Vol. .4 No 3 SL~II~I)I~I.192 11 WAX-microprocessorVAX Systenls

500, ;uncI 600; :11itl thc V,\S 6000 Moclel 600.".-5') to l)ro\~iclrhigh perh)rmance, are overlappecl with NVV\Ssupports ;in extern;~l write-l);icl\: ci~chc ;irbitr;~tionfor S~ltilretransactions ancl ;iclE<:chil>21064 nlicroproccssor inll)le~ne~~t;~tjo~itrans;~ctions. Esternal interrupt requests :Ire of the i-\I~,li;i hr;i' ;irchitecturc anel is currently useel receivecl tliroi~glideclic;~tecl lines and arbitratetl by in tlie \I&i 7000 Moclel 600 ;ind the VAS 10000 logic it1 the <:llll. Moclel 600 s!-stems. N\lAS+ supports ;III extern:~l The NIhS+ chip interfi~ces to ;in external cache ;mcl I)i~sprotocol th;it is compatible witli th;~t write-back 13-c;1cheimplemented with tag anel data of the r>Ec:chip 21064 microprocessor. In existing stittic 1I1\Ms on the module through ;I port shared syste~iis.VAS+ is configi~reclto si~pport;11i exter- witli system control logic, as shown in Figure 2. nal write-back c;iche tli;~timplements a conelition:il Iiesponsibility h)r controlling the cache port is writc-update snoopy coherence protocol.11 shared between N\G\X+ ;und the system environ- The two <:Pli chips ~~rovitleboth the tnc;itls to ~iient:the N\'AX+ chip h;lndles the co~iirno~~cases upgr:icle inst;illed VAX sj'stems. thus protecting pre- of re;id hit ;~ndexclusive write, i111cl the system vjorls in\7estrnents,;lncl ;I migr:~tionp:~th from :I \:AX environment provides cache policy control for microprocessor to ;I I>l':(:cliip2IOO.i micsoproces- other events. -The size and speed of the cache can sor in the new Alpha ,\XI' systems. be configill-ed to ;~llo\v:I range of possible system configl~r;itions. Chip I~zter&ces l'he I>l':<:cliip21064 d;11:1ancl ;~cltlresslines (ELI~~L) 'l'he Nl'I\S chip in~erf;icesto ;in estern:~lwrite-lxlck constitute :I cle~nultiplexed,bidirectional bus with c;~clie(H-c;lclie) tliro~~gh;I priv:~te port with t;~,q 29 bits of :~clclress.128 bits of data, i~ndthe associ- :~ncl cl;lt;~static r;intiom-;iccess nicmorics (RhAls) ;~tc.rlcontrol signiils. This 1x1soperates ;it one-1i;llf. on the moclulc, as ~Iiownin Figure 1. 'l'hc size :~ntl one-thil-tl, or one-fourth tlie frequency of the <:pLl slxecl of the c;icIic ;ire progr;imm;ll,le, ;~llo\ving l'rom clocks provideel by tlie <:l-'ll. The speed of the the chip to ;~ccoinmocl;~te;I r;iIige of possible s!.s- s!.stern clocks C;I~be progs;~mmeclto :iccornmo- tern confignrntions. cl;~tev;~rious RAM ;inel stem speecls. At power-up The NVAS cl;~t:~;inel ;~cltlrcsslines (NDl\f.) con- time, initi:tliz:~tion inform:~tio~i,including stitute a 64-bit hidirectio~i:~lextcrn:11 hus witli ;ISSO- timing, ;ind di;~gnostics;Ire loaded fro111 ;I serial ciatecl control sign:~lsth;tt operates ;it one-thircl the re;~cl-onlymemory (ROM) into tlie on-chip cache. frequency of the <;1'11 from clocks providecl by tlie The external interrupt hiindling is similar to that of <:I1(:.Acldresses ;incl clatn ;Ire ti~iie-~iiultil?lrsed;inel. the NM\S chip.

OSCILLATOR

NVAX

NDAL 4 * CONTROL SYSTEM - LOGIC INTERRUPTS SYSTEM CLOCKS

I c / N VAX Chip 11zteiyi~ce Block I)i~i~llwnz

I1 1.i)l. .f No. .$ .Srctrrt~lev19% Digilnl Technical Jocrrnrrl The NVAX and AWAX+ Higb7perfornzul-zcVAX 1Micl-o/~roccs.sor.q

OSCILLATOR CACHE TAG SERIAL RoM :rlriplDATA RAMS

1 DA _ CONTROL SYSTEM * * LOGIC INTERRUPTS SYSTEM CLOCKS

Figure 2 WAX+ Chip I~zterfaceBlock Diugranz

Electrical and Physical Design Architectural Design Process technology, clocking scheme, clock fre- l'he WAX/NVA?(+ design is partitioned into five quency, ancl

Digilal Technical Journal I/ol. 4 r\;o 3 S~r~irl~rer.I992 AROTATOR

MV$UTCHES

Figure -3 Block Diagrnnz of tbe NVAx/WAX+ Core The NVAX and NVAX+ High-performance VAX Microprocessors queues, and the branch-related information into and porn. A population provides the num- the branch queue. ber of bits set in a mask and is used in the CALIS, For operand specifiers other than short literal or CALLG, PUSHR, and POPR instructions. In addition, register mode, the I-box decode logic invokes the microcode can operate the pipelined complex specifier unit (CSU) to compute (ALU) and shifter independently to produce two the effective address and initiate the appropriate computations per cycle, which can significantly memory request to the Mbox. The CSU is similar in improve the parallel operation of the complex function to the load/store ilnit on many traditional instructions. RlsC machines. In addition to riormal instruction processing, the The I-box automatically redirects the program E-box performs all power-up functions and inter- counter (PC) to the target address when it decodes rupt and exception processing, directs operands to one of the following instruction types: uncon- the F-box, ant1 accepts results from the E-box. To ditional branch, jump, and subroutine call and guarantee that instructions complete in instruction return. The branch-taken penalty is two cycles for stream order, the E-box orchestrates result stores any conditional or unconditional branch. To keep and instruction completion between the E-box and the pipeline full across conditional branches, the E-box. I-box includes a 512-bit by 4-bit branch prediction array. The prediction is entered in the branch queue The F-box by the I-box and compared with the actual branch direction by the E-box. If the I-box predicts incor- The F-box performs longword (32-bit) integer mul- rectly, the E-box invokes a trap mechanism to drain tiply and floating-point instruction execution. The the pipeline and restart the I-box at the alternate E-box supplies operands, and the F-box transmits PC. A branch mispredict incurs a four-cycle penalty results and status back to the E-box. for a branch that is actually taken and a six-cycle The F-box contains a four-stage, floating-point penalty for a branch that is not taken. and integer-multiply pipeline, and a nonpipe- lined, floating-point divider. Subject to operand availability, the F-box can start a single-precision, The E-box floating-point operation during every cycle. and a The E-box is responsible for the execution of all double-precision, floating-point or integer-multiply non-fJ.oating-point instructions, for interrupt and operation durjng every other cycle. exception handling, and for various overhead func- Stage 1 of the pipeline calculates operand expo- tions. All functions are microcode-controlled, i.e., nent difference, adds the fraction fields, performs driven by a with a 1,600-wordcon- recoding of the multiplier, ant1 computes three trol store and a 20-word patch capability. Since the times the multiplicand. Stage 2 performs align- control store does not limit the cycle time, we ment, fraction multiplication, and zero and leatling- chose to implement a single microcode control one detection of the intermediate results. Stage 3 scheme, rather than hardwire control for the simple performs normalization, fraction adtlition, ant1 a instructions and provide microcode control for the miniround operation for floating-point add, sub- remaining instructions. tract, and multiply instructions. Stage 4 performs The E-box begins instruction execution based rounding, exception detection, and condition code on information taken from the instruction queue. evaluation. References to specifier operands and results are Stage 3 performs a miniround operation on the made indirectly through pointers in the source and result calculated to that point to determine if a full- destination queues. In this way, most E-box instruc- round operation is recluiretl in Stage 4. 'lbdo this, a tion flows do not need to know whether operands round operation is perfornied on only the low- or results are in register, memory, or instruction order three (for single-precision) or six (for double- stream. precision) fraction bits of the result. If no carry-out To improve the performance of certain critical occurs for this operation, the remaining fraction instructions, the E-box contains special-purpose bits are not affected and the fill1 stagc 4 round oper- hardware. A mask processing unit finds the next ation is not required. If the full round is not bit set in a mask register and is used in the follow- required, stage 4 is dynamically bypassed, resulting ing instructions: FFC. FFS, CALLS, CAL1.G. RET, PUSHR, in an effective three-stage pipeline.

Digital Tecbrrical Jottr11a1 Vol. 4 No. Summer 199-3 15

The WAXon61 WAX+Hi,q/3-l~erfor~nn~zce VAX Micro/~rocessors implernentetl by the ~~Cchip21064 micropro- write-invalidate protocol for data that was not cessor. This C-box interface includes the basic recently referenced. interface control for the external B-cache and for To accommodate the programmable nature of the memory and 1/0 system. The NVAX+ C-box both the system and cache clock frequencies, the receives read ancl write requests from the M-box. NVAX+ C-box supports nine different combinations These requests are queued and arbitrated within of cache and system clock frequencies. This sup- thc C-box ant1 result in cache or system access port allows efficient use of the chip in a wide range across the EI)I\L. The NVAX+ C-box also maintains of different performance class systems. cache coherency by sencling invalidate requests to the M-box whcn requested by external logic. Pipeline Opera tion The NVLY+ C-box implementation provides many The NVAX and WAX+ chips implement a macro- of the same features and performance enhancements pipeline. Multiple VAX macroinstructions are pro- available in the NVAX C-box. Includecl is support for cessed in parallel by relatively autonomous software-programmable B-cache speeds (one-half, functional units with queued interfaces at critical one-third, or one-fourth times the CPU frequency) boundaries. Each functional unit also has an inter- and sizes (128KH to 8MB), write packing, write queu- nal pipeline (micropipeline) to allow a new oper- ing, and reatl-write reortlering. In addition, the ation to start at the beginning of every cycle. The WAX+ C-box supports the newer platforms and pipeline operation can be logically depicted, as increases the degree to which WAX+ is compatible shown in Figure 4. with the IIECchip 21064 microprocessor. NVAX+ In pipeline segment SO, instruction stream data C-box features include program~nablesystem clock is read from the VIC. The next VAX instruction com- speeds, I/O space-mapping, and a direct-mapped ponent is parsed, and queue entries are made in option on the P-cache. segment S1. For short literal and register specifiers, A major difference between the WAX and WAX+ no other processing is required. Requests for fur- implementations is in the B- proto- ther processing for all other specifiers are queued col. Rather than mandate a fixed B-cache coher- to the CSU pipeline, which reads operand base ence protocol, the NVAX+ implementation allows addresses in segment S2, calculates an effective systems to tailor the protocol to their particular address, and makes any required M-box request needs, hW+ cache coherency is implemented contained in segment S3. If an M-box request is jointly by off-chip system support logic and by the made, address translation and P-cache lookup CPU chip, with relevant information passed occur in segments S4 and S5. between the two over the EDAL bus. To allow clupli- Instruction execution starts with an E-box cate cache tag stores (if they exist) to be properly control store lookup in segment S2, followed by updated, the NVA)i+ C-box provides information to a read of any required operands in off-chip logic, indicating when the internal caches segment S3, an ALU and/or shifter operation in seg- are updatecl. External logic notifies the NVAX+ ment S4, and a potential result store or register file C-box when an internal cache entry needs to be write in segment S5. If an M-box request is required, invaliclated because of external bus activity e.g.,for a memory store, the request is made in seg- Existing systems configure the B-cache to imple- ment S4; address translation or PA queue access ment a conditional write-update snoopy protocol occurs in segment S5; and a P-cache access occurs carrietl out using shared and written signals on the in segment S6. system bus. Writes to sharetl bloclcs are broadcast to Floating-point and integer-multiply instruction other caches for conditional update in those execution starts in the E-box, which transfers oper- caches. A CI'U that receives a write update checks ands to the F-box. The four-stage F-box pipeline is the NVU+ P-cache to determine if the block is also skewed by half a cycle with respect to the E-box present in that cache. If the block is present, the pipeline, beginning halfway through segment S4. B-cache update is accepted and written into the The fourth segment of the F-box pipeline is con- B-cache, ant1 the 1'-cache is invaliclated. If the data is ditionally bypassed if a full-round operation is not present in the P-cache, the B-cache is invali- not required. The result is transmitted back to the clatecl. This results in a write-update protocol for E-box, logically in segment S5 of the pipeline. data that was recently referenced by a CPu (and Pipeline bypasses exist for all important cases hence is valid in the P-cache) and recluces to a in the I-box and E-box pipelines, so that there are

Digital TechnicalJorrrrral Vo1. 4 No. ..j! Sttrn~nerI992 17 NVAX-microprocessor VAX Systems

SO S1 52 S3 54 S5 S6 INSTRUCTION FETCH, VIC INSTN OPERND ADD DECODE, SPECIFIER €VAL, TB READ PARSE READ MEMREQ P-CACHE MEMORY REQUEST

INTEGER, LOGICAL INSTRUCTION EXECUTION

FLOATING-POINT INSTRUCTION EXECUTION

KEY: S SEGMENT ALU ARITHMETIC LOGIC UNIT SPECIFIER EVAL SPECIFIER EVALUATION CS CONTROL STORE INSTN INSTRUCTION EXPON DlFF EXPONENT DIFFERENCE OPERND OPERAND ALIGN, FRMULT ALIGNMENT AND FRACTION MULTIPLY MEMREQ MEMORY REQUEST NORM, FRADD NORMALIZATION AND FRACTION ADDITION TB TRANSLATION BUFFER BYPASS ROUND RESULT ROUNDING AND BYPASS

no stalls for results feecling directly into subse- N~~rnDerof CPU Chips quent operands. The M-box processing of memory The VLY 6000 Model 400 core CPIl in~l>lemenration references initiated as a result of operand speci- is a three-chip design: a processor chip, with a fier processing by the I-bos is usually overlappetl sm;~llon-chip primary cache; a floating-point chip; with tlie execut~onof the previous instruction in and a secondary cache controllel; with internal the E-box, witli few or no stalls occurring on cache tags. The initial attempt at NVLY CPIJ defini- P-cache hit. tion was a two-chip design. One chip contained the I-box (with a 410 VIC), the E-box. the F-box, and Design Evolution and Trade-08s the Mbos (witli a ~GKB,direct-mapped P-cache). The second chip held the C-box ant1 the B-cache tag The N\Wi and W!+chips are tlie latest in a line of array. The project design goals, especially time-to- CMOS VM microprocessors designed by Digital's engineers and represent a continuing evolution of market, let1 to a single-chip solution. rather than a two-chip design. architectural concepts from one irnp1ernent;ltion to the next. The preceding chip clesign was the To condense the design from two chips to one, we halved the sizes of the Wc; ant1 the P-cache and CPrJ for the VAX 6000 Model 400 system.I2To meet tlie time-to-market ancl performance goals, we rnovecl the R-cache tags to external static UMs, leaving the B-cache controller on-chip. Later, we had to modify the NV~U/NVAX+ design throughout tlie project. were able to reduce the penalty of halving the size of the I-'-cacheby making it two-way set associative One of the early vehicles for making tlesign rather than direct mapped. With these changes, the trade-offs was the WAX performance model, performance moclel showed a performance loss of which predicts CPIJ and system performance less t11;1n 1.4 percent across all the traces, relative and aids in quantifying the performance impact of various tlesign options. The performance to the two-chip design, with a worst-case pen~lty mode.1 is a detailed, trace-clriven motlel which can of 3.9 percent. There ;Ire strong z~dvantagesto the single-chip be easily configuretl by changing any of ;I variety solution. of input parameters. The modcl stimuli used were 15 generic timesharing ancl 22 bench- Designing a single chip takes less time. mark instruction trace files that were captilred This design requires the production ancl main- by running actual programs on existing VAX tenance of only one design and one systems. mask set. The followi~igsections describe tlie evolution Latency to the B-cache is shorter. of the chip design, including the number of chips, the pipelining technique used, and various cache An off-chip tag store provides more flexibility in issues. H-cache configurations. The WAX' and /WAX+ High-performance VAX 1~4icro,!1rocessors

Macropipelining unlike the original F-chip implementation, the Run-time perfortnance is the product of the cycle NVA,Y/NVAX+control of the F-box allows a fully time, the average time to execute an instruction pipelined operation, which significantly improves ( [CHI) ant1 the number of floating-point performance over the F-chip clesign. instructions executed. CMOS process improve- Although a totally new design would have had ments made it possible to decrease the hViu;/ shorter floating-point latencies, the combination of N\iiU(+ cycle time with respect to the previous gen- a fully pipelined operation and a final-stage bypass eration of ~amicroprocessors, thus improving the allowed us to achieve our performance goal, while first factor in run-time performance. meeting our time-to-market goal. The VM 6000 Model 400 CPU design uses tra- ditional microinstruction pipelining, i.e., micro- Cache Coherence pipelining, to achieve some amount of overlap and Performance studies with the previous generation to decrease the CPI. However, using micropipe- of VAX microprocessors clearly indicate that system lining techniques would not reduce the WAX/ bus write bandwidth .limits performance unless an N\IN(+ <;PI to the level required to meet the perfor- external write-back cache is in~plementetl.In addi- mance go;lls of the NVAX/h'VLY+ projects. We tion, the VA)( architecture requirecl that we imple- acliievetl this reduction by using MSC clesign and ment the cache coherence protocol in hardware. implementation techniques referred to as macro- The NVkY implementation uses a directory-based pipelining. In a macropipelined architecture, the coherence protocol for compatibility with existing I-box acts much like a load/store engine, dynam- and planned target system platforms. The NDAL bus ically prefetching operands prior to instruction supports multiple outstanding read and write execution. Using the macropipeline technique in requests, which allows the microprocessor to uti- the ~Wto;and WAX+ CPUs makes it possible to lize the capability of the system bus to process retire one basic complex instruction set computer these operations in a pipelinetl fashion. We investi- (CISC) macroinstruction per cycle, as in a simple gated the possibility of implementing both direc- RlSC design. Although macropipelining introcluced tor~~-basedand snoopy coherence protocols, but considerable complexity into the NVAXI/NVAX+ time-to-market considerations and the opportunity design, this complexity resulted in a significant to optimize the design for performance in existing performance improvement over a traditional system platforms outweighed the desirability of micropipelinetl design. supporting snoopy protocols. For the WAX+ implementation, the coherence Number of Specifers per Cycle policy is determined by hardware external to the The NVm/NVAX+ I-box can parse at most one NVAX+ chip, in the given system. The WAX+ cache opcocle and one VAX specifier per cycle. The I-box and system interface allows the system environ- design initially consiclered was capable of parsing ment to implement a variety of coherence pro- two specifiers per cycle. Although this parsing tocols. Compatibility with the DECchip 21064 scheme represented significant complexity and cir- interface definition required limiting WAX+ to one cuit risk, intuitively, it seemed important to quickly outstanding external cache miss. However, this retire specifiers in the I-box in order to keep the limitation is more than offset by the significantly macropipeline full. However, the performance better main memory access times achieved in target model predicted a maximum performance improve- systems. ment of less than two percent on our traces, and we One significant advantage of the hW~x+scheme decidetl to l~rnitcomplexity and scheclule risk by is that most policies associated with the external parsing only one specifier per cycle. cache are determined by hardware outside the NVA>(+chip (such as the coherence policy), allow- ing the chip to be used in a wide variety of systems. F-boxDesign Implementing the DECchip 21064 interface on The Nvm F-box design is highly leveraged from the WkY+ greatly reduces the hardware engineering VAX 6000 Model 400 F-chip design. Rather than investment required to deliver a VAX CPU and an start from scratch, we integrated the existing Alpha AXP CPU in the same system environment. design onto the NVAX and Nvm+ CPu chips and For both the NVAX ant1 the WAX+ chips, cache added a final-stage bypass mechanism. In addition, coherence is maintained for the P-cache by keeping WAX-microprocessor VAX Systems

it ;I subset of the extrrn;rl cache. Exter~lall!~orig- fills c;ln resitlt in ;i l~igherP-c;iche d;it;~stre;~ni miss in:itetl inv;~litl;~terequests are forwartletl to the rate, bec;~llsethe!, replace rl:~tathat is likely to be 1'-c;rche only when the block is it1 the extern;il rcferencetl ;ig;lin. We used the performance model c;icIie. Tliis mininiizcs the number of 1'-cache cycles wit11 :~v;~il:ibletraces to tleterniine that looking up spent processing invalidate requests. The two-wa!- VI(: misses in the P-cache generally rcsultecl set-;~ssoci;~tiveP-c:~clie might have been slightl!, in higher performance. In specific ap],Iic;itions, more effective if it were not a subset of the 1;irger higher ~)erSorrn;i~icecan be achieved by not looking direct-nlapped extcrn;il cache. However, this effect up instl-uction references in the P-c;~che.As ;I is far less significi~ntthan tlie effect of expencling ;I result, we iml>lenlentedP-cache configuration bits 1'-c;~cliecycle for every external invalidate event. tli;~tnllow system designers to implement either Virtu;~lc;iches :itmost always have lower l;~tency scheme. 13y tlefaiilt, h'va and N\'AS+ systcrns ;ire tlxrn pliysic;~lc;~ches ant1 usually do not require ;t configuretl to en;tble both instruction ant1 clata tledicatetl translation buffer. The VrU( architecture caching in the 1'-cache, but this may be cli;~ngetlby supports the use of ;I VI<: b!; allowing the caclie to tlie console softw;ire in certain systcnis to supl>ort be incoherent with respect to the data stre;~n~,i.e.. prcp;~cli:igetl;~pl>lic;~~ion systems. not i~j>tl:ltedwith recent writes by tlie (:H:contiiin- ing the Vlc:, or 1)). ;in! other (;l-'U. However, solile Exte1-Izal Cache Size mec1i;unism must be clefined to make the \'I(;coher- I3oth NVAX ant1 N\;tX+ support niultiple extern:il ent with the d;it;~sLrc;tni. In the \Iu architecture. c;~cliesizes to ;~llowsystem tlesjgners fi11I fle~ihilit!~ the execution of thc \:tY return from interrupt or ill selecti~lgextern;il caclie configurations. With exception (1ll:l) instri~ctionperforms this function. existing static iL-\hl technolog!; smaller external We chose to perfi)rm a complete flush of the \'I(; caclie configi~r;~tionsare usually faster than larger as p;irt 01' the execution of every RE1 instruction. configur;~tions.Performance motleling j11tlic;ltetl Rec:~use an IlEI a11v;iys follows a process context that m;Iny ;~pplic;~tions,especially some popul;ir , ;I Flush tluring ;ln llEI removes the process- l>cnchm;~rks,fit entirely in a c;lcIie \vhose size js specifjc virtu;il ;itltlrcsses of the previous process ;~ntl 512Kll or less, resulting in slightly better 1,erh)r- prevents conflict with (potentially identical) virtu:~l m;ince. However, many common app1ic;ttions ;~tltlresscsfor the new process. We coultl have also utilize more memory than will fit in such c;~ches chosen to keep tlic \/I(: coherent with the data stre;lnl ;incl benel'it more from an external cache whose ;~nclimplement per-process qualifiers th;it woulcl size is IMI) to 4hI13, elmwith the atldition;il I;~tency have matle per-process virtu;ll atldresses unique. involvctl. As ;I result. OLI~system designs use 1;lrger Howevrl; coherence would have requiretl both. :in but slightly slower external cache configu~~tions. inv;~liclateaclclrcss ;~ncl:I control path to the VIC. :II~ some h)rm of backm;~pto resolve \,irtii;rl ;~dtlress Block Size ;II i;~ses.I'cI--process qi~:~lifiers would h;ir.e requiretl 1)uring tlie analybis of the previous generation a vr\X ;~rchitecturech;inge ant1 significant operi~ting s!.stenl software ch;~nges.To reduce project risk, of \<\X microprocessors in existing systems, we observctl th:~tthe 10-byte block size was too srn;lll we chose to flush the VI<: on every RE1 instructio~i. to ;ichie\.e optim;~lperformance on nl;injr ;~pplic;~- tions, As a result, IVC chose a 32-byte block size h)r Cuche Hie~zircl~l the NVAS ant! N\RS+ internzil caches. Tliis sizc The NVAX :~ntlNVAS+ chips have three levels of ~>twvicles;I goocl Ixilance between fill size ;ind the caclie hierarchy: the vI<:. tlie I-'-cache, ;ind tlic ni~mberof cycles I-equired to tlo tlie fill, give11 H-cache. The V1(; ant1 P-cache are firlly pipclinetl 8-I)ytc fill tI;it;i 1~1tl~::. ;~ntl1i;lve ~nininii~mli~tency, which allows instruc- I'or comp~tibilit)-with installed s)lstetiis, the size tions to be fetched ancl processed in parallel at very of the N\hX external cache block ancl the c:iche high ratcs. fill size is 32 Rytes. On NVILY+. the extern:~lc;iche The tlel;~ult 1'-c:~checonfigi~r;~tinn caitscs Vl<: I>loch:size rn;ilr be 1;lrgt.r and js (74 bytcs j~ithe V,\X misses to be lookctl 111) in tlie 1'-cache. l'his lookup 7000 Wlotlel 60(.):~ncl \:\S 10000 hlotlel 600 s).stcms. ptx)cessis ;~civ;~~it;tgeo~~ssince the vl<: typic:~ll!. Ilcc:~usc both s!-stems irnpiement low-latency exlxrienccs a sm;~llcrmiss penalty because latency rne1lior\. ;ind high-b:tritlnridtll buses, the incre:lse in for 1'-c;tche hits is roughly one-third t11;1t for ester- extcrn;ll c;lclie block size results in better n;~lc:~clic hits. The clis;~tl\r;~nt:igeis that instruction pcrform:ince.

l ill. -1 .GI..i SII~IIIIIL,I.133.2 Digital Tec6rricnl Jorrrrrnl X riizd NVAX+ ~//R~-/I~~O~'II?L~IZCCVAX iWicro/)l-occssors

Special Feal-ures the failurc. We refined the patdl to place additional The NV~IX'/NV~\\S+design inclutles several fe:ttures checking into various instructions in the sequence. that supplement core chip fi~nctionsby providing This refinement allowed us to isolate the exact acldetl value in areas of debug, syste~n~naintenance. instruction th;tt was causing the tra~isferto PC: 0. and syste~ns;III;I~J'S~S. Among the features are tlie With this information, we werc then able to repro- patchable control store (P(:S) and the perforniance duce the problem in simulation ;uid correct tlie sec- monitoring hardware. ontl-pass clesign. Without this tli:~gnosticcapability we probably would have neeclccl weeks or months ofaeldition;rl clebug time to iso1;tte the fq.1' I ure. Pntcli~nbLeControl Store In atltlition to itsing the powerhi1 diagnostic Tlie 1x1s~m;~cliine micrococle is stored in a ROkl capability of the I'CS, we usetl jx~tchesto correct or control store in the E-box. Tlie 1,600-micro.rvortl work aro~~ntlthe few fi~nction;ilIILI~S that rem;cinecl c;cp;ccity of the E-box controls niacroinstruction in the first-p;css design. For ex:tmple, a microcotle l'<:S execution and exception hitnclling. 'The con- p:~tcliwas ilsetl to correct ;I condition code prob- sists of 20 entries th;~tcan be configured to repl;tce lem ca~~sed17). ;I ~nicrocodebilg during the execu- or supplement the microcode residing in ROhl. E;lch tion of an integer-rn~ltil)l!~instruction. Because tlie PCS entry cont;cins ;I co~itent-:ttltlressalsleniemor)r E-box is centr:~lto tlie execlition of all instructions, (CAM)/I~I\Mjx~ir that stores the p;~tcIimicrowortl we were ;tlso :tble to use p;ltches to correct 1i;rrd- atlclress ;cntl patch microwortl, respectively. l'he ware problems in other boxes. In one inst;~nce,a RO>I co11t1-olstorc and the I1(:S ;ire ;lccessed in p;~r:tl- patch was used to inject a synchronization primi- Iel. Typic:~lly worcls ;Ire fetched from the KOM tive into the M-box in onler to correct an M-box control storc. but if a microwortl ;~tldressmatches tlesign error As a result of tlic simplicit!, ant1 ele- the <;.\A1 in one of the I'CS entries, then the PCS RAM gance of this solution. the final second-pass cor- for that entry supplies tlie microword, and tlie RON1 rection was to move the piltcli into microcode KONl, output is clisabled. rather than motlify the M-box li:~rdw;rredesign. Privilegeel software controls the loatling of tlie PCS by means of internal processor registers. 111 sys- tem operation. :I 1,atch file is nor~n;tllyloaded into Perfot-nzalzce iMorzitori~zgEizuiron~nelzt the P<:S early in the boot proccclure, so that any As computer tlesigns incre;cse in complexity theil. mini~n;~lsystem c;~pableof st;trting system boot c;~n tlynaniic beli;~\iiorbecomes lcss intuitive. <:onl- install p;~tchesto the bast microcode. This fe:ctl~re pl~terclesigners rely more ant1 more 011 enipiric;~l presents ;I w:cy to ~liodifythe base NVAX/NVt\X+ [xrforni;cnce c1;tt;c to aicl in the analysis of system cliip tliroi~glisoftw;~re; the majority of engineering helia~~ior;in0 to provide a b:~sisfor making 1i;crtl- change orders (E(:Os) can be ;tccomplisIietl 13). ware ant1 sol'tm~aredesign tlccisions. In atltlition, releasing new p;~tchfiles, thus ;~llcvi;itingthe neetl multiple levels of logic integration on VLSI chips to change tlie 1iicl.dwi11-e tlesign ;1nd re1001 for the restrict tlie collection of this 1,erfi)rmance d;lt;c, very large-sc;~leintegration (Vl.SI) ktbrication. because many of the interesting cvents are no longer We booted tlic VMS oper;tting system within 16 visible to extern;~l instri~ment;ttion.Tlie N\IAX/ clays of receiving first-pass witfers fro111 fabrication, NVm+ dlip design includes 1i;trclwnre a tribute to ;I very thorough clcsign verification. ;111d counters ~1i;ttcan be configurecl to count any of Howevel; tlie continuing rigorous testing on proto- a set of predetermined, internill st:lte changes. type systems revealed sever;~l1,roblems with the Two 64-bit performance counters are m;~in- base niicrocodc ;~ndli;~rdw;~re. Tlie IJCS mechanism tained in nicmory for each <:I'll in an NVL.X/N\NS+ helped to iclcntify, isolate. ant1 work arountl many system. Tlie lower 16 bits of each counter are imple- of the problems tluring systcnl tlebug and thus mentetl in h;~rclw;~rein the <:l'(l, ;lnd at specifietl allowed extensive system testing to contini~eon points, the clil;tdword coun~crsin memory arc first-p;~sschips. uptlatecl with the contents of tllc l~;~rclwarecoun- For ex;cmplc. we ilsetl ;I secjilence of P(;S p;ttclies ters. Privileged software c;tn be usetl to configi~re during syslem clebug to iso1:lte ;In ol~scurefailure tlie Iiartlw:~recounters to count ;In), one of a b:lsic whose syml~romwas a tr;~nsferto virtual adtlress set of intern:ll events, such ;is c:tclie access ancl liit, 0. By patching the main microcotle exception li;~ncl- ?'B access and hit, cycle ancl instri~ctionretire, :Inel ling routine LO check for this event, we identifietl cycle and st;~ll.When the 16-bit counters re:cch ;I the instruction stream secluence that was cxusing half-full state, tlie perfornir~ncemonitor requests NVAX-microprocessorVAX Systems

an ititerrupt. Tlie interrupt is serviced in ;I ~iorm;~l Results and Conclusions way, i.e.,between instructions (or in tlie rnitltlle of With ;I focus on time-to-market, we shortened tlie interruptible instructions) and at an architecturally originally projected iWkY design schetlule, from specifietl interrupt priority level. Unlike other inter- the start of implementation to the completion of rupts, the performance monitor logic interrupt is the chip design, by 27 percent. We booted the oper- serviced entirely in microcode ;~ntlthen dismissed; ating system just 16 days after prototype wafers no software interrupt handler is required. became available. Tlie use of the pcS allowed us to The microcode component updates the counters quickly tlebug nnd work around the few functional in memory when it services tlie performance moni- bugs that remainetl in the first-pass design. F3ec:luse tor interrupt. During a counter upd;~tc,the ~ilicro- of tlie quality achieved in first-pass chips, we were code temporarily disables the cou~iters,reads ;lnd able to shorten tlie schedule fro111 chip clesign cle;~t-atlie li;~rclwarecounters, ~~pdatesthe counters completion to system product delivery As ;I result, in ~iie~iior):enables the counters, ;ind rcsumes s)-stems were tielivered to customers four months instructio~iexecution. The b;lse ;itldress of tlie earlier than the origin;~lIy projected date. counters in memory is taken from ;I system vector At the same time, we were ;tble to dramatically t;tble ;~ndoffset by the specific <:Pt1 nuliiber, creat- improve CPLl performance relative to previous VAX ing a data structure in memory that contains microprocessors by implementing a macro- ;I p;iir of 64-bit counters for each (:I-'tJ. pipelinecl clesign, in which multiple autonomous Combining the use of hartlware, softw;ire, and fi~nctionalllnits cooperate to execute VAX instruc- the I'<:s createcl a versatile performance monitoring tions. Our internal goal was performance in excess environment-one that goes beyond the scope of of 25 times the pcrform;~nceof the VKX-11/780 s)a- the lx~sicbardware capabilities. 111 this cnviron- teni. We signific:~ntlyexceetled this goal 21s denion- ~iie~it,we c:ln correlate the counts with Iiigher-level stratecl by tlie follo\ving St;lndard Performance ~)~ste~i~events and change the representiition of the Evaluatjon (:ooper;~tive(SPEC:) Release 1.2 perfor- collectetl tlata. For example, micrococle can enable mance r:~tings:'5 tlie counters every time n process context is loatletl ant1 tlis:tble the counters when a process context is saved. This feature allows us to set up workloacls ant1 g;lther dynamic statistics on ;I per-process basis. We can also use I'<:S patches to motlify the SPECint 30.4 memory counter adtlress in order to provide an These ratings were measured on a VKi 6000 adtlitional offset based on one of the five Vt\X pro- Motlel 600 system ;~tthe initial announcement and cessor operating modes: interrupt, kernel, execu- are two to three times higher thm those for the pre- tive, supervisor, or user. This tecliniclue provides vious VAX microprocessor running in the s;lme sys- ;I new performance counter cl;~t;~striIctilre that coJ- tem. Softw;~re2nd system tuning has subseqi~e~~~tl~~ lects st;~tisticson a per-mode, per-process, per-<:l'Ll improvetl tlie initial numbers on all systems. b:rsis. Also, microcode patches can I>e used to ntld The NVAS/NVAX+ design provides an upgratle context checks that filter and count v:lrious events. path ant1 system investment protection to cus- For ex;~niple,we can patch the \'AX context switch tomers with inst:~lletl \'.A>< systems, as well :IS :I instruction to count context or patch the migration patll froni an hVh,X+ microprocessor to a interlocketl instructions to count the number and DECchip 21064 microprocessor in the new Alpha typesof ;lccesses to nlultipl-ocessor synchroniza- AXP systems. tion locks. The performance nlonitoring environment is ;I Acknowledgments powerful tool that we have i~seclto collect the data We woulcl like to ;~cknowletlgetlie contributions of recli~ireclto a~i;~lyzehartlware ;~nclsoftware behav- the following people: ior ;~ndinteractions, and to develop ;in i~nderstand- W Antlerson, E. Arnold. I). Asher, 11. H;~cleatr, ing of system perfor~iiance.We l1;lve ;~pplietlthis I. Hahar, M. Henoit, U. Uenschneider, D. Rh;rvs;t~-, knowletlge to tune the perh)rm;~nceof operating M. Hlaskovich, 1' Houcher, W Bowhill, L. Briggs. systems ant1 application softw:lre, and continue to S. Butler, R. <:alcagni. S. (krroll, M. Case, R. <:;a- ;ipply the knowledge to improve the design ant1 per- telino, A. Cave, S. <:hopr;r. ,M. Coiley E. Cooper, formance of future hardware ;~ntlsoftware. R. Cvijetic. K. I>avies, iM. Delane): I). Deverell.

72 I'ol. 4 IVV..l .Stftnmcr 1992 Digital Tecbrrictrl Jorrrrrtrl A. Dipace, C. Dobriansky, D. Donchin, R. Dutta, 9. L. Chisvin, G. Bouchard, and T. Wenners, "The 1). DuXlrne): J. J. Ellis, J. l? Ellis, H. Fair, H. Feaster, ViUC 6000 Moclel 600 Processor," Digital T: Fisclie~;T. Fox, G. Franceschi, N.Geagan, S. Goel, Technicc~lJounzcll, vol. 4, no. 3 (Summer M. (;ow;ln, J. Grodstein, I! Gronowski, W Grund- 1992, this issue): 47-59. mann, H. Harkness, W Herrick, R. Hicks, C. Holub, 10. A. Agarwal et ill., "An Ev;~luationof Directory J. Muhcr, 11. Jain, K. Jennison, J. Jensen, M. Kantro- Schemes for Cache Coherence," Proceedings witz, R. Kh;lnna, R. Kiser, D. Koslow, D. Kravitz, of the 15th An~zclalInternational Sjjmnpo- K. Ladd, S. Levitin, J. Licklider, J. Lunclberg, siurn on Cot7zputer Architect~we(May 1988): R. Marcello, S. Martin, T. McDerniott, K. McFadden, 280-289. B. McGee. J. illeyer, M. Minardi, D. Miner, E. Nangi;~, L. No;~ck,T. O'Brien, J. Pan, H. P;lrtovi, N. 1';1tw;1, 11. I? Stenstrom, "A Survey of Cache Coherence V: I'eng, N. Phillips. R. Preston, R. Razdan, J. St. Schemes for illultiprocessors." Collzputer Li~urent,S. S;~mudrala,B. Shah, S. Sherman, W Sher- (June 1990): 12-24. wootl, J. Siegel, K. Siegel. C. Somanathan, C. Stolicny, 12. H. Durtlan et al., "An Overview of the ViU( P Stroppaso, R. Supnik, M. Tareila, S. Thierauf. N. 6000 Model 400 Chip Set," Digital Technical W~cle,S. W~tkins,Wl Wheelel; G. LVolrich, and Y. Yen. Journal, vol. 2, no. 2 (Spring 1990): 36-51.

References 13. VAX 6000 Datcicelzter Systenzs PerJor~nance Report (Maynard, Mi\ Digital Equipment 1. R. Hrunnel; ed., ViM' Architecture Reference Corporation, 1991) it.lcirz~~al,Second Edition, (Betlfortl, NLA: Digi- tal Press, 1991). 2. D. Dobberpuhl et al., "A 200MHz 64b Dual-Issue CMOS Microprocessor," IEEE Inter- ~zaliorzal Solid-State Circziits Co~zference, Digest of Technical Paj~ers,vol. 35 (1992): 106- 102 3. R. Sites, ecl., Alpha Architecture Reference Mlnn~lal,(Burlington, MA: Digital Press, 1992).

4. 0. Fite, Jr. et al., "Design Strategy for the Vfi 9000 System," Digital Technical Jo~lrncll, vol. 2, no. 4 (Fall 1990): 13-24. 5. W Antlerson, "Logical Verification of the WitY CPU Chip Design," Digital Technical Journal, vol. 4, no. 3 (Summer 1992, this issue): 38-46.

6 M. Callantler et al., "The Vhystation 4000 Motlel 90," Digital Technical Journal, vol. 4, no. 3 (Summer 1992, this issue): 82-91. 7 J. Crowell and D. Maruska, "The Design of the VAX 4000 Nlodel 100 ant1 MicroVAX 3100 Model 90 Desktop Systems," Digit611 Technical Jo~frnal,vol. 4, no. 3 (Summer 1992, this issue): 73-81.

8. J. Crowell et al., "Design of the VIIX 4000 Model 400, 500, and 600 Systems," Digital Techniccll Journal, vol. 4, no. 3 (Summer 1992, this issue): 60-72.

Digilrrl Zechrricrrl Jour~znl Vol Q iVo 3 \rrtttttter IYW 23 Dale R. Donchin Timothy C. Fischer Thomas E Fox Victor Peng Ronald E! Preston William R Wheeler me NVAX CPU Chip: Design Challenges, Methods, and CAD Tools

The ArVAX (PL1 chip is 0 1.3 117rlllol1trc~~zsisto~ 12S I?IICI'O/II'OC~SSOI.desrgized ill Digital's 0 75-11zio.orrzeter 6110.Y-4 tecl~rzologj!It has a tjpic~11cjlcle ti112e oj' 12 ns u~zde~.worst-case operating conditio~zs.The goal of the cl~ipdesign teallz was to design a high-pe~$)rrrzmzce,robust, n12d reliable chip, z~itl7irzthe con- stmirfts of n short scl?ecl~~leDesign strategies zilere clez~elopedto uchielle this goal, inclz~dingouga/zizalion of a ~hlydesig~zJloiu n~zd 11ez11 il~~l~kl?rozt~~tro~iand ilerl- fication neth hods 1Veio pr*ol)rletllry CAD tools also plqyed i/rzlIot~t~llntroles in the ch@ deieoelopr~zent.

The N\RX CPU chip is a 1.3 million , \lkY The I-box fetches, parses. and decodes instruc- microprocessor clesignecl in 1)igital's 0.75-micro- tions, ant1 predicts conditional branches. The E-box meter hmrtli-generation coniplen~ent;lr)~metal- ~LI~S~~ndcr microprogr:~mmetl control ancl executes oxitle semiconductor (<:M<>S-4) technolog): The all instructions. except flo:iting-point and long- implementation of the ~'VRYc:pU chip posed many word integer multiply instructions, wliicli at-e exe- design and complexity management challenges. cuted by the F-box. The M-box processes memory The combination of the cliip performance god, the references, including \rirtual-to-pllysical atldress high level of integration, and the sm;ll l feature sizes translation. The <:-box controls tlie off-chip backup of tlie <:NI~s-~tecl.lnology increasecl the severity of cache (tlie secontl-level c;~chefor tlata and thircl- on-chip electrical issues and tlie difficulty of verify- level c;iche for instri~ctions);tntl contains the bus ing the physical design. These challenges were interpdce unit. "l'he cliip also h;ls a direct-m;tpped intensifiecl by a short design sclietlult.. This paper 2 kilobyte (KU) virtual instruction cache (\TIC), ;I describes some of the design striitegies, methotls. two-way set-associ;ltive 8K13 data and instruction techniques, ant1 proprietary computer-aided design primary cache (1'-cache). ;I 12KB control store reatl- ((:AD) tools ~lsedby the chip tlesign team, which only memor)r (RON!), ;I 96-entry, fully associative were instrumental in meeting these cli;illenges. translation bi~ffer,:~ncl ;I 512-bjt by 4-bit condition;il branch history rr~nclom-;~ccessmemory (R,\kI) for Chip Overview branch prediction. The chip photomicrograph, I11 order to ;ippreciate the design ch;illenges that with box and large ;irr;iy bo~tntlariesoutlinetl, is urere faced, it is necessary to untlerst;incl the size shown in Figure 1. and complexity of the design. ?'he NVAX c:IYJ chip The NVAX chip's l;tyout clatab;usc is composecl of Ji;ts :I m;~cropipelinedmicroarchitecture ;~ndimple- over 4,200 uniclue ci~stotiicells, ant1 has a total tran- ments the VUi17struction set.' Hec;iitse it is a com- sistor count of 1.3million. It was the first product plex instruction set computer (CISC) architecture, chip to be iniplementecl in 1)igit;il's 0.75-micromr- the vt\x arcliitecture implements varable length ter, three met;tl layer. 3.3-volt (V) CMOS-4 technol- instructions with complex addressing mocles, ;ulitl ogj..? The chips typical c!.cle time under worst-case precise traps and exceptions. The chip is composetl operating conditions is 12 nanosecontls (ns) or 83.3 of five subchips, or functional boxes. calletl tlie megahertz (MHz), and it tlissip;ites 14 watts (W). A I-box. E-box. F-box, M-box, anel C-box. s1rrnm;iry of tlie chip fc:~turesis given in Table I.

24 %I. 4 No..{ St~rrrmcr1992 Digilal Technical Jozrrrrnl The NVAX CPU Chip: Ilesig~zCl.alle?zges, ~l~etl~ods, and CAI1 7hols

L+E-BOX F-BOX

Design Goals and Constraints previous designs. Due to competitive marketing Our design goal was to develop a high-perform:~nce pressures, this schedule was substantially reduced, chip that is electrically robust ;rnd reliable. We had making it significantly more aggressive than for pre- to accomplish this within the <;>10S-4 process vious clesigns. Our cycle time was constr;iinecl to constr;~ints;~ntl the design time :illottecl by the a maximum of 14 ns untler worst-case conclitions. development schedule. 011sinitial implenientntion Electric;il reliability h;icl to be guaranteed for cycle scheclirle was based on schetluling metrics from times clown to 10 ns ~inclerworst-case conditions. NVm-microprocessorVAX systems

Table 1 Summary of Chip Features In atltlition to minimizing risks to the schedule. we had to solve several g1ob:il tlesign and veri- Transistors 1.3 million fication problems to achieve the cycle time of 14 Die size 16.2 mm by 14.6 mm ns. The tlesign team assumetl a 10-ns cycle time Cycle time 12 ns (typical) when it ;~ddrcssedproblmms that are exacerb;~ted Signal pads 261 by a brster cycle time, such ;IS interconnect reliabil- Supply pads 121 ity ant1 signal integrity Some of the key concerns Package 339 pin THPGA* were on-chip power, grountl, ant1 low skew clock Power dissipation 14 W (average) distribution, ant1 the routing of long signal inter- at 12 ns connects. Verification of the custom layout was

Note: "Through-holepin grid array another ch;rllenge, particul;irly given the schec[ule constrainrs. Tlie use of <:AI) tools was a significant benefit in cleveloptnent of tlie chip. These issues ;Ire addressetl in more cletail in the remaining sections Llnsed on <:ILI<)s-~process limits, the chip clie size of this papec was constrained by :I maximum diagonal length of 21.8 millimeters (mm). The tr;~tle-offsbetween clesign time and chip c11:i- Design Flow and Management racterjstics, SLICII ;IS pel-for~ii;ince,rirca, and function, Tlie chip clesign team was org;inized into five semi- were the tlominant issues tliroi~gboutthe project. autono~nousgroups, eacli of which focused on the design of a h~nctionalunit (C-box, E-box, F-box, I-box. 1M-box). To ensure design compatibility and Design Strntegy and Challenges consistency across the chip. e;rcli team adheretl to 'To achieve the goal of a 14-11s cycle time, we the same design giridelines :illtl methods. For exam- tlesignetl custoni circuitry iincl lajlouts, inclutling ple, box-level interfaces were rigorously definecl, n clynamic, ;is)lnchronous, :rntl tlifferential logic. To consistent register transfer level (RTL) motleling deal with the size and complexity of the chip, stylc w;rs userl, ant1 circuit noise margins, transistor together with the schedule constraint, our strategy sizing, ;ind other circuit and 1:iyout gi~idelineswere c;rlletl for ;I large tlesign te;im. Complex, custom observetl. The design team followetl the top-down, very large-sc;rle integration (VI.SI) chip tlesigns hierarchical tlesig~lflow tlepicted in Figirre 2, but inherently h;ive high levels of clesign schetlule risk. there wits muc11 overlap between the activities. To retluce our exposure to scl~etluleslips, we ni;rtlc Complexity was managecl by thoroughly reviewing several higll-level design decisions early in the proj- and testing the clesign at c;icIi level of abstraction ect. Wherever possible, we avoided using circuit (microarchitecture perforn1:ince model, RTL model, structirres that were time-consuming to analyze. logic. circuit, and layout), iintl by using <:AD tools We defined ant1 followetl tlctailed tlesign gi~itlelilles to verify th;it all the design representations were to ensure clesign consistency robustness, ancl re- consistent ncross the levels of ;rbstf-action. When li:rbiljt)~ We usetl a top-down tlesign flow ;inti making clesign tlecisions, we consitlered the im- m:itle extensive use of proprietary CAD tools th;rt plications :icross the levcls of abstraction. For were expressly tlevelopcd for high-perform;mce example, when we made rnicroarchitectilral tr;itle- custom VLSl design. Lastly: we minimized chip area offs, we considered the implications for power by hanclcrafting Iqout in sections of the chip consumption, logic complexity circuit speed and cycle time, l:ijroiit. die size ;~ntl,of course, schetlule. where the leverage on reducing overall die size W:IS signific:~nt;or when critic;rl p:rth node had to be minimizccl to satisfy the path timing constr;~ints. Perfor~nnnceModel and The floor plan was accur;~tely monitored Mic-oarchitecture S'~ecijicution tl~rouglioutthe project. This was essential because The NVAS performalice motlel is a software pro- initial tlie estimates illclicated that the chip size was gram that models tlie micro:irchitecture of tlie close to the maximum size t11;it the CivlOS-4 tech- h?Ifi chip. The architecture te:rnl ilsetl the nioclel nology could sup,port. These strategic decisions to study the effcct of varioi~sn~icroarchitectur;il reduced tlesign time ant1 allowecl us to focus on factors on overall CPU perh)rrn;ince and to tlefine ;~chieving;I first <:l'Li cycle time \vitliout compromis- the chip's microarchitecture. The performance ing the qu;ility of tlie tlesign. model was updated as the microarchitecture The NVA X CPCI Chip: Design ChallenLqes,&lc.tl~ocls, and CAD Tools

detailed box-level floor pl;lns. All floor plans were

AND MICROARCHITECTURAL entered directly into the layout tlat;lb;lse and main- tai~ietlthroughout the project. Trdcking the floor plan ;~tseverr~l levels eased the tlifficulty of integrat- ing the box l;lyouts to form tlie full-chip composite

RTL MODEL AND late in the project. More details on floor plans ilntl CHIP FLOOR PLAN the use of <;All tools are described in the section I I Floor Plan Tecliniques.

RTL Verqicntion and Sche~nnticDesign RTL MODEL VERIFICATION AND SCHEMATIC DESIGN We verified the RTL model by running a com- bination of pseucloranclorn tests, stanclartl VAX architecture tests, and Ii;~ndwritten implemen- tation-specific tests. In ortler to identify flaws SCHEMATIC LOGIC AND CIRCUIT VERIFICATION AND before time-consuming sclicmatic and layout PHYSICAL LAYOUT DESIGN changes were implemented, we ran regression tests on rhc rnoclel whenever we ~natlechanges to tlie design. 'Ti) track tlesign changes and issues, clesign- FULL-CHIP LAYOUT ers posted a tlescription of the cllanges and issues VERIFICATION along with tlie ramifications for other parts of the design in :in electronic bulletin boartl. Eacli new entry to the bulletin board was ~iiailedelectroni- cally to every ~iiemberof the teilm. This tracking procedure helped reduce clesign iterations caused evolved so th;rt the team coi~ltlassess tlie impact of by stale i11h)rmation. tlesign c11;rngcs on perfornlance. We ~~setlthe R'I'L model a.4 ;I specification for logic/ The chip niicroarcliitects wrote a textual speci- circuit design. To synthesize logic directly from tlie fic;~tio~iof the chip that clocuniented its function RTL model for circuits with less critical area ;lntl ;~ntlmicroarchitecture. As thc tlet;~ilsof the tlesign speed constr;~ints,we used ;I <:AD tool, O(:(:/\M. were resolved, tlie specific:~tionwas uptlatetl 21ncl Becal~sethese constr>iintswere stringent for ;I 1;lrge exp;~ncleclto reflect the actilal implementation. The portion of the chip, engineers designed most of the functional clesign verification team used the syeci- circuits. We tleveloped a librilry of common circuit fication to develop imple~iient;~tion-specifictests. and layoi~tstructures to recluce the design and veri- fication effort. We clefinetl ;ind followed detniletl RTL Mo6tel cr nd Floor Pl~rns clesign gi~itlelinesto ensure clesign consistency. More information on the types circuits usetl on The design team developctl :I cletailed RTI. model of of tlie chip once the chip specification was the NVAX chip is being published in "A 100 MHz completeel. 'lliis model was written in Digitxl's Macropipelinetl VhX (:MOS Microprocess~c"~ proprietary 131)s liardware tlescription language. As the HIIS cotle w:ls being written, many logic/circuit Sche~naticT/'c~rificatio~z and Layout Design feasibility stirclies were sp;lwnetl. TLle model was We held clesign reviews ;111tl i~sedCAD tools to used to verify tli;rt the proposetl ~iiicroarchitecture check scliem;~ticsfor unintentletl cleviations from eseci~teclVt\X code correctly. It ;~lsoservetl to guicle the clesign guiclelines. \Ye performed extensive logic implementation ;IIICI W;IS i~sedby system logic simulation on the schematics. We used SI'I<;E design temis to develop modules based on tlie for accur;lte critical path timing analyses, ant1 ;I WAX microprocessor.u 'rhc IU'l. rnodel was 111)- static circuit timing an;~lyzer,N'l'V, to detect unidcn- d;itetl ;IS the project progressetl. tifietl slow p;~ths.('(NVAX timing verification is The chip floor plan was tlevised early in the describetl in ;I I;~tersection.) Figure 3 is a histogram design to estimate ant1 track the clie size ant1 to of slow paths as a function of cycle time that was provide area-impact data for microarcIiitectur;~l generatetl by NTV about two months before we trade-offs. Once the chip-level floor pl;ln was tapetl out the first pass of the cliip. Because NTlJ stable, thc box tlesign te;ims cleveloped more does not pretlict circuit delays as accurately as WAX-microprocessor VAX systems

la).oitt clcsign was cornpl.eted, tlic final stages of lay- out veril-'ic:~tionensured tii:it the ;~ssembleclchip layout !net rrli;~bilit!~and integrity guiclelines for global nodes such as power. clocks, ant1 signals. Most of the I;~youtchecks were perforrnecl by (:All tools th;lt usetl the CMOS-4 I;~youtdesign r~~les;ind NVAX-specific electrical rules. (More details on 1;1y- out verific;~tionare given in tlie section Layout \'erificntion Tools. Floor Plan Techniques 1VitIi 1.3 million transistors. ~iearlyhalf a million signal notles. ;ant1 over 16 million polygons on the ~WA>;tlic, precise monitoring of the floor plan was CYCLE TME (RELATIVE TO DESIGN TARGET) critic;~l.From the earliest :II-e:~estimates, it \+/;IS Note: Oala is normalized based on 14-ns design larget. clear th;~ttlie chip size was close to the m;~sim~rm The cycle tlme of 1.0 equals 74 ns th;ic the (:h[0~-4technology coultl support. We h;~d to ensure that tlie die clitl not cxceed the technol- 1 I . I 11-iNTV flistog1~111z ogy limit.

SI'I(:E, all c1uestion;tble p;tths fl;~ggetlb!~ N'1.v \\/ere simulatecl ubing SL'ICE. Those that were foi~nclto hc Prelimin;rry estimites of the tiit. ;trea were m;~cleby slonl were retlesipnecl to meet the target c!.cle time. extrapo1;ltions from previous <;I'll designs.'" Better Logic ;inel circuit chatiges resulting from tliesc are;l cstim;ttes were tlevcloped for regu1;ir struc- an;llyscs ;tntl the in~j>;~ctof these changes on other tures, such ;is the cache :I~S;I~Sancl data p;lths, by design represent;ttions ;~ntlverific;ltion tests were creating test 1a)iout structures. More accurate floor tr;~clictlon the electronic b~rllctinbo:~rcl. Since m;cny plans of the co~ltrojsections of tlie chip were clcvcl- people were worliing 011 tlie clesign sin~ult;meousl): oped after the 1enissues pr~\~cd cases, the jx~rtitioningof the 111'1. model was :I close cruci;tl to meeting oirr scl~vtlule.A single ch;~r~gc ;~plxoxim;ttio~ito the tlesirctl sche111;ttic;~nd I;~!~out of cu recli~irrcl1i1odific;ttion ;~ntlrcverific;ttion of p;irtitioning. 11) estimate tlie ;ire2 of r;uidom con- the IU'I. moclcl. scliem;itics, I;~)~out,verification tests, trol logic, we itsed the O<:<:Akl logic sy~~thesistool, anel in some cases tlie chil-, testu;~lspecification. and in nxlny cases Digital's ~~roprietaryla!-out syn- Ltyout design proceetletl on :I subbox, or section, thesis tool, <:I.EO. b;tsis. A t!.l,ical sectiori consistctl of ;~pprosim;~tcly i\s seen in the chip pl~otomicrogr;t~>liin Figure I, ren rel;ttctl sc1iem;ttics. hch scction was clieckctl the clock gener;itor ant] rlri~~ers((;LKGEN) Miere tiis- for con~iccti\;it!~itntl correctness ;lfter its layout \v;~s tributetl in ;I narrow no~-th-soi~thchannel near the coml,leted. Sections within ;I box were then inte- center of the chip. That 1oc;ltion was chosen to min- grated :~ndvcrifietl (boxes cont:~jnedfrom 5 to I5 imize clocli skew. The I-box, E-box, M-box, :lncl sections) hcfor-e the complete chip composite w:ts C-box s~~I>chips;Ire loc;~teclon the right side of rhe ;~sse~nl,lctl. chip. 'I'lleir relrttive positions were chosen to Scl1enl:ctic i'crification :~nclI;t).out design wcrc accommotl:~tcthe flow of the pipelined d;lt;~in the perfornlecl concurrently tluring much of the pnjj- main tl;~t;~p;lth. which runs north-south just to the ect. Although this overlap lctl to desig11 c11;111ges right of <:I.K(;EN. The \TI<:, F-box, ;~ndP-c;lchc were t11;lt incre;lsccl the amount of I;~youtrework, the placetl on the left-l~antlside of the chij,, ;ttlj:lcent to chi17 tlevelol,ment schetlule \voultl have been signif- tlie boxcs with which they con~n~unicate. icantly (ongcr if schem;ttic verific;ltion and I;~yo~tt elesig11h;icl been done seri;~lly.L;i!-o~tt design tool< I~ztercolzlzectPlnlz~zilzg and Routing about :I !.e;~rto complete. Mter we tleterminecl the initi:ll placements of :~ll major control sections. \vc t~setl;I new two-level bloclc router c;~lletlc;I.C)\V to route the layoi~tfor the Sectioo- ;lncl box-level I;lyout verification was per- interlwx ;IIICI intrabox signals. Kouting w;ls per- fornletl in p;~rallelwith tlie 1;tyo~ltdesign. Once formed :it the same time schematics for the control areas were in tlesign. Since layout clid not exist Third Melal Layer for tlie control blocks. <;LOW was ;lllowetl to roi~te The top al~~niinuminterconnect I;~!.er (M.3) in tlie sign;~lsover the blocks witli some restrictions, ;is CMOS-4 process was specific;~ll!clesignecl to mcet well ;IS route signals in clx~nnels.We influencetl the electric;~lreqi~iremel~ts of tlie N\k-C\: chil?. 'fhe how C;I.oW routed signals by specifying some areas thirtl met;~l1:tyt.r mias tlesignecl for ;I low slieet rcsis- of blocks 21s op;1clue (no over-the-block routing per- tivity ant1 high current c:ip;~cityin orcler to li:~ntlle j~iissible);~ncl some as porous (over-the-blockrotrt- the electric;~lproblems associ;~teclwitli tlie powcr ing is permissible if cli;~nnelsare fi~ll).'l'ypic;~ll~ grid, clock clistl-ihution, critic:~lsign;~l routing, ;incl cell hlocks tli;~tcontainetl regul;~rarrays (such as large arr;)!. clesign. progr:~nini;~l~lelogic arrays) or critical circuitry were identified as opaque, wliere;~smost rancloni PQLU~I?a~zd GI-ound Di.sh.iD~~liolz control ;lre;is were identified 21s porous. 1Vhen the NVA); niicroprocessor is run at m;~xirn~~nl The c;~p;~cit;incevalues of the interbox ant1 intra- speecl, it clraws ;I tlirect current of about 5 :impcres. box signal interconnects were extracted from the Due to (:hIOS s\~itcIii~lgtl.;~nsients, the a1tern;iting layout ;rntl used in circuit sini~~l;~tionsof critic;il current pe:tks ;ire consicler;~l,l!. higher. 1)istribtrting paths. In some cases, the pl;~cementof sections ;lntl power (l;,,,) and ground (r;) ;icross the cliil, while the sign:~lrouting were alteretl to reduce intercon- keeping power gritl volt;~getlrops (I]{) ilncler 300 nect c;~p;lcit;~nceon critic;~Isig~i;~ls. The use of syn- millivolts (10 percent of rnini~iii~m L;,,) was ;I major thesis tools SLICII ;IS (;I.OW, O<:CAI\JI, ancl (:1.130 challenge. 'It) ;~dclressthis constraint ;117d nieet allowed 21 much more tletailetl floor plan to be interconnect reliability go;~ls,\ye ~~setltl?e low- developed than was typical for prior design~.-~T'lie resistancc M.3 layer extensivel!. to distribute I;,, ability to ketl capacit;~nce infixmation from and yi. As shown in Figure 4, we tlesignecl the floor plan routing back into SI'ICE simirl;itions right-li;~nclside of tlie chip to he covel-etl witli provetl in\~:ilu;ible. an intercligit:~teclarray of ;~ltern;~tingI.;,(, ancl L:,

/ MAIN CLOCK ROUTING (M3) I POWER OR CLOCK VIC LINES (M3) SUPER BIT LINES (M3) POWER OR CLOCK M2 STRAPS

CONTROL F-BOX STORE ROM M3 POWER SUPER BIT STRAPS LINES (M3)

POWER OR CLOCK LINES (M3)

POWER P-CACHE OR CLOCK SUPER BIT -- LlNES (M3) LINES (M3)

M3 POWER RING ' M3 POWER SPINE NVAX-microprocessorVAX systems lines, each 17 ~iiicrometerswide. Vertical met;~l On-chip Clock Distributio~z two (M2) .lines are usecl to strap the power lines In ortlcr for 11s to meet the perform;itlce go;~ls,it ;lntl form a t;,,, grid and a y., grid. The ant1 1; was critic;~lto keep clock skews small ant1 edge distribution of the left-li;i~idside of the chip w;is rates sh:irp across the chip. As shown in Figure 5, different from that on tlie riglit bec;iiise of the spe- special attention w;is given to the clock distribu- cial layout requirements of the cache ;crr;lys ant1 the tion scheme. Differential outputs 1-'rom ;ui off-chip F-box. oscillator were sul~pliedto a receiver locatecl at the Indivitlual cell layout tlitl not cont;~inMS. The top of the chip. The output of the receiver was powel-, groirncl, and clock connections for a cell routecl to the global clock generator (<:I.K<;EN), were routed by short vertical 1M2 lines inside each which was placetl ;it the center of tlie chip to cell. These iM2 lines were connected to the Mj retluce clock skew. The outpi~tsof the gloh11clock grids automatically by a CAI> tool. generator were buffered b). four inverters to

OSCILLATOR-LOW -,, ,.,rOSC1LLATOR-H IGH DIFFERENTIAL AMPLIFIER \\ / ,-UPPER PAD CLOCKS (M3)

E-BOX CLOCKS M2 STRAPS

OSCILLATOR 3 E-BOX (M3) CLOCKS (M3)

GLOBAL CLOCKS (M3)

\ I LOWER PAD CLOCKS (MJ) The NVAX CPU Chip: Design CL7allenges, ~llethods,and CAD Tools increase their driving capability. The clocks were need for sense amplifiers and voltage reference gen- then distributed, using the low-resistance third erators. This substantially recluced the time metal layer (17 milliohms per square), from the top required to clesign and verify the control store 1m)hl. to the bottom of the central clock routing channel that spans the chip. Primary Cache Clocks were supplied to the different functional A similar technique was used in the 8KB P-cache to boxes by locally tapping off the central clock rout- ease the timing requirements. The three high-orcler ing and buffering each signal with four inverters to P-cache address bits must be translated and conse- further increase the signal's driving capability. This quently become valid later than the untranslatetl buffering helps to minimize the capacitive loads lower-orcler bits. By dividing the P-cache into eight seen by the clock phases in the central routing subarrays, each with its own sense amplifiers, the channel in which the RC delays are held to 30 pico- cache subarray access can be started before the seconds (ps). To reduce distribution skew between three translated address bits are valid. When the the global clock lines, loading on each global line last three address bits become valid, the outputs of was balanced by adding dummy loads to the more the subarrays are multiplexed onto the M3 super bit lightly loaclecl lines. The buffered clock phases lines, resulting in a faster cache access time. were distributetl to the east and west of the central clock routing channel, again using M3 to reduce RC delay. The east-west clock routing was strapped Layout Vmification Tools with M2 as shown in Figure 5. These straps were Verifying the WAX chip layout presented several not allowed to cross box boundaries. Box-level CAD software challenges. Prior to the NVhs design. clock skew was reduced by using a common the existing layout verification tools were able to section buffer design ancl layout, ancl by carefillly extract full-chip netlists from layout for all large tuning the buffer drive capability to the clock Load designs in a single batch process. However, the in each section. existing layout netlist extractor could not hanclle Finally, before the clocks were usecl by the logic, designs such as NVhK with over one nlillion transis- the clock signals were locally buffered. These final tors. Also, a more accurate capacitance extraction stages of local buffering served two purposes: they algorithm was required to calculate side-to-sideant1 recluced the gate loading on the east-west clock fringing capacitance, which came to show signifi- routing, and they helped to sharpen the clock edges cant effects in the small physical dimensions in seen by the logic. C&IOS-4. Furthermore, accurate interconnect resis- The global clock routing network was spaced so tance extraction was needed for NVAX. A combinil- that the RC delays of local clock branches woulcl tion of new CAD tools (see Figure 7) and design never exceed a negligible 125 ps. \Ye calculated the methods was employed to meet the WAX layout RC delays of local clock branches using the W~WoTH verification requirements. layout interconnect analyzer (described in the section New I'roprietary CAD Tools) and, where Partitioning Using "CleanBelts" necessary, rerouted branches to meet the 125-ps design goal. A sample RC plot, generated by To address the problem of extracting parasitic WAWOTII. for a section of local clock routing, is capacitance data from such a large design, the WAY given in Figure 6. The clock skews and edge rates chip layout was constructecl so that each chip parti- across this 1.62-centimeterchip are less than 0.5 ns tion could be indepenclently extracted without and 0.65 ns, respectively. introducing inaccuracies in the results. The chip was partitioned into nonoverlapping regions, each of which had a rectilinear annulits or "clean belt" Microcode Control Store around its periphery. A clean belt is a rectangular The design of the 12KB RO&l control store was sim- region that contains only metal lines and satisfies ;I plifiecl by dividing it into four subarrays. Each subar- number of layout design rules beyond those set by ray has its own M1 bit lines. The M1 bit lines from the technology. The clean belt layout rules pre- the subarrays are cascacled onto low-capacitance vented design rule violations within the clean belt M3 super bit lines that extend over all four subar- and between adjacent clean belts. The rules also rays. Since the capacitance of the 1M3 super bit lines ensured that extracting parasitic capacitance from is low, the access time is very fast, obviating the a region enclosecl by a clean belt could be done

Digital Techiticril Jourrcrrl Vol 4 No .; J~r~unrer1992 3 1 WAX-microprocessorVAX systems

Nole: Times are given in picoseconds

;~ccur;~telyreg:irclless of the m:~trrials that border out cell inst;i~~ces:11ic1 in many cases needs to thc region. Partitioning the chip in this manner extract cell-orily tlcl'initions. 111 contrast, the previ- nlatle it easier to locate glob;~lwiring errors. ous rietlist extractor "fl;ittetletl" the layout data into one tio111iierarchic:rlcell and, therefore. extr;~ctecl Hic.t"n~-cl?icc/lNetlist Extr~~ctiol? all data witlio~~tre~rsi~ig previoi~sly extractetl ccll f\ new netlist estractor, HILES, \\i:~~LISCCI to meet the definitions. The :ict~r;il~,erh)rmance inipro\lement high d:it;i capacity requirements of the NVAX niicro- realized by H1I.I.S tlepe~ltlson the hierarchy of the ~rocessor.H1I.M is more efficient th;m the previous cllip layoi~tclcsign. If very I'ew cells are replic;~tetl. in-house netlist extractor bec;l~lsrit recognizes 1:iy or cells ;ire rcplic:~tetlin ;I way that recllrires the The WAXCPU Chip:Design Challenges, Methods, and CAD Tools

I LAYOUT I

t F -I J CUP HlLEX

t t

RESISTANCE * WAWOTH 1 1

Figure 7 Simplified Layout Verification Tool Flow

extractor to explode the cells (i.e., create more flat VMS operating system architecture. To go beyond data) to extract them properly, then minimal per- the VMS limit, the internal memory formance improvements are seen. An example of management routines within HILEX were modified the latter situation is an array of a repeated pair of to allocate additional heap from the process stack overlapping cells that forms one or more transis- (in P1 space) when the VMS memory allocation tors due to the overlap (one cell contains diffusion routines indicated that PO memory space was areas that become source and drain regions when exhausted. This technique was used to allocate the overlapped with the other cell, which contains 2.5 million virtual pages required for fuI I-chip con- polysilicon lines that become transistor gates). nectivity extraction. Several layout design guidelines were defined to ensure that performance improvements from using Netlist Co~nparison HlLM would be realized. Adherence to the guide- A utility called WLC was used to verify that netlists lines minimized situations that require HILEX to extracted from layout by HILEX matched netlists explode cells and encouraged the use of hierarchy generated from the chip schematics. Since the in the layout. However, since it was not always pos- NVAX schematic hierarchy rigorously matched the sible to adhere to these design recommendations, layout hierarchy only at certain levels, the con- HlLEX was designed to handle large amounts of flat nectivity comparison was performed flat, wLC data. employed a graph-building and graph-traversing The chip netlist was extracted from the complete algorithm that worked well for comparing less than chip layout prior to tape out. This presented quite a 500,000 device connections. However, substantial challenge since even with the use of HILEX, extract- paging occurred when comparing larger netlists. ing the chip netlist from the 225MB chip layout file Since the NVXX chip contained 1.3 million devices, in one pass required more than the maximum of the performance of KCwas inadeqtiate for full- two million virtual pages of memory allowed by the chip netlist verification.

Digital Tecbnical Journal Vo1. 4 No. 3 Summer 1992 33 NVAX-microprocessor VAX systems

To improve the elapsed time of netlist co~np;~ri- extraction batch jobs. CUP sectioned the I;IYOLI~ son b;ltch jobs, ~liultiprocessingwas employed. tl;tt;tl>;~seinto fixed-size stripes, which were HILEX was moclified to write the extracted netlist of inserted into the batch queues of niitltiple C:Prrs. the clean belt partitions. Each partition was then This method retluced the data complexity and compared by WLC, in parallel on multiple CPlls, to ;~llowetlas many parallel computations as there its ec1uiv;llent schematic-generated netlist. This were processors. During NviUi chip design, c;lp;~ci- appro;lch reduced the total elapsed time for the tance extraction was partitioned across as many as I\IVrU( chip netlist comparison from more than three 20 CPUs. Multiprocessing reclucetl, for example, the days to seven I-lours. Cross-reference files output by NVAY I-box capacitance extraction fro~n26 hours to W1.C: ;tncl the schematic netJists were processecl by ;I just 8 hours using 4 processors. Extractiorl of the separate program, MATCH-CHECKER, to verify the E-box took 40 l~oursusing one processor, but only connectivity of nodes that crossed partition bound- 12 hours with 4 CPUs. Table 2 shows the device and aries. This ;rdditional step added only eight minutes node counts of the WAX boxes (excluding the to the total elapsed time for comparing chip caches), and the CUP extraction run times on ;I MIX netlists. 6000 Model 500. Each box run resulted in ;11,proxi- ~nately500.000 extracted parasitic cap;lcitors. Capacitance Extraction The small spacing dimensions of the submicro~l Resistance Extraction <:,\.IOS-4process caused fringing and lateral capaci- Veri@ing the IWAY power, ground, ;~ntlclock net- t;inces to contribute significantly to the total nodal works, allti long signal lines required ;lccur;lte c;lpacit;lnce. The existing layout extraction tool extraction of interconnect resistance from layout. only extracted overlapping parallel plate capaci- To meet this requirement, the REX resist;unce tance. Tliits, a new layout capacitance extractor, extractor was developed.~liEX processeel the (:111', w:ts written to accurately extr;rct fringing, lilt- output of HI1,EX to produce a series ;inel p;lralJel er;~l,;inel ;trea c;~p;~citances. con1bin;ltion of resistors that modeled a node's (:[]I' extracted interconnect capacitance from interconnect. The resistor network was gener;~tecl layout by decomposing interconnect layout into by fracturing the node layout into polygons b;lsed pieces of uniform layout cross sections. The geome- on changes in the layout geometries (wiclth, length, try of each interconnect piece, and its distance 1,ends) of the nocle or the occurrence of'cont:lcts. from 1;lyers above, belon: and adjacent to it, are The effective resistance of each polygon ant1 con- used to calculate its area, fringing, and lateral com- tact, or cluster of contacts, was then determined ponents of capacitance. Thc empirical formulae from technology parameters and the polygon used to c;tlculate the capacitive co~uponentswere geometries. h:~setlon curves of two-climensional electrostatic The power and ground resistor networks were simulation data of various layout cross sections. extracted for individual boxes rather than the entire This technique produced accurate internodal ant1 chip. The resulting networks were still cluite large total interconnect capacitance data. This accuracy clue to the fine granularity of the REX extr;~ction resulteel in <:(It' being very compute intensive. process. Table 3 shows the extraction ti~nrsfor a multiprocessing was enlployed again to reduce REX job running on a VrU( 6000 Model 500 ancl the the el;~psetl turnaround time for capacitance total number of resistors extracted from each box.

Table 2 CUP Parasitic Capacitance Extraction Batch Run-time Data for NVAX Boxes Single CPU Four CPUs Box Device Count Node Count (Hours) (Hours) I-box 107,000 36,830 26 8 M-box 102,000 38,770 29 8 E-box 107,600 41,760 40 12 C-box 92,400 45,050 42 12 F-box 129,150 55,550 45 12.5

3-i Val. 4 ,\lo, .j ,YLIII/I~~~1992 Digital Tecbtticf~lJo~~t-t~~r/ The WXX CPU Chip:Design Challenges, Methods, and CAD Tools

Table 3 REX Extracted Parasitic Resistance simulation code is designed to execute in a straight- Data and Batch Run-time Data for line fashion. Due to the high switching event densi- NVAX Boxes ties we observed on NVAX, 18 percent on average, Extraction Time this straight-line compiled approach to simulation Box Resistor Count (Hours) was more efficient than event-driven sin~ulators, which typically fail to compete when event densi- ties increase beyond 3 to 5 percent. The CHANGO translation process further optimized the sim- ulation by partitioning the sitnulation according to signals that should be evaluatecl cluring each par- ticular clock phase. This avoided processing signals during clock phases when signal transitions could not occur. Further, evaluation of a switching event New Proprietary CAD Tools was only performed when the signal could affect Several other novel CAD tools were specifically the evaluation of some other signal. This prevented designed for the NVAX chip. These tools provided simulation of unimportant switching events that practical solutions to verification and analysis prob- were ignored by the remaining design. Redundant lems that were previously unmanageable or signals (i.e., nodes with the same logical behavior) intractable. were grouped together as a list of synonym signals in order to moclel multiple nodes by only one simu- lation event. CWGOLogic Simulator CHANGO was an important development for NVAX functional verification because it allowed sig- NTV Timing Verzyie~ nificantly more simulation cycles and functional NTV is a static timing verification tool developed for verification tests to be performed from the WAX use on the WAX: micropro~essor.~~'NW processetl transistor-level clescriptio~lthan was previously 350,000 circuit paths and checked 42,000 timing possible. CHANGO is a two-state gate-level logic constraints on the N\IA?( design. NTV eliminated the simulator designed to maximize performance need for the pattern-dependent dynamic speed through compiled, straight-line simulation. Elec- verification strategy used by other chip designs trical issues such as gate delay and charge sharing and significantly reduced the extensive speed ver- were not modeled since CIWGO was used ification work needed for SPICE sin~ulations.It for functional, not timing, verification. CHANGO's identified critical paths that woulcl have otherwise parallel simulation capability allowed the sirnul- remained undetected clue to the complexity and taneous execution of 13 different NVAX moclel size of the NWdesign. simulations on one CPU, which resultecl in an eight- NTV read multiple flat transistor netlists with or fold increase in simulation performance. Overall, without parasitics and automatically identified cir- CHANGO has been shown to accelerate simulation cuit structures such as cornplementar): clynamic, one to two orders of magnitucle over traditional and cascode gates as well as several types of latches. event-driven gate-level simulators. Its high through- Based on the classification of these structures, NTV put enabled us to boot the vMS operating system iclentifiecl timing constraints. For example, NT\r twice (75 million cycles) prior to tape out. checked that the latch storage nodes become valid To create a CHANGO moclel, a transistor-level before the latches closed. NTV also read user-speci- netlist description of the design was input to a pre- fied timing for primary inputs and propagated node processor called <;EN-MODEI.. GEN-MODEL trans- timing throughout the design basecl on when sig- formed the netlist into a logical tlescription of the nals arrived at gate inputs, the drive capability of design, consisting of simple Boolean elements, each gate, and its output loading. D-type latches, ancl SR . <:FUNGO transformet1 N'Ill has three delay models that were usecl for this logical description into a highly optimized sim- calculating gate delay: (1) unit delay was used for an ulation stream of VN( assembly code. initial rough timing estimate before real parasitics CHnNGO achieved its high simulation through- were known, (2) a SPICE-calibrated lumped RC put in many ways. Conclitional branch latency model was used for delay calculation of comple- penalties were largely avoicled because CHAN(;O mentary gates, and (3) an Elmore-distributed Kc:

Digital Technical Journal Vnl. 4 No. .? Summer 1992 WAX-microprocessor VAX systems

model was i~setlfor other structures." NTV flr~gged tlerivetl from average node-switching frequencies circuits that hiletl to meet tlie itlentifiecl timing calctrlatetl from logic simulation tl;ita. Over 1,800 co11str:lintswithin ri user-specified time toler;lnce. sign:~lnodes were ;ilso analyzed by \W.W01'1-[. Like other static tinling verifiers, sonie paths identi- fied by Nnwere "don't cares" or were logically Conclusions impossible. The user eliminated these false patlis Our tlesign strategies, methotls, and (:t\l> tools by tleleting timing constraint checks or by specifj.- 211 lowed LIS to complete tlie NVLY CP'LJ chip tlesign in ing niutu:ll exclusivity between specifietl groups of 30 percent less time than our initial schedule hat1 nodes. required. Typical parts operate at 8.3.3 MHz (a 12-11s cycle time) under worst-case conditions for tem- WA WOTH Interconnect A nnlyzer per;lturc ;lnd power supply This is 2 ns better tli;~ti o~~rnl;~xi~ni~ni cycle time design constraint. The 'Fradition:~I ni;unu;ll techniques for checking R<: chip bootetl the VMS operating system ten clays clela): IR noise, and electromigration (EM) were after the first prototype wafers were available, :md impractic;ll for iWrL\: clue to the size and complexity booted the IILTRIX system a few days I;tter. Multi- of the design. A suite of CAD tools called \Yi5.WOTH processing was ri~nuingwithin weeks. Fifteen was tlevrloped to perform these checks ;li~tom;~ti- obscure logic bugs were found in the first-p;~ss cally, more accun~telyand in far less time than chips. but none of them impeded system debug or would otherwise h:i\.e been possible. prototype development. No circuit design bugs During EM ant1 IR ;~nalysis,current sources rep- were founcl. Only one design revision was neetled resenting gate-s\vitcliing events miere :iddetl to :i prior to \~oluniechip ~il;u~iufact~~ring. REX-generated resistor network. The magnitude of <:arefill design. complexity management, and each current source was calculated based on the proprietary CAD tools targeted to large custom average switching frequency of the gate, the lo;id it (:&los integraletl circuits pl;lyed crucial roles in the drove, ant1 whether average or peak current wns successfi~ltlesign of the Nvi\x microprocessor. clesiretl for tlie current analysis mode. The networli node voltages were then solved through 1.1.1 tleconl- Acknowledgments position. Peak voltages were flaggetl for IR analysis. We \voulcl like to acknowledge the contributions of ant1 average ;~ndpeak current densities were calcu- tlie following pcople: W Antlerson, E. Arnold, latecl for each resistor element alicl checked against R. Bacleai~,I. Bah;u. IM.Benoit, B. Bensclineitlel; EN1 limits. I. Uernstein, D. Bh;ivsar, L. Biro. M. Hl;akovich, Ihring R(: ;mnlysis, node capacitance was I? Bouclier, W Bowhill, 1.. Briggs, J. Brown, S. Hutler, proportiotiateljr tlistributetl along the resistor net- R. <>~Ic;~gni,S. Carroll, M. Case, R. <;:~stclitio, work. The resulting RC network was processetl A. C;ive, S. Chopra, A?. Coile): E. Cooper, R. Cvijetic, by Carnegic-;Llellon's,\WE algorithin to generate :I iM. 1)el;uie); D. Ileverell, A. DiP;rce, D. I)uV;~rne!: close ;~pproxinl;~tio~iof the transfer function for R. Dutta, J. Edniondson, J. Ellis, H. Fair, <;. Fr;unces- tlie network.I2 From this, the step response I<(: chi, N. C;e;lgan, S. Goel, M. go war^, J. Grotlstein, delay w;u crrlculatcd ant1 the delay response to any I? Gronowski, W Grundmann, H. 11:1rkness, W specifietl ctlge \vas found through convolution of I-Icrrick, R. Hicks, C:. Holub. A. Jain, K. KIi;~nti:~, the tr;insfcr function. R. Kiser, D. Koslow. K. Ladd, S. Levitin. S. Martin, Since it was neither possible nor necessary to T. ivIcI>erniott, K. McFadden, B. McGee. J. Meyer. perform It<.<:and EM analysis on a11 signal nodes, M. Minartli, D. iMinel; T. O'Brien, J. Pan, H. l'irrtovi, WAWOTH cont;~inerla number of tools that identi- N. IY~illi~s,J. Picltholtz, R. Raztlan, S. S;~mutlr;~l;~, fied only tliose nodes that would have sonie cli;~ncc j. Siegel, K. Siegel, Somanathan, J. St. Lluretit, of hliling these checks. To decrease run time, we <:. R. Stanim, It. Supnik, &I.Tareila, S. Thieriu~f,M. reclucetl the size of tlie files that were input to IJhler, N. Watle, S. Wat kins, ant1 Y Yen. WAWOTH by eliminriting any devices anti parasitics that were not rel;~tcclto the node under ex:imina- References and Note tion. Noteworthy were the large data requireliients nict by \YAIVOTH. For example, WAWITH calculated 1. <;. IJliler et al., "The N\RX ant1 R'vA);+ Higli- the current through the 71C).000 resistive elements performance \'AX Microprocessors," Ili'yit~I tlirlt compose tlie power ancl ground notles of tlie fich~zical,/ol~i'i~nl, vol. 4, no. 3 (Sunimer E-box. Current stjrn~~lusof the network w:is 1992, this issue): 11-23, The NVAX CI'U Chq~:Design Challenges, ~MElhods,aizd CAD Tools

2. B. ZetterJuncl,J. Farrell, and T. Fox, ra micro pro- TechniccllJo~lrizal, vol. 1, no. 7 (August 1988): cessor Performance and Process Complexity 95- 108. in CMOS Technologies," Digit~il Technical W! Jourizal, vol. 4, no. 2 (Spring 1992): 12-24. 8. W Durdan, W. Bowhill, J. Brown, Herrick, R. Marcello, S. Samudrala, G. tlhler, and 3. J. Crowell, K. Chui, T. Kopec, S. Nadkarni, and N. Wade, "An Overview of the \IILY 6000 D. Sovie, "Design of the \rN( 4000 Model 400, Moclel 400 Chip Set," Digital Technical 500, and 600 Systems," Digital Technical Jounzal, vol 2, no. 2 (Spring 1990): 36-51. Jounzal, vol. 4, 110. 3 (Summer 1992, this issue): 60-72. 9. J. Hwang, "REX-A VLSl Parasitic Extraction Tool for ElectronligEdtion and Signal Analysis," 4. L. Chisvin, G. Bouchartl, ancl T. Wenners, "The 28th Design A LI torncftioiz Conference (1991): VAS 6000 Model 600 Processor," Lligital 717-722. Tkclnizical Jo~lrizal,vol. 4, no. 3 (Summer 1992, this issue): 47-59. 10. J. Pan, L. Biro, J. Grodstein, B. Gruntlmann, antl Y. Yen, "Timing Verification on a 1.2M- 5. R. Batleau et al., "A 100 MHz Macropipelineel Device Fu l I-Custonl <:MOS Design," 28th VtOC CMOS Microprocessor," Accepted for Design Automation Conference (1991): publication in IEEE .loul-izal of Solid State 551 -554. Circ~lits,vol. 27,110. 11 (November 1992).

6. SPICE is a general-purpose circuit simulator 11 W Elmer, "The Translent Response of program developed by Lawrence Nagel antl Damped Llnear Networks with Particular Ellis Cohen of the Department of Electrical Regard to Wide-band Amplifiers," Journal of Engineering and Computer Sciences, Univer- Applied Physics, vol. 19 (1948) 55-63 sity of California, Rerkeley. 12. V Raghavan and R. Rohrer, "A New Nonlinear 7 T. Fox, I? Grononrski, A. Jain, B. Leary, and Driver Model for Interconnect Analysis," 2Stl9 D. Miner, "The CVAx: 78034 Chip, a 32-bit Sec- Design Azllomatioiz ConJerence (1991): oncl-generation VN( Microprocessor," Digitcil 561 -566. Walker Anderson I

Logical Vmicationof the WAXCPU Chip Design

DigitalS AVRX high-perfonnance inicl-oprocessor has ct co?~zple.xlogical design. A rigorous simulation-based uerific~itioneffort was ~~rzdertakeizto ensure th~lt there zilere no logical errors. At the core of the effort were inz~llelnent6~tiolz-orie1zted, directed,pseudorandom exercisers. Tl~eseexercisers were supplemented with inzple- mentation-specificfocused tests and e-ristirzg VRX architectural tests. Only 15 logical bugs, all unobtrusive, were detected in the first-pass design, and the operating systellz booted withfirst-pass chips in a prototype system.

The WAX CPU chip is a highly complex VAX micro- Group (SEC) whose primary responsibility was to processor whose design recluiretl a rigorous verifi- detect ancl eliminate the logical errors in the NViOL cation effort to ensure that there were no logical design. The detection and elimination of timing, errors. The complexity of the design is a result of electrical, and physical design errors were left to the advanced microarchitectural features that make separate efforts.' up the NVkV architecture, such as branch predic- Given the design complexity, the critical need for tion, micropipelining and macropipelining tech- highly f~~nctioaalfirst-pass chips, and the fact that niques, a three-level hierarchy of instruction the designers had other responsibilities related caching, and a two-level hierarchy of write-through to the circuit and physical inlplementation of and write-back tlata caching.' Also, the chip was ini- the full-custom chip, special attention to logical tially intended for two different target system con- verification was considered a recluirement. Every figurations and had to be verifiecl for operation in verification engineer approachetl the verification both. Product time-to-market goals mandated a problem with a tlifferent focus. Each member of short development schedule relative to previous one group of engineers focusetl on the verification projects, and there was a limited number of verifi- of a single box, while other engineers focused on cation engineers available to perform the tasks. functions that spanned several boxes. Certain veri- The verification team set two key goals. The first fication engineers were available throughout the was to have no "show stopper" logical bugs in the project to test the fi~nctionsof the chip that first-pass chips antl, consequently, to be able to required extra attention. This variety of perspec- boot the operating system on prototype systems. tives was an important aspect of the verification Meeting this goal would enable the prototype strategy. Most verification engineers followed the system hardware ancl software developn~entteams process tlescribed below. to meet their schedules and \voultl allow more intensive logical verification of the chip design to 1. Plan tests for a function. continue in prototype systems. The second key team goal was to design second-pass chips with no 2. Review those plans with the design and architec- logical bugs, so that these chips could then be ture teams. shipped to customers in revenue-producing sys- tems, meeting this goal was critical to achieving the 3. Implenlent the tests. time-to-market goals for the two planned NVAX- basecl systems. 4. Review the actual testing with the design and architecture teams. Team Organization and Approach Logical verification was performed hy a team of engi- 5. Implement any adclitional testing that was neers from Digital's Sen~iconductorEngineering deernetl necessary.

38 Vol. 4 1Vo. 3 S~~rw~net.1992 Digital TecbrricalJotirrral Lowical Verificationof the lWAX CPU Chip Design

Simzclntion and Modeling Methodology (i.e.,board) and then system n~odels.The chip veri- The verification effort employed several models of fication team performed only a limited amount of the full NVAX CPU chip and of the individual design simulation using these module and system models. elements. Each model had its strengths and weak- These simulations were used primarily to verify nesses, but all models were necessq to ensure a that the chip model functioned correctly in a more thorough logical verification of the design. accurate model of a target system configuration and to better test the multiprocessor support filnc- tions in the design. The system development Behavioral Models groups performed more extensive simulations with The behavioral models of the chip were written by such models. design team members using the DECSIM behavioral modeling language; to achieve optimal sinlulation performance, they were written in a procedural style. These models are two-state models that are Schematics-derived models were created and simu- lated at both the box and full-chip level. The logically accurate at the CPU phase clock bound- CHANGO simulator is a two-state, compilatio~i- aries. These fairly detailed behavioral models repre- driven simulator and, like the behavioral motlel, is sent logic at the register transfer level (RTL), with every latch in the design represented and the com- accurate only at the CPU phase clock bounclaries.' binational logic described in a way similar to the The full-chip CHANGO model linked together the ultimate logic design. following: the code that was automatically gen- Rehavioral simulations were performed first on erated from the schematics; C-code models for box-level models, where most of the straightfor- chip-internal features such as control store and ward design ancl modeling errors were eliminated. random-access memories (&IS); C-code models A box is a functional unit such as the instruction to perform simulation control functions; and the prefetch/clecode and instruction cache control same DECSIM behavioral models for the backup unit, the , the floating-point unit, cache, main memory, ancl system functions that the memory management and primary cache con- were used in the fiill-chip behavioral model. The trol unit, or the bus interface and backup cache simulation performance of the full-chip CHANGO .' model was only about one-half that of the unaccel- The box-level models were then integrated into erated, full-chip behavioral model. Although these a fill]-chip behavioral model, which also included a models were not useful for electrical or timing veri- backup cache model, a main memory moclel, and fication because they did not model timing or elec- models to emulate the effects of several system con- trical characteristics of the design, their sin~ulation figurations. The pseudosystem models did not performance made them extremely useful for logi- model any one specific target system configuration cal verification. but coulcl be set up to operate effectively like any Another full-chip model was created to riun on target system configilration or in a way that exer- an event-driven, n~ultiple-state simulator. How- ciseel the chip more intensely than any target ever, only a limited amount of simulation was per- system would. Available early in the project, this formed using this model, because its performance model was the primary vehicle for logical verifica- was very slow when compared to the CHANGO and tion until the schematics-derived, f~~ll-chip,in- behavioral models. Since it was the only model that house CHANGO model was created. The hill-chip could run on a multiple-state simulator, this third behavioral model could simulate approximately 13 model was used primarily to verify chip power-up cycles per VAX vMs CPU second and was used to and initialization. simulate about one billion CPU cycles. l'he procedural, filll-chip behavioral model also Pseudorandom Exercisers ran on ;I hardware simulation accelerator where it Early in the project, it became aplm-en1 that, given achieveel simulation performance of about twice the limited number of engineers, the short sched- that of the unaccelerated simulation. The simula- ule, and the complexity of the NVAX chip design, a tion accelerator was used primarily for simulating strategy of developing and simulating all conceiv- long, automatetl, noninteractive tests. able implementation-specific test cases would be In addition, the moclel was encapsulated in an ineffective This strategy would have required the event-driven shell and incorporated into module engineers to implement tedious, handcrafted tests WAX-rnicroprocessorVAX Systems

Instead, the verification team adopted a strategy script files that contain hierarchical text generation that depended heavily on the use of directed, pseu- templates and implements the basic functions of dorandom tests referred to as exercisers. This strat- a programming language. egy generated and ran many more interesting test SEGUE provides a notation that allows the user to cases than would ever have been conceived by the specify sets of possible text expansions. Elements verification and design engineers themselves. of these sets can be selected either pseudoran- The basic structure of an exerciser consisted of domly or exhaustively, and the user can spec@ the the following five steps, which were repeated until weighting desired for the selection process. For a failure was encountered: example, a hierarchy of SEGUE templates typically comprised three levels. At the lowest level, a SEGUE 1. Set up the test case. tenlplate was created to select pseudorandomly a 2. Simulate the test case on either the behavioral or VtU( opcode, and another template was created to the CHtUVGO model. select a specifier, i.e., operand At an intermediate level, the verification engineers created templates 3. Execute the test program on a VrU( reference that called the lowest-level templates to generate machine. short sequences of instructions to cause various 4.Analyze the simulation and accumulate data. events to occur, e.g., a cache miss, an invalidate from the system model, or a copy of register file 5. Check the results for failure. contents to memory. At the highest level, these Figure 1 depicts the interoperation of the tools intermediate-level templates were selected pseutlo- used to construct an exerciser and its basic flow. randomly with varied weighting to generate a com- plete test program. Setup Because the SEGUE tool was developetl with Setting up the test case involved generating a verification test generation as its primary applica- short ~ssemblylanguage test program, activating tion, the syntax allows for the easy description of some demons to emulate various system effects, test cases and the ability to combine them in inter- and selecting a chip/systetn configuration for esting fashions. Using SEGUE, the verification engi- simulation. neers were able to create top-level scripts quickly ?'he assembly language test programs were gen- and easily that could generate a diverse array of erated using SEGUE, a text generation/expansion test cases. These engineers considered SEGUE to tool developed for this project. This tool processes be a significant productivity-enhancing tool and

SEGUE SCRIPT FlLE

DEMON AND CONFIGURATION SETUP AND SIMULATION CONTROL I

BEHAVIORAL OR CHANGO SIMULATION EXECUTION

MEMORY DUMP BINARY 1 AREA FILES 1 1TRACE FlLE I PASSIFAIL CUMULATIVE INDICATION COVERAGE DATA

Figure I Ver~yicntionTool Flowjbr Exercisers

liol. 4 ivo. .3 Sl.l~nmer1992 Digital Technical Journal Logicul Verijiccltion of the NVAX CPU Chip Design

preferretl sing SEGIJE to hand-coding Inany ant1 various internal signals, were created. As the focusetl tests. exerciser test programs execucecl, various VAY Before simulations were performed, motlel architectural state information was written period- demons were set up. Demons were enabled or dis- ically to a region of modeled memory referrecl to as abled, and their operating modes were selected the dump area. When the sirnul;~tetlexecution of pseudorantlomly. Demons tilap be features of the the test program completed, tlie contents of the lnoclel environment that cause some exteni;ll clump are;l were storecl in a file. Also, the test pro- events such ;IS interrupts, single-bit errors, or cache gram was executed under the VMS operating invalidates to occur at pseutlor;~ndom,varying system running on a Vr\X cornpilter used as a refer- intervals. 1)emons may also be niotles of operation ence machine. At tlie end of execution, tlie con- for the systenl model that cause pseudorandom tents of tlie memory dump ;trea were stored in variation in operations such as the chip bus prottr another file. col, memory latency, or the order in which data is retilrnetl. Some demons were implemented to force Analysis chip-internal events. e.g., ;I primary cache parity A tool calletl SAVES allo\vs users to create C pro- error or ;I . 'These chip-internill grams in ortler to analyze tlie contents of bin;lr). clernons li;~tlto be carefully irnplementecl, bec;iuse trace f les. SAVES was used to ~,rovitlecover;lge ;ln;ll- sometimes tliey forced ;II internal state from which ysis of tests, and to check for correct be1i;lvior of tlie chip was not necessarily designetl to operate. 111 chip-internal logic and give a pass/fail indication. a pseuclorantlomly generated test, it is frequently For coverage analysis purposes, information tlifficult or i~iipossibleto check for the correct han- such as the number of times that a certain event dling of an event caused by ;I demon, e.g.. check occurred during simulation or the interval that an interrupt is serviced by the proper h;~ndler between occurrences was ;~ccumulated across with correct priority However, simply triggering several simulations. This tlat;~gave the verificzltion those events ant1 ensuring that the design tlitl not engineer ;I sense of the over;lll effectiveness of enter so~iiecatastrophic state proved to be a 1,ower- an exerciser. For example, ;I verification engineer fill verification techniclue. who wanted to check an exerciser that was Chip/system co~lfigurationoptions such ;IS c;~che intended to test tlie branch prediction logic w;~s enables, the floating-point unit enable, ant1 the able to use the SAVES tool to measure the number of backup cache size and speetl were also preselected branch mispreclictions. pseuclorandomly. Asisit from testing the chip in all Frequently, verification engineers ilsecl the SAVES possible configurations, e.g., with a specific c;~clie tool to perforn~cross-protluct ;~nalysisand tlata dis;ibled, varying the configuration in a pseudo- acculiiulation. For cross-procluct analysis, the engi- random manner c;~usetlinstruction sequences to neer sl,ecifietl two sets of clesign events to he xna- execute in ve1-y tlifferent ways and evoked many tlif- lyzed. The ;u~i;clysisdetermined the number of times ferent types of OLI~Sunrelatetl to the specific con- that events from the first set occurred simultane- figuration. Also, specific configurations and demon ously with (or skewed by some number of cycles setlips would signific;lntly slow down tlie simu- from) events in the secontl set. For example, one lated execution of the test program, sometimes to verification engineer analyzed tlie occurrence of tlie point where intended testing was not being different types of primary cache p;~rityerrors rela- accomplished. To work arountl this probleni, the tive to different types of metnory accesses. verification engineer coultl force the configuration ,4nalyzi1igthe cross-protluct of state 11iachinest;ttes ant1 demon selection to avoitl problematic setilps. against one :inother, skewed by one cycle, allowecl state m;icIiine transition cover;lge to be qilickly Simulation and VMReference Execution understootl. After ;~ssernblingand linking the test program, it The verification team used this SAVES information was loadecl into modeled memory, and its execu- about the exerciser coverage in the following ways: tion was sin~~~latedon either the behavioral or tlie To enhance productivity by helping the engi- CkMN(;O moclel. As the test program simi~l;~tion neers identify plannetl tests that 110 1o1ig~'r took place, ;I simulation log file ancl a binary-format neetlecl to be developed ;lnd run bec;~usethe file, which cont;rinecl a trace of the state of the pins exerciser ;rlreadjr coveretl the test case WLY-microprocessor VAX Systems

To indic;itc bignificant areas of the design where CHAN<;O nioclels, and proved to be very effective at coverage may have been deficient detecting subtle, complex bugs in tlie clesign. Each exerciser concentr;~tedon testing a single box, a To dctermine how the exercisers might be subsection of a box (e.g., branch prediction logic), adjusted to become more effective or thorough, or a particular global chip function. By adjusting or to focus on a particular low-level function of the SEGIIE template weightings, preventing or forc- the chip design ing the use of a particular demon, or forcing a par- P~~ss/T.rrilChecking ticular configuration parameter. the exercisers could be controlled at a high level to focus on low- Several checking mec11;11iisms were e~nployeclto level fi~nctions.Verification engineers traded inter- cletermine whether tests p;lssed or klilecl. The SAVES esting SEGUE templates among themselves to tool was i~sec!to check for correct behavior of the provide each exerciser with a rich and diverse set design. especially where correct behavior was cliffi- of possibilities for code generation, while still cult to observe at a Vr\X architectur-;~llevel. For maintaining the intended focus of tlie exerciser. example, the verification engineers ~~seclSAVES to check the proper hlnctioning of performance- Focused Tests eliliancing fe;lti~rcssuch ;IS branch prediction logic, Several focused tests were generated to supple- pipelines, and caches. ment tlie exercisers. These were necess;lry to test i\ i\ V&lS command proceclure automatic;~Ily i~iiplerne~itatioli-specificaspects of the tlesign that scanned simulation log files for error output fro111 coulcl not be checltetl by eoriip;~ringresults against any of sever;~lclesign assertion checkers built into a VtU reference machine. 111 some cases, ;in exer- tlie niodcl. These ;~sscrtionclicckers variecl witlely ciser col~ldhave been used to test a particul;~rfi~tic- in complexity For cx;lmple, simple assertion check- tion, but the verification engineer judgecl it easier ers ensuretl tli;~tunused clicoclings of multiplexers' to hantl-cock a focusecl test program than to con- select lines ne\ier occurred. As another examplc, ;I trol an exerciser in orcler to accomplish the testing. more sophistic;lted ant1 coliiplex assertion checker Focused tests were necessary and particularlp chal- verified that the CPU had maintained cache colier- lenging to create anel maint;lin when very precise cnce and proper subsets ;\mong tlie three caclies timing of events was required to test a certi~insce- and the main memory. nario of chip operntion. This timing could be The same \'%1S cornlnand procedure checked the achieveel only by hanclcrafting an assembly lan- simulation log file to verify that the simulation of guage test and running it uncler caretillly controlled the executio~iof the test program reachetl the simulation conditions. proper completion . Finally, a sirn- Each of tlie focusecl tests was run at least once on ple program compared the memory dump area files the fi~ll-chipbehavioral moclel and then again generated by the simul;~tionand the reference on tlic hill-chip <:HANG0 model. 111:ichine execution to verify that the memory clump areas were iclentical. Although thc simulated test Other Tests program may h;lve followetl a different cxccution Several tests that had been used for the verification p;ltIi from the VL~reference execution because it of previous UhX implementations were also ilsecl was simulated in the presence of demons, the com- for verificiltion of the hVrU( <:HIchip. The use of pletion points of both executions were the same, these tests allowccl the NVAX logical verification :md theVrl\; architectural state informntion that was team to focus on the implementation-specific corn- co~iiparedwas iclentical. plexities of the N\i.\X tlesign and riot expend as If these checks founcl no errors, the exerciser much effort on iniplenient;~tion-intlepenht,\'AX loopecl back to generate another test case. Recailse architectural verific;ltion. this whole process was ;~utorn;~ted,the verification The H<:ORE suite 01' tests can be used to verify engineer coulcl run this test continuousl); on ;III sever:~lpermutations of all \'AX instructions, :IS well ;rv;~ilablecompi~ting resources. as some \Om architectural concepts, e.g., memory man:lgement.? HCORE was valu;~l>lein that it w:ls the Olher Aspects of the Exercisej8s first test usecl to debug both the Cull-chip behav- Thc csercisers were tlie core of tlie NVAX CPU chip ioral moclel and tlie (:MUGO motlel. verification effort. They were run nei~rlycontinu- Srn;lll portions of tlie HCORE suite were used as a ously throughout the project on be11;lvioral and/or nightly model regression test. I11 general. very little Logical Vcrificntion of the WAX CPU CI?i) Desig~z regression testing of tlie NVAX models took place; simulation process and a box-level CHANGO the tearn believed that using computing resources simulation process iinder the \'his operating system to run pseudorancloni exercisers and other new and then comtnunicating between the processes. tests was of more value to the verification effort Stimulus and response data from the behavioral than consuming resources with extensive, frequent simulation is used to drive the inputs to and check regression testing. Consecluently, the entire HCORE the outputs from the box-level CHAN(;O model. In suite was rut1 at only a few key checkpoints during addition to comparing primary outputs from the the project. box, this technique was used to compare many AXE is a VAX arcl-litectural exerciser that pseudo- chip-internal points. The I-'OTF technique elirni- randomly generates single-instruction test cases.4 nated the need to extract and maintain large pat- MAX is an extension of AXE that generates multiple- tern files from behavioral simulations and proved instruction test cases with complex tlata dependen- to be a straightforward way of comparing tlie two cies between the instructions. Both tools set up motlels. Exercisers and focused tests were run enough vAX architectural state to prepare for a test using the POTF method, and several bugs were case, simulate the test case on a model, execute the quickly and easily isolated. Because a close correla- test case on a VhX reference machine, compare VAX tion between the behavioral models and the imple- architect~~ralstale i~iformation from the sin~u- mentation as represented by the schematics had lation ant1 VAX reference execution, ancl finally, been maintained, few conceptual, logical design report any discrepancies. Each test case may force errors were found by the box-level, POTF simula- some number of exceptions; the AXE and MAX tools tions. These simulations were, however, extremely ensure that all exceptions are detected and prop- useful for finding simple schematic entry errors. erly handled. The tU(E and MAX tools generate tests with no Full-chip CHANCO Simulation knowledge of the particular VAX implementation Next, the team constructed tlie fi~ll-chipCHANGO being testetl anel thus differ from the implernenta- model. The sirnulation environment of this model tion-specific exercisers. Consequently, IYXE and inclurlecl many features available in the behavioral MAX are less effective than the implementation- model environment. After simulating the HCORE specific exercisers for intensive exercising of per- suite of tests, all the focused tests were run on formance-enhancing features that are transparent the full-chip CHANGO model, and the exercisers from a VAX architectural perspective. However, were run on this model for several weeks. In adtli- iMkY was an effective test for the micropipelin- tion, 44,000 AXE cases and 33,000 &LAX cases were ing and macropipelining aspects of the WAX run on the full-chip CHAN<;O model. All these simu- design. Altogether, about 706,000 hXE test cases lations uncovered only one additional schematic and 137,000 &bu test cases were run on the beliav- entry error. ioral model. Simulation of the VMS Boot Process Schematic Vm~ication To ensure the success of operating system booting, An initial goal of the NVAX CPU chip verification i.e., initial processor loading, on first-pass chips and tearn was to perform a more extensive verifica- as a final functional test of the design, members of tion of the schematic design than had been accom- the architecture team simulated the VMS operating plished in the past. Hecause of the development of system boot process on the full-chip CHANGO the CHANGO simulator, with its significant perfor- model. The operating system source code was mance advant;rge over previously used logic simu- modifietl to add support for the Nva-specific fea- lators, the team met this goal. Approximately 75 tures and for the modeletl system environment. million WAX C1'1J cycles were simulated on the A ViMS system clisk that contained the changes was schematics-derived, full-chip (:HANG0 tnodel. created on an existing VAX systeni. Each block of the disk was copied to a VMS file, which was then Box-level CCHANGO Simulation used as the system disk image during sitnulation. First, box-level CHANGO motlels were constructed A disk model with a simple programming inter- antl tested using a technique called patterns- face antl a tlirect memory access (Dm) capability on-the-fly (POTF). This technique involved simul- was added to the simulation environment of the full- taneously starting a full-chip behavioral model chip CIWNGO model. The disk tnodel read blocks

Digital TechnicalJoztrrtal Vol. 4 No. .$ S~irj~rrler~1992 NVAX-microprocessorVAX Systems from the system disk image file, ancl wrote data to a intormation about a bug was col lectecl on the proto- small cache of internally maintainetl tlisk blocks. To type system, ant1 then the failing scenario was accelerate disk transfers, the disk model woultl reproduced on the behavioral model, where the examine cache state and use tlie system bus for the scenario could be analyzed and better untlerstood. disk transfers only when the data was present in the The chip-internal signals were extremely difficult cache and rec]uired a cache invalidate or write back. to observe, but a 12-bit, allowed access In other cases, the data was transferred directly into to one of eight sets of signals from various sections the memory subsystem in zero simulatetl time. of the chip. The ability to monitor the control store Wiile tracking the progress of the simulation, address bus by means of this parallel port proved to the team itlentified operating system code tlrnt be an essential debugging feature. executed a time-consuming search algorithm. To The control store patching mechanism that was limit the amount of time spent in this loop, the part of the chip design helped identify some bugs code was rewritten to implement a much faster in prototype chips. The debugging engineers suc- algorithm. However, because the booting simula- cessfully used microcode patches to work around tion effort could not be restarted from the begin- several of the Ii;~rdwareant1 micrococle bugs. In ning, several utilities were clevelopecl that allowed cases where a microcode bug was patcliecl, exten- tlie code to be replaced in the system disk image sive system testing verified that tlie planned change file and in simulatecl memory during a pause in the was correct. simul;~tion. To provide the ability to restart the simulation Bug Tracking and Design Release effort ant1 move it to any available computing resource, simulation state was saved after every Bug detection was a key status i~ltlicatorthrough- 50,000 to 100,000 cydes of simulation. In total, out the N\l= logical verification effort and thus approxim;ttely 25 million cycles were simulated. helped to steer the team's work. Hugs were tracked The si~ni~lationwas stopped at the point where mul- carefi~llj.with an on-line system ant1 analyzed each tiple processes were created ant1 the main start-up week to consider trends, successful and unsuccess- process began executing. Even though this effort hi1 bug-finding techniques, and bug hot spots, identified no bugs in the design, it tlid provide a high wl~ichrequired additional attention. The bug detec- degree of confidence that the design was ready to tion rate was fairly constant throughout the project be released for fabrication of first-pass chips. at about 22 per month, with the exception of the last month in which the rate clropped to nearly zero. An analysis of the bug-cletecting effectiveness Prototype Chip Verification of eacli testing technique shows that all test tech- The prototype NVm chips were verified in several niques were effective ant1 seemed to co~iiplement VAX 6000 Motfel 600 multiprocessor systems. The each other. Table 1 shows the percentage of bugs <:Pu motlule was the only new liarclware coni- detected by each technique. This table inclutles ponent in the system; tlie backplane, memory and data on the ever-valuable, nonsirnulation verifica- 1/0 subsystem were know~ito be robust, because tion technique of simply reviewing, inspecting, ancl they were used in the Vm 6000 Model 500 system. cliscussing the design and its many representations. One logic analyzer WAS connectetl to the system The decision to release the design for fabrication bus, and another was connected to the pins of the of first-pass chips was a consensus decision made WILY chip. by t.tle verification, architecture, and design teams. 'I'he strategy for thc early prototype verification From a verification perspective, tlie design was was to boot the low-le\~clconsole user interpace, ready for release when the bug tletectiori rate run the HCORE suite of tests, boot the VMS operat- remained at zero for several weeks ant1 the majority ing system, and then run the User Environment Test of the planned tests had been inlplemented. The Package (UETP) sytenl exerciser. \Vithin 10 clays of verification of some areas of the design was receiving tlie first prototype chips, all tliese tasks deferred until after the release of the first-pass had been accomplished. Later, the AXE ant1 MAX tlesign. The development team decitled that any exercisers were run 011 the prototype systems. bugs that might be found in tliese areas would not The rigorous testing that continued 011 prototype have a significant negative impact on the system systems revealed ;I few logical bugs which hat1 gone development schedule, whereas additional delay in i~ntletectedduring simulated verification. Typic;~lly, releasing the tlesign would.

44 Vol. 4 No. 3 Slrt~r~ner1992 Digital Tecb$ricalJorrrrrnl Table 1 Bug Detection Using 1. One simple bug was cletectecl by running the Various Techniques HCORE test suite on the prototype system with Percent of the flo;~ting-pointunit (F-box) clis;tbletl. This bug Technique Total Bugs Found coultl have bee11 fount1 in the same way through simul:~tion,but the test suite was not run as a Focused tests 28 final regression test with the F-box disabletl. In Directed, pseudorandom general, focused tests like H(;ORE were not run exercisers 23 with varied cliip/system config~~r;~tions.The vcr- Review, inspection, ification team concludecl that ;~llfocused tests observation, thought 20 shoultl be run with different chip/system config- Detection technique unknown 12 urations. At the minimum, ;I configuration that AXE 8 disables all possible functions should be tcstecl. MAX 6 HCORE 2 2. Another bug was discovered because the (:l'll chip generated spurious writes to memory in the Other 1 prototype system. The exercisers probably tlitl generate the contlitio~isnecess;~r)~ to evoke this bug; however, the spi~rioi~swrites went unno- Results and Conclusions ticed. It is extremely tlifficult to verify th;~t;I machine cloes everything it is supposetl to tlo O~ily15 logical bugs were found in the first-pass anel nothing more. Adtlitional ;usertion checkers I\~JAXCl'li chip design, al.1 of \vIiich were eitlier e;is- or monitors in tlie ~iiodelsmight detect sucl1 ily worked ;lrountl or did not impact normal system bugs in the f~rti~re. operation. 7'1ie nature of the bugs found in the first- pass design ranged from straightforward bugs that 3. A third bug was evokecl when a prototype escapetl tletection for clear-cut reasons to extremely system exerciser executetl ;I translation buffer complex bi~gsthat required hours or weeks of rig- inv;llid;~teall (TBIA) instruction under cert;~in orous prototype system testing to uncover. Some of conclitions. On a real system, the Tslh instruc- the bugs escaped cletection during simulatecl verifi- tion is ~~sedonly by the operating system. In our cation for cl;lssic reasons such :a: verification effort, the pl'lllA instruction w;~slittle used by the eserciserh t11;lt were simi~l;~tecl. Testing of the filnction was perfornietl juht Operations that are perfornietl only by the oper- before release, in a hurried manner. ating system should not be untlerempIi;~sizt.d in exercisers. Simulation performance prohibited running a certr~intype of test c;lse. 4. One first-pass bug was rel;~teclto tlie halt inter- rupt, which js usetl only tluring tlebugging oper- A test was not run in a certain mode due to the ations. The halt interrupt receiveel minini;~l difficulty of running it ill all possible modes. testing and was not testetl at all in any type of exerciser. Iliscovering this bug was especially It took an exerciser running on a simulator a ;11111oyingbecause a similar bug hat1 escr~petl long time to encounter tlie conditions th;~t detection by the initial logical verification effort would evoke the bug. for a previous VXX implementation. This turn of A test was inaclvertently tlropped from the set of events reinforces the belief tli;lt there is \r;~luein exercisers tl~;~twere run contini~ously. reviewing the escaped bug lists from other proj- ects. Also, tluring the verification effort, tliere Details ;lbout five of the more interesting bugs seemed to be a natural, but erroneous, tendency found in the first-pass design follow. Included to untlertest functions i~sedinfrequently or not is information about how the bug was detected, at all tluring normal system operation. Such a hypothesis on why the bug eludetl detection functions sonietilnes require extra ;ittention, before first-pas5 chips were kll>ricated, ;~nclles- becausc tl~eymay be quite complex ;~ntlm;ly sons le;lrnecl from the detection and elimination have been given less carehil thought tluring the of the bug. design process.

Digital TecbrricnlJournal 161. .I .\'o. .3 Slr~n~rlo.I992 4 5 RnTAX-microprocessorVAX Systems

5. A state bit that needed to be i~iitialized011 power- development and support of tools. l'he <:MANGO up was not. This problem was noticed during simulator would not Iiave been ~~ossiblewithout initialization simulation but erroneously r;rtio- the significant contributions of Kevin Ladtl. Will nalized as being acceptable. Design assumptions Slierwood provided high-quality top-level guid- and assertions about initialization shoultl be wr- ance tliro~~gh;ill pli;rses of the project. The s!.stem ified through sirnul;~tionor other means. development groups performed system-lelrl simu- lations ;md rigorous prototype testing. The VAX Overall, the WAS (:I'll chip logical verification Architecture <;ro~~l'rLYE/&kY team, once again, effort was a success. 'The pseudoranclom testing provided ;mtl supportecl an effective verification strategy tletectecl several complex ancl subtle logi- tool. Lastly, tlic success of the project ;mcl the final cal bugs that otllerwise probably nroulcl not h:lve quality of tlie NVAX chip logical design ;Ire as m~lcll heen detected by simulation. Tlie extensive si~iii~l;~- a tribute to tlic work of the \WAX ;~rchitcctureand tioti perfornied on the schematics-tlerivetl nioclel tlesign tt';Ims ;IS the), are to the work of tlic verifica- of the chip provided a high degree of cnnficlence in tion team. tlie design. Tlie goals of producing highly fi~nction;~lfirst- pass chips and bug-free, second-pass chips were References both met. Neither the bi~gsin first-pass chips nor 1. G. Ilhler et ill., "The NVta ancl Nvi\S+ High- their work-arounds impeded prototype system perform;ulce VAX' Microprocessors," 11;iXli~al debugging in any significant way ;111cl first-pass Tccf)r~ica/Jonnzal, vol. 4, no. 3 (Summer chips with work- rounds were used in prepro- 1S)92. this issue): 11-23. duction, fielcl-test systems. The verification te:rm correctecl the 15 first-p;iss design bugs for second- 2. D. l>oncliin et al., "The N\XS <:1'1: Chip: p:iss chips, which were sliiypecl to customers in Ilesign <:li:~llenges,Methotls. i~ncl<:AI) Tools," revenire-protlucing systems. L)icyit~t1llxY~r~ical Jour-rial, vol 4, no. 3 (Summer 1992, this issue): 24-57 Acknowledgments 3. R. (:;ilc;rgni :rnd W Sherwood, "VAX 6000 The WAX logical verificr~tioneffort wils perfi)rmetl Motlel 400 (:ITl Chip Set Function;il Ilesign by a team of engineers from the SEG microproces- Veri ficatio~i,"Digital T~cI?I~~c.LII,JOLII.I?CII, sor verification group. Members of this teirm vol, 2, no. 2 (Spring 1990): 64-72, included Walker Anderson, Rick Calcagni, S;lnj;~), Chopra, and John St. Laurent. h!kY architects Mike 4. D. RIi;~ntlark;~l;"Architecture ,Ll;~n;~gernentfor G'hler and Debra Bernstein provided extensive tccli- Ens~~ri~igSoftw;l~-e Compatibilit\r in the \Jiu; nical direction and assistance to the verification Fan~il)~of <:omputers," /I;/:'/:' Conzll~iter team. Tlie SEG CAD group helped by its continual (Febr11;rry 1982): 87-93.

46 W. 4 jVo. .3 Strmmer 1992 Digital TecbrricrtlJorrrrrnl Lawrence Cbisvin &egg A. Bouchard Thomas M. Wenners

The VAX 6000 Model 600 Processor

The rMoclel600 is the newest member of the VM GOO0 series ojXbI[2-based,/n.ultl- processing compziters. The Model 6OOprocessor integrates easily into existing pht- forms. Each processor module provides 40.5 SPECmarks of pe!fonnance made possible 031 the ~VrrAxCPlJ chip. The major VLSl interfnce chip, called IVEXIVII, was opecited using Lligztal's inter-rzal ClVIOY-3 design and layout process The ability to clesigrz and fabricate the i~zterrfncechip internally uns critical to delzueri~zgn z~~ork- ing CPUprototjpe nodule on schedzile. The aggressive modzile timing goals ulere nzet by emnplcying preuiotis lnodule experience in combinntion rvith exterzsive SPICE sinz ulcition

The Model 600 system is the latest addition to Table 1 Comparison of CPU Performance Digital's VN( 6000 family of midrange symmetric VAX 6000 VAX 6000 multiprocessing computers.' The Model 600 was Benchmark Model 600 Model 500 desig~ieclto be integratecl cost-effectively into an existing VAX 6000 platform, and to provide a signifi- SPECmark3 cant performance improvement over the previous SPECint generation of systems. Table 1 compares the perfor- SPECfp mance of the Model 600 with the previous genera- VLlPs tion Model 500 on several important bencl~marks. Notes: SPECmark is a quant~tativemeasure of performance, The powerfill NVAX single-chip microprocessor determined by running a suite of 10 benchmark programs. enables this level of performance.' SPECint defines the performance on the subset of the tests that are integer intensive, and SPECfp defines the floating-point intensive tests. A VAX-1 ln8O system has a performance of 1.0 Design Goals SPECmark by definition. The SPEC tests used for the table 'The primary goal of the project was to deliver a values are from the System Performance Evaluation Cooperative, module that included the appropriate support Release 1. VUP is a VAX unit of performance. One VUP equals the performance of a VAX-11/780. functions, performance, and v'6000 system com- patibility. Equally important, the module had to be deliveretl on schedule to prevent an adverse time- This paper relates the background ancl design to-market impact on the program. This includetl process for the VC 6000 Model 600 processor the delivery of a working prototype module before module. The module and its system context are the first WtD( microprocessor chips were available, described, as well as many of the design decisions since a vN( 6000-based platform wor~ldbe used to that were made during the project. The first section debug the initial N\IN( CPU chips. Furthermore, the describes the nlodule in general terms, along with prototype moclule had to allow the VMS operating some of the trade-offs and choices ;~ssociatrdwith system to be bootecl and tested. its development. The second section foci~seson the Our goals were achieved. When the NVAX CPU NEXMI support chip, which is the prim;:ry interface chip, the prototype moclule, and the NEXMI sup- device positioned between the NWCPIJ data and port applications specific (ASIC) address lines (NDAL), the XM12 system bus, and the were integrated for the first time, the harclware support ROMBIIS. It eliscusses the very worked almost immediately. The software team large-scale integration (VLSI) design process used to was well prepared, ancl the fill1 VMS system boot create ant1 verify the functions of this interface. The took place 11 days after the hardware was put final section details some of the physical aspects of together. Moreover, the first-pass modules were the module design, including the important work used for all the system debugging, and were of suffi- performed to ensure good module signal integrity cient quality to be used for system field test. ancl thermal management.

Digital TechrticalJoumnl Val. 4 No. .3 Sctn?n~zer1331 IWAX-microprocessorVAX Systems

Description of tbe Processor Module Figure 1 shows the VAX' 6000 Model 600 CPU mod- ule. Figure 2 is a block diagram of the module, showing the m;~jorsubsect~ons. The CPU module contains two VlSI components: the NV' micro- processor and the NEXMI ASIC interface chip. The module also holds the backup cache static random- access memory (SRAIvI) devices, the XMI2 corner bus interface, and the supporting logic necessa1-y to implement a ViLV 6000 processor node. The NVkV CPU directly controls its external backup cache SRA~MS.When the tlata is resident in the cache, the NVAX CPU modifies it there, and implements a write-back sc11erne.When the data muses in the backup cache, or when an 110 access is necessary, the IWXX CPU places the commantl 011 the NDAL, along with any associated control and data information. The NEXMl accepts and processes the command The NEXMI VLSI chip provides an interface to the Figure I VM6000 Model 600 functions necessary to integrate a VAX' 6000 proces- Processor Mod~lle sor module into an mI2-based system. In particu- lar, the NEXM1 chip: Forwarcls invalidate traffic to the ~WIUCl'U for lookup and potential write back Translates the NDAL bus commands to XMI2 bus commands Controls NDAL bus arbitration

= Returns read data from the %MI2 bus to the NVN( The NESMI contains a programmable interval CPlJ (via the NDAL) timer, reset logic, halt arbitration logic, and secure

XM12 BUS &@lQlulADDRESS TAG STORE t

NEXMI ASIC OSCILLATOR NVAX CPU * CORNER BUS - El- CHIP INTERFACE t 0DATA CACHE

Figure 2 Block Diagrulnz cftl'tlne A4ode1600 iVIod~~le

48 Vol. 4 'Vo. 3 Sif/)t/rzer1992 Digilnl Tecbrrical Jourtral The VAX 6000 Model 600 Processor console logic on chip. It accommodates the rest of non-writeback queue services both tlie Xi1112logic the support fi~nctionsthrough the ROMBUS. and the SSC logic. Depending on the function, tlie The ROMBUS is controlled by and interfaces to the SSC logic might then send the comrnancl on to the NEXMI; it supplies a path for the low-speed devices ROMBUS. Each queue is loaded in the NDAL time that are necessary to create a working computer. domain, and a request signal is sent to the appropri- The devices on the ROMBUS are off-the-shelf parts ate %MI2 or SSC logic for processing. that contribute specific functions, including ROM On read cornniands, data is returned from the for the boot diagnostic and console program, a XI12 responder queue or SSC control section, and stack Wi for storage of diagnostic/console forwarded to the NDAL through the NDAL transmit dynamic information, an electrically erasable pro- section. The NDAL is then requested, and when bus grammable read-only memory (EEPROM) for saved access is granted the data is driven onto the NDAL to state, a universal asynchronous receiver/transmit- be accepted by the NVAX CPU. ter (UART) for console communication, an input The XMI2 logic also sends potential invalidate and output port, and a time-of-year (TOY) clock. addresses to the h'vm, where the information is The bare module itself is a sophisticated compared with the existing tag address in the printed wiring board (PWB) with the following indexed backup cache block. An invalidate address characteristics: is nothing more than the address associatecl with 11.024-inchby 9.18-inch module size a command initiated on the XM12 by another pro- cessor. If the cache block matches, and if the WiM' 10-layer module with 0.093-inch thickness must relinquish control of the data, the block is 4 signal layers, 4 power/ground layers and either invalidated or written back. The choice 2 component/dispersion layers depends upon the type of transaction and the state of the cache block. 0.010-inch vias In the Model 600, the NDAL arbitration is handled 5-mil etch/7!5 mil space minimum for signals by tlie NEXMl chip. Tlie NVAX CPU does not imple- 10-mil etch/40 mil space minimum for clocks ment the NDAL arbitration on chip because tlie microprocessor must acco~iiniodatemany differ- NEXMI Support Chip ent types of system platforms. A method of arbi- tration that is fair and efficient on one type of The NEXMI chip is the routing and control interface system (e.g., a single processor workstation with between the three major buses that reside on the several potential NDAL master nodes implemented VAX 6000 Model 600 processor module. A func- in off-the-shelf programmable devices) might be less tional diagram of the NEXMI is provided in Figure than satisfactory for another type of system (such 3. The three major buses are the NDAL, X\II2, and as the NEXMI). ROMBUS. The NEXMI chip contains the following The NEXMI and the hWCPU are the only two functions, each function is associated with one or nodes on the NDM in this implementation, so a sini- more of the three major buses: ple priority scheme is used. The NEXMI always has NDAL bus arbitration hi'ghest priority, since it is either returning data or forwarding Xi112 bus transactions for potential NDAL receive and transmit logic invalidation and write back. In both cases, some NDAL/XMI2 queues other entity on the bus is actively waiting for the Xi12 clata/invalidate responder queue information to be returned. XM12 bus control logic Choice of NEXrMI Technology System support control (SSC) logic The NDAL is the WAX CPLJ external interface bus. It The NDAL receive logic latches and decodes com- is a 64-bit, bidirectional, multiplexecl address ancl mands and data from the NVLY CPU, and routes data bus that runs synchronously with the CPU. them to one of two queues. If the command is Although it is significantly slower than the internal a write back, consisting of an address and 32 bytes CPU speed (an NVAX with a 12-nanosecond [ns] of write data, it is placed in the XMI2 write-back clock cycle has an NDAL with a 36-11s cycle), it is still queue. All other commands (reads and 8-byte aggressive in many respects, and presented a chal- writes) are placed in the non-write-back queue. The lenge to design using standard parts. Tlie NDAL

Digital Technical Jourrrnl Vol 4 No 3 Sc~rniner1992 , RECEIVE I WRITE-BACKQUEUE I

KEY:

ADR CMP ADDRESS COMPARE PAR GEN PARITY GENERATOR CMD DEC COMMAND DECODER REQ CTL REQUEST CONTROL NDAL ARB NDAL BUS ARBITRATION XCI XMl CHIP INTERCONNECT PAR CHK PARITY CHECK

Figure 3 Block Diagmwz oJtl9e NEXMI Chip The VAX 6000 Model 600 Processor arbitration for the next cycle happens in parallel direct contact with the VLsI design, process, and with the data transfer for the current cycle, and the manufacturing groups. As our design progressed, fill1 request and grant loop must be performed in we had constant co~nmunicationwith the people one NDAL cycle. In addition, the NDAL data path is who wrote and supported the computer-aided bidirectional. An NDAL master must be able to trans design (CAD) tools, the people who had designed mit its data to the receiver within a single cycle, the circuits and standard cell library elements, and then allow time for the bus to become tristate the people who would eventually fabricate the before the next master can drive the bus. chips. At any stage of the project, we could deter- The original product plan was to use several gate mine how to obtain a performance advantage with- arrays to implement the module control logic. One out risk to either chip yield or product reliability. gate array would have contained the Xi12 interface When a tool problem surfaced, or when a tool did logic, and the other would liave controlled the not do exactly what we needed, the issue would be system alpport functions and interface. It became immediately addressed and resolved. clear early in the project that the external I/O cells By having easy access to the individuals who had of the gate array could not meet the timing man- detailed knowledge of the circuits, we could attain dated by the NDAL specification. Furthermore, we the maximum possible speed and density from were unable to obtain timely and accurate SPICE the design. We benefited from the experience of models of the gate array output stage, which com- the internal full-custom design comnlunity, and pounded our general difficulty in determining the profited from access to its complete and accurate design trade-offs for the module.-l SPICE libraries. Consequently, we were certain of The use of an external commoclity gate array how we could obtain a design advantage, yet still implied another design-related drawback. The nor- allow the chip to perform reliably under worst-case mal gate array design process included the submis- conditions. sion of our design for chip layout after we had The Digital semicustom design process completely verified it for both logical correctness (described in more detail in the section NEW1 and estimated timing constraints. The timing was Design Process) provides constant feedback on cir- estimated since the actual timing could not be cuit layout. Therefore, our functional and logical known until the ASIC was routed. The gate array timing simulations were always up-to-date with routing would be performed by the vendor, and accurate gate and wire delays. By the time we were actual delay numbers would be used to verlfy the ready to freeze the design (after we had verified design again. This sometimes meant changing logic it for logic and timing correctness), only one interconnections to fix timing-related violations, regression run was needed on a minor change from especially if the design team had been aggressive in the last iteration. This approach prevented last- either the chip cycle time or gate usage. The design minute surprises, and resulted in a smooth transi- would then have to be verified again, and sent back tion from design description, through layout, and to the vendor where the process was repeated until into fabrication. everything worked at speed. Consequently, we decided to design the NEW1 NEXMI Design Issues ASIC chip using Digital's CMOS-3 standard cell pro- During the design of the NEXMI chip, several inter- cess at the Hudson, &LA site. Once the decision had esting problems were addressed. been made to use the internal process, we realized other significant benefits. We believed that we Interblock Control Signal Synchronization The could collapse the design into one package, since clock that drives the NVAX and NEXiiI chips runs at we could makc use of Digital's chip expertise to 36 ns, and is not synchronized to the 64-ns clock create full-custom sections where necessary. We that drives the mi2bus interface. Without any spe- would know early in the design cycle if our assump- cial care, the data that is generated in one clock tions were wrong, since the design flow uses a sub- domain can cause the target latching device out- chip approach. The approximate size of each puts to enter a state called "metastable." This state subchip is known as soon as the first pass of the is characterized by oscillations, or an output volt- structural design is finished. age level that is neither high nor low for an Another major advantage of using Digital's inter- extended period of time. This can cause unreliable nal CMOS design process was that it afforded us system operation.

Digital Technical Journal Vol. 4 No. .? Summer I992 51 WAX-microprocessor\'AX Systems

T'radition;rlly. a synchronizer is provided for the stare machines, interblock control signals, and control signals between two different clock queue head and tail pointers. \Ye then grouped domains to allow communication without c;lusing them into similar units so that related signals were metastability. A synchronizer is a casc:~tletlset of visible within one control group. Internally, the sig- latches, allowing the first latch to go metastable, nals were segregated within their boxes, and routed but characterized so that it settles clown beh)re the so that they ditl not ;~dverselyaffect the top-level secontl latching device is sampled. The drawback chip routing. to a synchronizer is that it increases the latency of con~munication due to the cascatletl latching NEXMI Design Process devices. The NES&II chip contains 250.000 transistors in a There are no inexpensive anel easy remedies for die that measures 0.595 inches by 0.586 inches. It is latency of communication when the system goes housetl in a custom-designed, 339-pin, ceramic pin from idle to active. However, we decided to reduce grid array (P<;A). This section describes the NEXMI the synchronizer latency on a busy system. Our chip design methodology and the unique <:AD tools methotl optimizes the case where information is that were used to design Digital's largest <:MOS-3 in either the NDAL queue or the XM12 respontler semicustom chip. (The chip is physically larger ancl queue, waiting for transmission when the current has more transistors than any previous standard command has been successfully transmitted. For cell produced using the Digital semicustom pro- these situations, we created a set of round-robin cess.) This section also covers many of the tratle- synchronizers, with a ring of request and done sig- offs made during the chip tlesign, and explains the nals in each direction. While one request/done pair reasoning behind our tlecisions. is being serviced, the pipeline overheat1 for the next The design of the NESMl chip can be character- pair is hidden by the overlapping synchronizers. ized by three m;~jorphases: Behavioral mocleling IVEXICU QL~~LL~SFor tlie three major qucues in the NESPII chip, we determined reasonable queue sizes Structural design based on our chip space constraints ancl perfor- Pllysical chip implementation rn;unce simulations. System testing on tlie real hartl- w;rre during our debugging ant1 systern integration 'The three design phases of Digital's internal phase confirniecl that our decisions were correct. CMOS design process significantly overlap each None of the queues cause perh)rrnance clegrada- other. Each tlesign stage is explained ant1 analyzed tion on an actual running systern. in this section. We decided to make the non-write-back and the XLMI responder queues into integrated queues Behavioral Moclelirzg Phase rather than have separate queues for each function. The first rni~jordesign effort focused on describing Queuing theory shows that a shared resource is bet- the chip functions at ;I high level of abstrac- ter utilized when only one queue is served by the tion. This is normally referrecl to as behavioral or nest ;~vailableresource, rather thati liaving a sepa- functional mocleling, ant1 the design team ilsetl rate clileue for each resoi~rce.~ Digital's internal li;~rclwaredescription I;rngilage, DECSIM-BDS, for this t;rsk.(>Functional design sec- Visibility Port A problem that always exists with tions were allocateel to different design engineers, dense, complex integrated circuits is how to tliag- and interfunctional block boundary descriptions nose problems that happen in an actual running were specifiecl. system that do not show up during simulation. A behavioral modeling strategy was attractive for Altlio~~ghno such problems appearetl in the NEXMI many reasons. A wel I-defined liierarchical partition chip, the designers wanted to provitle visibility to of the design was quickly realized, and s~nallersub- ;IS many internal states as possible. To this end, we sections (or subchips) within the chip were identi- cre;~teda parallel port to provide visibility to some fied. Behavioral models of each subchip were important internal signals. initially tlevelopetl to prove firnctional correctness. Given the size of the design, we could provide As the tlesign progressed, these subchips were only the minimum number of signals for visibility. replacetl by functionally equivalent gate-level struc- W/e tried to predict the internal sign;rls chat might tural models. IJsing this mixed-mode functional help diagnosis and adjustment, such as the internal simulation strategy, each tlcsigner could progress

5 2 Vol. 4 IVO. 5 SIIIIIIII~,~.1992 Digital Techrrical Journal The VAX GOO0 Model 600 Processor at his/her own pace, from behavioral definition tion process, we were also able to take advantage of to structural implementation, independent of the a wealth of full-custom knowledge providetl by our status of other subchips. support groups. One of the original reasons for The behavioral modeling effort identified many using the internal \TSI process was the tight timing architectural problems early it1 the design cycle. on the NDAL. Therefore, the NEXMI pad ring was Details such as queue sizes and structure, flow con- custon~designed for speed, control, and TTL-level trol mechanisms, and interblock communication compatibility. Other major handcrafted sections protocols were all emphasized, and each decision were the dense, multiported queue structures. yielded valuable information about the feasibility of To coordinate our large, multiple-person chip a single-chip implementation. design, we used the organized chip design (ORCHID) Behavioral motleliag allowed a functional file management system. The ORCHID system man- description of the NEXMI chip to be integrated into ages the files created by each design tool for every a system-level model quickly. This enabled the veri- hierarchical subchip. It contains translation tools fication team to write and debug tests early in the that convert the schematic data into file formats design cycle. As each structural subchip was fin- accepted by the simulation, layout, and verification ished, it replaced its previous behavioral counter- tools. The system allowed individual designers to part in the verification model, and was tested for work independently on subchips at various stages functional correctness This step-wise, integrated of development (schematic entry, sitni~lation,floor approach permitted the tests, models, and logic to plans, layout, verification) yet still maintained a be verified incrementally. coherent hierarchical design database that could be One major advantage of our behavioral modeling shared by all members of the design tarn. The itlea strategy was that it let the design progress without of a shared database facilitated the reuse of com- targeting a specific technology. The designers mon logic (e.g., counters, parity trees, testability focused their attention on logical implementation devices), and designers often borrowed from each rather than on the technology-specific details, such other to avoid primiti~~edesign duplication. as the timing and loading constraints ~mposedby a choice of technology. The Choice of NEXMI Senzicustom Design Flow Technology section described the process of select- ing the CMOS-3 implementation path. The initial Figure 4 shows the semicustom design process and behavioral phase of the project allowed some of the the individual tools that were used during the struc- design to be finished before the final choice of tural and physical implementation phases of the CMOS technology was matle. NEXlMI chip design. ALOE, Digital's in-house graphical editor, was our schematic entry vehicle, and was used in conjunc- Structural Design and Chip tion with the primitive syrnbols and models from Implementation Phases the CMOS-3 standard cell library (SCL3). The tools The next major project design phase involved map- within the ORCHID system were used to translate ping the behavioral subchip models into their schematics into generic wirelists with estimated equivalent gate-level structural representations. delays. The schematics were then input to other The design methodology chosen followed directly tools for logic and timing verification. Specific from the decision to implement NEXMI using design tools inclutled the internal logic simulator, Digital's CMOS-3, 1-micrometer, semicustom pro- DECSIM, a timing analysis tool, AUTODLY, and the cess. The semicustom process includes a fully spec- SPICE circuit simulator. ified library of primitive elements, called standard Automated was used to convert cells, similar to the cells included in a gate array some behavioral models into structural entities. library OCCAM, an internal CAD tool, was usecl to synthe- The advantage of the semicustom approach is size a gate-level representation of the scattered that it gives the designer ful I control over the place- address decode, and was able to minimize the logic ment and routing of the inclividual primitives to meet the aggressive bus timing.: Controllers or groups of primitives (subchips) within the chip. embedded within the %MI and SSC subchips were This allows the engineer to easily take advantage of synthesized using SMD2SIM, a tool developed at special placement for speed-critical paths. Because Digital's Boxboro, i?M site for another project. This we were using the internal tool suite and fabrica- synthesizer allows large programmed logic array

Digital Technical Journal Vo1. 4 No. 3 Summer 192 53 NVAX-microprocessor VAX Systems

I DESIGN ENTRY 1

SCHEMATIC STATE MACHINE SYNTHESIZER I

1 GENIE ORCHID TRANSLATION TOOL SUITE I GENERIC I WIRELIST I I GENERATOR I I I I I FLT I HIERARCHICAL I I FLATTENER I I I b I I ------I I - ---- 1 I I ~TZS----- I I DLY EST I I WLUICHAS I ITOOL DELAY I SPICE WIRELIST 1 SUITE TWEDT ESTIMATOR I I EXTRACTOR TWOLFEDITOR 1 I I DELAY I---__-_ I I I DISTRIBUTOR I- - - TWOLF I 1 I DLYCALC SPICECNV I TIMBERWOLF I I I PLACEIGLOBAL DELAY WlRELlST I ROUTER I I CALCULATOR TRANSLATOR I I I I I I I I SCASM I I I STANDARDCELL IDECSIM I I ASSEMBLER 1 I NETLIST TRANSLATOR I I DETAILED ROUTER I 1 FAME II-__-__-__- _--_____- I I--__-_---_ -----_--- 1 FLOOR PLANNER 1 AND TOP-LEVEL I I

DECSIM SPICE I LOGIC ANALOG I 1 SIMULATOR SIMULATOR I I I 1 ------I I I

HIERARCHICAL DESIGN RULE EXTRACTOR

I IVCMP I I I COMPARISON I I PROGRAM PHYSICAL CHECKS I L------

Figure 4 Semicustom Design Flow

(PLA) structures to be realized from a simple LISP- After each gate-level subchip was created, the like behavioral description. During the course of ATLAS tool suite was used to place and route the the design, controller function changes became standard cells within the subchip. The TWOLF edi- easier to maintain using a textual description. tor (mDT) was used to place special cells, such as

54 Vo1. 4 No. 3 Summer 1992 Digital TechnicalJournal The V'AX 6000 Model 600 Processor clock buffers and timing-critical gates, within the sizes of some of the driving transistors, based upon subchips. The rest of the subchip was then placed potential problems identifietl by the XREF program. automatically ancl globally routetl using the TW'OLF Figure 5 is a representation of Iligital's CMOS-3 program, which relies on a simulatetl annealing design process. It shows the chip floor plan, along placement algorithn~.~Lastly, the standard cell with an example of the schematics usecl to drive the assembler known as SCASM was used to assign stan- layout process, and a timing diagram created by the dard cell rows and to conlplete the tletailecl rout- sim~~lationprogram. ing.WS<;ASM uses both channel routing and "over-the-cell" routing, wliicli reduced the overall Verzj.icationof the VAX GO00 Model 600 size of the subchip by permitting metal routes over Early in the design cycle, our verification team was the underlying cells. Subchip post-layout timing able to integrate a behavioral chip-level model into information was then fed back into the logic and a CPU module model and perform system-level test- timing verification tools for a more detailed analy- ing of the v!x 6000 Model 600 These tests gave the sis. The ability to analyze post-layout delay informa- designers timely feedback on the effects of their tion ancl to iterate through the layout process early design decisions on the system as a whole. in the design phase was crucial to meeting our An NDAL/XMI:! bus monitor, calletl the BEMAR, scheclule. was written to verify the bus activity between the The top-level floor plan was progressing in paral- NDAL ;und MI2 ports of the chip, and to log valuable lel with the process of subchip placement and rout- cycle information. The NEAT, an NDAL cycle emula- ing. FAME, another tool from the ATLAS tool suite, tor, was also used to create NV&Y transactions on was used for the chip floor plan ant1 top-level rout- the NDAL. It was written to reduce simulation test ing.l(J.llInforniation about top-level routing was fed run times and to ease the test verification process. back to the subchip place and route tools to change Most of the tests written were focused tests, tar- the basic shapes and subchip boundary pin place- geting a specific function within the NEXMI chip. ment inforniation. This was used to keep inter- However, a random bus exerciser was also written subchip routing to a minimum, which in turn to test unexpected combinations, and was instru- recluced overall loacling and timing delays. Many mental in catching two design flaws that the successive iterations were necessary to achieve an focused test cases had missed. In all, over 100 optimal top-level floor plan that would meet our focused tests were written and verified against the timing goals and fit onto a single die. Fortunately, simulation model, giving us a high degree of confi- much of this work coulcl be done by the designers dence that first-pass silicon would be functional. themselves. The quality of the prototype systems is due in large After the final place and route iteration hacl been part to this complete verification. clone on the tlie entire chip, an exhaustive set of A subset of the h~nctionaltests was also used checks was performed to ensure tlesign integrity by manufacturing to generate chip test vectors. and circuit reliability for first-pass silicon. A VLSl Test vectors were automatically convertecl from design rule checker was run on the indivitlual sub- DECSIM-formatted trace files to Takeda pattern sets chips, then on the entire chip, to verify the layout through a special program called TEMPEST Prior to against the CMOS-3 process design ~~~les.As a pre- the return of first-pass silicon, the generated pat- cautionary measure, the original chip schematics tern sets were convertecl back into DE(:SIM format, were extracted to SPICE wirelists using an internal to ver~that the pattern sets would run success- wirelist tool, and another program, called fi~llyon the Takeda chip tester. This test pattern FIII.EX/CUP, clid the same thing using tlie geometric generation ancl simulation process reduced the inform:~tion in the final layout database. A wirelist time needed to debug the test patterns when the comparison program, called IVCMP, compared the NE)(IMI chips became available. two wirelists to ensure that the design we hat1 sim- ulated was the same one we would builcl. Modzcle Design Issues Finally, a program called XREF used the capaci- During the des~gnof the VM 6000 Model 600 tance from each internal chip node and the topol- processor, many module-related issues were ogy of the routed interconnection database to considered. The next section describes some of the predict tlie cross talk each signal could expect. more interesting issues that we encounteretl tluring Changes were made to the signal routing and the the physical module clesign process. In many cases,

Digital TechnicalJorrrrral Vol. 4 No. 3 Suin~rro/992 WAX-microprocessorVAX Systems

Figzlre 5 Representation of the CiMOOY-.3Design Flow

we describe the method that was adopted by the The footprints for the two SkkM cache chips that design team to prevent problems through careful represented the cache size trade-offs were differ- platitiing and analysis. ent, which made it difficult to create one prototype module to accom~nodateboth sizes. Creating two Backup Cache Size and Speed separate modules to test the different combinations We chose to defer the selection of the backup cache of cache sizes would have burdened the rest of size and speed until late in the design cycle. This the project, so special surface-mount pads were allowed us to make the decision based upon the tlesigned to handle the different SUM geometries pricing and availability of the S&\1 devices, and the on the same module. Once the cycle time of the performance to be gained from combinations of NVAX' CPU was determined to be 12 ns, we were able these devices. We also wanted to wait until we to make the final choice on the cache size and S&i knew the final speed of the NVAX CPU. The two speeds. The final decision was to provide a 2MB most likely cache sizes were 512 kilobytes (KB) and cache with 20-11s256~~ by 4 Smis for the data and 2 megabytes (MB). We felt that simulation co~~ltl 15-11s 64~~by 4 SUMS for the tag. This provided the provide insight into the performance trade-offs, best balance between cost and performance in the and sitnulation studies were perfor~nedwith both Model 600 system. sizes to establish approximate performance num- bers. We knew, however, that running real modules 011a variety of benchmarks under actual workloads ROMB US Decisions was the most accurate way to determine the perfor- General-purpose computing systems neecl a core of mance side of the price/performance trade-off. system functions to supple~nentthe computing

56 Val. 4 No. .3 S~l?~??ner1032 Digital Technical Jozrrnal The VAX 6000 Model 600 Processor

power of the processor. On the VAX 6000 series of Previous VA)i 6000 systems used a system sup- computers, this includes: port chip for most of these functions. After we decided to combine ;w: much support logic as possi- ROM to hold the boot, diagnostic, and console ble into the single ASIC device (NEX'MI), we had to code determine what functions to include insicle the EEPROM to hold items such as boot paths and chip, and what functions to locate external to the error information chip. We saw a potential schedule risk if some func- Console terminal UART tions, such as the UART and TOY clock, were placed inside the NEWI. These f~inctionswere relegatecl Time-of-year (TOY) clock to outside the chip since industry-stanclard,off-the- Battery backed up RAM shelf components were available to perform exactly the functions we needed. Programmable interval timer Once we decided to implement the UART and TOY Time-of-day register clock outside the NEXMI, ancl added the normally external KOM, EEPROM, and input/output ports, we Input ports to sense external switcl~esand other found that the NWMI pin count was hig11er than we state could afford. Since all the support devices miere Output ports to drive module light-emitting slow, and each one was byte-wick by its nature, diodes (LEDs) we solved the pin problem by creating a single, Reset logic for the module slow-speed, bidirectional bus, which we called the ROMBUS. Figure 6 is a block diagram of the ROMBUS Halt detection and arbitration and its components. The ROMBUS reduced the 4 Secure console logic NEXiiII pin count by 40 pins for the same fi~nctions.

r-lEEPROM 0TOY CLOCK

I3OM ADDRESS I *I::GENERATION I _mPORT 0

ROMBUS

1M-lINPUT PORT

RAM ROMBUS

ROMBUS COMMAND<3:0> NEXMI CHIP

Figure G Block Diagram of the ROMBUS

Digital TechnicalJournal Vo1. 4 IVO. ..3 S~tontner192 5 7 NVAX-microprocessorVAX Systems

Using the ROhllnUS for the UART and TOY clock ceramic through-hole PGA on a 100-mil grid was the reduced the risk entailed in designing them inside best trade-off. the semicustom NEXMI ASIC, but the ROMBUSitself was eventually burdened with 12 separate devices. NLIAL and Backup Cache Sirnrllatio~z The most This presented a large capacitive and direct current important module-level signal integrity analysis load to all the components on the bus. The UART was the extensive characterization of the NDAL and and TOY chip in particular were not well suited to backup cache. As the interconnection bus between this large load. We attempted to find CMOS-equiva- the NVAX and the NEXhII, the NDAL controls not lent devices for all the 'ITL components to eliminate only the transmission speed between the compo- the direct current problem, but were unsuccessful. nents, but also the speed at which the N\IAX can run We finally decided to split the bus into two sec- (since the NDAL is synchronous to ant1 scales with tions, and place all the CMOS low-drive-capability the NVM speed). The speed of the backup cache devices on a separate segment. A transceiver con- was a performance concern for obvious reasons. nected the two segments. We determined that estimates of the intercon- Given that most of the devices on the ROMBUS nect and routing would be insufficient for our were non-Digital components. we had the normal aggressive timing goals. Several trial layouts were problem of obtaining accurate SPICE models to performed to provide accurate input for the SPICE determine the ROMBUS timing. Extensive lab testing simulations. The I/O drivers for both the NVU and was performecl on the devices to characterize their the NEXMI chips were known to be complete and output delays under different capacitive loading accurate, since they were both designed internall)! conditions. These loading tests helped generate The timing requirements for reliable operation SPICE models for the ROMBUS simulations. were also well understood for the same reason. The module-level SPICE simulations provitletl guidelines Signal Integrity Considerations about the length and routing rules associated with Module signal integrity analysis was one of the most each high-speed signal trace, such as: important aspects of the project. A system that Daisy-chain routing of a1 l signals includes components n~nningas fast as the NVAX ant1 NEXMI can easily see performance degradation Clock routing scheme (e.g., matched length and or unreliability if the module signals are not care- termination) fully placed, routed, ant1 terminated where neces- Maximum length requirements for each type of sary. The past experience of the signal integrity network team played a major role in the eventual success of the process. Several approaches were taken for this Treeing, or ordering, requirements for networks aspect of the design. Impedance of different networks Package Selection SPICE simulations were per- These requirements were used as design guide- formed on each level of the design. This included lines. A cross-talk prediction program within the the chip, package, module, and backplane. Even layout tool verified that the coupling between sig- though each particular simulation analysis was nals was within an acceptable range. After the protv focused on one aspect of the design, we under- type modules were delivered, measurements were stood that each individual area affected the entire taken of all the critical signals on the module, show- system. For example, the selection of a package ing excellent correlation with the S1'ICE results spanned several important levels of design hierar- chy. The connections between the die pads and the Thermal Management module signal pins, as well as the connection of the During the early phases of the module design pro- package to the board, needed to be considered; as cess, the power dissipation of the NVto; chip was did the module layout that accompanied each pack- estimated to be as high as 20 watts, though the final age size and type. One aspect of the signal integrity figure for the 12-11s component that we shipped might show that a particular package type was with the product was 14 watts. Cooling such a part superior to another, while another aspect might presented a challenge to the module designers. favor a different approach to module component Since the Model 600 was an upgrade option in the interconnection. Electrical SPICE simulation deter- VAX 6000 family, there was no possibility of system mined that the performance and reliability of a rnoclifications to improve the thermal design.

58 Vol. 4 No. 3 Suinn~er.1992 Digital Technical Jozrrtrnl The VMGO00 iModel600 Processor

Figure 1, which is a picture of the Model 600 Engineering and Computer Sciences, Univer- module, shows that the heat sink for the WAX is sity of California, Berkeley. larger than the package. The final dimensions of the 5. S. Ross, Introduction to Probability 1Models heat sink are 3.285 inches by 3.285 inches by 0.325 (Orlando, Florida: Academic Press, Inc., inch, while the PGA package is only 2.2 inches by 2.2 1985). inches. One of our limitations was the maximum component height of 0.420 inch on side 1 for any 6. M. Kearney, "DECSIM: A Multi-Level Simula- module in a V&X 6000 backplane. This restriction tion System for Digital Design," Proceedings forced the heat sink to grow wider rather than of IEEE International Conference on Com- higher. The WAX chip was also placed closer to the puter Design: VLSl in Computers (1984): edge of the board, where the airflow provides max- 206-209. imum cooling. The final size of the NV'heat sink 7. R. Brayton, ASV, ant1 A. Wang, "MIS: A Multi- was a compromise between system requirements, ple-Level Logic Optimization System," IEEE board area, and thermal performance. Transactions on Computer-Aided Design, vol. CAI-6,no. 6 (November 1987). Summary The decision to use Digital's proprietary tool suite 8. C. Sechen and A. Sangiovanni-Vincentelli, and fabrication process was proved correct by the "Timberwolf 3.2: A New Standard Cell Place- quality of the VAX 6000 Motlel 600 ant1 its delivery, ment and Global Routing Package," Proceed- on schedule, for WAX CPU debugging and product ings of the 2.3rd Design Automation shipment. Access to accurate information about the Conference (1986): 432. components allowed decisions to be made early, 9. J. Reed, A. Sangiovanni-Vincentelli,and M. and entailed less risk. This advantage, coupled with Santamauro, "A New Sy~nbolic Channel a seasoned module development team and exten- Router: YACR2," ZEEE Transactions on Com- sive functional, timing, and circuit simulation puter-Aided Design, vol. 4 (July 1985): 208. resulted in a successfill project. 10. N. Chen, C. Hsu, and E. Kuh, "The Berkeley Acknowledgments Building-Block (BBL) Layout System for IrL.51 Design," Digest of Technical Papers, IEEE The authors would like to acl

References and Note 1. R. Allison, "An Overview of the VAX 6200 Family of Systems," Digital Technical.Journa1, vol. 1, no. 7 (August 1988): 10-18.

I. G. Uhler et al., "The WAX and W+High- performance VNL Microprocessors," Digital Techniccil Journal, vol. 4, no. 3 (Summer 1992, this issue): 11-23. 3. SPEC Nezusletter; vol. 3, no. 4 (December 1991). 4. SPICE is a general-purpose circuit simulator program developed by Lawrence Nagel and Ellis Cohen of the Department of Electrical

Digital TechnlcrrlJorrrnal Vol + No 3 SLLI~I~~I,1992 59 Jonathan C Crocvell Kwong-Tak A. Chui Thomas E. Kopec Samyojita A. Nadkarni Dean A. Souie

Design of the VM4000 Model400, m, and 600 Systems

The design ofDigital's NVM CPlJ chip provided the oppor-tuizity to bring RISC-class pe~:fori?zalzceto deskside CISC VRY co112p/.fter'systei~zs. The tlerv VAX 4000 model 400, 500, and 600 low-etzd systet?zs takeJ~111adz~antage of lir3eperfov~rzn1zceccryabllities of the IWAX mic~~oprocessor:The three sj~stenzsoffeel. jkom two toJour times theper- .fonnaizce of the pret~io~lscop-@he-line vitX 4000 il/lode/ 300 sjstei~zin the same deskside enclosure. To achieue this increased pet$omza~za~zce~Digital's systertzs erzgi- ~zeersdesigned LI nezil high-pelformarzce n2enzoly controller chip aspart ofthe CPU irzoclule, whose buic desi~izis shavecl bj~the Cbree systenzs, liz addition, a 1nigb-pe1- fonnaizce memwj, module and a VLSI bus adapter chip were designed.

The design of Digital's N\rM high-performance goals. As a result, the new top-of the-line VAX 4000 ~nicroprocessor offered systems engineers the Model 600 system performance is four times that opportunity to design computer systems with sig- of the ~Motlel300. nificantly improved performance.' The project Achieving the performance goals required structured to use the NVAX <:PU chip to upgrade the designing ;I new, high-performance memory con- VAX 4000 line of low-end desksicle computers troller chip called the NViuc data and acldress lines resulted in a family of three new systems ant1 the (NDAL) pin bus (NMC:). The associated CPll modules. These systems share a objectives were to double the memory bandwidth basic CPU module design but offer a range of per- of the WLY 4000 Motlel 300 system and to provide formance capabilities. This paper first presents a total system memory capacity of 512 megabytes the goals of the development project and then (MR). To support tlie NiLlC specifications, a high- describes the architecture, design, and implementa- perfornia~icememory module called the ~~690 tion of the resulting new VAX 4000 iWotlel400, 500, was designed.' and GOO systems. To reduce hardware and software development cost atid the risk of failing to meet project sclietl- ules, the system design incorporatetl all existing Goals of the VM4000 Project high-performance I/O adapter chips. These devices Schedule and performance goals were of prime include the second-generation Etllernet controller concern to the engineers committed to upgrading chip (SGEC), the Digital Storage Systems Inter- the ViD(. 4000 fanlily of comlxiters. 'Time-to-m;irket connect (DSSI) shared-host adapter chip (SFIAC), was a key goal of the development project. the CVAX Q22-bus interface chip (CQBIC), and the Consequently, hllLy ql~alifiedsystems were ready to system support chip (SSC).L.3' A very large-scale be shipped to customers when the NVkY CPlj chip integration (W.SI) bus adapter chip was required to was released ancl available in volume. The initial provide CVAX pin (cr) buses to connect these I/O goals for performance specified that the new sys- devices.5 The NDA1,-to-CP bus adapter chip (NCA) tems provicle three times the perfornlance of the was tlesignetl to meet this nectl. V&Y 4000 Model 300 system. Ultimately, the perfor- The three CPU modules designed to upgrade mance of the NVkY CP'LJ chip exceecled its tlesign existing VAX 4000 Model 300 systems retain tlie

60 Ih1 4 IVO .7 S~~irrr~rer1992 Digilrrl Technical Journal Design of the VRX 4000 iModel400,500, and 600 Sj~stems same ~~440systeni enclosure usecl for these sys- buses, a thiclc-wire and TliinWire adapter, tems. The upgrade requires that the older ~~670a Q-bus adapter, and the console serial line. Tlie memory module ilsetl in the Model 300 be replacetl CPU module performance is 16, 24, ant1 32 times tlie with the new, higher-performance ~~670memory performance of the well-known \%X-11/780 systern, module. The new systems had to support all of the for the three new VN( 4000 systems, respectively. Q-bus option 111ocl~Iesthat were supportetl on the The CPIJ cycle clock speed and cache sizes deter- \$a4000 Model 300. mine the procluct performance, as cliscussed in the section CPU-cache Subsystem. System Overview of the VAX 4000 The backplane in the systern enclosure provides the signal interconnection and power distribution Models 400,500, and 600 between systern components. There are connec- The ~~440s)fstem enclosure shown in Figure 1 sup- tors and slots for the CPU module, four slots for portstthe VhY 4000 Motlels 300, 400, 500, ancl 600. ~~690memory modules, ;lncl seven Q-bus slots. This petlestal enclosure was designed to operate in The CPU motlule has a 270-pin connector that an open office environment. lo allow the systems receives the module power and connects the CPIJ to operate cluietly, the cooling kins are speecl con- to the NvAX memory interconnect (NMI) bus, the trollecl, b;~setl on the ambient temperature. The system DSSl bus, and the Q-bus. The backplane was enclosure power supply provicles 644 watts of motlifietl to support the wider 72-bit clata path of direct current from a standard 15-ampere wall cir- the new 1~1~690memory moclules. This new baclc- cuit. The system was designed and qtlalifiecl to plane was ph;lsed into the Bi1440, enabling most operate in an cn\~ironmentwith a temperature \!U 4000 Moclel 300 systems to be upgracletl with- range of from 10 to 40 degrees Celsius. out requiring a backplane change. The system enclosure supports up to four DSSI or small computer system interface (SCSI) tape inte- INTEGRATED STORAGE grated storage elements (1SEs). These cableless ELEMENTS bricks support either one 5.25-inch, full-height drive or two 3.5-inch clrives. The ISEs are ;~v;~ilablein TAPE DRIVE variants that support the 2-gigabyte (GO), RF73 DSSI disk drive and dual 85MB, RF35 DSSI disk drives. The SYSTEM single-system pedestal can support six RF35 clevices ant1 a tape drive, providing 4.8~~of storage for I! ! POWER applications that require high 1/0 rates. This RF35 SUPPLY configuration can provide over 360 queued 1/0s CONSOLE MODULE per second for random I/Os. If IW73 drives are used,

FIVE DEDICATED the single-system box can provide SGB of storage. SLOTS BEHIND There are sever;~lways to expand the b;~se CONSOLE MODULE (CPU PLUS FOUR 4000 system; the most common way is to expand to MEMORY) another DSSI-based system ant1 create a two- or SEVEN Q-BUS U/ OPTION SLOTS three-node DSSI \iiU(cluster. The (2-bus in the \ILK 4000 system can be expanded to provide 10 adcli- KEY: tional Q-bus slots to each system using the B213A CARDCAGE Q-bus expansion enclosure. Tlie 1)SSI expansion e~iclosurestogether with tlic Q-bus IISSI adapter Figu7.e I BA440 Splenz Bizclosure (KFQSA) can exp;lnd tlie total available disk storage to 28 DSSI disks. Using the RF73 disk allows up to 56~~of disk storage. The new <:PI1 moclules, differentiated only by the part numbers ~1675,~680 and ~690,are utilizecl CPU Module as the engines for the VLY 4000 hqotlels 400, 500, The Cl'U moclule common to the three new VAX ancl 600, respectively. All three CPU modules, 4000 systcms is based on ;I highly integratecl CpU henceforth referred to as the CPU module, provide and 1/0 system built on the single 21.6-by-26.7- the same l/O fi~nctionalit-):including two DSSl centimeter (8.5-by-10.5-inch) module shown in

Digital Tecbrcicul Jozirnal Vol. 4 No. .j ,Striiriirer 1992 61 NVAX-microprocessor VAX Systems

Figures 2 and 3. The CPU ~nocluleprinteel wiring (ns) with a slip cycle, i.e., a two-cycle read (20 ns) board (PWB) consists of the following subsystems: a using 8-ns SbLMs. This write-back caching architec- central processor and its associated three-level ture significantly recluces the cle~narldson main cache; a pin bus, bus aclaptel; ancl memory control- memory by caching both reads ant1 writes without ler; and an I/O system with integrated controllers for the need for a memory access. On all previous VAX' DSSI and Ethernet buses. The CPU module also con- 4000 systems, the caches recluirecl that all write tains a CQBIC and 512KB of field erasable program- operations continue through to main memory, i.e., mable reacl-only memory (FEPROM) for console code. write through. The WitX CPr is clockecl by a clifferential emitter- CPU-cache Sz~bsystem coupled logic (ECL) surface acoustic wave oscilla- The CPU-cache subsystem is built around the single- tor. This oscillator runs at 250 megahertz (MHz) chip i\'V~xCPU, which provides a three-level cache (16-11s cycle time), 286 PlHz (14-ns cycle time), or architecture. The first two levels of cache, which 333 MHz (12-ns cycle time) on the ~675,~1680, are containeel on the chip, include a 2KB virtually ancl ~690CPU modules, respectively. The NllIU; aclclressecl instruction cache and an 8KB physically chip produces a four-phase internal clock directly addressed instruction and data cache. The thircl from this inlx~tand generates system clocks at one- level of cache, the backup cache, is constructed third the internal clock rate. using static random-access memories (SUMS) on The new CPU module design supports either a the module ant1 is completely controllecl by the 128-kilobyte (KB) or a 512KB backup cache. (512KB WAX CPIJ chip. The backup cache was designed to for the ~~690module and 128KB for the U680 and support a CpU cycle time as low as 10 nanoseconds ~675.)The tag store for the two cache sizes can be

NDAL-TO-CP BUS SGEC ETHERNET SHAC DSSl ADAPTER CHlP iNCAl ADAPTER CHlP ADAPTER CHlP

NVAX CPU CHlP

NVAX MEMORY CVAX &BUS SHAC DSSl CONTROLLER (NMC) INTERFACE ADAPTER CHlP CHIP (CQBIC)

62 Vol 4 iVo .j SLIIIIIIIL'I1992 Digital Tech~zicrrlJournal Design ofthe V'AX 4000 Model 400, 500, and 600 Sj,stems

DSSl BUS v SGEC SHAC DSSl ETHERNET ADAPTER ADAPTER CHlP 128KB OR 512KB CHlP B-CACHE SRAMS -

NDAL-TO-CP CP BUS 1 NVAX - BUS ADAPTER 2 CHIP (NCA) CPU m - SYSTEM FPU SUPPORT P-CACHE N- - 2 - CHlP (SSC) B-CACHE NVAX V) 3 CONTROL MEMORY m CONTROLLER a (NMC) SHAC DSSl OSCILLATOR ADAPTER 8 INTERFACE

BA440 CONNECTOR NVAX Q-BUS DSSl BUS MEMORY INTERCONNECT (NMI)

Figure 3 Block Diagram of the CPU~Vlo~lule constructed from S&bVis that have tlie same 24-pin NDAL Interconnect, Memory package with co~npatiblepinouts. This packaging Controllel;and Adapter makes it easy to design the P\yrR to support either The NDAL bus is ;I synchronous, multiplexetl cache. However, the data store, which is construc- address and data, pendetl interconnect. Each tetl from parts whose paclcages Ii;we incompatible device on the NDAL bus may be a commander pinouts, required a special dual footprint tlesign (request data transfer), a responcler (respond to to :~ccornmodate either 16K-by-4K or 64K-by-4K commantler requests), or both. In the new ViOi SRt\Vs. This footprint is clesigned witli the clecou- 4000 systems, the IW~OLCPU chip is only a comman- pling capacitors and series edge-limiting resistors der, the NMC is only a responder, and the NCA T/O carefirlly placetl in the outline of the footprint to adapter is botli a conimantler and a responcler. ;illow a clense, geometric packing of tlie 18 kt\ls Arbitration of the NDAL interconnect is per- necessary for the data store. formed by the NMC,which is also the default master The backup cache is a write-back caclie witli responsible for driving v;rlitl no-operation bus memoly coherence maintained tlirougli ;I direc- cycles when there is no master activity. The WAX tory-based broadcast colierence protocol.' When CPU is responsible for watching all NDAL traffic and the NVAX <:PlJ needs to write data to memory, the performing any invalitlates or write backs of pri- dat:~is first transferred into the backup cache with a mary and backup caclie data required to maintain request h)r write privilege command. Once ;I cache cache coherence with memory. row is stored in the cache as "written," any (DM) read to that memory address of tlisplacement of that cache row will result in a 1.0 Buses rele;rsc write privilege transaction, even if the data The CPU module uses a custom third-genera- was never actually written. The cache controller tion complementary metal-oxide semiconductor insitle the WiOi CpU is responsible for all activities (<:$lOS-3) process 110 adapter (the NCA chip) to related to cache maintenance. interface between the NDAL bus and a pair of 32-bit NVAX-microprocessorVAX Systems

CP buses. Two CP buses are usecl to prevent long requirements were met by using a strong driver in response latencies on the Q-bus from interfering the NVAX chip and a parallel R-C termination at the with buffer management on the Ethernet interface. far end of two of the three stubs. The R-C termina- One CP bus, CPl, operates under the synchronous tions absorb some of the incident energy, reducing CP bus protocol, and the other, CP2, uses the asyn- the reflections to an acceptable amount and allow- chronous protocol. The faster , the ing incident-wave switching without the reflected Ethernet and the two DSSI adapter chips, reside on wave reentering the threshold region. CPI and can take advantage of optimizations in the The backup cache data store was designed with NCA to reach a peak bus bandwidth of 33MB per strong drivers and incident-wave switching on the second (iLIB/s). CP2 has only one peripheral llivL4 address lines. The routing of a representative cache master, it., the CQBIC, which also connects the SSC address signal is shown in Figure 4. The stubs were ancl the console FEPlXOM to the system. The NCA arranged in such a way that the unavoidable reflec- acts as a master on both CP1 and CP2. Since only tion from the far entl of the lines was reduced by one of the buses is asynchronous, the system uses a partial reflection from the center junction. Thus, only one CP bus clock (CCLK) chip for signal syn- signals traverse the threshold region fairly cleanly chronization and CP clock distribution. and settle rapidly outside the threshold region, i.e., The two CI' buses share the same clocks. Con- approximately 3 ns elapse from the beginning of sequently, arbitration is perforrnecl by a single pro- the transition at the driver until the time when grammable sequencer, which serves both buses. a valid signal arrives at the receiver. The sequencer is clocked at 35 11s (~~680and ~690)or 40 ns (~~675);this is one-half the CP Printed Wiring Board Physical Design cycle time. The arbitration for CP2 is a simple two- priority scheme with the CQBIC at the higher prior- The NVrU( CPU chip draws several amperes of cur- ity. When more than one master is requesting the rent from the 3.3-volt power supply. This current bus, the minimum Dmrequest deassertion times draw has significant high-frequency components. on both the CQnl<: and the NCA effectively make The integrated decoupling capacitor on the NVAX this scheme behave like a round-robin arbiter. The die helps eliminate some of the high-frequency cur- internal state does not have to keep track of the pre- rent pulses, but much of this current must be sup- vious master. No special treatment is required for pliecl by the nloaule-level clecoupling capacitors. lock cycles because the CQBIC will never perform a Charge stored 011 the module in these decou- lock on behalf of the Q-bus. pling capacitors supplies this current. Any induc- tance in the path of the current reduces the effectiveness of the capacitors by limiting the rate- Signal Integrity of-change of the current. In addition, the larger the Signal integrity work began very early in the project, physical area enclosed by the current path, the and tlie effort was a close collaboration between more radio frequency (RF) energy will be radiated the cPrJ module design team, the design teams for into space that must be contained by tlie enclosure the three VLsI devices, and the VAX 6000 Model 600 in order to meet regulatory radiation recluirements. CPU module design team. Because some CPU mod- These two issues led to the exploration of how to ule team members had experience with designing minimize both the inductance of the decoupling CP bus modules, the signal integrity issues on the path and the physical area of the RF current spread. CP buses were generally well understood. The CP The internal PWB standard, as it existed when the bus data lines were routed with only a length con- new cpu module was being designed, required a straint. 'The control signals on CPl required signifi- minimum of 25 mils (0.025 inch) of surface etch on cant analysis ;1nd SPICE modeling to meet both any device before a via could be dropped into an settling time and waveform requirements. inner layer. Traditionally, the inductance of this 'I'he most critical signals in the backup cache connection was retluced by using a wide (i.e., 25- are the output enable and write enable signals. The mil) etch for this connection. The i~lductanceof output enable deasserting edge must be transi- surface etch on the module lay-up used on the new tioned quickly to avoitl tri-state contention on the CPU module (calculated with two-dimensional cache data lines. The write enable signal must be transmission line [TDTL]) is shown in Table 1. perfectly monotonic through the threshold region, Table 1 provides the data to calculate the total because it is an edge-sensitive signal Both of these inductance of a pair of 25-by-25-miletch segments,

Vol 4 No. 3 Su~rriner1992 Digital Tecbnicnl Jourrrnl Design of the VAX 4000 Model 400, 500, and 600 Sj~stems

OOOC EXAMPLE OF A OOOC SINGLE ADDRESS oooc LINE ETCH -0OOC OOOC lL-!%? lL-!%? C * "C OOOC

cN!5 c

OOOC

OOOC OOOC 0 0 0 c OOOC OOOC OOOC oooc OOOC

Figure 4 Backup Cache Treeing i.e., approximately 0.4 nanohenrys (nH). (This mea- inductance of the high-qualit): 1,000-picofarad surement is approximate, due to the short dimen- (pF) RF capacitor used on the CPU module is sions of the segments.) The effective series approxinlately 1 nH, including the inductance of the vias connecting the capacitors to the power planes. The reactance as a function of frequency for Table 1 lnductance of a Surface Etch the inductive component of this decoupling system Etch Width lnductance is shown in Table 2. Because the resulting impedance is still quite 10 mils high at upper frequencies, multiple capacitors are 25 mils used in parallel. The 1,000-pF capacitors are cho-

Note: This table represents the results of the inductance of a 0.5- sen from two different case styles to ensure that ounce copper strip on a 14-mil FR4 epoxy glass laminate the parasitic inductance of the capacitors is not dielectric over a ground plane. identical for all the high-frequency decoupling

Table 2 Reactance as a Function of Frequency for Decoupling with and without Dispersion Etch I Reactance I Frequency With Dispersion Without Dispersion (MHz) (Ohms) (Ohms)

Digitul Technical Journal Vo1. 4 No. .? Summer 1992 65 NVAX-microprocessor VAX Systems

capacitors. This method of selection staggers the moclule would have scan capabilities Thus, the freqi1enc)l of the parasitic resonances. overall module test strategy could not be basetl entirely on the JTAG specification. Cycle Design Goal and Testing As the teams reviewed the overall module Although the original design goal for the CpU mod- design, certain issues appeared to show promise for ule was a 12-11s CPU cycle time, as knowledge about the application ofa scan-based test: Digital's fourth-generation complenientary metal- 1. The automatic test equipment (ATE) pin density oxide semiconductor (CMOS-4) process increased, was likely to be very high in the areas of the the design teams investigated the critical paths for a three 339-pin pin grid arrays (PGAs). Reducing 10-ns operation. At this time, the CPU module had this high density would improve the reliability of not been routed, so a 10-ns cycle time was set as the the test fixturing. layout goal. The cache loop layout resulted in a mea- 2. Previous experience with module manufacture sured requirement that RAMS have an access time of in Digital's plants sl~otrredthat the risk for solder approximately 8.5 ns to meet worst-case timing defects woulcl be high in the cache area, because with no addecl slip cycles. of tlie J-leacl Sh11s.A fast way to isolate these RAlMs meeting this specification were not readily problems would help reduce debug time. available when the CPU module was designed. However, several vendors are beginning to ship 3. Providing at least a driver or a receiver for use by devices at this speed today. The NDAL data lines the JTAG scan ring would eliminate the need to were capable of running at the 30-11s NDAL cycle use continuity structures to verify bonding and that is generated with a 10-ns CPU clock. The point- solder integrity on the VLSI parts. to-point NDAL arbitration signals are the tightest Eventually, the niocl~~leand chip teams settled on timing path on the NDAL. The combination of analy- 21 subset implementation of the Jrl;l<; scan-based sis by the NMC team and a carefill hantl-routing of test. The NViLV CI'lI chip implements scan latches these signals by the module team allowed appropri- on all clata ancl control pads (input ancl output, in ate NMC speed binning (it., sorting the chips basecl tlie case of bitlirectional pads). Thus, the cache on correct operation at the fastest possible speed) S&Ms can be tested using only the JTM; port, and to meet the timing requirements for a 30-ns NIIAL tlie NVAX (:PI1 can act as the driver for scan testing cycle goal. of the NDAL interface. The NMC and the XCA imple- Very few MbU CPU chips were available that ment receive-only scan, so that the NDAL interface would f~~nctionat 10 ns over the fill1 range of volt- could be tested with no ATE pins required. The age and temperature. However, empirical signal- external tester was able to test the remaining pins delay measurements ant1 limited-range module solely by driving signals that could be scanned out testing have proven that the NMC and the NCA are of the pad latches. This subset implementation pro- ready to operate on the CPU moclule at this speed. vided the same module-level coverage as would An WN(1 CPU that hlnctions at this 10-ns speed can have been possible using a full JTAG irnplementa- operate the cache with no slip cycles using the tion. In addition, the implementation removed faster SRAMs. Future products may be based on an some design obstacles that were causing implemen- NVAX running at a 10-ns cycle, as sufficient yields at tation problems in the chips. this speed bin are reached. The ability to use scan-based testing is advanta- geous to the n~anufacturingprocess in the follow- Modzcle Testing ing two areas: The module and chip teams considered more than 1. The scan tester can find open circuit defects in one approach when determining what module- the cache area where the bed-of-nails tester level testability features to implement in the V[SI could not resolve whether the problem was a fis- devices Digital was building for \TAX 4000 Model ture contact problem or an actual open circuit. 500 computers. The scan-based Test Access ancl Boundary Scan Architecture (JrAG) proposal was 2. The ability to create "virtual test points" on coming into its omin, ant1 the teams desired to fol- scanned nets has allowetl the test coverage of low that specification, if scan-based test features the bed-of-nails tester to be expaticled without were to be i1sed.6However, no other devices on the having to purchase an expensive tester upgrade.

66 Vo1. 4 IVO..3 Slr1t1117el-I992 Digital Technical Jourrral Design of the VRX 4000 Model 400,500, ulzd 600 Systems

Clnfortunately,the nioclule and chip teams' previ- clocl<. The NMI timing scales with the NDAL clock ous experience with scan-based testing at the mod- cycle time. ule level had been spotty at best. The ability of this In this section, we describe the architecture of test to reduce the pin density, therefore, was not the NMC, the objectives of the NMC project, and the ilsetl to full advantage in the problem areas under results of the effort. the large PGA tlevices. Basecl on the experience test- ing the C13U module for the three new VPiX 4000 sys- NMC Architecture tems, follow-on products have been able to use this As shown in Figure 5, the NMC is partitioned into testing feature to good advantage. The teams now six major sections: the NDAL arbiter, the NDAL inter- have a firm base of experience on which to base face, the transaction handler, control and status reg- fiiture test strategies. isters (CSRs), the menlory interface, and the 0-bit interface. The NMC responds to all inelnosy space The WAXMemory Controller addresses when NDAL address bit 29 is equal to 0 The NMC is a 520-by-500-milcustom chip fabricated and responds to 1/0 space addresses in its allocatecl using Digital's 1-micrometer CMOS-3 process and range, i.e., 2101 0000 .. 2101 FFFF (hexadecimal). contains 148,000 transistors paclcagecl in a 339-pin The NDAI. arbiter gives highest priority to the PGA. The NMC provides the interface between the NMC for returning read data. The two 1/0 nodes NDAL bus and up to 512blB of main memory by have second priority; their requests are hand led in means of the NiLlI The NDAL bus supports three a round-robin fashion. The CPU has lowest priority. other nocles on the NDAL-the NVAX CPU and up to The NDAL interface consists of an input section two 1/0 adapters (101 and 102). In the new VAX and an output section. The input section moni- 4000 systems, the NCA serves as both I/O nodes on tors the NDAL for a new transaction every cycle. A the bus. The NMC contains the arbiter for the NDAL valid transaction that has been decoded by the NMC and also helps the Nva CPU maintain cache-mem- is put into one of four transaction queues ory coherency in the system by interfacing with a (INQUEUEs). There is one queue for each of the separate 0-bit memory. NDAL nodes: CPU, 101, 102, and the fourth, which The NMI can operate with either a 32- or 64-bit- stores release write privilege transactions. wide data path and supports single error correc- Commander nodes on the NDAL initiate release tion, double error detection, and nibble error write privilege transactions to release the write detection, and runs synchronous with the NDAL privilege of bloclcs in memory. The NMC must

CONTROL CURRENT ADDRESS DATA NDAL *ODRESS > TRANSACTION 0-BIT ADDRESS I-WNDLER WRITE DATA INTERFACE

NDAL NDAL INTERFACE CONTROL CSRS - ADDRESS CSR READ

MEMORY READ DATA

pElARBITER

Figure 5 WM Memwry Controller Block Diugmm

Digital Techrrical Jourtral Vol. 4 No. 3 Sunzmer 1992 67 WAX-microprocessor VAX Systems accept a release transaction from a node, irrespec- NMC Project Ob~ectiveand Results tive of the state of its INQUE'LJE; otherwise, there is a The NMC project objective was to create a liigli-per- potential for cleadlock. Consequently, the NMC has formance nielnory design that would be com- a separate queue for release write privilege trans- patible with the VAX 4000 Model 300 nlelnory actions. The output section of the NDAL interface subsystem and could provide two to three times buffers up to six qllaclworcls (it., six groups of four tlie performance of that subsystem. (The v.4000 contigilous 16-bit words for a total of 384 bits) of Model 300 has a banclwiclth of 47.4M~/s,at a cycle read data that must be returned to the NDAL. In a time of 28 ns, using 100-11s DltkVs with a 32-bit normal functioning system, a buffer depth of six memory data bus.) This goal was achieved by using c~uatlwortlstogether with an arbitration scheme a 64-bit memory data bus ancl an interconnect that that gives tlie NMC tlie Iiighest priority will never operates at ;I cycle time as low as 36 ns, using 100-ns result in ;I fi~lloutput queue. Therefore, there was DRAMS, and at an NDAL cycle as low as 30 ns, using no need to check for a fill1 queue and to stall while 80-ns DIWMs. The asymptotic bandwidtli on the loading the queue. NMI using the 100-ns DRAM technology and a 64-bit The tr~nsactionhandler arbitrates between the tlata path is lll.llMB/s, it., 2.3 times the bandwidth fo111-INQIIEIJEs ;untl stores selected transilctions in a of the VNL 4000 Model 300. Using faster 80-11s current transaction birffeec Tlie current transaction DRAMS, tlie batidwitlth is 133,33MB/s,i.e., 2.5 times buffer serves as a pipeline stage; this buffer allows the bandwidtli of the VN( 4000 Model 300. the corresponding INQUEUE to be loacled with the The NMC interface is efficient from the moment it next transaction while the current transaction is receives a transaction on the NUAL until it starts a being servicetl. The NMC can service back-to-back transaction on the NMI. This timing path was transactions with no stall cycles on the NMI. extremely tight and results in real memory read The CSR section of the NMC contains meniory bandwidth of 63.6~~/sand 76.32~~/sat 36-11s and configuration registers, error status registers, and 30-11s cycle times, respectively. motle ant1 tliagnostic registers. Tlie NMC chip was designed to meet an NDAL The memory interface contains the data path, cycle time of 36 ns, which made the timing very adtlress path, and co~itrolfor up to four memory critical. Most of tlie NMC chips produced can modules on the NMI. Tlie data path contains all the exceed this goal and will run at an NIIAL cycle error correction ant1 detection logic. The address time of 30 11s. Future designs based on the 10-ns path contains the row and column acltlress multi- NVkX chips will require this c)rcle time. plexers ant1 a refresh address counter. Tlie control To meet the performance goal, we chose to use is provitlecl by a state m:tcliine that can perform a 64-bit meniory interface. However, achieving multitransfer reatl operations, multitransfer write compatibility with the Model 300 menior)l moclules operations, ant1 read-modifywrite operations. presented a challenge with respect to the Ecc The 0-bit interface directly controls the 0-bit generation and checking mechanism. A simple dynamic random-access ~nernories(DMMs), whicli approach woulcl have been to include two separate are housecl on the (;I'[J n~odule.For every memory ECC trees, one for 64-bit operation and the other transaction, the NMC reads the corresponding 0-bit for 32-bit operation. This design would have been ill parallel with the memory access. If the block of very are^-intensive, so we chose 64-bit ECC code memory is written, the memory transaction is such that 32-bit Ec<: was a subset. 64-bit Ecc aborted until the corresponding rele;lse transaction requires eight check bits, and 32-bit ECC requires is received by the i\riL1C. If the block is unwritten, seven check bits. In oilr 64-bit code, the eighth the memory transaction is allowed to cornpletc. check bit depentls solely on the upper 32 bits. 11132- Initiating the memory transaction in parallel with bit mode, we force the upper 32-bits to a known the 0-bit access reduces the transaction latency on value; therefore, that check bit is always a fixed tr;~nsactionsthat are not written. Since most mem- value in 32-bit motle. ory accesses are to unwritten locations, using this The \'a4000 systems do not allow the use of 32- scheme improves memory performance consicler- bit memory modules, because it is tlifficult to meet abl).. 111 addition, the system design engineers were Q-bus latency requirements with the slower mem- able to use inexpensive DlLkls to implement the ory. This system constraint indirectly affected the 0-bit memory instead of faster, more expensive NMC. The CQBlC and SGEC devices, for example, had SIL\Ms. stringent low l;~tency requirements. The Q-bus

68 1'01. 4 LVO.3 Su~rrtt~er1992 Digital Technicril Journal 1;ltenc)i problem had to be solved without c;lubing tlie G chip (used in the VAX 4000 &loelel 300 sys- WE<:latency problems. Thus, the latency issue was tems), the N<:A chip was a new clesign. To optimize ;~cldressedin the following way: the DBh to memory b;lntlwidth and the bus access latency the N(;A provicles the two CP bus interface The NklC has a mode that indicates whether or not the Q-bus is present in the system. l'luring ports (CPI ;~ncl(:t)2, mentionetl previously), which this mode, the transaction handler gives the operate inclependently. 'Tllis strategy has three (2-bus node (102) the highest priority. However, advantages. First, the (2-bus adapter ancl the system ~~ad-o~llymemory (I<(oM)/consoleare connecteel to to keep the SGEC latency within required limits, the transaction handler must service trans- the CP bus separately from the Ethernet and the ;~ctionsfrom the 101 node at str;~tegictinies. mass storage devices. This arrangement allows the systeni to tune and optimize throughput on the "lo minimize the latency seen by any one node, Ethernet and mass stor;lge clevices without clegracl- the three NDAL nodes reqilire separate queues. ing the bus ;Iccess latency seen on the (2-bus. The simplest implementation of the NIIAL input Second, the Io:~dingon the two CP buses is recli~cecl. interface would have been to have two queues, Therefore, e;~clibus can operate at a liiglier fre- one for release write privilege tr;~nsactionsand quency and without external buffers; this also snves one Sor all other transactions. l'hus, preserving module area. Third, the dual-bus structure allows the order of NDAL transactions would have been the use of a simpler bus arbitr;ltion scheme. very easy. With three queues, it is necessary to In addition to using the tlual-bus strategy, tlie N<:A comp;lre queue addresses to preserve trans- uses write buffer ;lntl re:~dprefetch transactions to 21ction ordering. allow ])MA devices, particularly the SIWC and S<;l:<: The NMC combines the CPU request for write chips, to oper;lte efficiently in clouble octaword privilege, the Dmread, and the 1)MA write into a mode, where ;I two-oct;~worcl(32-byte) bi~rstof tl;lt;i single transaction before retiring the data to is transferretl within ;I single bus grant. For write mcmory This optimization reduces the latent!, transactions, the N(:h t)i~ffersup to two octnwords of written transactions. of data. Thus. the bus c;in operate without stalling. Although the features that were added to the while the N(:A arbitrates for the NDAL bus for the NM<: chip to recluce Q-bus and S<;E<: latency buffered write transactions. For read trans;~ctions, incre;aed tlie complexity of the chip, these features the N<:A contains 21 hexworcl-size (32-byte) prel'etch successfi~lly keep the (2-bus latency below 8 buffer Wliererls tlie maximnm burst length is only microseconds. an octaworcl on the <:I' bus, the NCA requests up to a hexwortl of d;lt:l cluring I)MA nielnory re;ld opera- NDAL-to-CP Bus Adapter Chip tions. The extra d;~t;~is stored in the prefetcli buffer NCA is a fi~ll-custom,high-performance I/O con- and is immedir~tely;~vail;~ble if the subsequent <:I' troller chip that provides tlie electric;~land func- bus read tl-ans;lction targets the same atldress ;IS the tional interface between tlie Whit NIML bus and prefetchetl clata. For WAS initiateel 1/0, up to four the 32-bit CP bus. In the new VAX 4000 systems, the operations can be buffered simultaneously. NJML supports three chips: the NVAX <:1'1J, the NU, The NCA is partitioned into four major sections: ant1 the NMC. 011the C13 bus, the N<:h su1,ports the the NDAL, <:I'I, (:1)2. ;lntl registers. The NDAL inter- SHA(:, SGEC, CQBIC, and SSC chips. N(:A is hbri- face and e;lch (:I) 1x1sinterface contain the master c:~tetlin Digit;ilSsCMOS-3 process, contains 155,000 and slave sequencer ;lncl controls for the corre- transistors, ;inel is packagrcl in a 3.-Ic)-pin tlC;A. The sponding bus. The N<:A chip also has 1/0 queues ;~ntl design goals for the NCA project were high quality, an internal arbiter to select operations from (:PI, improvecl performance, optimizecl time-to-market, CP2, anti N<:A register reat1 tr;lnsactions to the NLIAI. and leveraged use of existing (:I' bi~schips. The NCA bus, basetl on ;I precleterminecl priority. team ;lchievecl all design goals and completed the The register section contains the control and st;i- project by the scheduled manufacture release elate. tus registers ;uncl the interval clock timer registers. The interval clock is ;I soft\vare-program~ii;~ble NCA Architecture Overuiezv timer i~setlby the oper;~lingsystem to :lccount h)r and Partitioning time-dependent events. Although the original concept of the N<:A was moti- The NCA supports parity check ancl cletection vatecl by a memory and I/O controller chip called on both the NI)AI. and the <:P buses. The N<:A illso

Digital Techtzicnl Jorrnznl lhl. ,4 No. .j SIIIIIIII~,I. 1992 69 NVAX-microprocessorVAX Systems

supports all interrupts defined by the NDAL and CP by up to 66 percent. For an 80-11s CP bus cycle time, bus protocols for other types of error contlitions. the maximum bandwidth is 33.33MR/s. When an error occurs, an error status bit is set in the NCA error . Depending on the Testubility type of error, an address may be available for tliag- To assist motlule testing, the NCA contains features nostics. The NCA also provides a mechanism to that comply with the IEEE Stantlard P1149.1 JTAG force a parity error condition on any of the buses to testability.%t the pin level, five special pins are help debug the interrupt routines of the operating provided and work in combination with the inter- system software. nal test access port controller inside the NCA and a bed-of-nails tester to perform short- and open- Q-busLatency Support circuit interconnection tests. To reduce latency seen by the (2-bus devices, the NCA provides special logic to gain priority MSG90 Memory Module from the NDAL arbiter. The NCA informs the arbi- The ~~690family of CMOS memory modules was ter of the imminent Q-bus operation, for which designed to support the memory requirements set latency is a concern. When a Q-bus is present in forth by the i\'VAX memory controller.2 The NMC the systems, the NCA is programmed to use the two requires the Ms690 memory module to provide a IDS mode on the NDAL ancl to enable the Q-bus two-way, bank-interleaved, 72-bit tlata path. In addi- present bit of the control register. Upon tletection tion, a self-test feature is provided that was used on CP of the Q-bus "map" reacl transaction on the the VNI 4000 Moclel 300 memory subsystem. The bus, the NCA immetliately asserts a signal to the ~~690module returns a unique board itlentifica- NDAL arbiter. The arbiter will not grant the bus tion sig~saturewhen polled by the NMC. The niod- to other devices until after the Q-bus read trans- ule used existing qualified parts and fits on a NMC action is accepted by the or until the signal quad-sized PWB. is tleassertecl. Requests from the buffered write at A common goal of Digital's Electronic Storage the same interface are masked off until the signal 1)evelopment (ESD) teams is to utilize a single PW is tleasserted. Using this scheme, the Q-bus latency design to accommodate as many memory sizes in the new VAX 4000 systems was never more than as possible. The ESD teams routinely stretch the 8 microseconds. boundaries of Digital's manufacturing processes to provide world-class memorjr subsystems. Because Enhanced CVAX Pin Bus memory subsystems form the core of the ESD char- The NCA supports the standard bus protocol in ter, the ESI) teams are uniquely t:ined into, and both synchronous ancl asynchronous motles. The actively shaping, present and future device specjfi- existing CP bus protocol does not utilize the maxi- cations for all types of random-access tlevices. This mum bus bandwidth possible w~ththe standard cp advance ancl intimate knowledge allows us to build bus protocol. The fastest data transfer rate is two current technology products with the hooks neces- cycles per four-byte (i.e., 32-bit) longword, because sary to capitalize on the next generation of storage the two primary signals for the hanclshake use the devices. same clock phase for the assertion. When the sent The Ms690 options are available in 32MB, 64~8, signal is detected, it is already too late to generate and 128MB sizes and are self-configuring. The the received signal within the same cycle. To ~~690memories communicate with the NKby achieve the one-cycle transfer rate, a modified pro- way of the private NMI. All co~ltroland clocks sig- tocol is used. The received signal is changed to rep nals originate off-board via the N&lI from the NMC. resent a ready-to-receive signal. The received signal Up to four memory motlules of any density mix may is asserted regardless of whether or not the sent sig- coexist on the NMl with a maximum memory size nal is asserted. When the sent and received signals of 512MB. are asserted at the same time, both the master and The ~S690is an extension of the existing 39-bit slave devices know that the data was successfully ~~670memory product designecl for the VAX 4000 transferrecl. Model 300 product line. The ~~562GMX was This protocol works with the existing CP bus designed and protluced in Digital's Hutlson, Mass- chips ant1 has increased the theoretical bandwidth achusetts, plant for the ~~67032MB memory. This

70 1/01. 4 No. 3 S~rmi~ier1992 Digital Technical Jourrral Design of the VAX 4000 Model 400,500, and 600 Systems

GMX is a semi-intelligent, 20-bit-wide, 4-to-1 and of the ~~690memory option and thus help control 1-to-4transceiver, with internal test/compare/error costs by reducing product-unique inventory. All logging capabilities for its five 1/0 ports. The ~~670parts, except the bare PCB, are used on products required eight banks of 39 bits of data, hence the already protluced in volume at Digital's Singapore reqi~irementof two GkiX chips per module. and Galway, Ireland, manufacturir~gplants. The ~~670CPU moclule used in the VAX 4000 The ~~690-BAmemory module, which uses 100- Model 300 transfers 32-bit longwords of data. For ns 1M-by-1MDlUMs, can support NIMC cycle times every longworcl, 7 bits of EC<; must be allocated, of 36 ns and 42 ns, respectively, for the ViiX 4000 i.e., 8 x (32 + 7) = 312 DRAMS. The CPtJ module used Model 400 and 500 systems. The ~~690-CA/DAmod- in the vi\x 4000 Model 500 transfers 64-bit quad- ules use 80-11s 4M-by-1M L)lLZWs and can accom- wortls of data. For every quadword, 8 bits of ECC modate 30-ns, 36-ns, and 42-ns NMC cycle times. must be allocatetl, i.e., 4 x (64 + 8) = 288 DRAMS. The ~1~690memory is configured as two inter- Performance ledved bank pairs, each 72 bits wicle (64 bits of data The CPU 1/0 subsystems on all three products pro- plus 8 bits of Ecc); all transactions are 72 bits. The vide exceptional performance, as shown in Table 3. memory moclule supports quadworcl, octaword, The pair of DSSI buses on the CPU rnoclules for the and hexword read/write/read-mocI~-writetrans- V!4000 Models 500 and 600 were tested under actions. TI-ansactions less than 72 bits, i.e., bytes, the VMS operating system performing single-block wortls, and longwortls, are not supporteel. (512-byte) reads from RF73 disk drives. The read Doubling the data word length is advantageous in rate was measured at over 2,600 I/Os per second two ways: the I/O bandwidth effectively doubles, with both buses running. The Ethernet subsystem, and 24 fewer DRAii1s are required. This last benefit based on the SGEC adapter chip, is also very effi- results from the fact that only one additional bit is cient. It has been measured transmitting and receiv- required to protrct 64 bits of data as compared to ing 192-byte-long packets at a rate of 5,882 packets protecting 32-bit clata. The available PWB space per second. Packets 1,581 bytes long can be trans- allowetl room for two aclclitional GMXs to hallrlle mitted at a rate of 9.9 megabits per second. the 33 arlclitional data bits. The ability to use the The performance of the CPU subsystem has tradi- existing <;MX integrated circuit eliminated the neecl tionally been measured using a suite of 99 bench- for a new1, 40-bit-wide,<;MX-type VLSI development. marks.' Scaling the results against the performance Because DRAMS are edge-sensitive devices, mod- of the VAX-11/780 processor and taking the geo- ule layout, balanced etch transmission lines, and metric mean yieltls the VN( unit of performance signal conditioning are extremely important to a (WP) rating. The processor WP rating for the new quality product. The ~~690design team used VLK 4000 system with the lowest performance, a combit~ecltotal of 18 years of memory design the Model 400, is twice the WP rating of the sys- experience along with extensive use of SPICE tem it is replacing, the Model 300. The two new modeling to determine the optimal PWB layout. The high-end systems provide three and four times the result was a clouble-sided, surface-mount PW performance of the Model 330-an impressive per- panel that can accommoclate all density variations formance increase.

Table 3 Summary of Performance Results for the VAX 4000 Models 400,500, and 6007 Metric Unit Model 400 Model 500 Model 600 SPEC Release 1.0 SPECmark 22.3 30.7 41.1 SPECint 17.1 24.9 31.8 SPECfp 26.6 35.4 48.7 Single User 99 VUPs 16.9 23.8 31.4 TPC-A tpsA-local 51 .O 62.4 103.0 Integer MIPS 34.2 43.4 64.4 Whetstone Single MIPS 47.6 71 -4 83.3 Whetstone Double MIPS 32.3 45.5 52.6 LINPACKD (100 by 100) MFLOPS 4.8 6.9 9.5 LINPACKS MFLOPS 7.5 10.5 14.7

Digital Technical Joirrrral Vnl. 4 IVO. .? Summer 1992 7 1 WAX-microprocessor VAX Systems

The system performance in multistream and Electrical ancl Electronics Engineers, Inc., transaction-oriented environments was measured 1990). with TPC Benchmark A."liis benchmark, which 7 VAX Systems Performance Summary (May- simulates a banking system, generally indicates per- nard, MA: Digital Equipment Corporation, formance in environments that are characterized by Order No. EE-C0376-46,1991). concurrent CPU and 1/0 activity and that have more than one program active at any given time. The per- 8. Transaction Processing Performance Cou~z- formance metric is transactions per second (TPS). cil, TPC Benchmark A Standard Sl,ecificc~tion The nleasured performance of the VAX 4000 Model (Menlo Park, CA: Waterside Associates, 600 system was more than 100 TPS, tpsA-local. As November, 1989). shown in Table 3, the performance of the new VAX 4000 Moclel 400, 500, ant1 600 systems is impres- General Reference sive, even coml~aredto NsC-based systems. DEC STD 032-0 VM Arclnitecture Stanclard Acknowledgments (Maynard, MA: Digital Eq~~iymentCorporation, The authors would like to acknowledge the contri- Order No. EL-00032-00, 1989). butions of the following members of the design teams: Matt Badcleley, Roy Batchelder, Abed Chebbani, Harold Cullison, Todd Davis, Joe Ervin, Nannette Fitzgerald, Avanindra Godbole, Judy Gold, Bryce Griswold, Thomas Harvey, Jim Kearns, John Kumpf, Steve Lang, Dave Lauerhass, Bill. Munroe, Chun Ning, Jim Pazik, Cheryl Preston, Matt Rearwin, Dave Rouille, Prasad Sabada, Stephen Shirron, Atlam Siege],Jaidev Singh, Emily Slaughter, Michael Smith, Steve Thierauf, Bob Weisenbach, and Pei \Vong.

References

1. G. UIiler et al., "The WAX and NVAX+ High- performance VAX Microprocessors," Digital Technical .\o~frizal, vol. 4, no. 3 (Summer 1992, this issue): 11-23. 2. KAGSO CPU ~kloduleTechnical Matzual (May- nard, ,MA: Digital Equipment Corporation, Order No. EK-KA~~O-TM-OO~,1992). 3. R. Maskas, "Development of the CVAX Q22- bus Interface Chip," Digital Technical Journal, vol. 1, no. 7 (August 1988): 129-138. 4. J. \Vinsto, "The System Support Chip, a Multifi~nctionChip for CVlLY Systems," Digital TechnicalJournal, vol. 1, no. 7 (August 1988): 121-128.

5, I? Rubinfeld et al., "The CVAX CPU, a CMOS VtU( Microprocessor Chip," ICCD Proceedings, (October 1987): 148-152. 6. IEEE Stnlzdard Test Access Port and Bound- ary Scan Architecture, IEEE Standard 1149.1- 1990 (New York, NY: The Institute of

72 Vol. I iVo. ,7 Suirrrr~ar1992 Digital Techrrical JWJ-trnl Jonathan C. Crotuell David W Maruska 1 me Design of the VAX 4000 Model 100 and MicroVM3ltW Model90 Desktop Systems

The 1VJicroVAX .3J00 Model 70 and VAX 4000 illode1 100 s~tsterizstoere ~Iesigneclto meet the grozuiiz& dewband for lowcost, higbperfonnance desktop servers and tinze- sharing sjstems. Both syste7n.s are bmed on the IWSCPU chip and a set of VLSI sup- port chips, zvhicb p~-orliAeoutstanding CPU and I/O perybi-112arzce.Housed iiz like desktop e~zclosures,lbe tzuo sj~stenlsprouide 24 times the CP[Jpet$ori~za~rceoftl?e original VAX-11/780 cornputel: With otler 2.j gigabj~tesof disk storage aizd 2-38 megabytes of ?nuininernor??, the complete Dme sj~sternfitsin less thcrtz one cubicfoot ofspace The system design tucrs high!^^ leveraged frojn existing designs to help tneet an aggressioe schedzile.

The clernantl for low-cost, high-performance desk- found in all previous VAx 4000 systems, i.e., I)igit;~l top servers ant1 timesharing systems is increasing Storage Systems Interconnect (1)SSI) storage and rapidly. 'The MicroVAX 3100 Moclel90 and VU4000 Q-bus expansion. The CPrJ mother board for the Model 100 systems were designed to meet this VAX 4000 Model 100 system is cal let1 the KA52. demand. Both systems are based on the WAX <:PU The KA50 ancl KA52 <:Pus are built from a com- chip and :I set of very large-scale integration (VLSI) mon <:PI1 mother board design; the base (:I'll support chips, which provide outstanding CPLl nntl mother board is configilred to create either the 1/0 performance. K~50or the U.52 product module. The DSSl ancl Each member of the MicroVAX 3100 family of Q-bus optional hardware is added to the <:!'(I systems constitutes a low-cost, general-purpose, mother board to convert a m50 to a U52. Also, to multii~serVAX system in an enclosure that fits provide the additional superset functionality found on the desktop. This enclosure supports all the on the U52 <:PrJ, the different system read-only required components of a typical system. including meniories (ItOMs) are added tluring the m;inuhctl~r- the main memor): synchronol~sant1 asynchronous ing process. In this paper, the KA50 and &\52 <:l'lls communication lines, thick-wire and ThinWire are referred to as the CPU mother board or motlule, Ethernet, and up to five small computer system except where differences exist. interface (S<:sl)-basedstorage devices. The system design was highly leveragecl from The MicroVAX 3100 Moclel9O system replaces the existing designs to help meet an aggressive schecl- Model 80 as the top performer in the line; the new ule. This paper describes the design and implemen- model has considerably more than twice the <:P\I tation of these systems. power of the previous moclel.' The Plodel 90 systerii also includes performance enhancements Design Goals to the Ethernet and SCSI adapters, as well as an The design team's primary goal was to develop ;I increased maximum system memory of 128 mega- CI-'lJ mother bo;~rdthat woulcl provide at least twice bytes (MH). The CPU mother boartl for the MicroVAX the cru performance of the MicroVM 3100 Nlodel 3100 Model 90 system is called the KA50. 80, while supporting all of the same I/O fi~nctional- The \'AX 4000 Motlel 100 system is housetl in the ity of the previous sjatems. This new system wo~~lcl same desktop packaging ;IS the MicroVAX 3100 leverage the core CPLJ design fro111 the VAX 4000 model 90 and provides the same base function;~lity. Model 500 system, thus delivering the high perfor- The VAX 4000 Model 100 adds two key features mance of the WAX' CPU chip to the desktop?

Digital Ach~iicalJournnl Vol. d No. -3 Sumtner 1992 ?3 WAX-microprocessor VAX Systems

The team set additional goals to increase system measures 14.99 centimeters (5.3 inches) high by capability and performance. These goals were to 46.38 centimeters (18.26 inches) wick by 40.00 cen- timeters inches) deep. 1. Increase the maximum system memory from (15.75 The system enclosure contains two shelves that 72MB to 128MR support the mass storage devices. In the lMicroVAX 2. Provide error correction cocle (ECC) protection 3100 Model 90, these storage locations are cabled to to nlemory using memory arrays that previously support SCSI clisks and tapes. The upper shelf sup- supported only parity ports three SCSI disks, whereas the lower shelf supports two SCSI devices (any combination of 3. Increase the performance of the Ethernet adapter removable or 3 1/2-inch disks) with access through 4. Increase the performance of the SCSl adapter a door in the front ofthe enclosure. I11 the VAx 4000 Model 100, the top shelf is configuretl to support Early in the project, the team proposed creating a three 3 1/2-inch DSSI disks; the bottom shelf sup- second CPU design that would have the features of ports two SCSI devices, as in the MicroVa 3100 the larger VAX 4000 systems. This proposal resulted model 90. in the design of a DSSl adapter option for the CPU The VhX 4000 Model 100 DSSl support is provicletl mother board, as well as a Q-bus atlapter to provide by a high-performance DSSI adapter card based on a means to upgrade the CPU power of older the shared-host adapter chip (SWC), i.e., a custom Q-bus-based Micro\Ja systems. VLSI design with an integrated reduced instruction The project to design, implement, and field-test set computer (NSC:) processor.' The system is con- these systems was accomplished under an aggres- figured with DSSI as the primary disk storage. The sive scheclule. Both designs were ready to ship to DSSIbus exits the enclosure by me;ins of a connec- customers in just over nine months from the official tor on the back panel. This expansion port can be start of the project. used to connect the system to additional DSSI Systmn Overview devices, or to form a DSSI-based VAXcluster with a second V,4000 Model 100 or any other DSSI-basecl The MicroV. 3100 Model 90 system supports system. the same I/O functionality as the previous genera- The Q-bus support on the VAY 4000 Model 100 is tion of systems, the hlicro\lAX 3100 Models provided by the VLSI adapter chip, i.e., the C\JM 40 and 80. The features include a SCSl storage adap- Q22-bus interface chip (CQBIC).' There are no ter, 20 asynchronous communication ports, two Q-bus option slots in the system enclosure. The synchronous cornmunication ports, and an Q-bus corlnects to an expansion enclosure through Ethernet adapter. a pair of connectors at the rear of the system enclo- The VAX 4000 Model 100 includes the same 1/0 sure. Two shielded cables and the H9405 expansion functionality as the ~MicroViLV 3100 Model 90. In module are usetl to connect the Q-bus to the expan- addition, the system provides the I/O fi~nctionality sion enclosure. The near end of the Q-bus is termi- of tlie larger VAX 4000 systems, that is, a high-perfor- nated in the system enclosure. mance DSSl storage adapter and a Q-bus atlapter port that connects to an external Q-bus ellclosure. Both systems provide 24 times the CPU perfor- CPU Mother Board Design mance of a VAX-11/780 system The memory subsys- The design goals presented engineering with con- ten1 uses Digital's ~~44single in-line memory straints that forced design trade-offs, Some key modules (SIMMs) and thus provides ~~IMB,32>1B, constraints were (1) fitting the required functional- 64~~,80MB, or 128MB of main memory. ity on a single 10-by-14-inch module; (2) designing As shown in Figure 1, the system enclosure used the system to adhere to tlie system power and cool- to house both systems, namely the RA~~B,provides ing budget; and (3) minimizing changes to the func- mounting for the CP'IJ mother boarcl, up to five loca- tional view of the motlule over previous designs, to tions for disk and tape devices, a 166-watt power decrease the number of softw;tre motlifications suppl): and fans for cooling the system elements. In required for operating system support. addition, the enclosure shields the system from The primary way the design team minimized radiated emissions All l/O connections are filtered system development was to leverage as much as ancl exit the enclosure through cutouts 111 the practical from existing designs. The Cl'U mother rear panel. The system enclosure is compact and board design used components fro111 the VAX 4000

Ifol. 4 No. 3 S~~rnnrer1992 Digit~~lTechrrical Journal The Design of the VAX 4000 rModel100 and MicroVAX 3100 Model 90Desktop Systems

OPTIONAL DSW42 110 POWER MODULE AND LOGIC

OPTIONAL DHW42 110 MODULE AND LOGIC BOARD

MEMORY MODULES

Q-BUS EXPANSION PORTS / / \

OPTIONAL DSW42 PORTS

MMJ SERIAL-LINE PORTS

ASYNCHRONOUS MODEM CONTROL PORT THICK-WIRE ETHERNET CONNECTOR THINWIRE ETHERNET CONNECTOR SYSTEM AC POWER CONNECTOR ONIOFF SWITCH

Figure 1 VM4000 Model 100 System Enclosure Showing the CPU and Connectors

Model 500, Microvu 3100 Model 80, and VAXstation The CPU mother board includes a linear regulator 4000 Motlel 90 systems. Using proven design that generates local 3.3-volt (V) current for the components allowed for a shorter development CPU core chip set. The voltage is stepped down cycle, s~nallerdesign teams, and consequently, a from the 5-V supply. The regulator is necessary higher-quality design, while meeting an aggressive because the 3.3-V direct current (DC) of the system schedule. is not sufficient to meet the 2 3 percent tolerance The design is structured so that both CPu mother regulation or to supply the required maximum boartls can be built using the same printed wiring current. board (PWB). The added functionality for the KA52 is provided by a daughter card, additional hardware CPU Core ant1 cabling, ant1 different system ROMs. The shared The CPU core consists of three chips: the NVAx CPU design helped reduce the complexity in testing and chip, the N\7LY data and address lines (NDAL) qualifying the system design. memory controller (NMC) chip, and the NDAL- The CPU module contains three major sections: to-CVAX pin (CP) bus adapter (NCA) chip. The the CPlJ core, the memory subsystem, and the I/O NVAX chip directly controls the 128-kilobyte subsystem. Figure 2 is a block diagram of the basic (KB) backup cache. The core chip set is inter- CPU ~llodulefor the VAx 4000 Model 100 and connected by means of the NDAL pin bus, as shown MicroVh?( 3100 Model 90 systems. Figure 3 is a pho- in Figure 2. The NDAL bus is 64 bits wide, has a tograph of the module, including the DSSI daughter 42-nanosecond (ns) cycle time, and supports carcl option. pended transactions.1 The peak bandwidth of the

Digital Technical Journal Vol. 4 No. .? Sumnzer 1992 NVLY-~nicroprocessorVAX Systems

r------7 I CPUCORE I I B-CACHE 128KB OR NVAX CPU I I 512KB CHIP I I I I I NDAL-TO-CP NVAX MEMORY 1 BUS ADAPTER I I CHIP (NCA) NDAL BUS I I------_ . ------I Ti--- I I I10 SUBSYSTEM I CP BUS 2 CP BUS 1 I I I I I I ------I I I I I I I I I SHAC DSSl I FLASH ROM I I ADAPTER CHIP :* 1 I I I I I I ---,,,,,----1 I I I I CEAClSQWF SUPPORT CHlP EDAL MEMORY SUBSYSTEM CHlP I I I BUS L 1 ------_-----____I

CVAX 0-BUS SGEC ETHERNET INTERFACE CHlP ADAPTER CHlP I OPTION I1 I I -,,-----,,,,1 ------I I I I I DHW42 I t I ,-1 ASYNCHRONOUS ' ETHERNET SCSl ADAPTER -

I I I 1 ,-----,,,-em 1 I I I I L------! Note lhat - + ~ndicatesa connection to an external bus

Fir2 CPlJiMo~l~~leBlock L)ic~g~.rr~n

NIML bus performing 32-byte operations is 152 cycle clocli ancl NI>AI. bus clocks are generated with meg;lb).tes per secontl (>1R/s). an on-chip clock generator supplied by a 286-mega- hertz (All-lz) osci1l;lLor. The h'VrU< <;I'll is basecl on a high-performance IVVAX CPU Chip The h'Vf\\S <:PI1 chip is an ;itI\r;~ncetl macropipelinetl ;irchitectt~resioiil;tr to that of the iii~plcmentationof the VILY ;irchitccture in I>igit;~l's \!AX 9000 (:I'IT.Ji 'l'lit: VI<: allows the caching of fourth-gener:~tion colliplen~entary metal-oxide instructions that have already been translated to semiconductor (CMOS-4) technology. The Nvu virtual atltlrcsses. Having the backup cache con- device consists of 1.3 million tr:~nsistorson ;I die troller on the chip tlecreases baclIC is the 2K11 virtl~al instri~ctioncache (Vl(:). ;I 96-entry I\r\IL?I rnemor! controller iniplemented in Digital's trans1:ition buffel; an on-chip SKI1 pri~ii;~r!~c;icllc, tliircl-gener;~tioncomplement;~r! metal-ox~de semi- and a11 on-chip backup cache controller. The <:I'll concluctor (CMOS-s)tecl~nology.~ The NMC cotlsists

WI. 4 No..3 .SIIII!~~CV1992 Digifal Technical Joztr~tal The Desigiz oJthe VAX 4000 ~Vlodel100 and ~kfio.oVAX3100 Jfodel90 Desktop Systenzs

lie3 CPU 1Motlner Honrd of 148,000 transistors and is the liigli-speed inter- interface from the NIIAL to the <;P bus6 Tlie N<:A face to the system main memory. Tlie N;vl(: is the consists of 155,000 transistors and supports two <:p arbiter for the three chips 011 the NI)AL. bus, namely, buses. The <;I' bus used on the CVa rnicroproces- the NVLY, the NCA, and the N&I<:. Tlie NMC chip sor family is also used on many of Digital's custom m:lnages tlie array of ownership bits tli;~tcorre- I/O adapter chips, such as the CQBIC, the SHAC, the spontl to each 32-byte segment of memory Each of seco~~tl-generatio11Ethernet controller (SGEC:), ;~ntl these segnients corresponds to :I cache line. Tlie the system silpport chip (SS<:).-'*THThi~s, the hard- ownership bit indicates whether the valid copy of ware and software designs for these I/() fi~nctionb tlie data is in memory, in the <:1'lJ write-back cache. coulcl be leveragetl from previous efforts. The N<:A or in an I/O devices buffec performs tlirect nicniory access (DiLlA) from the I/O The NMC has four commancl clueues that accept tlevices :lntl supports the c;iche consistency proto- read, write, and remove write privilege tr;lns- col required for tlie NJ)fiI.bus. actions from the NDAI, bus. Buffers hold the read ?'lie N<:t\ w;~sclesigned to optimize DMA tr:lffjc tlat;~to be returned to the ~wdethat requested the from CP bus clevices. In tlie KA50 CPLJ, the <;I' bus d;lt;l. The NMC and tlie memory subsystem provicle devices inclucle the S(;Ec: Ethernet aclapter, the $$

K~52<:PI! also attaches the CQ131<: Q-bus 2id;ipter Support for the S<:SI bus is providetl by the 53<:94 chip and the SHAC DSSI host adapter cliip. SCSI adapter cI~ip.~('"l'he5334 chip is interfaced to the system on the E[)AI. bus a11tl uses the SQW chip Memory Subsystem to increase its I>MA performance. The SQWF The memory subsystem is controlled by the NMC chip makes it possible to buffer data moving to the chip. The main memory is implementetl using MS44 CP bus. The SCSl bus operates in synchronous mode SlMMs and low-cost gate array (LCGA) chips to pro- for high-performance storage access of SMR/S. vide an interface between tlie NM<: and the SIMMs." The QUAKT gate array supplies the logic for four The SIMivls are used in groups of four to provide built-in serial ports. Tlie QUART, originally used two interleaved banks, each with a 64-bit data path on the DZQll Q-bus device, provides the same and eight bits of ECC. This interleaving scheme software interface 21s that device. The third port increases the bandwiclth of main memory by alter- provides modem control functions by means of nating data between both banks of memory adtlitional logic; tlie first, second. and fourth ports The E<:C provides single-bit error correction and are data leads only. tlouble-bit error detection. The SGEC Ethernet adapter cliip was chosen The individual SIMMs are available in either 4&1~ because it provides liiglier performance than the or 16~~variants. Since four SIMMs form a complete Ethernet adapter used on tlie Micro\lkY 3100 Model fu~ictionalset, sets can be 16~13or 64~~in size. 80. The SGEC is the adapter chip used on all \'AX Therefore, because the system supports up to two 4000 systems. In atldition, this chip directly inter- sets of SIMMS, the total system memory size can be faces with the (:I' bus. either 16~~.32IVIR. 64>1~,80MB. or 128&1R,tlepencl- The limitecl sjze of the CPLl mother board ing on the combination of SIM1\/1 size ;~ndtile num- recluiretl the I>sSI adapter to be added by means of ber of sets. a daughter c;ircl. Tlie Q-bus ;~d:ipterchip and bus To coincide with the cache coherency scheme termination are provicletl clirectly on the mother lrsecl in tlie N\lltY CPU chip, the N,\il(; keeps track of board. the cache lines tliat lia\!e write privilege rcserved by the <:pU or l/O devices. This state is stored in sep- Console and Diagnostics arate djrn;lmic random-access memories (DRAMS). The MicroVhX 3100 Moclels 80 and 90 differ in their 'l'hese I>tli\>ls interface tlirectly to the NM: by collsole clesigns ;111c1 cli;lgnostics. Because the basic means ofa private bus. The ownership bits are pro- CPU core of the NilicroV~a3100 1Motlel 90 and the tected by ECC. VAX 4000 Model 100 systems is very similar to tliat of the Vta 4000 Model 500 system, the design team I/O Subsystem decidecl to adopt tlie console of tlie Model 500 :rncl Hec;iuse the MicroVAX 3100 Model 90 was intended add the required commands and functionality. ;aan i~pgraclefor the Model 40 ant1 80 systems, the Borrowing proven tlesigns, such as the console of I/() subsystem of the earlier systelns dictated the another hV'-based system, significantly shortenetl design of the new Model 90. In addition, the 1/0 the protluct development scheclule. subsystem of the ro\52 CPU motlule for the Viuc One enhancement to the CPU mother board was 4000 Model 100 supports two functions found in the addition of a FEPROM subsystem. If an upd;~teis the other \?&Y 4000 systems, the DSsl adapter and required, the console and diagnostic code on tlie the Q-bus adapter.Vhe I/O subsystem includes CPIJ can be reprogrammed in the field. In contrast. a ThinWIire and thick-wire Ethernet atlapter, four previous systems recli~iredthe memories to be in built-in asynchronous terminal lines, a connector sockets and the parts to be replaced in the field. for the asynchronous option, and the CUC and With FEPROMs, a program is loaded from any SQ\VF chips. bootable device. This program erases the FEPROMs A bus interface was incorporated in the I/<) sub- and reprograms them with the new ROM image. system to support the DSW42 synchronous commu- This eilhallceme~itserves as an easy mechanism for nication option. the SCSI atlapter chip, and the updating the ROMs in the field to provide new fea- QUART four-port asynchronous controller chip. The tures or to fix bugs that ma); be discovered. <:EA<: :ind SQWF chips, which are gate arrays On power-up, the (:Pr! starts executing from the tlesignecl for the VAYstation 4000 ivlotlel 90, ;ire FEPROivI memory ;~ntlruns the power-up self-test to used to create the EDM. bus. help verify that the system is firlly operational.

Vol. 4 No. -3 Svtntner 1992 Digilnl Tecb~zicnlJorrr7znl 711~L)CS~.I(IZ of tlgc VM4000 /Model 100 and iMio.oVAX 3100 1Moclel90Desktop Systelrzs

Upon completion of tlie execution of this test, tlie onboard cli;~gnostics.how eve^; ;IS for the console, system tr;insfers control to tlie colisole program. the philosophy used in tleveloping these diagnos- Depending on the values configi~retlill nonvo1;ltile tics w3s to leverage as rnucl~of the software tlesign memory, the console progr;lm either boots the as possible fom existing tlesigns. system with the correct p:lr;lmeLers or stops For With the advent of larger boot and diagnostic console input. IiOMs, the tli;~gnosticcoverage of the power-up self- In earlier systems, the speed of executing from tests grezitly increased, inclutling extensive testing RON1 could be more than an ostler of magnitude of the cache, the memory antl tlie CPU core. These slower t1i;ln running from cachet1 main memory. tests help ;issure the customer t1i;lt ;I fi~iledcompo- The WAS <:I'll chip added tlie virtual instruction nent will be detected antl reportecl upon power-up. cache, which allows the caching of instruction In many cases, the new tests can help isolate fail- stream references from I/(> space. This feature ures in the inclividual SI&livl or cache chip. This fea- greatly incre;~sesthe perform;~nceof the code. ture is used extensively in manufacturing, ;IS well as by field service. During the power-up sequence, an instl-uction The console program gains control of the c:rrJ exerciser (HCORE) is run to test the floating-point whenever tlie processor 1i;llts or performs a restart liartlware. This test provides very good coverage operation such as power-up. Tlie console provides of the floating-point unit. In the past, H<:ORE the following services: has been run as a standalone diagnostic in ni;mu- facturing before a system is shipped. The tlesign 1. Interface to the diagnostics that test compo- team for the two new desktop systems believecl nents of the <:PU and system that this test should run on every system power-up 2. A~~tolli:~Lic/rnant~albootstrap of an operating self-test. system following processor li;~lts The Cl'u core is designed to fi~nctionover a wide range of environmental conclitions. Some variables 3. An interactive comm;~ndlanguage that allows of the environment are temperature, voltage, and the user to examine and alter the state of tlie minimum/maximun~component parameters at a processor given clock frequency. Exceeding the worst-case These ;Ire minor diffel-ences between the KA50 design envelope can cause ~~npredictableresults. and k152 consoles. Largely these differences relate For exan~ple,to avoid problems caused by a defec- to tlie Kt\52 <:l'U mother boartl support for tlie IISSI tive main clock oscillator that may be running too bus ancl the (2-bus.Although the console is similar fast, the tli;~gnosticsmeasure the speed of the <:PU to that I'oi~nclon the \'AX 4000 Model 500, some new c)~cleclock to tletermine if it is within tlie ;rcceptetl commands were impleniented to provide function- tolerance. If the c).cle is f;~sterthan the design mar- ality that exists on previous MicroVta 3100 systems. gin dictates, an error is reportecl. These commands include I..O<;IN ant1 SET PSWI) (set password), which give support for a secure con- Design Tools sole; SET/SHOW SCSLID; SHOW <:ONFI(;IJRATION; The design of the cpu mother boartl uniquely SHOW ERICOR; and various commantls to support ;I merged components f1-0111 several designs. Tlie system exerciser. success of this approach relied on the use of tlesign On the US2 <:PC!, tlie console supports tlie [ISSI tools to perform the merge zlntl to verifj. the cor- bus ant1 the Q-bus with a set of commands. These rectness of the merge. comniantls ;lllow polling of the DSSl bus to deter- The normal design process is to create ;I set mine what tlevices are present irntl to configure the of design schematics and to verify these through internal par;imeters of each clevice. The system can simul;rtion. Once the design is logically verified, be booted From devices on the S<:SI, DSSI, or (2-bus the layout process begins. The layout process chips, as well as ~\~ertlie Ethernet port. includcs the use of the SPI<:Esimulator to give direc- tion to the physical layout structure and to check Di~ig~zosti~~ the integrity of the layout.11After the layout is com- Diagnostics help isolate k~ultsin the system down plete, the database is fed back into logic simula- to the level of the field-replaceable units. Significant tion, which again verifies the correctness of tlie effort was expended on tlie developrlient of design database. WLY-microprocessor VAX Systems

The <:PU mother boarcl tlesigners took ;I different read olxr;~tionsfrom IW35 disk drives. The read ;~pproach.Since the physical placement of tlie con- rate w21s meas~rretlat more than 1,200 l/Os per sec- nector portion of the motlule was tlie same ;IS hr ond. The s(:sI adapter on both <;P[Js was me;~surecl tlie Micro\fit\: 3100 Model 80 niotlule, this tlesign at more tli;111 500 I/Os per seconcl for single-block element was ilsetl as the starting point for tlie over- reatls. :111 design. The d;~t;~b;~sewas eclitetl using tlie \$LY The Ethernet subsystem used on both the Kt150 1;ryout s1,stem (\'LS), ancl the only components and k!52 ~ilotlulesis very efficient ;tnd has been s;~vetlwere those that were to be used in tlie new measuretl transmitting &byte p;lckets ;rt a rate CI'LI module. ?'his procetlure providetl the correct of 14,789packets per secontl. The measuretl receive placement for all I/o connectors that exited the rate for 64-byte packets was 14,785 packets per system enclosure. second. In addition, tlie \b!X 4000 Model 500 <:1'11 core The perforniar~ce of the CPLr subsystem has nTasusecl as the basis for tlie CPU mother boartl etch traditionally been measured using ;I suite of 99 Inyour. Tlie Moclel 500 design has proven to have benchm;~rks.~~Tlie results are scaletl ;~g:~insttlie very gootl signal integrity clue to its well-thought- performance of tlie \JAY-11/780 processor, and out circuit boartl layout. To leverage Plotlel 500 the geometric mean is taken. This calci~lationyields work in the l;~!,out ol' tlie <:PlJ niotlier board, the the VAS unit of performance (Wlr)) rating. 'The pro- designers extracted tlie signal cessor V111) rating for both the &I50 ant1 KA52 CPUs routing froni the Motlel 500. This signal routing is 24 W Ps-more than twice tlie perl'ormance included the <:I'll core and cache treeing, tlie most of the Micro\rtIX 3100 Model 80. Table 1 presents critical areas. This ;~ppro;~cheliminatetl the neecl to a summary of the performance results for the \'AX model critic;~lsign;~l interconnect in the design ancl 4000 Motlel 100 and the Micro\RS 3100 Wlotlel 90 gu;iranteetl t1i;lt tlie signal integrity ancl connector systems. la).out wo~lltlbe itlentic;~lto that of tlie proven Tlie perform;lnce of the system in niultistream h'lodel 500. and tra~isactio~i-orie~iteclenvironments was niee Database cornp;~risontools were usetl to gi1;lrvi- surecl with TP<: Benchmark A.k1This benchmark, tee that the schematics matched the p1lysic:lJ Iayoi~t which simulates a banking system, genel-ally indi- database. As ;I final step, the physical layout data- cates perform;~ncein environments ch;~r;~cterized hnse netlist was usetl to create a siniulation moclel. by concurrent <:l'[i and I/o activity and in which I>ECSII\I,Digital's simulation tool, was used to verify more than one progr;im is active at any given time. tlie final correctness of the tlesign tlatab;lse. The perforni;~ncemetric is trans:~ctionsper second ('fPS). The me;isurcd performance of the Vr\X 4000 Model 100 is 50 TPS tpsA-local; that of tlie Wlicro\'kX The CI'U I/O subsystenis OII both the MicroVAS 3100 Model 90 is 34 TPS tpsA-loc;~l.7'he tlil'ference 3100 Model 90 ant1 tlie VAS 4000 Model 100 pro- in performance between tlie VA>; 4000 Model 100 vitle exception;rl perform;~rlce.The DSSl bi~son and the NJicroVAS 3100 model 90 is ;I result of their the U52 CPLi was tested untler the VkiS oper;~t- different tlisk subsystems, i.e., tlie 1)SSI ant1 SCSI iiig system performing single-block (512-byte) adapter support.

Table 1 Summary of Performance Results for the VAX 4000 Model 100 and the MicroVAX 3100 Model 90 Systems13

VAX 4000 MicroVAX 31 00 Metric Unit Model 100 Model 90 -- SPEC Release 1.0 SPECmark 30.5 SPECint 24.3 SPECfp 35.5 Single User 99 VLlPs 23.8 Integer VUPs 19.6 Single VUPs 26.0 Double VUPs 31.3 TPC-A 34 Tl7e Desiyn of the VRY 4000 Model 100 and MicroVAX 31 00 &lode190 Desktop %~ste??zs

Acknowledgments Tech~zicaLJo~rr~zd,vol. 1, no. 7 (August 1988): The authors would like to thank all the designers of 79-86. other products whose logic components were used 9. M. Callantler, Sr., L. Carlson, A. Ladd, ancl M. in the \la 4000 Model 100 and MicroVAX 3100 Norcross, "The Vastation 4000 Moclel 90," Model 90 designs, namel): the Vastation 4000 Digital Tech~zicalJo~lrrral, vol. 4, no. 3 Model 90, VAX 4000 Model 500, and MicroVAX (Summer 1992, this issue): 82-91. 3100 Motlel 80 teams. The authors woultl also like to acknowleclge the many engineers who helped 10. NCR Microelectronics Proclucts Division, /VCR design and clualify the VLK 4000 Model 100 and 5.3C94, 53C95, 5.3C96 Advanced Controller MicroVkY 3100 Nlodel 90 systems on a very aggres- Data Sheet (Santa Clara, Ch: NCK Corpora- sive schedule. Special thanks to Harold Cullison tion, 1990). and Judy Gold (diagnostics and console), Matt 11. SPICE is a general-purpose circuit simulator Benson and Joe Ervin (module design and debug), program developed by Lawrence Nagel and Cheryl Preston (signal integrity), Dave Rouille Ellis Cohen of the Department of Electrical (design verification), Colin Hrench (EMC consul- Engineering and Computer Sciences, Univer- tant), Larry Wlazzotle (mechanical design), Billy sity of California at Berkele)! Cassidy and Walt Supinski (PWB layout), and Rita Bureau (schem;~tics) 12. VAX Sj~stenzs Perjiorrna~zce Szinzmary (Maynard, MA: Digital Equipment Corpora- tion, Order No. ~~C0376-46,1991). Refmeme and Note 13. SPEC: A 1Vezu Perspective on Coml,aring 1. G. Uhler et al., "The Why ancl WAX+ High- System Performance, rev. 10 (Maynard, MA: performance \IS Microprocessors," Digital Digital Equipment Corporation, Orcler No. Technical Jotirnal, vol. 4, no. 3 (Summer EE-EA~~~-46,1990). 1992, this issue): 11-23. 14. Trarzsaction Processing Perfofonnance Coulz- 2, klicroL4lX 3100 Models 40 and 80 Technical c~l,TPC Bencl711zn,kA St~r~zd~l~zlSpecijiccltion Infor~nntiorrMa~zual, rev. 1 (Maynard, 1Lh: (Menlo Park, CA \%iters~de Assoc~ates, Digital Equipment Corporation, Order No. November 1989). EK-A0525-TD,1991).

3. KA68O CPU Afodule Teclonical PIa~zzral (Maynard, IMA: Digital Equipment Corpora- tion, Orcler No. EK-KA~~O-TNI-OO~,1992).

4. B. Maskas, "Development of the CVAX Q22- bus Iiiterface Chip." Digital Technical Journa/, vol. 1, no. 7 (August 1988): 129-138. 5. J. Nlurra): R. Hetherington, ancl R. Salett, "VAX Instructions That Illustrate the Architectural Features of the VN( 9000 CPU," Digital Tech~ziccrl./our~znl,vol. 2, no. 4 (Fall 1990): 25-42. 6. J. Crowell et al., "Design of the \!AX 4000 Motlel 400, 500, ancl 600 Systems," Digital Tecbt?icaI JOZLTIZLI~,VOI. 4, no. 3 (Summer 1992, this issue): 60-72.

7 l? RubinCeld et al., "The CVluc CPG, a CMOS VAx Microprocessor Chip," ICCLI Proceedings (October 1987): 148-152. 8. G. Litlington, "Overview of the MicroVm 3500/3600 I'rocessor Module," Digital Michael A. Callandel; Sr: Lauren M. Carlson Andrew R. Ladd Mitchell 0.Norcross

The VAXstaliol?4000 illodel90 1s llw lutest nzel~zberoJ the l/AX~tcttroi~~~~'odlrctIiiw Based on the IVI/AX CPIJ, the .l!odcl 90 illas des~yilt.clcis ct ~~~odule~rp~ylwtke to the l/ilXstatioll 4000 1Woclel60sj)ste~lz The ,llode190 has 2 tiliic~sIhe CIJlJpotfo~~i~rc~r~ce ofthe iUodel60 ~~~ldpr'o~~iclesbcrse-lerbel, tr~lo-dii~lt.~ls~oiic~l~i~~phrcspe~$)~si,la~~c'c of 266,000 i~ectols/)erS second It sm~p/~o~.t.sup to 12al11B oq ~~rei~zor:];an SC\I- I Olrs rlltels- face, a TURHOc/7anlzd option, a SJ~I?C/~~OIZ~ZLSconl~~z~i~~ic~mlioti optioll, 61izd SL'LJ~I'GII graphics optio~zs.The desiglz te~rn?~lsed orzl~ prog~zrninluble dellices lo i~lz/)kr~zcrlt the nezi, logic clesiglzed irzto the sjste~~z.In addition. a O~.ea&oard psocli~led tbe bnsisfor logic arrtl softu wr.e ~ler~ificrrtiorz.

During the summer of 1991, Digital's Semicon- merit in the Motlel 60 components. Seconcl. Ijy ductor Engineering Group l>eg;ln planning a new using ;IS m;my components as possible from the VAY workst;~tionbased on the NVU CPU chip.' The Moclel 60 clesign, we could reduce the 11ardw:tre development process had three main go;ils: and software engineering effort required to pro- to incre;~se (:PU perform:lnc.c. to maintain ;In duce the new system. The &lorlel 90 s!'stem ~iiocIii[e aggressive time-to-market schedule, a~ltlto pro- provitles ;I direct upgracle from the Moclel GO. The vide ul>gr;itle coliipatibility with the VA>;st;ition only system component or option on tlie Motlel 60 Model 60. that is not supported by the ivlotlel 90 is the entry- The prim;~rygoal of the Vi\Xst;~tion4000 Nloclel level gr;iphics option. 90 design w;is to implement ;I workstation with This p:iper presents tlie tlesign methotlolog!- nre well over twice the CPLl perfor~ii;~nceof its pretle- followctl to meet our project go:ils. It discusses tlic cessor, the Model 60. The :~dvt.ntof high-perfor- four major components in the ,Moclel 90 s!'stenl. It lnance workst:itions based on reduced instruction descrilxs the physical clesign 01-' the system bo;~rcl set computers (RISC) required any new Vi%X work- and the bre;~tlboartlsystem we ilsetl for logic vcrifi- station to provide a significant performance cation and debugging of softw;~re.'['he p;iper ericls increase over ~reviousVAX workstations to be com- with a comparison of perform:itice data for Iligital's petitive in tlie marketplace. Tlie Model 90 met this workst:~tions. goal by achieving 2.7 times tlie performance of the \IhSstation Model 60. Design Methodology The second major goal of the project was to Tlie design methodology used during the Model90 develop and ship the system as quickly as possible. project consistecl of the following :ipproaclies: This was mandated by competitive pressures in the workstation market. We proposed an aggressive Complex logic, software. ancl firmware from best-case schedule wliich forecast a bre;ldbo;ird existing clesigns woultl be used whenever running within three months of tlie project PIT)- possible. posal, prototypes running tlle VMS system within All new logic woultl be implemented using pro- five months, ant1 a customer ship clate within gramm;ihle technolog!! eleven months. The development teams nchicvetl almost every project milestone within a few weeks A brc;itIbo;~rdwould be built :ISearlj. a5 possible. of the proposed schedule. Logic woultl he simillated onl! if it could not be The final major goal of the project was to design verified n~itlithe breadbo;~rcl. the system such that it could be offerecl as n simple module upgrade to tlie VAXst:ltion Model 60. There These :tpl>ro;lches were influencecl ant1 \lial>ecl were two main reasons for designing the system ;IS by our ;~ggressivesclicclule, by the en1ergcnce an upgrade. First, it protectetl tlie customer's invest- of new programmable technologies, and by tlic The VAXstation 4000 Model 90

availability of certain VAX system tlesig~isthat attached to a \!AX 4000 Motlel 500.l Unlike a con- includecl some of the subsystems that we ]bl;unnetl to ventional prototype, the breadboard logic was use. T'hese infli~encesare tliscussed in this section. expected to change; therefore we includetl recon- 'I'he strategy of LIS~I~~existing 1i;lrdware ant1 soft- figi~rableconnections to the FPGAs. Also, the bread- WLII-ecomponents stemmed from our goal to board system tlitl not need to meet any of the deliver the Motlel 90 as quickly as possible. The physical constraints, such as size and I;lyout, that project sched~~letlicl not al low time for the tlevelop- ;ire normally requirecl of a conventional prototype. ment of any major new pieces of liartlw;~reor soft- An early bre;~tlboardsystem provided the clear ben- \v;lre. Consetli~ently.we usetl as much li;~rdware efits of rapid testing ant1 change of hartlware and ;mtl aoftware from other VtL\; proclucts ;IS possible. software. Our aggressive schedule also prompted us to Because we could change logic quickly ant1 easily explore different technologies and verification on the hre;rdboard, the role of simulation on this methods. On our previous projects, we used con- project focused on verifying module interconnect, ventional gate ;irr;~yor st;~ntlartlcell technology, and not on exh;~ustivelogic verification. We main- :inti typically we strove for exh;~~~stivelogic sjrnula- tained a working system-simulation model, with a tion ;lncl timing verification prior to releasi~igchip basic set of regression tests, ;IS a reference for logic tlesigns. Our go;~lwas to hwe li~llyfunctional first- changes ant1 as a tool for debugging. Logic verifica- pass silicon. Ilnfortunately, the first pass of a gate tion was performed on the breaclboard to an extent array was rarely fi~llyfi~nction;il. This appro;~chhad not possilble using sirn11l;ition. two consequences on the project schedule: (1) First-p:us harclw;~rcwas usually tlelayed as much as Major System Components possible to allow for more thorough logic simula- Figure 1 is a block diagram showing the primary tion ant1 timing verification, and (2) a second pass components in the Model 90. In this paper we foci~s was needed it' first-pass silicon was not fi~llyfunc- on four distinct components in the system: the tionxl, ;~dtlingseveral months to the overall project core, tlie niemory subsystem, the I/O subsystem, schetlule. and the graphics subsystem. At the time ol our design, several new pro- The core chip set is composed of a 74.4-mega- gr;lrnmable silicon technologies were emerging hertz (MHz) NVAX CPLI, the RWXX memory con- that promisetl performance, tlensities, ant1 package troller (NMC), ant1 the NVAX I/O adapter (NCA). The sizes comp;~rableto gate ;lrray technology As tlie NVba CP'LI also corltrols ;I 256-kilobyte (KB) write- logical tlesign of the system progressecl through its b;~cksecondary cache that reduces memory reatl e;~rlystages, we ev;il~~atedthese new technologies latency and decreases memory write traffic. :mtl found that they had tnntured enough to be used The memory subsystem supports a 64-bit data in the design of the model 90. We chose Xilinx field path to main memory th;~tis cornposetl totally of progr;immable gate arrays (FI'<;As) ant1 A~bll) MACH single in-line tnemory rnoclules (SIMMs). Main mem- 1'Al.s to implement large-scale integration, ant1 stan- ory sizes of up to 125 meg;~b).tes(Mu) ;ire supported tl:lrtl PALS to implement sm;iller logic functiotis. by the Model 90. These progr;~mm;tbletechnologies allowecl us to The I/O subsystem coniprises two independent build first-pass h;~rclwarewith the full expectation 32-bit buses that communicate with the various I/o that we woultl need to make inevitable changes in ant1 graphics options of the Model 00. One bus response to logic bugs ;lntl timing problems. interfaces to the optional TrJRBOchannel adapter. Fortunately, with the new technologies, bug fixes the firmware rc;itl-only memory (ROM) chips used were made in ;I miltter of nlinutes or clays. inste;ltl of for console ant1 diagnostics, and the various graph- the weeks or months it woultl 11;lve taken using con- ics options available with tlie Model 90.5 The other ventional gate arrays. bus interfaces to the Ethernet and EDAI. controllers. 1)uring tlie Motlel 90 project, we examined our l'he EDAI. is ;I general-purpose 16-bit I/O bus. The previous notions about the roles of prototyping EIIAL controller consists of a CDAI.-to-El)i\l. chip :~nclsimulation in product tlevelopment. Because (<:MC)ant1 a small computer system interface the core of tlie model 90 w;~sborrowed from the (S<:SI) quadwortl first-in first-out (FIFO) chip, VAX 4000 Model 500, an opportunity arose for us to known as SWF. Tliese two chips communicate built1 a breatlboard system consisting of the pro- over the EIIAL bus with tlie system's remaining 110 grammable l/O ;~ncl graphics interface designs devices.

Digital TechrticrrlJorrrnal lt~l.4 iVo. .j SLIIIIIII~,J.199.2 83 WAX-microprocessorITAX Systems

I zfiy NVAYCPU I

CONTROLLER

MEMORY SlMMS

SPXG OR EDAL SPXGT BUS 2'SOUND CHIP >I LED REGISTERS I

Figu 1-c. 1 Block Di~tgianzofthe VAXst~~tioil4000 Model 90

Finally, tlie graphics subsystem provides support was completed and stable. The Nva CPU is ;t fully for three different firaphicsoptions. These options custom complementary metal-oxide semiconduc- include one low-cost graphics option and two high- tor (CMOS) (:I'll fabric;~tedin Digital's 0.75-micro- perforn~i~ncethree-diniensionaI ;~ccelerators. meter (:MOS-4 process. The NCA and N>I<: are ;~lso The nlajority of the components used in the ft~llycustom (:MOS chips, but ;Ire fabricated i~sing Motlel 90 liad been used in previous VrUi systems. Digital's 1.0-micrometer C,\iOS-3 process. The three Table 1 lists the major Motlel 90 components anti custom cllips communicate with each other over a indicates tlie source of these components. 64-bit bitlirectional bus n;lnicd the NDAL. The NVtiX (:PI1 contains a 2KI3 virti~alinstruction VAXstation 4000 Model 90 Core cache, ;In 8KR write-through instruction/data pri- The NVAX (:I'll, the nklc, tlie N<:h, and the backup mary c;~clie,and, on the Model 90, interfaces to caclie compose tlie core of tlie system module. This a 256~13write-back instruction/data second;~ry core ;lrchitect~lrewas talien directl). from the v,\>c cache. It contains an on-chip floating-point 4000 Moclel 500 system. This irrcliitecture w;~scho- unit ancl branch prediction logic. The N\IIU( <:l'U sen bec;~useit would meet our performance goals; pipelines instruction execution at the macroin- it provicled siml>le interk~ccsto 0111- memory, I/O, struction level as well as the traditional micro- ant1 gr;~pliicssi~bsystenis; :rntl because the tlesign instruction le\~el.

1'01. .I .Vo, .3 .S~IIIIIII~.I.1992 Digilal Technical Jorrrrrrtl Table 1 Model 90 Component Source NM<;provides most of the memory control signals, and only simple bank selection logic is requiretl Source external Iy. Core chip set VAX 4000 Model 500 The Motlel 90 memory subsystem impletnents Ethernet controller VAX 4000 Model 500 two sets of memory using the same 36-bit witle Memory SlMMs VAXstation 4000 Model 60 SIM>ls that are used in the Model 60 system. Four SPXg and SPXgt VAXstation 4000 Model 60 SIMMs are reqiliretl for each set. By using either the graphics options /~MHor 16~1%SIMMS, the Motlel 90 allows niemory TURBOchannel option VAXstation 4000 Model 60 configuration sizes of 16&113,32>1B, 64~1i,80AlD, or Synchronous VAXstation 4000 Model 60 128NIB. communications option Due to module space constr;~ints ant1 cost SCSl controller VAXstation 4000 Model 60 concerns, we investigated ;~lternativesto the four Enclosure, power VAXstation 4000 Model 60 <;MX memory d;~tapath chips used on the VAX supply, cables, brackets 4000 Moclel 500 memos), modules. We determined Memory transceivers VAXstation 4000 VLC th;~ttwo low-cost gate ;irrays clesignetl for the LCSPX graphics option Modified VXT 2000 VAXstation Vl.<; coulcl be ilsed instead. These module gate arrays providecl tlie same ant1 EDAL controller New design transceiver filnctions fountl in tlie lI on the Model 90 consists of onl!. interface two loads, the liigll-tlrive c;~l,ability of the (;MX chips was not requiretl. We used a simple PAL to decode tlie bank selection signals from the NMC The NCA provides direct memory access (DIU) and to generate the control sign;~lsrequired for the and programmetl I/O (PIO) support between the gate arrays. 64-bit NDAL bus and two 32-bit bidirection;~lCDAL Because the Model 90 tlesign uses the NMC, we buses nametl (:Pl and CP2. 111 the Moclel90 system, receivecl an atlditional benefit of havin~CI- -. I or COP these buses run at an 80-ns cycle time and interface rection code protection at no addition;ll cost to the to all the graphics and I/O devices in tlie system. system. The N>l<; implements :I single-bit error cor- The NCA also contains the VAX stantlard interval rect, double-bit error detect code (SE<:/I)Idevices resitled. The Motlel 60 also si~pported;I Memory Subsystem T[lRHOchannel adapter th;~tconnectetl to a 32-bit Tl~ememory sul>systeniof the 1Moclel90is ba.sed on <:I)AL bus. tlie tlesign of the VAX 4000 Model 500 system. In the The main task of tlie Model 90 I/<) subsys- memory subsystem, the N.\l<: h;~ndlesall NDAL ten1 design was to provide ;III interface between niemory references by transferring them over the the two 9-bit <:I1 buses provitled by the N<;A ant1 64-bit NMI. The NM<: supports tl;ita transfer rates up the 16-bit EI)Al. bus and tlie 32-bit TIJRl30ch;innel to 58.5MB per second over the NiLII when used with adapter option offered on the Model 60. Tlie design a 74.4-MHZWAX <:PU. Memory is configi~redin sets; work necessary included a small PAL design for the each set contains two banks of interleaved 64-bit TrJmOchannel interface on the cP2 bus and tlie wide memory. External multiplexers and trans- design of two progratnni;~blegate amlys for the ceivers are requirecl to perform interleaving. The interface between the 32-bit (:PI bus and tlie 16-bit

Digilal Techrrical Journal Val. 4 /\h 3 S~~irrner.192 NVAX-microprocessorVAX Systems

EDAL. The following list descril>es the Model 90 Synchrono~ls Communic;~tions Option-The I/O tlevices and options :~nclexplains why e:~ch Model 90 supports the s:tme multiple protocol- mias chosen. co~ilrnunic;itionsoption th;~tis offered by the &iotlel 60. 'l'his interfilce \vas implenientecl on Etliernet-'T'he hlotlel 90 Ethernet interblce is the EDAT. bus ancl allows iise of synchronous impleo~ented with the second-gelieratio~i wicle- re;^ network co~iimunic;~tionthrough pro- Ethernet controller (S<;E<:), wliich provides ;un tocols such ;IS high-level clat;r link control (tll)l.(:) Ethernet connection through a TliiliWire or a~itlsynchronous data link control (SDLC:). thick-wire cable, selectable by a switch on the rear of the system box. Tlie SGEC, which con- Miscellaneous EDAL Devices-The other clevices nects to tlie CP1 bus and was used on the VAX anel registers on the Model 90 16-bit El>hI.are a 4000 ~MotleI500 system, facilitates scatter/gatIier 16-bit systeni configur:ttion registel; an 8-bit mapping and dual internal FIFO buffering. We light-emitting diode register-, an Ethernet itlenti- cliose the Vr\X 4000 hloclel 500 tlesigl~to imple- fic;~tionlt(>kl, and a wi~tchchip. All of' these ment ;ln Etlierr~etcontroller bec;~useit rcqiiirecl devices ;ilso existed on the kloclel 60 El)r\~. bus no new logic design. ;~ntlwere ;lccessetl in ;I similar manner.

Small Computer System 1nterf;lce-The S(:SI bus interface is impleme~itetlusing tlie NCR 5.3<;94 CEAC and SQ WF Chi'Desigrzs SCSl controller chip that was used on the Model One of the major pieces of tlesign work requirecl h)r 60. The NCK 53C94 device connects to the El>A1. the Model 90 1/0 subsystem was to interface tlie bus ;lnd performs DMA operations to and from 32-bit <:I'I bus to all the I/() clevices that resirlc on main menlor!. in concert with tlie two pro- the 16-bit EDAL bus. This interhce was partitioned glamn~;~blegate arrays known as the CEAC and into two tightly couplecl designs called thr (:FA(: SQWE OMt\ virtual-to-physic;~I;tcldress tr;insl;t and SQWF. 'The CEAC chip is primarily responsible ti011 is performed by the SQWF chip b;lsecl on for I~ancllingcontrol of I/() I-egisterread and write 8,192 mapping registers i~iiple~iientedin exter- requests from the NCA to the v;trious clevices on the nal static Rt\&ls. EDAL. The SQWlZ chip h;~ncllcs1)4W transfers ancl buffering of data between the S<:SI controller chip Seri:il Lines-The D<:708S q~iadilniversal asy1.1- and the NCA. cbronous receiver/transniitter (UAKT) chip was Tlie (:EM: chip, which was first implemented in ;r chosen to provide the Motlel90 with four serial Xili~x3090 PPGA and later convertecl to a conven- lines for the keyboard, mouse, modem, and tional gate array, is a 3,400-gi~tedesign and uses I I9 printer/console ports. The 1)<:7085provides a I/O pins of n 160-pin plz~sticquad flat package 64-entry FIFO queue that is sli;~redby all h)ur (PQFP). It performs tlie <:PI bus arbitration receive lines and is implemented in a small exter- between the SQWF for s<:sl DkIA. tlie sc;Ec: for nal S%\Zl\hl. Ether~letDMr\ traffic, ;tncl the N<:A for I/O register Soi~ncl-The Model 90 sountl hinctionnlity is access. The (:WC respontls to N(:A I/O accesses th;lt impleniented using the Akll) 79CSO souncl chip are clirectecl ;it internal (:Ii/\(;/S()WFregisters anel just as it wits in the Model 60. 'P'lie programmetl EDAL tlcvice registers. Its sl;~vescclilcncer controls I/O interface to this device allows both record read, write, ;~ntlchip-select sign:~ls that control and playback functions thro~~gha jack on the EDAL tlevices. The CI:.\C has <:PI and EDAL ~nulti- front panel, and provicles voice-quality sountl. plexing logic wliich selects between addresses ancl data ant1 is controlled by the sl;~vesequencer. I'he TURllOchanneJ-The Moclel 00 provides a single CEAC cliip contains an interrupt controller which slot into which any TIIRB0ch;rnnel option that is consists of interrupt request :~ntlmask registers, supported by tlie vl\.IS operating systeni may be priority clecocling logic. ancl interrupt vector gener- installetl. On the Blodel 60, tlie TURBOcI1;lnnel ation logic. The CWC also hi~s;I ni:tster secluencer adapter \v;~stiesigned to interk~ceto a Cl)~il.th;~t that sul>portstthe SQWF cluring transfers of l).\lA was not ;I coniplete iml?lernentation of the gcn- data on the (:PI bus. eral-purpose CDAL bus. For the new design, :i Tlie SQWl; cliip, which w:~sfirst implemcntecl in small amount of interface logic was necessary to a Xilinx 4005 FIIW and 1;ller converted to a con- atlapt the TllltBOchannel option to tlie CP2 bus. ventional gate array, is a 3,900-gate tlesign ancl uses 110 1/0 pins of a 160-pin I'oFI'. The SQW responcls and software clevelopment cost, we niet witli a to requests from the N<:R 53(:94 SCSI controller number of graphics 11;lrdware and software engi- chip to tlo I>hM transferh. 1)uring SCSl IIMA. neers. We founcl that a new X terminal, the \'ST the S(ZW/i7 chip helps to optimize iitilization of tlie 2000, was being develol>ecl with graphics based on CP l/NI)AI./N>iI buses by bnffering up to eight bytes a cost-retliicetl version of the SPX graphics motlule of data in either clirectio~l.The SQWF performs byte originally ilsed in the Vtocst;~tion3100 system. This swapping to map the NcR 53<:94's 16-bit transfers module was close to the Model 60 LCC; in both cost to ;irbitr;iry main memory byte bounclaries. Tlie and 1xrform;lnce. 111 aclclitjon, it was designeel to SQV;/F contl~insa 22-bit main iiiemor)r acldress byte interface clirectly to a CDAL bus and was software counter :inel ;I tlirection bit which are accessible as compatjble with the VAXstation 3100 SPX. As a registers in I/() space. The S(Z\Y/F chip also performs result, the module could interface directly to our DNIA virtual-to-physical acltlress translation by refer- CP2 bus with a minimal number of changes to its encing an 8.192-page atldress map store basetl in supporting software. external SJb\>l. To use the v>;T 2000 St'X module in the Model 90, we needed to lay out the n~oduleagain to fit the Graphics Subsystem physical constraints of our system. This new niotl- One of tlie keys to proclucing a workstation ;~rountl ule was named LCSPX (low-cost SPX). No logic tlesign work was requiretl on Lhe 1.CSPS or on the the VA~-'to00 Model 500 core was the ability to inte- grate gr;~pliicssupport into the system successfi~lly system module to support it. A connector 011 the In atlclition, m;iint;lining the high level of gr;iphics CP2 bus provides the interfi~ceto the 1,CSPX module. perforrn;unce h)uncl in tlie Motlel 60 was viewetl Altlioiigh the perforrn;ince of the vXT 2000 sl'X as an important goal. The Moclel 60 offered three motlule was close to that of the LCG on the Motlel very goocl gr:~pliicsoptions. Tlie Model 60 low-cost 60, we \vanted to extract as much performance oiit graphics (I.<:<;) option features an inexpensive of the .I.<:SL)Xn~odule as posbible. To improve the frame buffer t.rmotluleand two-tlimensional gr;ipIiics perfor1n;lnce of the LCSI'X, we jncreased the clock acceleration logic contai~leclwithin a large gate speed of tlie module. A speed analysis of the mod- ule was performed to determine how much margin array 011 the system module. The other Model 60 graphics options, SPXg and SPXgt, are three-dimen- existed in the design. Tlie original VXT 2000 SllX sional graphics accelerators. Tlie SPSg is an 8pl;ule module ran at 20 MHz, and we deternlined that by upgrading a number of components, tlie L~SI~X option, ;ind the SPXgt is ;I 24-plane option. The three-din~e~~sio~ialgrapliics options simply repl;~ce could run at 25 MHz. As a result of this 25 percent the I.<:<; fr;~mebuffer in the Model 60. We realizetl increase in speed, the performance of tlie I.<;sI% that the Moclel 90 system li;~dto support a high-per- module exceeds the performance of the Model 60 formance, entry-level, two-climensional graphics LCG for ;llmost all operations. option ancl tlie three-tlin~ension;ilSPXg and Sl'Xgt options. Tlie first major task in the design of the SPXg nnct SPXgt Gmophics Options Model 90 was defini~~gthe entry-level graphics On the Model 60, the SI'Sg and SPXgt graphics option. options plug into the LC:(; frame buffer port, ant1 a subset of the LCG control logic provides access to LCSPX Grc~phicsOption these options. To support SIJXg and SPXgt on the From thc sr:irt of the Motlel 90 project, we Iknew I\ilotlel')O, ;I port that emulated tlie L.CG frame buffer we coulcl not support I.<:<;.'The 1,CG control logic port was required. The Motlel 60 supports both a on the Motlel 60 was enibetlded within a very large PI0 and a 1)MA interface to the SPXg and Sk'Xgt, but gate array that also served as ;I memory and I/O the Model 90 supports only ;I 1'10 interface. controller. This part was ~iotcompatible witli oils Wc consiclered a DM)\ interk~cefor the Model 90, core :~rchitecti~re.Reclrsigning the Moclel 60 I.(:(; but disc;irdccl the idea for several reasons. A logic to fit our system woultl 1i;ive been a major interface similar to the Model 60, which supports design t;isk requiring a midsize gate arr;ij: This virtilal I>MA, requires more logic than would fit in was nrell beyond our engineering schedule and the programmable technologies we were consider- resources. ing for the Motlel 90. A simpler DIM interface To finel a graphics option tli;~twould provitle woulcl not I~avebeen compatible with VMS graphics the tlesirecl perh)rm;~nceancl have a low 11ardw:ire systeni software and woultl have required a large NVAX-microprocessorVAX Systems

number of ch;unges to the software. Fin;~lly, it breaclboartl :~llowedquicker 21ntl more accurate ;lppeared that processing on the SPXg ancl SI'Xgt 11ardw;ire verific;ition tlian logic simulation. In modules, ancl not tl;~tabantlwidtli, was tlle limiting addition, the bre;~tlboardallowetl tlebugging of factor in perforn1:lnce in the Moclel 60 system. console ant1 VhIS software earlier th;~n;I conven- Based on this ;un;~lysis.;I liigh-speed 1'10 interhce tional prototype. was built. The bre:~tlboartlsystem was basetl on ;I \iit\:4000 The SPSg/SPSgt interface 011 the Model 00 simply Model 500. Logically, the breadboard simply translates (:I-'2 bus re;~dand write comrn;~ndsinto extended tlie <:IJ1 and CP2 buses of a \'AX 4000 frame buffer port trilnsactions. The interface is Model 500 system to include the complete I/O and pipelined such that it can keep up with the peak graphics subsystems of the Model 90. The Model 90 transfer rate of the (:P2 bus. We implementetl the breaclboartl was ;III eight-layer etch module and majority of Sl%g/Sl'Xgt interface logic using two included ;III the devices on the MocleJ 90 <:PI. CP2, AIMD ,\IL\(:H I'tlLs. One of these large PALS cont;lins and EDt\I. buses. l'he breadboard system used a VM the co~ltrolsequencer ;~ndgenerates all CP2, frame 4000 .Motlel 500 test backplane th;~t;~llou~etl com- buffer port, ancl tl;~t;~p:~tli control sig11;rls.Ihe other plete phjrsical access to both sitles of the \',t\;4000 h,I~(:t1 PAL co~it:~ins;In ;~dclresstlecocler ;inti ;Iddress Model 500 <:I'Il motlule. A socket wit11 pins that data path. A few miscellaneous medium-scale inte- extentlecl 1 incli througli the back of the ~iiodule gration components make up the remaintier of the was used on the NCA chip of the VAX 4000 Motlel interface. Performance analysis of the SI'Xg ant1 500 CPI: module. The hreadboartl, which contained SPXgt modules sh:)ws that performance on the the holes for the NCA, was then attached to the VtLX Motlel 90 is virtu;~llythe same as on the Model 60. 4000 Motlel 500 (:1'1J module by soldering it to the extended socket pins. CP bus clocks were not Physical Design directly routetl to the hreadboartl logic. To control The physical design of the Model 90 system boartl clock skew, ;I phase lock loop (PLL) was i~seclon tlie presentetl many ch;~llenges. Being a module breatlbo;~rtlto regenerate the CI-' bus clocks. With upgrade from the ivloclel 60, the Model 90 usetl ;i this configur;~tion,the breadboartl system was able syste~nboard that liatl many fixed-position obsta- to run at fill1 speetl. cles for placement and routing, such as connectors Once tlie breaclboartl system was ;~ssembled, ;und stand-off post holes. In addition to the seven we were ;tble to execute console com~n:~ndsafter connectors ;ind tlie single switch along tlie b:~ckof a quick tlebugging of the system. At this time, very the unit, seven more connectors scatteretl ;tbout little of tlie bre;~dboarcl logic w;a being used the module had to retain their positions. Also. the because the console program was using the VAX h$lodel90had to fit eight SIMMs in the same area that 4000 Model SO0 I/O devices ancl not tlie Model the Model 60 hat1 six SIMMs. Furthermore, pro- 90 devices. The hardware team beg;ln debug- grammable technologies generally proviclc logic ging the bre;~clboarcl logic piece by piece. of less density tlian conventional gate arriiys, rind Debugging was quick because a completely fi~nc- therefore require more module space. To Iiieet tional console i111cl 1/0 system alre:~tly existed. these challenges, we eliminatetl on-board rn;~in Sirnple functions, such as register re;~cls;1nt1 writes, niemor)r (8MB were present on the Motlel 60) ;inti were debuggctl using the consolc ex;~~iiincand reclucetl the size of the seco~1tla1-yc;lche from the deposit comm;~nds.More complex functions, si~ch originally pl;cnned 5121(11 to 256~~. as reading and writing to an SCSl clisl<,were tested The Model 90 system board measures I6 inches by writing test progratns in 1II.C); ,\Iu(:Ko.download- by 10.5 inches. ant1 h;a 8 layers of etch. approxi- ing them into melnor): and executing them i~sing mately 100 surfice-mount and through-llole conl- the console. ponents, 2.3 connectors, 5 oscillators, and over 300 After some of the major pieces of fi~nctionality discrete resistors ;lntl capacitors. All components were verified by the hardware group, members ;ire mounted 011 ;I single sitle. Figure 2 shows the of the V>J$ groilp began to use the breatlbo;~rd. Moclel 90 system motlule. A rnodifietl version of the VMS oper;~tingsystem mias usetl to debug VblS device drivers. Drivers Model 90 Breadboard System for the serial lines. LCSPX, SPXg. SI'Xgt, ;lnd tlie SCSl Our logic verification strateg!. dependetl on bi~iltl- port were debugged. In addition to soft\v;~redebug- ing a breaclboard early in the design cycle. 'l'his ging, this effort provided tlie softw;~reto perform CEAC SQWF NVAX I10 CHlP CHlP ADAPTER (NCA)

MEMORY SlMM LCSPX SPXGISPXGT F Y NVAX CONNECTORS CONNECTOR CONNECTOR

Fi'qu re -3 VAXslation 4000 Model 90 Sjstenz fMr,~lule extended ver.ific;~tion. The liartlware group was As soon as we assembletl the f rst Moclel 90 pro- able to use graphics test packages running under totype system, we realized the benefits of all I)ECwinclows software, disk exercisers, system the work performetl ilsing the breadboarcl system. exercisers, ant1 other tools supported by the V&lS During the first day of debugging, we ran the operating system. This provitled a verification envi- console program ;lntl booted the VMS system ronment we coi~ltlnever achicve with traditional with minimal effort. We also ran UE(:windows sirnillation methocls. software using the I.CSPX and the sl'xg and At tliis point, we were still using the VAX 4000 SPXgt graphics options. This quick debugging Moclel 500 console. The breaclboard was then i~sed allowed ;idtlition;rl prototype systems to be bi~ilt to tlebug the Model 90 console code. We disabled immediately and shipped to v:lrious develop- the system support chip, which controls much of ment and verific;rtion groups throughout the the console support h;~rdwarein the Viu; 4000 co m 1x111 y. Model 500, ant1 began using the Motlel 90 console support hardware. A base console th;ct includetl Performance minimal power-ul-, self-test, basic con~mandsup- The VA>istation 4000 Motlel 90 represents the port, and S<:SI boot support was debugged by the fastest ViU workst;~tionever produced. Its <:I'11 per- Model 90 console team. Once the console was filnc- form;ul~cesurpasses previous VtU worltstations and tional, the V&IS group retiirned and clebugged boot is comparable to Digital's RlS<:-based workstations. support for the Model 90 using the breadboard. By utilizing the NVAS CPU chip, the Model 90 When tliis was finishecl, the softw;rre was com- achieves 2.7 times the performance of the Moclel60 pletely debuggctl and ready to be loatled onto the when measurecl against the SPECmark bench- first Model 90 prototype. mark~.~Table 2 gives the CPLl performance of the

Digital Techtriccrl Jor~n~alVol 4 \'0. 3 .St~~~t~net.1992 RW-microprocessor \'AX Systems

Table 2 CPU Performance Comparison the Model 60 ant1 the Model 90. Table 4 cornpiires Workstation SPECmark* Rating the three-dimensional graphics performance of sev- eral of 11igit;il's workstations using standard three- VAXstation 4000 Model 90 32.7 dimensio11:ll nietrics. 111 ;~tltlition,Table 5 gives VAXstation 4000 Model 60 12.0 three-dimensional perform;cnce using the picture- VAXstation 31 00 Model 76 6.8 level benchmark (PLB) suite. DECstation 5000 Model 240 32.4 Szcrnmnr y Note: *SPECmark is a quantitative measurement of performance, The M'I\X (:t'U chip provicles the high pcrform:lnce determined by running a suite of ten benchmark programs. that makes the VLXstation 4000 Model 90 competi- tive in totlay's market. The design met1ioclolog)l usecl during the project allowecl 11sto develop ;und VAXstation ~\llotlel 90 comp;~redto other Digital ship the Model 90 c1uickl.y ;end to provide a si~iipl~ workst;~tions. upgrade path for existing \iiU(station customers. I.CSI'S is the entry-level, two-dimensional graph- ics option offered on the Moclel 90. The perfor- Acknowledgments mance ol this option is better tha~ithe LC:(; option The ;luthors would like to ;~cknonrledgethe fi)llo\v- offerecl on the Model 60 lor most graphics opera- ing people for their contributions to the \ii-\sst;~tion tions. Table 3 compares the L<:SPX graphics perfor- Model 90 tlevelopment: Gregg Bouchartl, I'ai~l mance to Digital workstations using stanclard Campos, Jon Crowell, Ter~yFurman, Dave Ives, Hill two-dimensio11:11metrics. Lapratle, Jim Luntlberg, Curt hliller. Jan Nortlh, Jim SPSg and SI'Sgt are high-pcrformnnce, three- Keille): Bri;ln liost. Mike Sull iwn, Pat Sulliv;in, Mike dirnensio~lalgraphics accelerators offered on both Warren, and Tom Wenners.

Table 3 Two-dimensional Graphics Performance Comparison Two-dimensional Two-dimensional Area Fill Vectors Workstation (Mpixels per second) (Kvectors per second) - VAXstation 4000 Model 90 LCSPX VAXstation 4000 Model 60 LCG VAXstation 31 00 Model 76 SPX DECstation 5000 Model 240 PXG 13.9 263.0

Table 4 Three-dimensional Graphics Performance Comparison Three-dimensional Three-dimensional Polygons Vectors Workstation (Kpolygons per second) (Kvectors per second) VAXstation 4000 Model 90 SPXgt 33 295 VAXstation 4000 Model 60 SPXgt VAXstation 4000 Model 90 SPXg VAXstation 4000 Model 60 SPXg VAXstation 31 00 Model 76 SPX 6 57 DECstation 5000 Model 240 PXG 52 302 Table 5 PLB Graphics Performance Comparison IGPCmark PLBlit Results*I Printed Circuit System Cylinder Workstation Board Chassis Head Head Shuttle - - - VAXstation 4000 Model 90 SPXgt 13.2 11.8 8.5 8.3 13.5 VAXstation 4000 Model 60 SPXgt 12.3 11.1 8.4 8.5 12.5 DECstation 5000 Model 240 PXG 10.0 11.7 14.9 19.2 18.3

Note: 'GPCmark is a quantitative measurement of performance, determined by dividing a normalizing constant by the elapsed time, in seconds, required to perform the test.

References 3. TURBOchannel HU~~ZLILLY~Sf)eciJic~~tltio?l (Palo Alto, CA: Digital Equipment Corporation, 1. (;. Uhler et al.. "The NVAX and NWX+ High- TRI/ADD Program, 1990). perforni;~nceVtLY Microprocessors," Digit~~l Tech~ziculjoc~rnul, vol. 4, no. 3 (Summer 4. Small Colnputer System InterJuce (SCSI) 1992, this issue): 11-23, (New krk: American N;~tional St;~ndartls Institute, ANSI X3.1.31-1986, 1986). 2. J. Crowell et al., "Design of the VtiX 4000 Model 400, 500, aritl 600 Systems:' Digitul 5. Digital's VAXst6ition 4000 Fa~?-zib~Pet-fbr- Ech~zic~~lJoul-rzal, vol. 4; no. 3 (Summer rnance Sc~rntnc~ry.Version 3.0 mayna nard: 1992, this issue): 60-72. Digital Equipment Corpor;rtion, 1992).

Digitccl Tech~iicnlJournal Vol. 4 iVo. ,i SILIIZII?CI.L992 9 I Brian Porter I

VAX 6000 Error Handling: A PragmticApproach

The KVIS operating syste~w3 CPU-clepc~1de11t support of the b:U 6000~fnmil~~of coni- j)Llters imn~~lenzeiitsa co~~zj~le.~~a~zdso/)lnisticaterl set of er'ror-hondli~~groul iilcs. At the st61llr'tOJN L!MS sessiolz, these ~'outi~zcshelp collslr.rrct the ~zeccss~/~:j~S,zr~)zriilorkto s~lppo~.ttile I/()~ub~]~sterr~ 11s the s1ste111 begins to enzerge. For ~rlrrclnoJ a I<\ISses- sior~,these routines then ILI~do~'~nant uvithin the SKSLOil inlr~ge.P~riodicr~ll]: loher? a/'ousec/,/Ihc]~j)eer illto b~a,duwre~e~qisters looki~rgfi)~. si'lzs oftr.ouble. Opn,ull is titell, arid the routines ret~11.11to I~ibe~~nntion.011 tl~oseoccc~sions zilherz the bald zi8ar.e requires assistance, error kandlir~ytokes co~~ipleteco~ltr~ol ofthe s~stein.It bcls but oue ~~zissioil:iclc~iit~~~ the errol; recoller ifl~ossible,Out at all costs e17slli.e that the itzte~~,rityoftlnc sjisten? re117ain.sintuct at~dthat G~LL~LI~S~JI'CJS~TL~~~.

Error hantlli~igis the set of routines that resitles ~ilentof error handling to sup13ortthe CPII ~~ioclules in the CPrr-depencient loaclable irn:~ge known :is and memory controllers that ~nalieup the system SYSl-OA. Each processor motlel that supports thc kernel in the va6000 series. This paper explains VAS system architecture ;lncl VkIS operilting system our en-or-hantlling strategy to not only reduce tlie h:~sits own SYSI.0.A image. Error Ii;uitlling is imple- amount of unique cotling, but also provitle an mented with other common routines like console opportunity to enhance, mature, ancl improve exist- support ant1 secondary processor start-up. Error ing VAX 6000 products. 1l;lntlling is unique for cnch processor motlel. Jndividual processor moclels bring u~iththem ;I Deuelopnzent oJError-handling wealth of error detectors ;ind consistency checkel-s. Routinesfor the VXX GOO0 Platform Each clevice has to be inclependently interrogateel The VAX 6000 platforni provided a unique opportu- ;lntl reset once triggered. nity to tlevelop error-handling routines. As sliown Error hantlling of one form or another resides in Fig~~re1, the s.\.IIb;~ckbo~ie of the system ;~llonis throi~ghoutthe VMS operati~igsystem. In some con- the creation of increasingly powerful systems that texts, trying to eelit a file in a directory structure retain much of their operating chariicteristics. that does not exist can be consitleretl ;In error. This Iticreases in processor capability are g;~inedby jxiper tliscusses only errors that t1c;il with the nierelj, exclial~gingprocessor modules for more ilnclerlying <:I'll :ind memory hardwiire on which powerful models. We decidetl tIi:it error h;incl ling tlie \fMS system is running. It describes tlie tlevelop- shoultl not be any clifferent. On prior systems,

XMl BUS ---. ---. I I I CPU 1 CPU 3 MEMORY 1 MEMORY N 110 1 I I 110N I I I .------I .------1 ElMEMORY 2

P'ig11r.e 1 $j.stclr~Block Ilicl~~~'nrr~ VAX 6000 Error Hundlitzg: A Prugmatic Aflroach

;I complete set of error-1i;lndling routines for e;icIi Objectives c:~rimodel h;ltl to be iriipleniented. We adopted ;in Error handling must identify the er1-or ancl I-ecover ;~pp~-oachto error hantl ling that coiultl be carriecl if possible. Above all, it must guarantee tlie h)rw;~rdfrom one processor to the next with little integrity of the system and the preservation of data. or no change to the initial error-h;~ncllingmoclel. An important project go;~lw;~s to procluce a This approach hanclles identical errors in the same robust :111tl q~~alityprodiuct t1i;lt WOLII~ h;lve pre- way with the same cotle Ix~se. clictable 1,rrforrn:ince. We chose to have ;I single The protocol of the Xkll bus was modified to error-li;~ndlingmotlel that could be implemented allow support of write-b;~cl<:). In ;~lmost;~ll instances, systems can- error contlitions. The other two fir~ictionsare logi- not recover fro111 li;~rcI errors. They inclicate that cally split between single-bit error correction cocle data or machine state h;~sbeen lost. H;lrcl errors are (EC<:) failures and tloi~ble-bitE<:(: fi~ilures. normally f;~t;ul.H;lrd errors are deliveretl through Common error-li;~nclling interfaces ancl routines S<:l3 vector 60 (liexadecim:~l); interrupt priority were established for the VtiX 6000 family of proces- level (IPJ.) js raisetl to 29 tlecitnal. sors. l'he use of common files ancl interfaces ensures that errors ;Ire handlecl in exactly tlie same SoJ Error Illterruf~tsSoft errors. on the other way h)r e;lch CPri motlel. hancl, generi~llysignal an asynchronous condition, with respect to the PC, tlii~thas been correctetl by Full Support of the Symmetric hardware, or that can be overcome with some soft- Multiprocessing Paradigm w:lre intervention. Soft errors are normally always benign to system operation. Soft errors are deliv- The VAS 6000 family of CPrJs :Ire symmetric mill ti- ered through S(:R vector 54 (hexatlecimal); IPL is processing (SMI') systems. The error-h;~utlling raised to 26 clecimal. model assumes that more than one CIJIJ is always active. The synchronization of error hanclling ibjucl7ine Check E-1-ceptior~sMachine check excep- throughout the system has numerous benefits. If an tions are internal processor conditions that are syn- error contlition were detected throughout the chronous to the PC. If the condition can be system, it woultl be a very complicated procedure corrected when the instruction th;~tcausecl the to ensure that all <:PIIS reacteel consistently. Such exception is reexecutetl, the result is the same 21s if errors would clutter the error log with reports the condition had not occurred. M;~ny of the from every CrU ant1 SMIdevice. ~ii;lcliinecheck exceptions that are reported by the vi\X GOO0 h~milyof processors allow recovery so Error Logging Synchronization tli;~t normill operation can continue. Machine In the vrD( 6000 scheme, en-or logging is syn- check exceptions are cleli\rered through SCB vector chronized across tlie system. If an error affects 4;IPL is raiwtl to 31 clecini;il. all nodea, this information is includecl with the NVAX-microprocessorVAX Systems

first CI'II to respond to tlie error. Machine state ant1 tlata. These messages are time-st;~mpetland is createcl that informs other (:I'U nodes that the sent to the systelii console device Ol-'r\O:. event h:is been loggrcl on their behalf. As each CI'u Messages can be output in two tliflerent modes. node responds to the error cotitlition, it can inter- Inters~~ptdriven tiiode is the most common ruid rogate this state. In the event that all error con- uses the stanc1;ird terminal driver h~nctionsof the clitions 1i:ive been logged on behalf of ;I (:PU, the running V>lS session. messages that use this mode error contlition is cleared and the interrupt or describe the tlisabling of some part of the CPlr ker- exception is dismissetl. The one entry in the error nel at system start-up or during tlie curre~~tsession. log for these types of errors clearly indicates that The other mode of output is sy~~chronousand is in other nodes were xctive. Information about the line with error processing. This mode is reserved nodes affected ancl state indicating how tlie node for hardware errors that are nonrecoverable ;~ncl was affectecl is recorded in tlie single error log result j~ia system crash. The message is output just entry. prior to calling the R(J(;(:HE(:K mechanism tIi:~t woultl terminate the current V&1S session abnor- CPU Configuration Data in the ErrporLog mally. Messages are always descriptive of the error A CPU running with some of its 1iardw;ire tlisabletl or exception conclition ant1 contain a11 the n~achirie ~iiayhave operating characteristics that ciiilse other state available at tlie time of tlie error. CP'IJs to incur error conditions of some type. An Formatted messages allow for errors that occur error log entry from a \'AX 6000 CPIJ always as tlie system is being initialized to be reported and includes the configul-ation of other active CHIS tlescribed shoi~ltlthe system fail to boot. The out- on the system. For example, if the CPll at node 6 put of messages is fully synchronizetl between the is running with its backup cache disabled, other primary and secondary (:IJrrs of SMi' systems. Tlie (:PUS inclutle this information with their error log primary CPU outputs 1ness;lges bout errors occur- tlata. Thus, potential error contlitions can be easily ring on secontlary 1,rocessors. identified. Error Rate Checking and Loop Error Log Filtering Detection Some errors that occur at too high a rate are filtered Tlie VS6000 family of CPr Is provides ;I great deal of from the error log. Errors that are delivered by the error detection. Tbe error conditions signaletl in soft error vector are invariably benign to system many cases are benign to the system if the appropri- operation. It is important that they be reportetl ate action is taken. However. blind recovery from because they can indicate an impending fatal error errors can be ;I downfall in itself. It is not uncom- in some subsystem. Howevel; if these errors are mon for so 1ii;11iybenign 1-'ailuresto occur that error occurring too often, only a subset is sent to the handling is the only task being performecl by tlie error log. Tlie algorithm is basetl on an error count system. Error Ii;~ndlingon the VA~6000 fa~iiilj~ over time. If an error is occurring too rapidly, implements a systern of rate checking ant1 loop logging of tlie errors is inhibited. At a later time, log cletection to combat this problem. ging is reenabled. Errors that do not appear in tlie error log are still counted. and tlie accumulated Rate and Loop Detection Tinze Gase totals are tlisplayed by other error conditions that The timing stancl;~rdused by the rate checking and are sent to tlie error log. loop detection subsystenis is the CPl1 TODR register, The TOl>R is independent of soft- Message Facility ware ;u~dincrements every 10 millisecontls. Error l~atitllingon the VAX 6000 has the ~~niclueabil- ity to oi~tpi~tforni;~tted messages. Integral to the Rate Checking of Enors error-handling subsystem is :I message processing Each error condition has an associ:~tetlrate check Pdcility that is composed of specialized routines database. The database tr;icl

\'of.4 No. .j SIIIIIIII~~I,1992 Digilnl Trcl~rricnlJorrr~cfil VAX GOO0 Ewor Handlin~:A Pra~maticAp/)roach

<:I'll can be turned off. When sufficient data about Setup ant1 synchronizatio~i an error has been collected, the error may be clis- Saving of state abled for a period of time. H;~rclwarefeatures such as internal c;lcIie errors can illso be disahlecl. If Parsing of tlat;~ c;~cheerrors occur that are recoverable, but are Processing nnd accounting of state occurring too hst, the cache is disabled. The occur- Error logging rence of multiple errors call indicate a broken structure, whereas a single error can indicate a sin- Error reset and dismissal gle transient event. This logical organization provides flexibility to the implementation being atltlressecl at the time. Loop Detection The parsing of clata step was ;~tlded;it tlie same Multiple errors of different types can also occur frc- cime that support of the VAX 6000 Moclel 400 was quently In this situation, the system is operational, irnplernentetl. but it is continuo~~slyat liigli It'L, servicing error interrupts or exceptions. This operating scenario is Setup and Synchroniz~itio~z detected by the frequency of transitions in and out Synchronization is accomplished by acquiring tlie of error handling. When error-handling code MCHECK spinlock. The use of spinlocks is a VMS threads are enterecl and exited. the TODR value is technique that provides atomic access to cocle saved. During execution of error handling, the threads and tl:~tastructures ;1nd ensilres that only enter TODR value is comparetl to the last exit TODR one CPLJ at a time is in error h:~nclling.7'1lus it is pos- v;ili~e.If the result is too close, a count is incre- sible to compress an error condition occi~rring mentecl. If the close relationship of exit to entry throughout the system into a single error log entry continires to occur, a loop condition is cleclaretl and by the first CI'II to service the error. appropriate action is taken. Most often this means Following synchronization. the S,\.11' sanity and the system is shut clown. spinlock acquisition timers are disabled. If ;in error occurs at the boundary of one of tliese timers, a Error-handling Model false termination of the session call occur clue to The tratlitional approach to error handling it1 tlie the time consi~rnedby the execution of error hnn- VMS operating system Iias been to interrogate regis- tlling. The SMI) sanity ant1 spinwait timers ;ire mecli- ters and act on the data tlirectly in real time. anisms usetl in VMS to ensure that <:Plls active in a Another appro;ich has been to save only a subset of multiprocessor system are interactive with each processor state that has a linkage to the error deliv- other and the synchronization primitives that con- ery vector and then act on this data during a later trol access to various resources. The sanity timer is parse operation. used as a watchdog timer to ensure tli;~tCPlIs When designing the error-handling model for the respond to hardware clock interrupts on ;I regular VAX GOO0 series, we decided to save all CPU state basis. Each CPU active on the system monitors tl1:lt is visible to macro programmers in buffers spe- ;mother CPli for its responx to h;lrdwr~reclock cific to each <:1'11. All interrogations are then made interrupts. The spinwait timer guarantees that one on the data in this buffer. Information on the hard- CPU does not retain ownership of ;I spinlock w;ire state is savetl ;IS well ;IS the current system resource for more than an ;~llottecltime period. time. Any action taken by error handling is also Error handling is always executed at ;In II'L ;tbove recorded in the buffer. This ;lpproach has several which hardware clock interrupts c;tn be serviced. advantages. First, a distinct footprint of tlie last As a result, it defeats the sanity timer meclianism. error is contained in the system image in memory Some of the actions taken by error h:~ndlingcan Should the system fail, the data is savetl when a cause a spinwait timer to expire if the error being crash clump is taken. Secontl. the many execution serviced occurs too close to the timer boundnr): Hy possibilities are madc easier to test and ver- disabling tliese SMP timers, ;I time periotl is startetl ify Fin;~lly,contlitions are easier to diagnose if the over when the error being serviced is clisrnissed origi11;ildata that error liantlling processetl :ind the ;~ntltimers ;Ire reenabled. actions that were taken are recorded in an error log. The buffer associated with the cpu experiencing The error-handling process for the VAX 6000 the error is initialized to zero and is ready to receive series consists of six distinct steps: the latest error state. If tlie error is a machine check,

Digital TecbnfcalJournal Wd.4 iVo. .3 SLIIUIIICI.1992 NVAX-microprocessorjiAX Systems stack space is allocated ;rncl initialized to allow for register read indications can help in the diagnosis of error dismissal and error-hantlling exit. the original error contlition. When mnchine checks or soft error interrupts The recovery blocks used when error state is col- are servicetl, tlie cache subsystem is uncondition- lected have special flags to intlicate that they were ;~llydisabletl. Error hand ling does this to preserve established by error h:.~ntlling.If an error does any error state tliat m;~ybc in the cache. If the error occul; control is retilrnecl to the appropriitte point 1i;tppens to be cache-related, the state can be in the error-hand l ing execution threatl. The origi- extracted at the appropriate time. <:ache-related nal saved state of the first error is not overwritten. errors are not reportetl by hard error interrupts on some VAX 6000 models. When hard error interrupts Parsing of Data itre senricecl on these processor nioclels, the cache The parsing of clat;~step was ;~cldedto support the is not disabletl by software. V/\X 6000 Model 400 ;lnd later models. The tlata col- 'T'lie 111:lchineclieck flow has an atltlitiolial checl< lectetl in the saving of state routines is pilrsed as to determine if tlie error is associated with a recov- a separate step. When error data is parsetl and pro- ery block. llecovery blocks on the VMS system cessecl in a single step, as in the VLY 6000 Model 200 ~?ro\~idekernel-mode macro progr;lninlers with ant1 \/A)(6000 Model 300, it is difficult to make the the ;~bilit! to protect ;In execution thread from necess;lry errors invisible to the error log. When the effects of fatal m;lchine checli exceptions. the error data is p;lrsed and an error mask is pro- Norm;llJy when a macliinc check occurs in kernel clucetl th;~trepresents tlie error conditions present, niotle, the cotle t11re;ld being executetl loses con- it is much easier to tletect if all current conditions trol. Ilnless tlie instruction can be restartetl, the have been serviced. Since it is also easier to detect V&lS session is terminated. The placement of kernel- conditions wit11 only one error present, expectetl mode execution threatls within the context of error conditions c;~nbe processed. The Vku 6000 :I recovery block ensures that a m;lchine clieck Motlel400 and later models have mimy benign error will cause control to be passecl to the end of tlie sy~ltlromesthat Ii;~\/etheir logging filterecl. recovery blocl<, p long with status to indicate tli;~t the machinc clieck has happened. The macro pro- Processing and Accounting gramtiler can select what type of machine check to be protectetl from. In gcneral. this is limited to During the processing and accounting step of error those machine checks catlsetl by references to the handling, the data in tlie CPlr private local buffer is physical atltlress space tliat do not respontl or parsetl and acted Llpon. These routines detect if a retilrn data. specific error contlition is global ant1 sensed If it is determined tli:~t tlie current deliverecl througlio~~ttlie system or loc;~lto the p:lrticular 1n;cchine check is protected by a recover!. block CPU. Global errors include the state from other and tliat error handling establishetl this condition. CPLJs ant1 tlevices in the error log if required. If the error is dismissed irnnletliately without further the global error is the only one present :tnd it is action. More details are given in the following expected, machine state is set to indicate tliat error section. logging should not occur. Shoultl this c:l'rJ be the first to process the global error, it samples cl;~tain registers of the other CP'LJs ;~ntldevices ant1 le;~ves SLL1.1 ilzg of Stzi le state to intlicate the error contlition has been ser- All available <:I'IJ error state is savetl regardless of viced and is expectetl. Consequently, tlie context of error type or delivery mechanism. Machine checks global errors is inclutled into a single entry in the ;ilso save the internal state passed by microcode on error log. XMI parity errors are serviced in this way. the stack. Each register is read into its local storage I.ocal errors reconl only the state from the CPU buffer m~itliinthe context of a recovery blocli. i\ experiencing the error in the error log. v;~litlbit is ;tssoci;~tetlwith e;~clilocal copy register Error handling supports the notion of expectecl cell ;~ccorclingto its status ;IS it exits the recovery errors, or errors t11;lt sometimes occur as a result of block context. ?'his is important because each cell oper;~tionsperforrnetl by error hanclling. These has been initializetl before use. A register value of errors are not reported to the error log. For exam- zero may be significant. and a failed register reatl ple, tluplicate tag prity errors can sometimes \vould allow tlie initialized value to be interpreted occur when the backup c;iche on the VAX 6000 ;IS not having :tny error or state bits set. Failed Motlel 200 and \/AX 6000 Moclel 300 is invalidated.

I/i,l. 4 .\?I. .i SIIIIIIII~~1992 Digital Tecbrricnl Jorrrnal VAX 6000 Crror H~~tzdll~lgA PIZI~IIZ~~IC A/!,~I.OLIC/I

7b cause tliese errors to be invisible to tlie error log, memory subsystem error, a flag to inclic;~teth;rt ;I mask of error bits to ignore is set up m~lienb;lcllS Tlie last step of error hirntlling resets error contli- error log buffer along with the current values of tions tli;rt have been servicetl ;inti tlisrnisses the accounting tl;~t;rfor the <:PIJ. Tlie accounting d;rta is interrupt or exception. Tlie im;rge of the data s;rvecl ;I count of the intlividt~;~lerror conclitions that have is trsed as a mask to reset error contlitions. This occurretl on this <:111 for the current session. An). technique guar;cntees th;rt double error conditions (:pU t1i:rt has some part of its functionality clis;rbled are not lost. inclutles that tl:rta in tlie error log as well. For exani- Registers that recluire initi;~lization ;ire reset ple, a <:pLl that is executing with a tlisabletl cache using the contents that were re;rd when the error may cause errors to occur on other <:lTIs. It is usefill was first serviced. Most error conditions are write- to know that :I <:l'u is running in a degraded mode one-clear. Thac is, to reset the error condition. a when investig;rting problems that are occirrring on mask of tlie error contlitions set h;rs to be written to a system. Tlie error log recortls of all CPUs clearly tlie ;~ppu)l~iateregister to clear the error. Tlie use indicate any <:l't1s operating at reduced cajxrcit) If of the original contents of the register as a ~n;rsk all CPlls are running unimpeclecl, the error log con- gu;rrantees that an error condition occurring clur- tains a flag to intlicate this status. ing the processing of an error cirnnot be lost. lieset The ;rniount of data includetl in tlie error log for of the \'AS 6000 Motlel 200 and VAS 6000 Moclel 300 ;rny given error can be tlifferent. The clata clescrib- error registers inclutles :I later probe of the register ing the <:lW context is the same except in the case for tlie ahsence of error intlications. Slioitld an error of machine checks. These errors also inclutle inter- still be present, the error-liantlling process is nal st;rte passetl by tlie microcotle through the restarted :rnd it treats tlie condition as ;I new error. stack. 1)epcnding on the error condition, context After errors are reset, tlie c:rcIie subs)-stein is inv;~li- from the XiMl bus, the memory subsystem, or an cl;rted and ;I check is matle to tletern~ineil'it slioirltl extern;rl XMI ;rtl;rpter can be inclutlecl. The error be reenabled. I'rocessing of the error coirld deter- data is organized into v;rrious subpackets that are mine tliat the cache or indeecl the Cl'll slioirltl be signaled to he present by ;I fl;rgs fieltl cont;rined taken off-line because of an error r;rte or l'inite in a he:rcler section of the (:PLi context packet. count tli;lr is too high. If ;rll is ulcll, thc c;rclie sub- For ex;r~nple, error can occur th:it describes n system is reenablecl. Tlie WICI-IE<:K spinlock is now krilure of a tr;~nsactionbetween the <:l-'U ancl niern- released zinc1 tlie interrupt or exception is tlismissetl or). If the tlata collected from the rnernory subsys- by executing a retirrn from exception or interrupt tem cluring the processing ;mtl :recounting step (KEI) instri~ction. indic;~tesan error is present, this is included in If the error being dismissetl is ;I m;lchine check, the error log recorcl. If there is no indication of ;I tlie additional storage allocatecl by error hantlling

Digital Tecbriicrrl Jor~vrrcrl 1.01 4 iVo .? .\~II?II?I~I1992 NVLY-microprocessor VAX Systems

ancl error state left by tlie microcotle has to be If ;I condition handler h;rs not been est;rblished, the removed. As shown in Figure 2, tlie ;~clclition;~lstor- VMS process is aborlecl. Kernel-mocle tlire;~tlstliat age is an ;lmy of qu;rtlwords. 'These quadwords rep- expesience f;it;ll machine checks :rlw:+ys res~iltin resent progrglrn counter/processor status longword the termin;~tionof the \fMS session. (P<:/13SI.) pairs that direct control to routines that must be executeti prior to control being returned Support ofthe V?6000 Model 200 and to the exception I)<:. The atltlitional postprocessing VAX GOO0 Model 300 that takes pl;~cefor noncorrectable memory errors The error-Ii;~ntllingsu~port for the VAX 6000 Model and errors that cause a process to be aborted are 200 ;rntl \'..Y 6000 ,\.loclel 300 is identic;il. These two dispatched using this mech;~nism. processor motlels are the same logical <;l3U.'l"l1e VAX &lachine clieck processing takes pl;~ceat 11% 31. 6000 Moclel 300 is ;I selected faster component set F;it;iI menior)zerror recover)r uses the V,\,lS system's of tlic \~LY 6000 &loclel 200. p;lge fault code thre;itls. These t1ire;ids use spin- The VM 0000 illoclrl 200 s)7ste1ii presentccl a locks that ciinnot be ;icquiretl from IPI. 51. Willeil dis- unicliie problem for error lh:~ntlling bec:iuse tlie pri- missing ni;~chinechecks, the P<;/I-'SLpairs are mary cache is internal to the (:I'U chip. Errors fro111 i~iterrogateclto determine if they are nonzero. If the the primary cache clo not c;iuse ;in j~iterrupt or probed (1u:tclwortl is zero, the stack pointer is exception. These errors can never cause a failure or lipdated to unwintl by tlie quatlnlortl ;illocatecl.This wrong result should they occur. Rec;~useall c;iclie continues i111tilall three array elements have been structures on the VAX 6000 Model 200 are write- probed. If the array element is nonzero. the RE1 through, data c;in be both in cache ant1 in memory, passes control to the PC ant1 PSL tlescribetl by that ancl it is always consistent. If a parity error occurs arrxy ele~lient.Synchronization is thus preserved, on either the data or tag section of the primary and spinlock acquisition rules are obeyetl. Even- cache, microcode c;m always fetch another copy of tually the array is traversed, :~ndeach element is the tlatn from Illernor)-. If a prinlary cache tag or removed. State left by the ~nicrocodeis removed, tlat:~error occurs, micrococle sets a status bit to control is piissed b;icli to the origin211exception PC, i~iclicatethe error in an intern~~lprocessor register and instruction retry is attenll~tetl.If error hantlling The internal is private to the determines that tlie execution tlire;icl sliou!d be 1oc:ll (;1'11. Previous (:l'rrs with this type of error sig- aborted, the original exception PC: is rep1;icecl by naling usetl ;I polling techniclue to cletrct these fail- a P<:/PSL pair that returns control to lfMS exception ures. On SM~' systems, only the prinlary c:13r 1 is routines. Fro111 these routines, control is normally interrt~l~tedon ;I regul;tr basis to allow polling row passetl to an appropriate mode condition handler. tines to run.

STACK GROWTH VMS ALLOCATES THREE FROM HIGHER TO 0 ADDITIONAL QUADWORDS. LOWER ADDRESSES t I EACH QUADWORD IS USED UNCORRECTABLE MEMORY AS A PClPSL PAIR FOR EXlT QUADWORD 3 t ERROR EXlT PROCESSING PROCESSING ROUTINES. QUADWORD 2 t ABORT EXlT PROCESSING i QUADWORD 1 C NORMAL EXlT PROCESSING THE STATE IS PASSED +THE STACK IS HERE WHEN BY MICROCODE ON THE MICROCODE STATE 1 VMS GAINS CONTROL STACK AT MACHINE AFTER MACHINE CHECK CHECK EXCEPTION. EXCEPTION.

MICROCODE STATE N

- EXCEPTION PC - EXCEPTION PSL

Fie2 iV1~1chi1leCheck E.~cejltio~zExit P~.oces.sirl~Stack FOI.III~I~ MX 6000 Error Hun~llil-zg:A l'ragmutic Approach

Since we hacl no precedent for rel'erence, we tletection jncorporatecl within each incre;~sedanel tlesignetl ;I system whereby the primr~ryCPrl irseh became more complex. Tlie over;~ll nod el imple- tlie interprocessor interrupt ~iiechanismto inter- mentetl on the \IIUL 6000 Model 200 was main- rupt secontl;~ryprocessors. When the secondary tained, but a step was ;~tltleclbetween the steps of CPU receives the (:I'll-specific interprocessor intel-- saving of state and processing and accounting. Tlie rupt, it reads the :rppropriate intern;~lprocessor new step, parsing of cl;~ta,was previoi~sly:I Imrt of register, places thc tlat;~in a known location, ;ind processing ancl :~ccounting.Error liancl ling sirpport sets ;In intlicator flag. On later poll cycles, the of the on-boartl CPll electric;rlly erasable pro- prini;rry <:I'll sees the indication from the sec- grammable read-only memory (EEI'IK),\.I) w;a also ondary <:l'r 1 2nd interrog;~testhe known location added. The EEPIIOM. was until now used only by <:I'll for ;my error bits. If no errors ;Ire tletected, rach console support. secontlary <:l'lJ is pollecl once every ten secontls. The overall model now became Shoulcl :in error be fou~itl,the secontl;iry (:I'll with the error li;~sits polling frequency increased to Setup ;rntl synclironiz:~tion once every becontl. If ten successive polls indicate Saving of state error contlitions, the secondary <:PC1 is signalecl to disable its primary cache. If this occurs, entries are Parsing of d;rt;l made in tlie error log ant1 to the systeni console by I'rocessing and accounting the primary <:Pl. on beh;ilf of the secondary CPU. During systems integration of the ViD; 6000 Error logging Motlel 200, cert:~inrandom-access memory (&\I) Error reset ant1 dismissal devices i~setlfor the backup c;lcIie exhibited exces- sive parity error Erilures. The problem was so Storage space for n~;~cliinecheck :rntl hart1 error severe that speci;~lerror-handling software and is sharetl in the VAX 6000 Moclel 200 system. :itltlitional <:I>(- hnrtlware functionality were tlevel- However, this support became too complic;~tetlto oped to isol;rte antl cli;rgnose the klilures. This work manage. In the VAX 6000 Motlel 400 ;rnd 1;tter mod- was so successh~lthat tlie 11ardw:ire function;ility els, botli bets of error state are av:ril:~blein cr:rsli was m;~tlea permanent feature of the processor, clumps. ant1 tlie error-Jlandling routines were niatle a per- m:inent part of the SfiI.OA image. The hardware EEPROM Supl,ort fi~octionalityand software routines allowed for the failing dat;~bit in the backuj, cache to be identified The experience gained from systems integrxtion of the VttY 6000 Model 200 showetl that real-time di;~g- at tlie time of the fiiilure. 'T'lie vi\X 6000 Motlel 200 nosis by the operating system has many benefits. experience had :I Iiisting effect on error handling VAX across tlie vt\\x 6000 family. The ability to diagnose However, tlie scheme ilsed by the 6000 Model caclie parity errors to the bit level in the operating 200 was cumbersome and recordecl only the result- system remains a characteristic of error handling ing cliagnostic clata to the error log. -1'lie challenge w;~stwoh)ld: to make tlie mech;inics of cache parity on ;ill vt\X 6000 systems. error di;~gnosiseasier, ;lnd to make the cl;rta more wiclely available. We achievecl both goals by using Support ofthe VRX 6OOOrModel400 and the EEPI

Digilnl Ticb~ricalJoro~rral WJ. 4 .Vo. ,J Stirrrmrt' 19'11 NVAX-microprocessorVAX Systems structures to avoicl special 11;trtlware oper:~ting present. It'vectol- errors :ire detected, the z~ppropri- ~notles.A few simple routines en:~bledeasier cliag- ate support routines-soft errol; 11;lrd error, or nosjs of cache parity error E~ilures;~ntl ;I methoel of machine check- re in.iroked. clis;~l?lingthe C:IC~C' th:~t tloes not tlisr~~pterror Error h;indling supports ;I n~asinlurn of foour state. 'l'he error-h;~nclIingS(:l% vectors were pointed vector processors. If the numbcr of errors or the to the EEPROM so the routines that clis;lblc the c:lche rate of errors becomes too great, vector processors could (lo so without ~nakiogcache references. (On are removed froni use. Error l~ancll ing never the v,\s 0000 hlotlel 500. the c;iche clis:~blingrou- renio\res the only 01-last remaining vector proces- tines ;Ire pl;~ccclin tlie high-speetl ltl\;L1.) sor. Support of errors has the Wllcn an error occurs, control is first p;~ssed same characteristics as support for CPl!-relateel to itlclivitlunl routines t1i;lt reside in I/O space. These error conditions. I'ortions of the vector processor routines disuble the cache subsystem and tlicn are disablecl if the ;~ssoci;iteclerror rate hecomes too return control, to tlie S1SLOA image in main memos!: gre;it. If other errors continue, tlie unit is rernovecl The V,\X 6000 Nloclel 500 has ;in error transition froni use. The notions of expected errors and mocle (L':'l'>I), nlhich ;tllo\vs the biickup c;iche to he errors th:it ;ire invisible to the error log also exist. p;wtinll!; dis;iblecl, Ncw blocks ;ire not ;illoc;~ted \4~Ilenin ET41 nlocle. 1)at:i requests are fillet1 from Support of the VAx 6000 Model 600 the c;iclie. Error interrupts or exceptions on the Error checking and detection on the vhx 6000 VAX 6000 Moclcl 500 clisp;~tchto routines that exe- Model 600 are very complex ]~rocesses.There are cute from I/() sp;ice anel place the write-back well over 160 unique soft and hard error conditions I>;icliup c;lche into rI;M ;uid tlis;tble the write- as c;ttegorized by the software. The actual count tI1ro~1g11prim;tr!- c:~clie. tleclarecl by the Iiartlw;~reis much greiiter. The dis- 'I'he EEPROI\.l on both tlie VhS 6000 Motlcl 400 ;ind parity results fiom the way software group5 error \/AS6000 R4otlel i00 is ;~lso~isetl to store fitilurc conclitions. The \5\S 6000 Moclel 600 error li;i~iclling inform;irion. When errors occur, ;I counter tli:~tis followetl the enh;~ncetlmodel iniplemented on tlie ;issoci:itctl wit11 the specific error condition is VU6000 Model 400. The statc s;ivetl for interroga- increnientecl. The number of error conditions is tion by V&Y 6000 Model 600 error lianclling consists finite ;ind h~llyclescribecl by the error mask pro- of 40 internal :ind XMI-atldress:~ble registers. clucecl I,!. tlie parsing of data routines. Writing to Support of tlie VA~6000 Model 600 ;ilso inclucletl the <:l'r' EEI'ltOhI is ti~ile-cos~~ingcomparecl to support of the on-board CPIJ EEI'ROM for the long- writing to ni:iin memory A byte write to EEI-'ROM term storage of failure information. The support t;ikes on the orcler ol 15 n~illiseconcls.7i) ;ivoicl this of the EEPROkl was extenrlecl to include the history overhc;icl, the LEl'l(! is written one of three states: in\~alid,v:tlicl/re;id-only, or Ixick to t11:it (:1'1-'s EEPRO.\.I. In ;itltlition to error vnlid/written. Multiple caches may hold reatl-or11y inh)rm:ttion, a coi11it of sccontls nln in ;I \/)IS envi- data siniult:meously; written data m:ly be held by ronment is t;~lliecl. only one c:ichc in tlie system ;it ;I time. Write privi- lege for a I~lockmust be obtaineel before rnotlifying the tl;~tain tli;~tblocl<. One sct of routines supports tlie \/AX6000 ,Clotlel Certain I~:ickupc;iche error conditions are severe 100 ;inrl \4\S 0000 ~Motlel 500 vector processors. enough to disable the c;~che.The backup cache may ,. . Ihcsc roiltines :ire org;inizecl in a11 identic;il milnner cont;iin written tl;~tathat is un;ivail;ible elsewhere to the (:PU routines ant1 h)llow tlie s:inie steps in tlie system. To access that data, the 1~;lckupcache rcl:itctl to <:1'[[ error conditions. During the process- is put into LTM. a state which allows written tl;~tato ing ant1 ;iccounting of <:PI: error conclitions, ;I check be :~ccesseclby the cache controller, but dis;~llows clctermines if any vector processor errors ;ire the use ofr~acl-onlydata. VAX 6000 I:',-ror F1~117dli1IS: A ?I.CISIIZLI~~(.il/)/)i.o~~cln

A cache enters ETM as a fii~ictionof either soft- When this occurs, ;III appropriate Inessagc is sent to ware or harclware. Tlie cache is put into ETM by the console terminal during system start-up. hardware only when cache data may have been cor- Tag parity errors that occur in the Vl<: ;~rccli;~g- rupted, or wlien caclie data ni;ly he inconsistent nosetl in ;In uliusual manner. Unlike other c;~clics with tl;it;i in memory Thus, correct;ible bzickup on the 6000 Model 600, the tag store of the \!I(: cache errors do not cause a transition into ETM, contains a virtual atlclress. To tletermine which bit but 1lncorrect;lble errors do. A parity error on the has k~iledwhen a parity error occurs, the tag store NVAX dat;~and address lines (NIIAI,) interface causes is probetl to retrieve the contents of the k~iling tlie cache to enter FI'M because an invalicl;~teor tag location. 'rlie ;~ssoci:lteclcl;~t;~ store loc;~tionis write-back request may have been niissed. A c;lche psobetl to retrieve its contents; each [-)it in the Ixlcl transition into UMoccurs when a request for write tag atltlress is then flippetl in turn. As each bit is privilege or a write back tloes not complete suc- flipped, the range of the resultant virtu;~l;~ddress is cessfillly 011 the VrOC 6000 Motlel 600. The state compared to page tables to tletcrmine its valitlity of the cache is likely to be inconsistent with that of within current context. The \lirtii;~laddress is tr;Ins- memory latecl. and tlie resulting ~~liysic;~l;~tltlress is mal~pccl Three recluirenients govern cache operation di~r- to allow error hanclling to reatl the contents of the ing WM: (I) The state of the cache is preservecl as page. The appropriate contents of the ~iewl!. nli~ch:IS possible to allow software to cliagnose the rnappetl page are comp;~retlto the contents reatl problem. (2) Memory references that hit written from the Vl<: data store. If one anrl only one m;~~cli blocks in the c;iche are processed, since this is the is fountl, the failing tag bit is itlcntified. ,\l;lsks of o1i1y source of tI;lt;l in the system. (3) <:ache failing bits fro111 dl VAX 0000 Motlel 600 c:~che coherency reclilests from the NDAL are processed structures ;Ire storecl in the <,PU EEI'ROM long with ~iormallyso that caclie state remains consistent other failure information. with memory. I'he instructio~~pipeline coml-,licatcs VIU: 6000 Although complex, E'TV allows tlie software to Nlotlel 600 error liantlling. In man!. instnnccs, choose wlien ;11it1 how to clis;~ble the cache. To errors experiencetl ;ire in no w;~!. rclatetl to the cur- rn;tke the process of error handling less cumber- rent instruction being executccl or interruptetl. some, the backup caclie is unconditionally put into When an error tloes occur. care must be taken to ETM by the software when any error contlition is fi~llyunderstantl in what context the error h;ts ;In being servicetl. effect. E(;(; proteccs both tag ant1 data stores on tlie 1);ickup c;iclie on the VtiX 6000 Model 600. Correct- Correctable Memory Errors able errors in tlie backup cache h:lve a recorcl Correctable memory errors are tlata errors tli;~t:u-c of failed synclrornes kept by error-hantlling rou- corrected 1)). the memory controller before tl;~t;~is Lines. Shoilltl tlie same syntlrome fail on more than returnecl to tlie requester. They occur pri~i~;~ril!r one occ:ision in a single V>ls session, the b;lckup bcc;iuse of ;~lpli;~~xirticle r;~cli;~tion, affect only ;i sill- c;~clieis tlisabletl. gle cell. ;incl are transient in nature. (:on-ect:ible If uncorrectable b;lckup cache errors occur. memory errors are completely benign. To drtel-- the error-handling routines determine if tlie block mine if a memory controller reporti~igcorrect;~ble is owned by tlie CPII and attempt to flush the errors hxs real defects, multiple errors nus st bc block back to memory If successfi~lor if tlie block i7ie~/etl. is not o\vned, the bacliiip caclie is tlis;~bletlbefore \'AX 6000 error h;~ndli~~p,implements ;I schenie returning the system to normal operation. If whereby error tlata reportetl by memos!- con- tlie data cannot be recovered, the VMS session is trollers for correctable errors is compressed into :I terrnin;lted. structure c;~lleda footprint. Tlie footprint reduces If tlie b;lckup c;~cheis disabletl by error h;uitlling the dat:~ rej,ostetl into :I form th;~t unicl~~el\~ h)r any reason, that fact is recortletl in tlie CPLJ tleacribes the error t1i:it just occurretl. 'l'hc intent I:EI'RCAM. Ilecords on dis;tbled st:rtus are also kept for of the footprint is to uniquely intlex the source tlie primary cache ant1 virtual instruction caclie component of memory the tlyn;~mic IL\AJ (VIC). Subsequent sessions of VMS interrogate tlie (DIWM). Hence, for a given memory sul~s).sten~. EEPROM ;lncl cause these structilres to remain clis- the number of valicl footprints \voultl cqu;~lrhc ;~bletlif tliey were tlis;~bleclin a pre\~ioussession. nuniber of I)IL4als. Furtllermorc, the footprint WAY-microprocessor VAX Systems block maint;~inetlper footlx-int is used to track tlie The data is not corrected in the storage DRAkls on context (repeat errors, scr~~bbing,etc.) of this error the controller. If the location is read over and over ;IS rvell ;IS other errors tl1:lt 111;1tchthis footprint ID. ag:~in,the silme error ancl correction c!.cle occurs The assumption here is that niost correctable each time. This continues until tlie location is III~III~I-yerrors are ;I result of DIWM component upcl;~teclwith write cl;it;~.An i~lterruptc:in be gener- klults, hence the gr;lnularity of tlie ~~niquel)R.r\,\,l. ated for e;~cherror correction cycle. <:are I~ILIS~be As show11 in Figure 3, the footprint forms ;I 52-bit taken when scr~rbbingtnemory 1oc:itions. The data integer from the ?(MI nocle 111, the Ec:c: syndrome of in any given memory :~tltlresscan be sh;ireel by any the error, ;uid the niemor)rcoutroller batik in error. number of <:l'Us or I/() tlevices. When this is tlie The integer is used to locate other correctable case, a higher-level software protocol is ~ior~ii;~lly errors that h;lr~eoccirrred in ;In internal dat;~b;~se. usecl to synchronize access. Error hiu~~cllingwould Along with the footprint, the atltlress of the cor- not be privy to these protocols. \'r\X 6000 Model rectitble error is passed to n set of routines tliat per- 500 0111clVtLY 6000 hlotlel 600 memory scri~bbingis forms all processing of correctable memory errors. possible bec:~useof the Xk112 bus protocol. Hehm a The clatab;~setracks the range of addresses that luve CPlj can moclify any location in Inernor!; it must be exlxriencecl correct:~bleerrors for the same foot- the exclusive owner of the 32-byte bloclt it1 which print. This aitls in the cliagnosis of row and column the address resicles. Ownership is effected ;it tlie kiilures with tlie IIRAh4s that make up memory con- primitive h;~rclwarelevel and so exclusive access is troller storage. On tlie Vta6000 Model 500 ;inel Vm gi~ar;~nteecl. 6000 &loelel 600 systems, nleliior!r scrubbing st;rtus When a correctable error interrupt occur5 on a is also tracketl. \ii\>c 6000 Modcl 500 or VU6000 Moclel 600 system, Mc~i~oryscri~bbing is a tecliniclue for reducing error hantlling rewrites the hliling loc;~tion\vitli its the number of error i~iterri~ptsfrom locations tliat contents. The ;tbilit!, to C;ILISCan interrupt is tlis- are reporting errors causetl by alpli;~particle clistur- abled in memory controllers that continue to bance. SCI-ul~bi~igremoves trslnsient faults from report errors wit11 tlie same h)otprint or that 1i:n.e the s).stem, which in tur~ireduces the number of not responded to scrubbing. This action occurs interrupts that result from such errors. I11 ;lcltlition, after sufficient data hiis intlic;~tedt1i;it something it helps to differenti;~tetrirnsient errors from pel-- other tli;~n;~lpIia particle clisturbancc has occurred ni;lnent I)l:,\kI coml)onent hults, ;IS captllretl in and tlie memor!~controller ~ii;~),require service. the error log. This inforniation wits previously The rate of correct;tble memory error interrupts unnv:~ilable. is checked to reduce tlie bi~rclen011 tlie system. If When the 6000 memory controller tletects the rate of errors occurring bcco~nestoo high, the a correctable memory error, circuitry in tlie con- ability to interrupt is clisablecl at the problem con- troller corrects the data returned for the request. troller for ;I periocl of time. <;orrect;tble memory error data collected cluring a VhIS session is sent to the error log at the encl of the vMS scssion.

Uncorrectable Memory Errors LOWEST ADDRESS I.ncorrectable memory errors experiencetl by the (;I)[ are reported ;IS m;tcliine clieclts. These machine checks are synchronoirs with the P(: mak- COUNT itlp the reference. Ilncorrectable ~iicmor!~errors occur when cl;ita is lost by the Iilernory controller FLAGS ant1 cxnnot be re-created by its ECC circuitry; hjrtu- 210 natel): these errors seldo~lioccur. Ilncorrect:lble KEY: ~ne~nor!~errors represent a serious problem to the 0 BUSY 1 SCRUBBED execution thread that experiences them. 'l'he: harcl- 2 INHIBIT THE REPORTING OF CORRECTABLE ware cannot ;~ssistin the recovery of this type of MEkiORY ERRORS error; I-ecover) is tot;~llya software function. If tlie page that experiences an uncorrect:~ble error is a process private page tlrat has not been modified, ancl the cocle threatl currently executing VAX 6000 El-/.or HLIlz~/lit?~: A P1'~1,q/lultic Ap]~r+oach is at pageabie priorit): the error is not considered represent synchronization points between ;In fatal. Tlie errol--h;indling routines arrange for the error-1i;lndling execution threi~dand the AITEST page to be re-crc:~tedin ;I different physic;~lpage in simulator. memory by inv;~litlatingthe 11ecess;iry menlor)/ The &]TEST simulator is :I privileged in~ageand management structures. As a result, a translation- consists of a user interface, a number of nonpage- not-valid exception occurs when the instruction able internal buffers, and simulator routines. The that experienced the exception is retried. The page user interkice allows the intern;ll buffers to be fi~ult~iiecli;~nisms of the VMS system do the ;lctual selectetl and 1o;idetl with dat:~patterns of the nser's rc-crei~tion.The origin:~lpage with the error is put choice. The user interface ~lsoallows the user to on a list of bael pages internal to the VMS system. If pass control to the S<:R vectors of the VMS system. the page does not meet the criteria for repli~ce- In our case we iiscd the vectors t1i;lt are the link;~ge ~nent,either the process is deletecl or the VkIS ses- to error-handling routines. Once in control, error sion is terriiin;~tecl.If the process is deleted, the handling woiiltl execute its model until it reaclietl a page is mnrketl "batl" by error li;~ndli~ig,and the J>EBII<;-TRANSFER code segment. The segment process run-clown routines in \'&IS retire the page woulcl cleterrnine that this was an error being simu- to the kltl page list. lated ant1 return control to MTES'J'. MTEST woiild then clecicle if the sync1ironiz;ition point was one Testing for which the user has tlata. The clata woulcl be transferred from the buffer namecl in tlie Early in tlie project. we decitled the ability to test 1)EHIIG-TRANSFER code segment to the adtlress also and verify hat1 to be built into error h;indling to pro- decl;ired in the segment. By judicioiisly placing the cluce ;I predictable, robust. ;~ndclu;~lity protluct. 0EBIIC;-TRANSFER synchronization points and care- Although the vt\r; 6000 family ant1 (:I'Us in general fi~llyselecting an appropriate d;ita pattern, we have a number of fei~turesthat allow errors to be were able to simulate any and a11 error contlitions generated, they telicl not to be gener;~l-purpose.In for the ;ippropriate Cl'lJ. most cases, they are designed for use b) special 111 this way we were able to verify ~iianycomplex diagnostic software that tloes not oper;ite in the ;rlgorithms and code paths that would have been context of an operating system, e.g., the VMs oper- difficult to exercise. We were also :ible to verifi ating system. We chose to implement a scheme error h:lndling and error logging from tlie point of whereby errors would be simulated in software on error to the error log file. MTES'I'c;m he either inter- the target 1iarclm~;ire.This approach give us several active or procedure-driven. This aspect allowed us clear aclv;intages. The most important was that the to maintain a library of procedures that could be ;lpproach coultl be extended as the power ancl com- iised at any time to verify that operation;~lchasac- plexity of (:PI] motlels i11cre;isetl and that complete teristics for inclivitlual errors hat1 not cliangecl when control was with the designers. No special liard- cocle p:iths that ;~ffectedJnany error types were ware eclujpment or <:I'll fentiire woultl be required. moclified. The only precondition was that certain software MTEST was the primary tool we used for testing. irnpIement;~tionguiclelines hat1 to be followed to l>uring the test alitl verification phase, prototype make use of the simulator. h~lrclwarethat had real error conditions became M;rcliilie check test (M'TEST) consists of two available, and we used these prototypes. parts, :I ~ttilityant1 an error-handling implementa- tion methodology. The niethodology consists of using main memory storage ;IS the primary agent Conclusions that is acted upon by error h:inclling. l'J?is method Tlie \/AX: 6000 family now has a robiist and complete ;~lsofit into our lilotlel of retaining clata in memory. set of error-hancl ling routines that accomplishetl Tlie other reqiiirement was the str~tegicplacement our project goals. In fact, many routines were never of the DEUIJ<;-TRANSFER macro. I>ERUG-TRANSFER before part of the VMS system. These routines exp;incls to produce a code segment that detes- include the ability to report complete error context luines if the current error being serviced is :III error to tlie system console and the itbility to group fail- sirnulation or not. If it is, data that rcsides in mem- ures occurring across the system to a single error ory that is being interrogated is modified, in con- log entry hn irnport;lnt SMt) feature is the ability to cert with MTEST, to reflect the error condition recognize ant1 retire failing processors from the being simul;~tecl.DEBIJ<;-TRANSFER code segments active set of a VMS session and ;illow the session to

Digital Technical Jorrrrtnl W)l. 4 No. .Z Srmmer 19%' N\'hX-tnicroprocessor VAX Systems

continue. These routines and others support tlie 6000 <:P[is showetl great discipline in producing entire range of \'r\X 6000 (:PU models. The object- engineering specifications that met the needs of oriented ;~ppro;~cIito error conelitions not on the both liartlw;~re;~nd softw;~re engineering groups. CI-'[J nioclule has ti~aclesupport :~ndintrocluction of The niany hours spent p;~instaki~iglyclescril)ing newer routines easier. The ;tbility to test at will any intricate details of error conditions and the produc- or all error-Ix~ncllingroutines has beeti ;I trernen- tion of parse trees allowecl the structured approach clous ;~d~~;lnt;~ge. we set out to achieve. Special thanks to Mike Uliler for his parse trees ant1 to Nick Carr, who suggested Acknowledgments this p;cper be written. Our silccess resulted from a number of factors, Refer rure inclueling the atlvant;tges of clesigning the ability to test into the procluct. 'T'liere is no substitution for 1. G. IJhler et al., "The WS;lnd N\'r\X+ High- ac~~~;~llyexecuting a cocle t1ire;ld to cletcrn~inetlie performance Vra Nlicroprocessors," Digil~il effectiveness of its clesign goal. The various engi- Ticl?lziturl Joun7~11.vol. 4. no. 3 (Summer neering groups involved in designing the many 1992, this issue): 11-23.

10-4 M'ol. ? i\'o .I SIUIIIIZ~,~19'11 Di'ilnl Ticbnicul Jorrrrrnl ISSN 0898-901 X