<<

NVAX- VAX Systems

Digital Technical Journal Digital Equipment Corporation

Volume 4 Number 3

Summer 1992 Editorial Jane C. Blake, Editor Kathleen M. Stetson, Associate Editor Helen L. Patterson, Associate Editor Circulation Catherine M. Phillips, Administrator Sherry L. Gonzalez Production Terri Autieri, Production Editor Anne S. Katzeff, Typographer Peter R. Woodbury, Illustrator Advisory Board Samuel H. Fuller, Chairman Rid1ard W Beane Richard./. Hollingsworth Alan G. Nemeth Victor A. Vyssotsky Gayn B. Winters

The Digital Technical journal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO l-3/B68, Maynard, Massachusetts 01754-2571. Subscriptions to the journalare $40.00 for four issues and must be prepaid i11 U.S. funds. University and college professors and Ph.D. students in the electrical engineering and science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Technical}ournal at the published-by address. Inquiries can also be sent electronically to 01]@CRL.DEC. COM. Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 1 Burli11gton Woods Drive, Burlington, MA 01830-4597.

Digital employees may send subscription orders on the ENET to RDVAX::JOURNAL or by interoffice mail to mailstop ML01-3/B68. Orders should include badge number, site location code, and address. All employees must advise of changes of address.

Comments on the content of any paper are welcomed and may be sent to the editor at the published-by or network address.

Copyright© 1992 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational illstitutions by faculty members and are not distributed for commercial advantage. Abstracti11g with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved.

The information ill thejounwl is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in the journal.

ISSN 0898-901X

Documentation Number EY-j884E-DP

The followillg are trademarks of Digital Equipment Corporation: Alpha AXP, DEC, DECchip 21064, DECstation, DECwindows, Digital, the Digital logo, KA50, KA52, KA675, KA680, KA690, MicroVAX, MS44, MS670, MS690, Q- , Q22-bus, ThillWire, TURBOchannel, , VAX, VAX-11/780, VAX 4000, VAX 6000, VAX 7000, VAX 10000, VAXcluster, VAX MACRO, VAXstation, VMS, and VXT 2000.

MACH is a trademark and PAL is a registered trademark of Cover Design Advanced Mkro Devices, Inc. The NVAX microprocessor is Digital's fastest VAXimplementation SPEC, SPECfp, SPECint, and SPECmark are registered trademarks of and the common theme of papers in this issue. Our cover graphic the Standard Performance Evaluation Cooperative. joins the NVAX project code name with an image of speed - the SPICE is a trademark of the University of California at Berkeley. chip's most salient characteristic and the performanceadvantage TPC and tpsA-Iocal are trademarks of the Transaction it brings to a range of new VAX systems. Processing Performance Council.

The cover design is byDeb Anderson of Quantic Communications, inc. Book production was clone by Quantic Communications, Inc. I Contents

9 Foreword Robert M. Supnik

NVAX-microprocessor VAX Systems

II The NVAXand NVAX+ High-performance VAX G. Michael Uhler, Debra Bernstein, Larry L. Biro, john F Brown III,jolu1 H. Edmondson,jeffrey D. Pickholtz, and Rebecca L. Stamm

24 The NVAX CPU Chip: Design Challenges, Methods, and CAD Tools Dale R. Donchin, Timothy C. Fischer, Thomas F Fox, Victor Peng. Ronald P Preston, and William R. Wheeler

38 Logical Verification of the NVAX CPU Chip Design Wa lker Anderson

47 The VAX6000 Model 600 Lawrence Chisvin, Gregg A. Bouchard, and Thomas M. We nners

60 Design of the VAX 4000Model 400, 500, and 600 Systems Jonathan C. Crowell, Kwong- A. Chui, Thomas E. Kopec, Samyojita A. Naclkarni, and Dean A. Sovie

75 The Design of the VAX 4000Model lOOand MicroVAX 3100 Model90 Desktop Systems Jonathan C. Crowell and David W Maruska

82 The VAXstation 4000Model90 Michael A. Callander, Sr., Lauren M. Carlson, Andrew R. Ladd, and Mitchell 0. Norcross

92 VAX 6000 Error Handling: A Pragmatic Approach Brian Porter I Editor's Introduction

paper on the new mid-range VAX 6000 multipro­ cessing system, Larry Chisvin, Gregg Bouchard, and To m Wenners explain the moc.lule design decisions that supported the goals of oOOO -series compatibil­ ity ami time to market. Of particular interest are the schedule and performance benefits derivec.l from developing a routing ami control interface ch ip. The engineers for new low-end c.leskside systems also chose to develop custom chips-a chip, memory moc.lule, and an 1/0 con­ troller. Jon Crowell, Kwong Ch ui, Tom Kopec, Sam Nadkarni, and Dean Sovie c.liscuss the chip func­ Jane Blake C. tions that were key to exceeding the performance Editor goal of three times that of the previous VAX 4000. For the low-end VAX 4000 Model 100 system and The microprocessor is a high-performance, NVt''< the Micro VAX 3100 desktop servers, designers sa\'ed single-chip implementation of the VAX architecture. significant time by "borrowing" existing compo­ It is toc.lay's fastest VA X microprocessor anc.l the CPU nents from proven systems. Jon Crowell and Dave at the heart of the mic.l-range, low-end, and work­ Maruska relate decisions that allowed them to c.lou­ station systems described in this issue of the Digital ble performance and complete the work within the Te ch nical.fournal. extraordinarily short time of nine months.

The NVt'v'Cchip is not only fast, with cycle times as The newest VAXstation , based on low as 11 ns, but also holds a unique position in the l\fVAX, is the Moclel 90. Mike Callander, Lauren Digital family of microprocessors: NVA X is both an Carlson, Andy Ladd, anc.l Mitch Norcross present upgrade path for existing VAX systems and a migra­ their design methodology. Most significant for tion path to Alpha A.XP systems. In their paper on development was the decision to implement new

the NVAX and NVA.X+ chips, Mike Uhler, Debra in programmable technology, which allowed Bernstein, Larry Biro, John Brown, John Edmonc.l­ bug fixes in minutes rather than weeks. son, Jeff Pickholtz, and Rebecca Stamm present an Not about system design but rather error han­ overview of the comp.lex microprocessor designs dling in 6000 systems, Brian Porter's paper and relate how RJSC techniques are used in this CISC describes an approach that reduces the amount of machine to achieve dramatic increases in perfor­ unique coding traclitional.ly required for error han­ mance over previous implementations. dling. He c.letails the development of sophisticated Increases in performance are also attributable to error hand I ing routines that accommodate the the CMOS-4 0.75-micrometer technology in complexity of the symmetric VAX which the NVAX is implemented. In their paper 6000 moc.lels. about the verification of the physical design, Dale The ec.litors thank Mike Uhler of the Semi­ Donchin, Tim Fischer. Frank Fox, Victor Peng, Ro n conductor Engi neering Group, who ensured that Preston, and Bi 11 Wheeler c.lescrihe the methods and the standards of excellence applied to NVAX devel­ the CAD tools created to manage the complexity of opment were appl iec.l to the development of this a chip with 13 million . issue. Also, this issue is notable editorially because The rigorous use of the CAD t.oo.lsancl thorough it is the first in which papers have been formally simulation-based testing resulted in highly hlllc­ refereed. I thank Gene Hoffnagle, editor of the

tional first-pass chips. In his paper about the logical JB,H Systetns .Tournai, for encouraging the use of verification, Walker Anderson discusses the suc­ the referee process in any journal worthy of the cessful strategies used to ensure no "show stopper" name. D1J issues will continue to be refereed so bugs existec.l in the design. Highl ighting major that we may offer engineering and academic read­ strategies, he reviews the behavioral models anc.l ers informative and relevant technical discussions. pseudorandom exercisers at the core of the verifi­ cation effort.

Each system design team chose a different approach to take advantage of NVAXperf ormance and to meet system-specific requirements. In a

2 Biographies I

Walker Anderson Principal engineer Wa lker Anderson is a member of the Models, To ols, and Ve rification Group in the Semiconductor Engineering Group. Currently a co-leader of the logical verification team fo r a future chip design, he Jed the NVAX logicalverifi cation effort. Before joining Digital in 198H, Wa lker was a diagnostic and testability engineer in a CPU development group at Corporation fo r eight years. He holds a B.S.E.E. (1980) fro m Cornell University and an M.B.A (1985) from Boston University.

Debra Bernstein Debra Bernstein is a consultant engineer in the Semi­ conductor Engineering Group. She worked on the CP design and architecture fo r the NVA)( microprocessor and the VAX 8700/8800 systems and is currently co-architecture leader for a future Alpha t\.-'\P processor. Deb received a B.S. in computer science (1982, cum laude) from the University of Massachusetts in Amherst. She holds one patent, has two patent applications pending, and has coautbored several technical papers.

Larry L. Biro Larry Biro joined the Electronic Storage Development (ESD) Group in 1983, after receiving an M.S.E.E. from Rensselaer Polytechnic Institute. While in ESD, he contributed to the advanced development of solid-state disk products. Larry joined the Semiconductor Engineering Group as a custom cir­ cuit designer on the NVAX E-box. Later, as a member of the NVAX+ chip imple­ mentation team, Larry designed the clock and reset logic, coordinated back-end ve rification efforts, and co-Jed the chip debugging. Currently, he is the project leader fo r a fu ture, single-chip VAX implementation.

Gregg A. Bouchard Gregg Bouchard is a senior hardware engineer with the Semiconductor Engineering Group. His current responsibilities include the mod­ ule design of a DECchip 21064 daughter card that contains a CPU-to-bus interface. Previously, Gregg worked on chip design of the NVAX-to-X.Ml bus interface fo r the VA,'\ 6000 Model 600, and field programmable chips fo r the VA.,'\station 4000 Model 90. Gregg joined Digital in 1986 after receiving his B.S. E. E. from the M.S.E. Rochester Institu te of Te chnology. He also holds an E. from Northeastern University and has a patent pending related to hardware queue structure.

3 BiogmjJhies

john F. Brown After receiving an :viS.E E. from Cornell University in 1980, John Brown joined the engineering staffat Digital. At present, he is a consultant en gineer working on Alpha AXP microprocessor advanced development. John "s previous responsibilities include managing the design of the instruction decode section of the NVAX microprocessor. He also made technical contributions to the VAX 6000 Model 200 ami 400 chip sets, and was hardware engineer for the extended floating-point Lnhancement to the VAX-11/780 system . John holds two patents ancl has seven applications pending.

Michael A. Callandet·, Sr. Michael Callander is a principal engineer in Digita l's Semiconductor Engineering Group. Mike was the technical leader for the VAXstation 4000 Model 90 system. His previous experience with Digital includes design and architectural specification for various CPU modules ami sys­ tems, i ncluding the VAX 8200 and th e VAX 6000 ;\'lode! 400 and Model )00 Mike received his B.S E.E. from the University of Massachusetts in 19H2 ancl joined Digital upon gra duatio n. He has authored several technical papers and has a number of patent applications pending.

Lauren M. Carlson A senior hardware engineer in the Semiconductor Engi­ neering Group, lauren Carlson is currenrly working on the design of a periph­ eral chip set for a new microprocessor. Previously, she designee! the VA Xstation 4000 Model 90 CDAI.-to-EDAI. adapter chip (CEAC) gate array, which is part of the I/O subsystem. Lauren also contributed to the design of the VA Xstation 4000 Model 90 system module and another VAX system CPU module. Prior to this, Lauren worked in the Advanced VAX Development Group. She received her BSE.E. from Worcester Polytechnic Institute in 19H6 and joined Di gital in 19H7

Lawrence Chisvin A principal hardware engineer in the Semicomluctor Engineering Group, larry Chisvin is involved in the design of modules and systems basecl on the DE<:chip 21064 microprocessor. Larry also provides tech­ nical support for customers, including Alpha AXP architecture presentations ancl example designs and application notes. Previously, he worked on processor and memory modules lor the VAX (JOOO Model 600. He holds a H.S.E.E. (su mm a cum Iauck) from Northeastern University and an ;\'l.S.Ef. from Worcester Polytechnic Institute. He is a member of the IEEE Computer Society and the AC:vl

Kwong-Tak A. Chui Kwong-Tak Chui is a principal hardware engineer in the Semiconductor Engineering Group . He is working on the design of the l/0 con­ troller section of a new Cl'll. Kwong was the project leader f()r the l/0 controller chip of the NVt\..\': chip set used in the VAX 4000 Model 400, 500, and 600 systems. Since joining Digital in 19H), he has worked on four other VLS! chip design proj­ ects for the V\X :)000an d VAX 4000 series . Kwong holds a B.S. in com­ puter engineering (19H'5) from the University of Illinois at rhana-Champaign ami an : SE. E. (l9H9) from Cornell University. I

Jonathan C. Crowell An engineering manager in the Entry Systems Business Group, jon Crowell was the project leader and system engineer on the VAX 4000 Models 100, 400, 500, and 600 and the MicroVAX 3800, 3900, and 3100 Model 90 systems. He is now working on the design of the next generation of VAX4000 sys­ tems. Previously, jon worked in the Systems Integration Group qualifying Q-bus devices and ossr adapters and storage devices. He joined Digital in 19H6. jon received a B.S.E. E. (1981 ) and an MS.E.E. (1986) from Northeastern University. He hold six patents and is an active member of IEEE.

Dale Donchin Dale Donchin manages schematic entry and layout verifica­ tion CAD tool development in the Semiconductor Engineering Group. He facili­ tated the use of CAD tools for NVAX design, primarily for layout and physical chip verification. Dale is presently performing in a similar capacity for a new micro­ based on the Alpha A.XP architecture. He joined Digital in 1978, and was previously a development manager in the RSX group. Dale holds a B.S.E.E. (1976, honors) and an M.S.E.E. (1978) from Rutgers University College of Engineering and is a member of IEEE and ACM.

John H- Edmondson john Edmondson is a principal engineer in the Semi­ conductor Engineering Group. At present, he is co-architect of a future RISC microprocessor. Previous to this, he was a member of the VAX 6000 Model 600 CPU chip design team. Before joining Digital in 1987, .John designed mini­ computers for five years at Canaan Computer Corporation. He also worked at Massachusetts General Hospital for two years, researching applications of tech­ nology to anesthesia and intensive care medicine. John received a B.S.E.E from the Massachusetts Institute of Technology in 1979.

Timothy C. Fischer Tim Fischer is a senior hardware engineer with the Semi­ conductor Engineering Group. He is working on the design of a floating-point unit for a future high-performance microprocessor based on the Alpha AXP architecture. Prior to this work, Tim was a member of theE-box design team and contributed to the design of global clock generation and distribution for the NVAX microprocessor. He also worked on the design of the bus interface unit on the NV�'(+ chip. Tim joined Digital in 1989 after receiving his M.S. in computer engineering from the University of Cincinnati.

Thomas F. Fox Frank Fox, a consulting engineer in the Semiconductor Engi­ neering Group, co-led the implementation of the NVAXmicroprocessor ancl con­ su.lted with the Advanced Semiconductor Development Group on the design of the GviOS-4 technology. He joined Digital in 1984 and worked on the implemen­ tation of the CVAX microprocessor. Frank received a B. E. degree from University College Cork, National University of Ireland (1974), and a Ph.D. degree from Trinity College, Dublin University (1978), both in electrical engineering. He has published papers on ultrasonic instrumentation, MRJ scanners, and VLSI design.

5 Biographies

Thomas E. Kopec Principal engineer To m Kopec was a member of the Entry Systems Business Group that designed the VA}\: 4000 Models 200 through 600, and the Micro VAX 3500 and 3800 systems. He also led an Alpha AXL' development project. Recently, To m joined the Assistive Te chnologies Group to work on com­ pact speech-synthesis and text-to-voice systems. He received a BS LC.E. (1980, with honors), concentrating in microwave engineering ami signal processing, and an ivi.S.E.C.E. (1985), concentrating in image processing and computer graph­ ics, both from the University of Massachusetts at Amherst.

Andrew R. Ladd Andy Ladd is a principal engineer in the Semiconductor Engineering Group and is the project leader for the VAXstation 4000 Model 90 CPU aml low-cost graphics module designs. Previously, he providecl timing verifi­ cation support fo r the DECchip 21064 and co-designed two bus interface chips h>r the VAX6000 Model 400 CPU module. Andy joined Digital in 1986. He received his B.S. in computer engineering from the University of Illinois ( 1984) ami his MS in computer science and engineering from the University of Michigan (1991 ). Andy is a member of the IEEE Computer Society, Tau Beta Pi, and Eta Kappa Nu.

David W. Marus ka Principal engineer David Maruska is a member of the Entry Systems Business Group and is presently involved in the design of the next generation of VAX 4000 crus. He was the lead designer fo r the KA 50 and KA 52 Cl'lJs ami project leader for the VAX4000 Model 200 system, the KZQSA Q-bus-to­ SC:SI adapter, and the + exerciser. Dave joined Digital in 1982, after receiving a BS. in computer engineering from Boston University. He worked on graphics fo r Mosaic Techno logies and Raster Technologies from 1985 to 1985 and then returned to Digital in 1986.

Samyojita A. Nadkarni Sam Naclkarni, a principal harchvare engineer in the Semiconductor Engineering Group, is currently the project leader for a set of support chips fo r the DEC:chip 21064 processor. She was the project leader for the NMC chip used in the VAX 4000 Model 400, 500, and 600 systems. She also worked on memory controller/bus adapter chips fo r the VAX 4000 ,\1odel :)00 and MicroVAX 3500 systems. Sam joined Digital in 1985 and holds a Bachelor of Te chnology (198.3) from the Indian Institute of Technology ancl an ,viS (J98'S) from Rensselaer Polytechnic Institute.

Mitchell 0. Norcross Senior engineer Mitch Norcross has been designing and analyzing digital subsystems since he joined Digital in 1986. As a member of the Semiconductor Engineering Group, Mitch designed a gate array fo r the VAXstation 4000 Model 90 and contribu ted to the design ami analysis of several system and CPU modules. Prior to joining SEG, Mitch designed a gate array for Digital's first fault-tolerant VAJC system, the VAXft :)000. He received a ll.E. in elec­ trictl engineering (1985) and an M.S. in computer engineering ( 1987), both from Manhattan College. Mitch holds one patent on fault-tolerant system design.

6 I

Victor Peng Victor Peng is a consultant engineer in the Semiconductor Engineering Group. He received his B.S. in electrical engineering from Rensselaer Polytechnic Institute in 1981, and his M.S. in electrical engineering from Cornell University in 1982. Victor joined Digital in 1982. He was a member of the design team for the VAX 8200/8300 memory interface chip and patchable chip. He led the implementation of the VA,'\ 6000 Model 400 float­ ing-point chip and was co-manager of the NVA,'\ chip design team.

jeffrey D. Pickholtz Jeffrey Pickholtz received an A.S (1977) in specialized technology from Penn Technical Institute and a B.S.E.ET (1989) from Central New England College. He joined Digital in 1977 as a technician and worked on a variety of midrange computers. More recently, his responsibilities have included technical contributions to the VAX 6000 Models 400 and 600, and VA,'\ 7000 Model 600 chip sets. Currently, Jeff is a senior engineer in the Semiconductor Engineering Group, leading the implementation of a future CMOS microprocessor chip.

Brian Porter As a consulting software engineer in the Systems Group of VMS Development, Brian Porter was responsible for CPU error handling in the VAX 6000 family. Prior to this work, he was responsible for support of VAX systems and was an author and maintainer of the VMS error log utility SYE. Brian is the author of the original VMS striping driver, which was later developed by others into the VMS striping driver product. He currently works in the Executive Group of VMS Development and is responsible for symmetric multiprocessing. He has two patents pending on memory error handling. Brian joined Digital in 1973.

Ronald P. Preston Ronald Preston is a principal engineer in the Semi­ conductor Engineering Group. Since joining Digital in 1988, he has worked on the design of several microprocessors. Ron was the circuit design and imple­ mentation leader for the E-box on the NVAX microprocessor. He is currently designing the instruction issue logic for a superscalar RISC microprocessor. Prior to joining Digital, he worked on the design of CMOS microcontroJ lers for Signetics Corporation. Ron received his B.S.E.E. in 1984 and his M.S.E.E. in 1988 from Rensselaer Polytechnic Institute. He is a member of Eta Kappa Nu and lEE E.

Dean A. Sovie Design engineer Dean Sovie joined Digital in 1981 and is a member of the Electronic Storage Development Group. He is currently involved in the battery backup laser memory design. Prior to this work, he contributed to the design of the high-performance memory module used in the VA,'\ 4000 Model 400, 500, and 600 systems. Dean also helped design memories for the VAX 6000, VAX 4000 Model 200, and PDP-11 systems. In addition to his present responsibili­ ties, he is pursuing a degree in electrical engineering at Northeastern University.

7 Biographies

Rebecca L. Stamm Rebecca Stamm is a principal hardware engineer in the Semiconductor Engineering Group. She led the design of the backup , the bus interface, and the pin bus for the NVAX CPU chip. Rebecca then led the chip debug team from tapeout through final release of the design to manufacturing. Since joining Digital in 1983, Rebecca has been engaged in microprocessor archi­ tecture and design. She holds two patents and has seven applications pending. Rebecca received a B.A. in history from Swarthmore College ancl a B.S. E.E. from the Massachusetts Institute of Technology.

G. Michael Uhler Michael Uhler is a senior consultant engineer in the Semi­ conductor Engineering Group, where he leads the advanced development effort for a new high-performance microprocessor. As chief architect for the NVAX and REX520 microprocessors, Mike was responsible for the CPU architecture, perfor­ mance evaluation, behavioral modeling, CPU , and CPU and system debug. He received a B.S.E.E. (1975) and an M.S.C.S. (1977) from the University of Arizona and joined Digital in 1978. Mike is a member of IEEE, ACM, Tau Beta Pi, and Phi Kappa Phi and holds eight patents.

Thomas M. Wenners Thomas Wenners is a senior hardware engineer in the Semiconductor Engineering Group. He is responsible for the design of a next­ generation VA,'\workstation CPU module and for CPU module designs of DECchip 21064 microprocessor-based products. Tom's previous work includes the mod­ ule design of the VAX 6000 Model 600, module design and signal integrity sup­ port on ESB products, and analysis ancl evaluation of advanced chip ancl module packaging. Tom joined Digital in 1985. He received a B.S.E.E. (1985, cum laude) and an M.S.E.E. (1990) from Northeastern University.

William R. Wheeler A principal engineer in the Semiconductor Engineering Group, Bill Wheeler performed architectural definition work for the NVAX microprocessor and was project leader of the NVAX M-box. Prior to this work, he designed the !-box unit for the third-generation VLSI VAX implementation. He is currently involved in advanced development work for improving custom design tools and methods. Before joining Digital in 1983, Bill received both his B.S.E.E. (1982) and his M.S.E.E. (1983) from Cornell University. He has coauthored three papers on VLSI designs and holds three patents.

8 I Foreword

of unparalleled performance and quality. The results speak for themselves.

• From its initial shipmen t in October 1991 through today (a year later), NV�'< was (and is) the fastest shipping crsc microprocessor in the world, whether measured by , SPECmarks, or transactions per second.

• NVAX had fewer bugs after design completion, and went from tape-out to production more quickly than any microprocessor in Digital's history. Robert M. Supnik CorporateConsu ltant, • NVAX systems, spanning the range from work­ Vi ce President station through mainframe, all shipped on or Technical Directo1; ahead of schedule, meeting or exceeding pre­ dicted performance. Alpha AXP and VAX Systems Anoutstanding engineering achievement indeed' The roots of NV�'< can be traced back a decade to two distinct engineering programs: the High-end Systems Group's studies and implementations of If, as the popular saying goes, "Once is happen­ highly pipelined VAX systems; and the Semicon­ stance, twice is coincidence, three times is con­ ductor Operations Group's projects in pro cess certed action," then four consecutive instances of development and microprocessor design. outstanding engineering achievement must be even The High-end Systems Group started work on more significant. highly parallel VAX systems in 1979, designing and Since 1985, Digital has designed, developed, and building the V�'< 8600-the first VAX to include shipped four generations of leadership V�'< micro­ overl.apped operand decoding (see the Digital processors and CMOS-based systems: Tecbnicaljournal, August 1985). At the same time, • In 1985, the MicroVAX chip and resulting systems a research team described HyperVAX, a hypotheti­ (such as the MicroVAX II and the VAXstation cal fully pipelined design. Although HyperVAX was 2000) never built, its had a strong influ­ ence on the design of the VAX 9000, Digital's ECL • In 1987, the CVAX chip and resulting systems mainframe (see the Digital Technicaljo urnal, Fall (such as the MicroVAX 3800, the VAX 6000-200, 1990). And the microarchitecture of the V�'< 9000, and the V�'

9 Fo reword

semiconcJucror process development schedule, tion, CAD development fo cused on improvements requiring breakthroughs in concurrent develop­ to productivity and accuracy through design syn­ ment of product and process. And third, the time thesis, electromigration analysis, and capacitance allottccl from chip design completion to system and resistance extraction. shipment was the shortest in Digital's history, The size and complexity of the design, as well as requiring unprecedented accuracy in chip and the stringent schedule constraints, also dictated an system design and verification. early start on verification issues. The verification As in past projects, work in various disciplines­ strategy fo rmed an integral part of the design effort semiconductor process development, chip micro­ from the outset. The verification team developed architecture and circuit design, microprocessor tools and strategies fo r verifying the microarchitec­ design tools, chip and system verification tools, ture, the microcode, the logic, the circuits, the chip ancl system design-cascaded from process as a component, and the chip in a system . through systems. First to start was a team from Lastly, the various system groups-data center Advanced Semiconductor Development (ASD), systems, office systems, workstations- began which designed, simulated, and introduced into designing systems to util ize the NVAX chip's capabil­ manufacturing CMOS-4, Digital's fo urth generation ities. Each group was able to build on the work of CMOS technology (see the Digital Te chnical clone in past VAX systems and designed an NVAX ­ jo urnal, Spring 1992). Building on prior technology based system that functioned both as an upgrade of generations, CMOS-4 contained many fe atures­ past systems and as a fo rmidably competitive new three layers of metal interconnect, salicide, preci­ system in its own right. sion resistors, local interconnect, deep diffusion The work of these project teams dovetailed per­ ring-which cJirectly supported the performance fectly. NVAX completed design and taped-out in late requirements of NVAX . In addition, AS D and Semi­ November 1990, just as the CMOS-4 process was conductor Manufacturing pioneered new tech­ ready fo r chip prototyping. Due to the outstanding niques fo r process transfer and qualification which work of the chip design, system design, CAD, and dramatically sl1ortened the time requirecl to debug verification teams, first-pass parts booted the VMS and qualify the CMOS-4 process. operating system at speed in early March 1991. The In parallel, a design team from the Semicon­ process team qual ified CMOS-4 in October 1991, and ductor Engineering Microprocessor Group initi­ systems using second-pass parts shipped to r rev­ ated microarchitectural and circuit studies. The enue that same month-three months ahead of team started with the VAX 9000, but they quickly schedule-with performance significantly greater discovered that the diffe rence in implementation than the original goaL media (multichip ECL gate arrays tor the VAX 9000, Clearly, the outstanding results from all the NVAX single-chip custom CMOS to r NVt�'l: ) required sig­ engineering projects are neither happenstance nificant changes and new concepts. The micro­ nor coincidence; rather, they represent concerted architecture sub-team used abstract ami detailed action-ream excellence and individual brilliance­ performance models, studies from existing VAX at its finest. Hundreds of people contributed to the systems, and experience with past designs to drive outcome. This issue of the Digital Technical quantitative decisions about fe atures and fu nctions jo urnal is their story. in NVt�'l: . At the sal1le time, the circuit sub-team fo r­ mulated the overall design, circuit, and clocking methodologies fo r the chip and established the fe asibility of the target cycle time, chip size, and lay­ out floor plan. As the microarchitectural concepts solidified. the design team realized that NVAX would be tbe largest and most complex chip ever designed at Digital, and that it would place unprecedented stress on the capabilities of both designers and cJesign tools. Accordingly, they initiated a partner­ ship with the Semiconducror Engineering CAD Group to improve current tools and to develop new tools. In addition to traditional areas like simula-

10 G. Michael Uhler Debra Bernstein Larry L. Biro

jo hn E Brown III jo hn H. Edmondson jeffr ey D. Pickholtz

Rebecca L Stamm

TheNVAX andNVAX + Highperfo rmance �AficrojJrocessors

+ The NVAXand NVAX CPUchips are high-performance VAX microprocessors that use techniques traditionally associated with RJSC microp rocessor designs to dramati­ cally improve VA)( ormancperf e. The two ch ips provide an upgrade path fo r existing

VAX sys tems and a migration path fr om VA X systems to the new Alpha AXP sy stems. The design evolved throughout the project as time-to-market, performance, and complexity trade-offs were made. Sp ecial design fe atures address the issues of debug, maintenance, and analysis.

The NVAX and NVAX+ CPUs are high-performance, on the high-leverage design points and on an single-chip microprocessors that implement unprecedented verification effort.' Digital's VA.'< architecture.1 The I\fVA.'\ chip provides Support fo r multiple system environments, an upgrade path fo r existing systems that use the compatibility with previous VA X products and previous gener::ttion of VAX microprocessors. The systems, and a means to migrate from traditional NVAX+ ch ip is used in new systems that support VAX systems to the new Alpha AXP platforms were Digital's DECchip 21064 microprocessor, which also important design go als. These go als had a pro­ implements the Alpha AX P architecture 2-' These fo und impact on the design of the cache protocols

two NVA.'\ chips share a basic design. and the external bus interfaces. NVA..'C and NVAX+ The high-performance, complementary metal­ engineers worked closely with engineers in ox ide semiconductor (CMOS) process used to Digital's systems groups during the definition of implement both chips allows the application of these operations. pipelining techniques traditionally associ::tted with The paper begins by comparing the basic fe a­

reduced instruction set computer (RISC) CPUs.·< tures of the NVA.'< and NVAX+ chips and then Using these techniques dramatically improves the describes in detail the ch ip interfaces and design performance of the NVAX and NVAX+ chips as com­ elements. This description serves as the fou ndation pared to previous VAX microprocessors and results for the ensuing discussion of design evolution and in performance that approaches and may even trade-offs. The paper concludes with information exceed the performance of popular industry RJSC about the special design fe atures that address the microprocessors. issues of debug, maintenance, and analysis. The chip design evolved throughout the project as the goals influenced the schedule, performance, Comparison of the NVAX and complexity trade-offs that were made. The two and NVAX + Chips primary design goals were time-to-market, without The NVAX and NVAX+ chips are identical in many sacrificing quality, and improved VA.'\ CPU respects, differing primarily in external cache and performance. Our internal goal was fo r the NVAX bus support. NVAX is intended to r systems that use CPU performance to be more than 25 times the previously designed VA.'C microprocessors. The performance of a VAX-11/780 system in a datacenter following systems currently use the NVA.'C chip: system. Achieving these goals required meeting the VA ..'\s tation 4000 Model 90; the MicroVAX aggressive schedules ancl thus concentrating 3100 Model 90; the VAX 4000 Models 100, 400,

Digital Te clmical]ournal Vo l. 4 No. 3 Summer 1992 11 NVA,'(-microprocessor VAX Systems

'iOO, and ()()(); and the VAX ()OOO Model (}()() -" -s ') to provide high performance, are overlapped with NV.\X supports an external write-back cache arbitration fo r fu ture transactions and acknowl­ that implements a directory-based broadcast edgment of· previous transactions. The NDAL bus coherence protocol that is compatible with earlier protocol allows up to two disconnected reads and VAX systems. 1" multiple write-backs to be outstanding at tbe same NVAX+ is designed t< Jr systems that usc the time, using identifiers to distinguish the diffe rent DECchip 21064 microprocessor implementation transactions. External interrupt requests are of the Alpha AX!' architecture and is currently used received through dedicated lines and arbitrated by in the VAX 7000 Model (i OO and the VAX 10000 logic in the C:l'll. Model (}OO S\ stems. NVAX+ supports an external The NVAX+ chip interfaces to an external cache ami bus protoco l that is compatible with that write-hack B-cache implemented with tag ami data of the D!Tchip 210()4 microprocessor. In existing static I�AMs on the module through a port shared S\'stems. '\VAX+ is eonfigureLI to support an exter­ with system contro.l Iogie, as shown in Figure 2. nal write-back cache that implements a conditional Responsibility for controlling the cache port is write-update snoopv coherence protocol.11 shared between NVAX+ and the system environ­ The two Cl'l l chips provide both the means to ment: the NVAX+ chip handles the common cases upgrade installed VAX S\'Stems, thus protecting pre­ of read hit ami exclusive write. ami the system vious investments, and a migration path from a VAX environment provides cache policy control fo r microprocessor to a DECchip 210()4 microproces­ other events. 'The size and speed of the cache can sor in the new Alpha AX!' systems. be configured to allow a range of possible system configurations. Chip Inteifaces The DE<:chip 21064 data and address lines (EDAL) The NVAX chip interfaces to an external write-hack constitute a demultiplexed, bidirectional bus with cache CH-eadle) through a private port with tag 2<) bits of address. 128 bits of data, and the associ­ and data static random-access memories (RA.\ls) ated control signals. This bus operates at one- h alf, on the module, as shown in Figure I. The ize and one-third, or one-fourth the frequency of the CPU speecl of the cache are programmable, allowing from clocks provided lw the CPU. The speed of the the chip to accommodate a range of possible sys­ system clocks can be programmed to accommo­ tem configurations. date various RAM and system speeds. At power-up The 1 AX data and address lines (NDAL) con­ time, initialization info rmation, including RA�I stitute a 64-bit bidirectional external bus with asso­ riming, and diagnostics are loaded from a serial ciated control signals that operates at one-third the read-only memory (RO�'I) into the on-chip cache. frequency of the C:Pll from clocks provided by the ·rhe externalint errupt handling is similar to that of C:l'll. Addresses and daL1 are time-multiplexed and, the NVAX chip.

OSCILLATOR CACHE TAG

ADDRESS B-CACHE TAG AND DATA DATA RAMS

NVAX

NDAL

CONTROL SYSTEM LOGIC INTERRUPTS

SYSTEM CLOCKS

Ngm ·e I NVAX ChifJ Interface Block /)iagmm

. 12 11>1. J No. .l Siiiii!IU!I' I'J'J..! Digital Te chnical jo urnal 17J e NVAX and NVAX + HigiJ-petjonnance VA X Microprocessors

OSCI LLATOR CACHE TAG

SERIAL ROM ADDRESS B-CACHE TAG AND DATA DATA RAMS

NVAX+

EDAL

CONTROL SYSTEM LOGIC INTERRUPTS

SYSTEM CLOCKS

+ Figure 2 JVVAX Chip fn teJface Block gramDia

Electrical and Physical Design Architectural Design Process technology, clocking scheme, clock fre­ The NVAX/NV�"\ + design is partitioned into five quency, and clie specifications are elements of relatively autonomous hmctional units: the instruc­ the electrical and physical design of the NVAX tion fe tch and decode unit (I-hox), the integer ancl and NVAX + chips. Both chips are implemented in logical instruction (E-box), the float­ Digital's fo urth-generation complementary metal­ ing-point execution unit (F-box), the address trans­ oxide semiconductor (CviOS-4) technology. CMOS-4 lation and primary cache interface (M-box), and the is a 0.75-micrometer, 3.3-volt process with support external cache ancl interface (C-box). fo r 5-volt input signals at the pins. The CMOS-4 Queues placed at critical interface boundaries nor­ process is optimized for high-performance micro­ malize the rate at which the units process instruc­ processors and provides short (05-micrometer) tions. A block diagram of the NVAX and NVAX+core channel lengths and three layers of metal inter­ is shown in Figure 3. connect. This robust and reliable process has been used to produce NVAX chips in volume fo r The !- box more than a year and is the same CMOS process used The 1-box fetches and decodes V�'{ instructions, in the DECchip 21064 microprocessor. evaluates operand specifiers, and queues operands NVA)( and i\T VA X+ use a fo ur-phase clocking in canonical form for further processing. included scheme, driven by an oscillator that operates in the 1-box is a 2-kilobyte (KB), direct-mapped vir­ at fo ur times the internal clock frequency. The oscil­ tual instruction cache (VIC) with 32-byte cache lator frequency is divided by an on-chip, finite­ blocks. For reliability, the VIC includes parity pro­ state-machine clock ge nerator; a low-skew clock tection on both tags and data. distribution network is used fo r both internal and During each cycle, the I-box attempts to fe tch external clocks. 8 bytes of instruction data from the VIC and place To meet the needs of the system designer, the tl1is data in an empty slot in the prefetch queue (PFQ). two chips are designed fo r use at various frequen­ A VIC miss incurs a three-cycle penalty if cies. At present, NVAX is used in systems at internal the requested data is fo und in the primary clock frequencies of 83.3 megahertz (MI-lz) cache. PFQ data is then decoded into the next VAX (12-nanosecond [ns] clock cycles), 74 .4 Ml-lz (14-ns instruction component, which may be one of clock cycles), and 62.5 MHz (16-ns clock cycles). the following: operation code () and first NV�'{+ is used in systems at a frequency of 90.9 MHZ specifier or branch clisplacement, subsequent (11-ns clock cycles). specifier, or implicit specifier (an imaginary speci­ Each chip contains 1.3 mill ion transistors on a fier included to improve the performance of die that is 16.2-by-14.6 millimeters in size. NVA)( is some instructions). The 1-box enters the opcode­ packaged in a 339-pin, through-hole pin grid array. related information into the instruction queue, NVAX + is packaged in a 431-pin, through-hole pin the pointers to source and destination operands grid array. into their respective source and destination

Digital Te chnical journal 14>1. 4 No J Summer 1')')2 ___ , E-60X, F-BOX MBOX --,----- oox -- I Z I I

6-CACHEI SYSTEM P I NS

------,- 1 I I I I I C-BOX I LATCH I I '------'r � ------'

Figure 3 Block Diagmm of the NVAX/N VAX+ Core The NVAX and NVAX + Highperfnrmance VA X Micrnprncessors

queues, and the branch-related information into and POPR. A population provides the num­ the branch queue. ber of bits set in a mask and is used in the CALLS, For operand specifiers other than short literal or CALLG, PUSHR, and POPR instructions. In addition, register mode, the 1-box decode logic invokes the microcode can operate the pipelined complex specifier unit (CSU) to compute (ALU) ancl shifter independently to produce two the effective address and initiate the appropriate computations per cycle, which can significantly improve the parallel operation of the complex memory request to the M-box. The csu is similar in fu nction to the load/store unit on many traditional instructions. RISC machines. In addition to normal instruction processing, the The 1-box automatically redirects the program E-box performs aJJ power-up functions and inter­ counter (PC) to the target address when it decodes rupt and exception processing, directs operands to one of the fo llowing instruction types: uncon­ the F-box, and accepts results from the E-box. To ditional branch, jump, and subroutine call and guarantee that instructions complete in instruction return. The branch-taken penalty is two cycles for stream order, the E-box orchestrates result stores any conditional or unconditional branch. To keep and instruction completion between the E-box and the fu ll across conditional branches, the E-box. 1-box includes a 512-bit by 4-bit branch prediction array. The prediction is entered in the branch queue The F- box by the 1-box and compared with the actual branch The F-box performs longword (32-bit) integer mul­ direction by the E-box. If the 1-box predicts incor­ tiply and floating-point instruction execution. The rectly, the E-box invokes a trap mechanism to drain E-box supplies operands, and the F-box transmits the pipeline and restart the 1-box at the alternate results and status back to the E-box . PC. A branch rnispredictincu rs a fo ur-cycle penalty The F-box contains a four-stage, floating-point fo r a branch that is actually taken and a six-cycle and integer-mult iply pipeline, and a nonpipe­ penalty for a branch that is not taken. lined, floating-point divider. Subject to operand availability, the F-box can start a single-precision, Th eE-box floating-point operation during every cycle, and a The E-box is responsible fo r the execution of all double-precision, floating-point or integer-multiply non-floating-point instructions, for interrupt and operation during every other cycle. exception handling, and fo r various overhead func­ Stage 1 of the pipeline calculates operand expo­ tions. Al l functions are microcode-controlled, i.e., nent diffe rence, adds the fraction fields, performs driven by a with a 1,600-word con­ recoding of the mul.tiplier, ancl computes three trol store and a 20-word patch capability. Since the times the multiplicand. Stage 2 performs align­ control store does not limit the cycle time, we ment, fraction mu ltiplication, and zero and leading­ chose to implement a single microcode control one detection of the intermediate resu lts. Stage 3 scheme, rather than hard wire control for the simple performs normalization, fraction addition, and a instructions and provide microcode control fo r the rniniround operation for floating-point add, sub­ remaining instructions. tract, and multiply instructions. Stage 4 performs The E-box begins instruction execution based rounding, exception detection, and condition code on information taken from the instruction queue. evaluation.

References to specifier operands and results are Stage 3 performs a miniround operation on the made indirectly through pointers in the source and result calculated to that point to determine if a full­ destination queues. In this way, most E-box instruc­ round operation is required in Stage 4. To do this, a tion flows do not need to know whether operands round operation is perfo rmed on only the low­ or results are in register, memory, or instruction order three (for single-precision) or six (for double­ stream. precision) fraction bits of the result. If no -out To improve the performance of certain critical occurs fo r this operation, the remaining fraction instructions, the E-box contains special-purpose bits are not affected and the fu ll stage 4 round oper­ hardware. A mask processing unit fi nds the next ation is not req uired. If the fu ll round is not bit set in a mask register and is used in the follow­ required, stage 4 is dynamically bypassed, resulting ing instructions: FFC. FFS, CALLS, CALLG, RET. PUSHR, in an effective three-stage pipel in e.

1992 Digital Te chnical jo untal Vo l. 4 No. 3 Summer 15 NVAX- microprocessot· VAX Systems

TbeM-box packing logic placed at the input ofan eight-entry The M-box is responsible for address translation, quad word write queue. access checking, and access to the primary instruc­ The C-box can accept one instruction read request and one data read request from the M-box. tion ancl data cache (P·cache). The M-box accepts requests from multiple sources and processes Conflict logic in the write queue allows noncon­ these requests in an order that reflects both the pri­ flicting read requests to be processed before ority of the request and the need to maintain queued write requests are performed. Conflicts are instruction stream ordering of memory reads anc.l rt:solved by processing write queue entries until writes. Address translation and cache access are the conflicting write is completed . fu lly pipelinecl; the M-box can start a new request at The C-box supports fo ur B-cache sizes: 128KB. the beginning of every cycle. 256KB, 5l2Kfl, and 2 megabytes (MB). The system The M-box performs address translation and designer can independently select tag ancl clata RAJ\11 access checking by means of a 9o -entry, fu lly asso­ speeds to meet system requirements, regardless of ciative translation buffe r (TB) with parity protec­ the frequency at which the CPU is running. The tion. If a TB miss occurs, the M-box automatically B-cache block size is 32 bytes, and both tag and data invokes a hardware miss sequence that calculates RAMs are protected with error correction code the address of the page table entry (PTE) that maps (ECC) that corrects single-bit errors and detects the page, fe tches the PTE fr om memory, refills the both double-bit errors ancl fu ll 4-bit RA.VI fa ilures. TB, and restarts the reference. TB allocation is per­ The B-cache implements a directory-based broad­ fo rmed using a not-last-used scheme, which is cast coherence protocol in conjunction with a similar to a ro und-robin but guarantees that the memory directory containing one bit per 32-byte most recently referenced entry will not be over­ block. Each memory directory bit indicates if the written. The M-box reports access violations and associated block is valid in memory or has been page faults to the £-box, and £-box microcode pro­ written and exists in a cache. Unwritten blocks may cesses these misses with hardware support from exist in multiple caches in the system. Wr itten the M-box. blocks may exist in exactly one cache. The M-box also translates memory destination An attempt to "" rite to a block that is not both operand addresses provided by the !-box and saves valid and already written in the B-cache causes the the corresponding physical address in the physical C-box to request write permission from memory by address (PA) queue. When the E-box stores a result, means of a sp ecial NDAL bus read command. The the M-box matches the data with the next address memory controller will not respond to any NDAL in the PA queue and converts this data to a normal bus transactions to a block that is written in a write request. The PA queue is also used to check cache. Instead, it waits fo r the CPU, which contains fo r conflicts in read requests to a location in which an updated copy of the block, to write the block nothing has been written. back to memory and then completes the original The P-cache is an 8KB, two-way set-associative transaction. All CPUs in the system monitor the cache with 32-byte blocks and parity protection NDAL bus fo r read and write transactions and com­ on tags and data. The P-cache can be configured pare the address against their B-cache tags. If a to cache instructions, data, or both, and usually match is fo und, the cache block is either written bas the latter configuration. For compatibility with back to memory, invalidated, or both, depending the DECchip 21064 microprocessor, the NVAX+ on the transaction type and the state of the block in the cache. P-cache can also be configured into a direct­ mapped organization. The NDAL protocol fu lly supports multipro­ cessing implementations and uoes not require any special chip variants to construct a multiprocessor Tbe NVAX C-box system. The C-box invokes invalidate or write-back The NVAX C-box maintains the interface to the requests as required to keep the B-cache and NDAL external B-cache and to the bus. The C-box P-cache coherent with NDAL activity. receives read and write requests from the M-box and monitors the NDAL fo r activity that wo uld require an invalidate operation in either cache. Th e NVAX + C-box Consecutive writes to the same quadword (o4 The NVAX+ C-box provides the interface between bits) are merged into a single quadword datum by the internal fu nctional units and the EDAL pin bus

!6 l'n/. 4 No. j Summer 1')')2 Digital Te cbuical jo urnal Th e NVAX and NVAX + Higbper:formance VA X Microprocessors

implemented by the DECchip 21064 micropro­ write-invalidate protocol fo r data that was not cessor. This C-box interface includes the basic recently referenced. inte rface control for the external B-cache and fo r To accommodate the programmable nature of the memory and 1/0 system. The NVAX+ C-box both the system and cache clock frequencies, the receives read and write requests from the M-box. NVA.'<+C-box supports nine different combinations These requests are queued and arbitrated within of cache and system clock frequencies. This sup­ the C-box and resu lt in cache or system access port allows efficient use of the chip in a wide range across the EDAL. The NVA)\+ C-box also maintains of different performance class systems. cache coherency by sending invalidate requests to the M-box when requested by external logic. Pipeline Op eration The NVA.'C+ C-box implementation provides many The NVAX. and NVAX+ chips implement a macro­ of the same features and performance enhancements pipeline. Multiple VA .'\ macroinstructions are pro­ available in the NVA.'\ C-box. Included is support fo r cessed in parallel by relatively autonomous software-programmable B-cache speeds (one-hal f, fu nctional units with queued in terfaces at critical one-third, or one-fourth times the CPU frequenq) boundaries. Each fu nctional unit also has an inter­ and sizes (128KB to 8MB), write packing, write queu­ nal pipel ine (micropipeline) to allow a new oper­ ing, and read-write reordering. In addition, the ation to start at the begi nning of every cycle. The l\TVAX+ C-box supports the newer platforms and pipeline operation can be logically depicted, as increases the degree to which NVA.X+ is compatible shown in Figure 4. with the DECchip 21064 microprocessor. NVAX+ In pipeline segment SO, instruction stream data C-box fe atures include programmable system clock is read from the VIC. The next VAX instruction com­ speeds, I!O space-mapping, and a clirect-mapped ponent is parsed, and queue entries are made in option on the P-cache. segment Sl. For short literal and register specifiers, A major difference between the NVAX and NVAX+ no other processing is required. Requests fo r fm­ implementations is in the B- proto­ ther processing fo r all other specifiers are queued col. Rather than mandate a fixed B-cache coher­ to the CSU pipeline, which reads operand base ence protocol, the NVAX+ implementation al lows addresses in segment 52, calculates an effective systems to tailor the protocol to their particular address, and makes any required M-box request needs. 1\f\'A.'\+ cache coherency is implemented contained in segment 53. If an M-box request is jointly by off-chip system support logic and by the made, address translation and P-cacbe lookup CPU chip, with relevant information passed occur in segments 54 and 55. between the two over the EDAL bus. To al low dupli­ Instruction execu tion starts with an E-box cate cache tag stores (if they exist) to be properly control store lookup in segment 52, fo llowed by updated , the J\TVAX+ C-box provides information to a read of any required operands in off-chip logic, indicating when the internal caches segment 53, an ALU and/or shifter operation in seg­ are updated . External logic notifies the NVA.'C+ ment 54, and a potential result store or register file C-box when an internal cache entry needs to be write in segment 55. If an M-box request is re quired, invalidated because of external bus activity. e.g., fo r a memory store, the request is made in seg­ Existing systems configure the B-cache to imple­ ment 54; address translation or PA queue access ment a conditional write-update snoopy protocol occurs in segment 55; and a P-cache access occurs carried out using shared and written signals on the in segment 56. system bus. Writes to shared blocks are broadcast to Floating-point and integer-multiply instruction other caches fo r conditional update in those execution starts in the E-box, which transfers oper­ caches. A et>l.J that receives a write update checks ands to the F-box. The fo ur-stage F-box pipeline is the J\TVA.'C+P-cache to determine if the block is also skewed by half a cycle with respect to the E-box present in that cache. If the block is present, the pipeline, beginning halfway through segment 54. B-cache update is accepted and written into the The fourth segment of the F-box pipeline is con­ B-cache, and the P-cache is invalidated . If the data is ditionally bypassed if a fu ll-round operation is not present in the P-cache, the B-cache is invali­ not required. The result is transmitted back to the dated. This res ults in a write-update protocol fo r E-box, logically in segment 55 of the pipeline. data that was recently referenced by a CPU (and Pipeline bypasses exist for all important cases hence is va I id in the P-cache) and reduces to a in the !-box and E-box pipelines, so that there are

Digital Te ciJuicaljo urnal Vol. 4 No. 3 Summe·r 1')<)2 17 NVAX-microprocessor VAX Systems

so S1 S2 S3 S4 S5 S6 S7 S8 INSTRUCTION FETCH, DECODE, SPECIFIER EVAL, MEMORY REQUEST

INTEGER, LOGICAL INSTRUCTION EXECUTION

FLOATING-POINT BYPASS I INSTRUCTION EXECUTION ROUND 1

KEY

s SEGMENT ALU ARITHMETIC LOGIC UNIT SPECIFIER EVAL SPECIFIER EVALUATION cs CONTROL STORE INSTN INSTRUCTION EXPON DIFF EXPONENT DIFFERENCE OPERND OPERAND ALIGN, FRMULT ALIGNMENT AND FRACTION MULTIPLY MEMREQ MEMORY REQUEST NORM, FRADD NORMALIZATION AND FRACTION ADDITION TB TRANSLATION BUFFER BYPASS ROUND RESULT ROUNDING AND BYPASS

Figure 4 Pipeline Organiza tion

no sta lls for results feed ing dire ctly into subse­ Number of CP U Ch ips quent operands. The M-box processing of memory 6000 400 CPU The VAX Model core implementation refe rences initiated :�s a result of operand speci­ is a three-chip design: a processor chip, with a fier processing by the !-box is usually overlapped small on-chip primary cache; a floating-point ch ip; with the execution of the previous instruction in and a secondary cache control ler. with internal the E-box, with few or no stal ls occurring on cache tags. The initial attempt at N"V�"X CPll defini­ P-cache hit. tion was a two-chip design. One ch ip contained

the 1-box (with a 4KBVlC). tbe E-hox. the F-box, and Design Evolution and Tra de-offs the M-box (with a I6KB, direct-mapped P-cache) The second chip held the C-box and the B-cache tag The and NVAX + chips are the latest in a line of NV�"X array. The project design goals, especia!Jy time-to­ GviOS VAX microprocessors designed by Digital"s market, Jed to a single-chip solution. rather than engineers and represent a continuing evolution of a two-chip design. architectural concepts from one implementation To condense the design from two chips to one, to the next. The preceding chip design was the we halved the sizes of the VlCand the P-cache and CPU for the 6000 Model 400 system.11 To meet VA}( moved the B-cacl1e tags to external static RAMs, the time-to-market and performance goals, we leaving the B-cache controller on-chip. Later, we had to modify the NVAX/NVAX+ design throughout were able to reduce the penalty of halving the size the project. of the P-cache by making it two-way set associative One of the early vehicles for making design rather than direct mapped. With these changes, the tracle-offs was the NVA.X performance model, perfo rmance model showed performance loss of which predicts CPU and system performance a less than 1.4 percent across all the traces, relative and aids in quantifying the performance impact the two-chip design, with a worst-case penalty of va rious design options. The performance to of 3.9 percent. mode.l is a detailed , trace-d riven model which can There are strong advantages to the single-chip be easily configured by changing any of a variety solution. of input parameters. The model stimuli used were 15 generic timesharing and 22 bench­ • Designing a single chip takes less time. mark instruction trace files that were captured • This design requires the production ami main­ VAX by running actual programs on existing tenance of only one design and one systems. mask set. The fo llowing sections describe the evo lution Latency to the B-cache is shorter. of the chip design, including the number of chips, • the pipel ining technique used, and various cache • An off-chip tag store provides more flexibility in issues. B-cache configurations.

18 Vol. 4 No. 3 Summer 1992 Digital Te ciJuica/jo urnal The NVAXand NVAX + High-performance VAX Microprocessors

Macropipelining unlike the original F-chip implementation, the Run-time performance is the product of the cycle NVAX/NVAX+ control of the F-box allows a fully time, the average time to execute an instruction pipelined operation, which significantly improves ( [CPI]) and the number of floating-point performance over the F-chip design. instructions executed. CMOS process improve­ Although a totally new design would have had ments made it possible to decrease the NVAX/ shorter floating-point latencies, the combination of NVAX+ cycle time with respect to the previous gen­ a fu lly pipelined operation and a final-stage bypass eration of VAX microprocessors, thus improving the allowed us to achieve our performance goal, while first factor in run-time performance. meeting our time-to-market goal. The VJL'( 6000 Model 400 CPU design uses tra­ ditional microinstruction pipelining, i.e., micro­ Cache Coherence pipelining, to achieve some amount of overlap and Performance studies with the previous generation to decrease the CPl. However, using micropipe­ of VAX microprocessors clearly indicate that system lin ing techniques would not reduce the NVJL'(/ bus write bandwidth .limits performance unless an NVJL'\ + CPI to the level required to meet the perfor­ external write -back cache is implemented. In addi­ mance goals of the NVJL'(/NVJL'\+ projects. We tion, the VAX architectme required that we imple­ achieved this reduction by using ruse design and ment the cache coherence protocol in hardware. implementation techniques referred to as macro­ The NV!L'\ implementation uses a directory-based pipelining. In a macropipelined architecture, the coherence protocol fo r compatibility with existing 1-box acts much like a load/store engine, dynam­ and planned target system platforms. The NDAL bus ically prefetching operands prior to instruction supports multiple outstanding read and write execution. Using the macropipeline technique in requests, which allows the microprocessor to uti­ the NVAX and NV�L'\+ crus makes it possible to lize the capability of the system bus to process retire one basic complex instruction set computer these operations in a pipelinecl fashion. We investi­ (CISC) macroinstruction per cycle, as in a simple gated the possibility of implementing both direc­ RISC design. Al though macropipelining introduced tory-based ancl snoopy coherence protocols, but considerable complexity into the NVAX/NVAX+ time-to -market considerations and the opportunity design, this complexity resulted in a significant to optimize the design for performance in existing performance improvement over a traditional system platforms outweighed the desirability of micropipelinecl design. supporting snoopy protocols. For the 1\TVAX+ implementation, the coherence Number of Sp ecifiers per Cy cle policy is determined by hardware external to the The NVAX/NVAX+ 1-box can parse at most one NVAX+ chip, in the given system. The 1\TVA X+ cache opcocle and one VA X specifier per cycle. The 1-box and system interface allows the system environ­ design initiaHy considered was capable of parsing ment to implement a variety of coherence pro­ 21064 two specifiers per cycle. Al though this parsing wcols Compatibility with the DECchip scheme represented significant complexity and cir­ interface definition required limiting 1\T V!L'\+ to one cuit risk, intuitively, it seemed important to quickly outstanding external cache miss. However, this retire specifiers in the 1-box in order to keep the limitation is more than offset by the significantly macropipeline full. However, the performance better main memory access times achieved in target model predicted a maximum performance improve­ systems. ment of less than two percent on our traces, and we One significant advantage of the l\TVAX+ scheme decided to li mit complexity and schedule risk by is that most policies associated with the external parsing only one specifier per cycle. cache are determined by hardware outside the NVAX+ chip (such as the coherence policy), allow­ ing the chip to be used in a wide variety of systems. F- box Design Implementing the DECchip 21064 interface on The 1\TVAX F-box design is highly leveraged from the NV!L'\+ greatly reduces the hardware engineering VAX 6000 Model 400 F-chip design. Rather than investment required to deliver a VAX CPU and an start from scratch, we integrated the existing Alpha AXP CPU in the same system environment. design onto the 1\TVAX and NVAX+ CPU chips and For both the NVJL'( and the l\T VAX+ chips, cache added a final -stage bypass mechanism. In addition, coherence is maintained for the P-cache by keeping

Digital Te e/m ica/ jo urnal \I(J/.4 No . 3 Summer 1992 19 NVA.X-rnicroprocessorVA.'X Systems

it a subset of the external cache. External ly orig­ fi lls can resu lt in a higher P-cache uata stream miss inated invalidate requests are fo rwarded to the rate, because they replace data that is likely to be P- cache only when the block is in the exte rnal re ferencedag ain. We used the performance model cache. This minimizes the number of P- cache cycles with available traces to determine that looking up spent processing invalidate requests. The two-way VIC misses in the P-cache generally resulted set-associative P- cache might have been sl ightly in higher performance. In specific applications, more effective if it were not a subset of the larger higher performance can be achieved by not looking ([irect-mapped external cache. However, this effect up instruction references in the P-cache. As a is fa r less significant than the effect of expending a re sult, we implemented P-cache configuration bits P-cache cycle for every external inval idate event. that allmv system designers to implement either

Virtual caches almost alw::tys have lower latency scheme. By default, NVA.X and NVAX+ systems are than physical caches and usually do not require a configured w enable both instruction and data VAX dedicated translation buffe r. The architecture caching in the !'-cache, bur this may be changed by supports the use of a VIC by allowing the cache to the console software in certain systems to support be incoherent with respect to the data stream, i.e.. prepackaged application systems. not updated with recent writes by the CPU contain­ VIC, CPU. ing the or by any other However, some EYternal Ca che Size mechanism must be defined to make the VIC coher­ Both NVAX and NV,\X+ support multiple external ent with the data stre:ml. fn the VAX architecture. cache sizes to allow sYstem designers full flexihi lit)' the execution of the return from interrupt or VAX in selecting external cache configurations. With exception (l�EI) instruction performs this fu nction. existing static RA.\J technology, smaller external We chose to perform a complete flush of the VIC cache configurations are usually faster than larger as pa rt of the execution of every REI instruction. configurations. Performance modeling ind icated Because an REI always fo llows a process context that many applications, especially some popular sw itch, a flush during an HE! removes the process­ benchmarks, fit entirely in a cache whose size is specific virtual addresses of the previous process and ')12Kil or less, resulting in slightly better perfor­ prevents conflict with (potentially identical) virtual mance. However, many common applications addresses fi >r the new process. We could have also utilize more memory than will fit in such caches VI<: chosen to keep the coherent with the datJ stream and benefit more from an external cache whose and implement per-process qualifiers that would size is I Mil to 4:VIB, ncn with the additional latency have ma(le per-process virtual addresses unique. involved. As a resu lt. our system designs usc larger However, coherence wou ld have required both an hut slightly sl.ower external cache configurations. VIC, inval idate address and :1 control path to the and some ti >rm of backmap to resolve virtual address Block Size aliases. Per-process qualifiers wou ld have required During the analysis of the previous generation a VAX architecture change and significant operating or V, \X microproceo.sors in existing systems, we s\·stem software ch;111ges. To reduce project risk, observed that the 1()-hyte block size was too small we chose to flush the VIC on every REI instruction. to achieve optimal performance on many applica­ tions. As a resu lt, we chose a 32-byte block si1.e fo r Ca che Hierarchy the NVAX and NVAX+ internal caches. This size provides a good balance between fill si;.e and the The NVAX and NVAX+ chips have three levels of : the VIC, the P-cache, and the number of cycles required to do the fill, given R-caclle. The VIC and P-cache are fu lly pipelined H-hyte fill data p;1tl 1:'. an(l have minimum latency, which allows instruc­ For compatibility with installed systems, the size tions to be fe tched and processed in parallel at very of the NVAX external cache block and the cache high rates. fi ll size is :)2 bytes. On i'\TYt'-''\+, the external cache VIC The default P- cache configuration causes block size may be larger and is 64 bytes in the VAX misses to be looked up in the P-cache. Th is lookup 7000 Model 600 and VAX 10000 Model 600 systems. process is advantageous since the VIC typically Because both s\·stems implement low- latency experiences a smaller miss penalt\' because latency menwn· ami high-bandwidth buses, the increase in fo r P-cache hits is roughly one-third that fo r exter­ external cache block size results in better nal cache hits. The disadvantage is that instruction perfo rmance.

Digital Te ciJuical jo urnal 20 liJI. .j .\o ..> Sunliii<'J' I')'Jl TheNV AX wzd NVAX+ H(!!,h-jJeJfonnance VAX .i\llicrojJrocessors

Sp ecial Features the failure. We refined the patch to place additional The NVAX/NVAX+ design includes several features checking into various instructions in the sequence. that supplement core chip fu nctions hy provid ing This refi nement allowed us to isolate the exact acldecl value in areas of debug, system maintenance, instruction th:n was causing the transfer to PC 0. and systems analysis. Among the features are the With this information, we were then able to repro­ duce the problem in simulation and correct the sec­ patchable control store (res) and the perfo rmance monitoring hardware. ond-pass design. Without this diagnostic capability, we probably would have needed weeks or months of additional debug time to isolate the fa ilure. Pa tcbable Control Store In addition to using the powerful diagnostic The base machine microcode is stored in a ROM capability of the PCS, we used patches to correct or control store in the E-box. The 1,600-microword work around the few fu nctional bugs that remained capacity of the E-box controls macroinstruction in the first-pass design. For example, a microcode PCS execution and exception hand ling. The con­ patch was used to correct a condition code prob­ sists of entries that can be configured to replace 20 lem caused by a microcode bug during the execu­ or supplement the microcode residing in R0;\'1 . Each tion of an integer-multiply instruction. Because the res entry contains a content-adtlressable memory E-box is central to the execution of all instructions, (CAM)/RAM pair that stores the patch microworcl we were also able to use patches to correct hard­ aclc.lress and patch microword, respectively. The ware problems in other boxes. In one instance, a ROM control store and the l'CS are accessed in paral­ patch was used to inject a synchronization primi­ lel. Typically, words are fe tched from the ROM tive into the M-box in order to correct an M-hox control store, but if a microword address matches design error. As a result of the simp! icity and ele­ the C\,VI in one of the PCS entries, then the res RA:VI gance of this solution, the final second-pass cor­ fo r that entry suppl ies the microword, and the llOM rection was to move the patch into microcode ROM , output is disabled. rJther than modify the M-box hardware design. Privileged software controls the loading of the res by means of interna l processor registers. In sys­ tem operation, a patch file is normally loaded into Performance Ll1onitoring Environment the res early in the boot procedure, so that any As computer designs increase in complexity, their minimal system capable of starting system boot can dynamic behavior becomes less intuitive . Com­ install patches to the base microcode. This feature puter designers re ly more ami more on empirical presents a way to modify the base NYAX/NVAX+ performance data to aid in the analysis of system chip through software; the majority of engineering behavior and to provide a basis fo r making hard­ change orders (EC:Os) can be accomplished by ware and software design decisions. In addition, releasing new patch files, thus alleviating the need multiple levels of logic integration on VLSI chips to change the hard>vare design and retool fo r the restrict the collection of this performance data, very large- scale integration (VLSI) fa brication. because many of the interesting events are no longer We booted the VMS operating system within I(> visible to external instru mentation. The NVAX/ days of receiving first-pass wafers from fabrication, NVA.,'C+ ch ip design includes hardware a tribute to a very thorough design verification. and counters that can be configured to count any of However, the continuing rigorous testing on proto­ a set of predetermined, internal state changes. type systems revealed several problems with the Two 64-bit performance counters are main­ base microcode and hardware. The PC:Smech: tnism tained in memory fo r each Cl'l i in an NVAX/NVAX+ helped to identify, isolate, and work around many system. The lower 16 bits of each counter are imple­ of the problems during system debug and thus mented in hardware in the Cl'l l, and at specified allowed extensive system testing to continue on points, the quadword counters in memory arc first-pass chips. updated with the contents of the hardware coun­ Fo r example, we used a sequence of PC:S patches ters. Privi leged software can be used to configure during system debug to isolate an obscure fa ilure the hardware counters to count any one of a basic whose symptom was a transfer to virtual address set of internaleve nts, such as cache access ami hit, 0. By patching the main microcode exception hand­ TB access and hit, cycle and instruction retire, and ling routine to check fo r this event, we identified cycle and stall. When the 16-bir counters reach a the instruction stream sequence that was causing half-full state, the performance monitor requests

Digital Te ciJnicaljo urnal Vo l. ·i Xo ..i Sllllllllt!r I'J'Jl 21 NVA."X-m.icroprocessorVAX Systems

an interrupt. The interrupt is serviced in a normal Results and Conclusions way, i.e., between instructions (or in the middle of With a fo cus on time-to-market, we shortened the interruptible instructions) and at an architecturally originally projected NVA.-\ design schedule, from specified interrupt priority level. Unlike other inter­ the start of implementation to the completion of rupts, the performance monitor logic interrupt is the chip design, by 27 percent. We booted the oper­ serviced entirely in microcode and then dismissed; ating system just 16 days after prototype wafers no software interrupt hand ler is required. became available. The use of the PCS allowed us to The microcode component updates the counters quickly debug and work around the few functional in memory when it services the perfo rmance moni­ bugs that remained in the first-pass design. Because tor interrupt. During a counter update, the micro­ of the quality achieved in first-pass chips, we were code temporarily disables the counters, reads and able to shorten the schedule from chip design clears the hardware counters, updates the counters completion to system product delivery. As a result, in memory, enables the counters, and resumes systems were clel ivered to customers fo ur months instruction execution. The base address of the earlier than the originally projected elate. counters in memory is taken from a system vector At the same time, we were able to dramatically table and offset by the specific C!'ll number, creat­ improve CPU performance relative to previous VAX ing a data structure in memory that contains microprocessors by implementing a macro­ a pa ir of64-bit counters fo r each CPU. pipelined design, in which multiple autonomous Combining the use of hardware, software, and fu nctional units cooperate to execute VA X instruc­ the I'CS created a versatile performance monitoring tions. Our internal goal was performance in excess environment-one that goes beyond the scope of of 25 times the performance of the VA.X -11/780 sys­ the basic hardware capabilities. In this environ­ tem. We significantly exceeded this goal as demon­ ment, we can correlate the counts with higher-level strated by the fo llowing Standard Performance system events and change the representation of the Evaluation Cooperative (SPEC:) Release 1.2 perfor­ collected data. For example, microcode can enable mance r:1tings: 15 the counters every time a process context is loaded SPECmark 40.'5 anti disable the counters when a process context is saved. This feature allows us to set up workloads SPECfp 48.8 and gather dynamic statistics on a per-process SPEC int :)0.4 basis. We can also use l'CS patches to modify the memory counter address in order to provide an These ratings were measured on a VAX 6000 additional offset basecl on one of the five VA X pro­ Model 600 system at the initial announcement and cessor operating modes: interrupt. kernel, execu­ are two to three times higher than those fo r the pre­ tive, supervisor, or user. This technique provides vious VAX microprocessor running in the same sys­ a new performance counter data structure that col­ tem. Software and system tuning has subsequently lects statistics on a per-mocle,per -process, per-CPU improved the initial numbers on al l systems. basis. Also, microcode patches can be usecl to add The NVAX/NVAX+ clesign provides an upgrade context checks that filter and count various events. path and system investment protection to cus­ For example, we can patch the VAX context tomers with installed VA X systems, as well as a instruction to count context or patch the migration path from an NVAX+ microprocessor to a interlocked instructions to count the number and DECchip 21064 microprocessor in the new Alpha types of accesses to multi processor synch roniza­ AXP systems. tion locks. The performance monitoring environment is a Acknowledgments powerful tool that we have used to collect the data We wou ld like to acknowledge the contributions of required to analyze hardware and software behav­ the fo llowing people: ior ami interactions, ancl to develop an understand­ W Anderson, E. Arnold. D. Asher, H. Badeau, ing of system performance. We have applied this I. Bahar, M. Benoit, B. Benschneider, D. Bhavsar, knowledge to tune the performance of operating M. Hlaskovich, P Boucher, W BowhiiL L. Briggs, systems and application software. and continue to S. Butler, It Calcagni, S. Carroll, M. Case, R. Cas­ apply the knowledge to improve the design and per­ telino, A. Cave, S. Chopra. M. Coiley, E. Cooper, fo rmance of future hardware and software. R. Cvijetic. R. Davies, M. Delaney, D. Deverell,

22 l'IJ/ . .j No. 3 Stll!l!lter /'J')l Digital Te chnical jo urnal The NVAX andNVAX + High-performance VAX Microprocessors

A. DiPace, C. Dobriansky, D. Donchin, R. Dutra, 9. L. Chisvin, G. Bouchard. and T. Wenners, "The D. DuVa rney, ). ). El lis, ). P. Ellis, H. Fair, B. Feaster, VAX 6000 Model 600 Processor," Digital T. Fischer, T. Fox, G. Franceschi, N. Geagan, S. Goel, Te chnical jo urnal, vol. 4, no. 3 (Summer M. Gowan, ). Grodstein, P. Gronowski, W Gruncl­ 1992, this issue): 47-59. mann, H. Harkness, W Herrick, R. Hicks, C. Holub, 10. A. Agarwal et al., "An Evaluation of Directory ). Huber, A. Jain, K. Jennison, ). Jensen, M. Kantro­ Schemes fo r Cache Coherence," Proceedings witz, R. Khanna, R. Kiser, D. Koslow, D. Kravitz, of the 15th Annual International Sy mpo­ K. Ladet, S. Levitin, ). Licklider, ). Lundberg, sium on (May 1988): R. Marcello, S. Martin, T. McDermott, K. McFadden, 280-289. B. McGee.). Meyer, M. Minardi, D. Miner, E. Nangia, L. Noack, T. O'Brien, ). Pan, H. Partovi, N. Patwa, 11. P. Stenstrom, "A Survey of Cache Coherence V Peng, N. Phillips. R. Preston, R. Razdan, J. St. Schemes fo r Multiprocessors," Computer Laurent, S. Samudrala, B. Shah, S. Sherman, W Sher­ (June 1990): 12-24. wood, J. Siegel, K. Siegel, C. Somanathan, C. Stolicny, 12. H. Durclan et al., "An Overview of the VA X P. Stropparo, R. Supnik, M. Ta reila, S. Thierauf, N. 6000 Model 400 Chip Set," Digital Te chnical Wa de, S. Wa tkins, W Wheeler, G. Wolrich, and Y. Ye n. jo urnal, vol. 2, no. 2 (Spring 1990): 36-51.

References 13. VA X 6000 Datacenter Sy stems Perfonnance Report (Maynard, 1'vl A: Digital Equipment l. R. Brunner, ed., l'l.JX Architecture Reference Corporation, 1991 ). ;V/anual, Second Edition, (Bedford, MA: Digi­ tal Press, 1991).

2. D. Dobberpuhl et al., "A 200MHz 64b Dual-Issue CiVIOS Microprocessor," IEEE Inter­ national Solid-State Circuits Conference, Digest of Te chnical Pap ers, vol. 35 (1992): 106-107

3. R. Sites, ed., Alpha Architecture Reference Manual, (Burlington, MA: Digital Press, 1992).

4. D. Fite, Jr. et al., "Design Strategy for the VAX 9000 System," Digital Te chnical jo urnal, voL 2, no. 4 (Fall 1990): 13-24.

5. W Anderson, "Logical Ve rification of the NVA-,'( CPU Chip Design," Digital Te chnical jo urnal, vo l. 4, no. 3 (Summer 1992, this issue): 38-46.

6. M. Cal lander et al., "The VAXstation 4000 Model 90,'' Digital Te chnical jo urnal, vol. 4, no. 3 (Summer 1992, this issue): 82- 91 .

7. J Crowell and D. Maruska, "The Design of the VA X 4000 Model 100 anclMicr o VAX 3100 Model 90 Desktop Systems," Digital Te chnical jo urnal, vol. 4, no. 3 (Summer 1992, this issue): 73-81 .

8. J. Crowe] I et al., "Design of the VA X 4000 Model 400, 500, and 600 Systems;· Digital Te chnical jo urnal, vol. 4, no. 3 (Summer 1992, this issue): 60-72.

Digita/ Jechuicul jou rnal Vo l. 4 No. .l Sunw1er 1')92 23 Dale R. Donchin

Ti mothy C. Fischer Th omas E Fox Vi ctor Peng

Ro naldP. Preston Wi lliam R Wheeler

TheNVAX CPU Chip: Design Ch allenges, Methods, and CAD To ols

The NV&\' CPU clnp is a 1.3 million transistot; l'l!X micmprocessor designed in Digital's 0. 75 -micrometer CMOS- 4 tecbno/ogy It bas a ty pical CJ'c/e time of

12 ns under worst-case operating conditions. Tb e goa/ of tbe cbip design team was to design a bigh-jJeJformance, robust, and reliable cl.ujJ, U'itbin the con­ straints of a short schedule. Design strategies were developed to achiel'e this goal, including organization of a chip design flow and new implementation and veri­

fi cation methods. Ne u' proprietary CA D tools also played important roles in the chip development.

NVAX CPU The chip is a 1.3 million , VA)( The T-box fetches, parses. and decodes instruc­ microprocessor designed in Digital's 0.7';-micro­ tions, and pre(iicts conJitional branches. The E-box meter fo urth-generation complementary metal­ runs under microprog.r:1mme<.l control and executes oxide semiconductor (CMOS-4) technology. The all instructions. except floating-point and long­ implementation of the 1\rvAX CPU chip posed many word integer multiply instructions, which are exe­ design and complexity management challenges. cuted by the F-box. The M-box processes memory The combination of the chip performance goal, the references, including virtual-to-physical address high level of integration, and tbe small fe ature sizes translation. The C-box controls the offchip backup of the C"'lOS-4 technology increased the severity of cache (the second-level cache fo r clara and third­ on-chip electrical issues and the c.l ifficulty of verify­ level cache fo r instructions) and contains the bus ing the physical design. These challenges were interface unit. The chip also has a direct-mapped intensified by a short design schedule. This paper 2 kilobyte (KB) virtual instruction cache (VIC), a describes some of the design strategies, methods. two-way, set-associative 8KB <.lata and instruction techniques, and proprietary computer-aided design primary cache (P-cache), a 12KB control store read­ (CAD) tools used by the chip design team, which only memory (ROM), a 96 -entry, fu lly associative were instrumental in meeting these challenges. translation buffer, ami a ';12-bit by4-bit conditional branch history random-access memory (RA:Vl) for Chip Overview branch prediction. The chip photomicrograph, In order to appreciate the design chal lenges that with box and large array boundaries outlined, is were faced, it is necessary to understand the size shown in Figure 1. and complexity of the design. The NVAX CPU chip The NYA,"\ chip's layout database is composed of has a macropipelined microarchitecture and imple­ over 4,200 unique custom cel ls, and has a total tran­ ments the VAX instruction set. 1 Because it is a com­ sistor count of 1.:)milli on. It was the first product plex instruction set computer (CJSC) architecture, chip to be implemented in Digital's 0.75-microme­ the VA X architecture implements \'arable length ter, three metal layer, }:)-volt (V) CMOS-4 technol­ instructions with complex addressing modes, and ogy.1 The chip's typical cycle time under worst-case precise traps and exceptions. The chip is composed operating conditions is 12 nanoseconds (ns) or 83. :) of five suhchips, or fu nctional boxes, called the mega hertz (.VIHz), and it dissipates 14 watts (W). A !-box. E-box. F-box, M-box, and C-box. summary of the chip fe atures is given in Ta ble 1.

24 Vol. 4 Nn. 3 St llll/1/er I '.II).! D(r!Jtal Te cbuical jo urunl T/.Je NVAX CPU Chip: Design Challenges, Methods, and CA D To ols

Figure I !WAX Cbip u•itb Boxes and l.mp,e Arrays Outlined

Design Goalsand Constraints previous designs. Due to compet1t1ve marketing Our design goal was to develop a high-performance pressures, this schedule was substantial ly reduced, chip that is electrically robust and rei iable. We had making it significantly more aggressive than fo r pre­ to accomplish this within the Cv!OS-4 proces� vious designs. Our cycle time was constrained to constraints and the design time allorted by the a maximum of 14 ns under worst-case conditions. development schedule. Our initial implementation Electrical rel iability had to be guaranteed fo r cycle schedule was based on scheduling metrics from times down to 10 ns under worst-case conditions.

Digital Te chnical journal .) SIIIJIIIH'I" H11. ·i ,\·o . /'}').! 2'5 NVAX-microprocessor VAX systems

Ta ble Summary Chip Features 1 of In addition to minimizing risks to the schedule, we had to solve several global design and veri­ Transistors 1.3 million fication problems to achieve the cycle time of 14 Die size 16.2 mm by 14.6 mm ns. The design team assumed a 10-ns cycle time Cycle time 12 ns (typical) when it addressed problems that are exacerbated Signal pads 261 by a fa ster cycle time, such as interconnect reliabil­ Supply pads 121 ity and signal integrity. Some of the key concerns Package 339 pin THPGA* were on-chip power, ground, and low skew clock Power dissipation 14 W (average) distribution, and the routing of long signal inter­ at 12 ns connects. Ve ri fication of the custom layout was another chal lenge, particularly given the schedule Note: 'Through-hole pin grid array constraints. The use of CAD tools was a significant benefit in development of the chip. These issues are addressed in more detail in the re maining sections 13aseu on Uvi OS-4 process limits, the chip die size of this paper. was constrained by a maximum diagonal length of millimeters (mm). 21.8 Design Flaw and Management The trade-ot"fs between design time and chip cha­ racteristics, such as performance, area , and htnction, The chip design team was organized into five semi­ were the dominant issues throughout the project. autonomous groups, each of which fo cused on the design of a fu nctional unit (C-box, E·box, F-box, !-box, M-box). To ensure design compatibility and i Design Strategy and Ch allenges consistency across the ch p . each team adhered to To achieve the goal of a 14 -ns cycle time, we the same design guidelines and methods. For exam­ designed custom circuitry and layouts, including ple, box- level interfaces were rigorously defined, a (RTL) dynamic, asynchronous, and diffe rential logic. To consistent register transfer level modeling deal with the size and complexity of the chip, style was used, and circuit noise margins, transistor together with the schedule constraint, our strategy sizing, and other circuit and layout guidelines were called fo r a large design team. Complex, custom observed. "fhe design team fo llo,.ve(l the top-down, hierarchical design flow depicted in Figme but very large- scale integration (Vl.SJ) chip designs 2. there was much overJap between the activities. inherently have high levels of design schedule risk. Complexity was managed by thoroughly reviewing To reduce our exposure to schedule slips, we made and testing the design at each level of abstraction several high-level design decisions early in the proj­ (microarchitecture performance model, RTL model, ect. Wherever possible, we avoided using circuit logic. circuit, and layout), ami by using CAD tools structures that were time-consuming to analyze . to verify that all the design representations were We defined and fo llowed detailed design guidelines consistent across the levels of abstraction. When to ensure design co nsistency. robustness, and re­ making design decisions, we considered the im­ liability We used a top-down design flow and plications across the levels of abstraction. For made extensive use of proprietary CAD tools that example, when we made microarchirectural trade­ were expressly developed tin high-performance offs, we considered the implications for power custom VLSI design. Lastly, we minimized chip area consumption, logic complexity, circuit speed and bY handcrafting layout in sections of the chip cycle time, layout, die size ami . of course, scheduk. where the leverage on reducing overa ll die size was sign ificanr: or when critical path node had to be Performance Model and minimized to satisfy the path timing constraints. The floor plan was accurately monitored Microarchitecture Sp ecification tl1roughout the project. This was essential because The NVA X performance model is a software pro­ initial die estimates indicated that the chip size was gram that models the microarchitectu re of the CMOS-4 close to the maximum size that the tech­ 1\'VA.,'\ chip. The architecture team used the model nology cou ld support. These strategic decisions to study the effe ct of various microarchitectural reduced design time and al lowed us to fo cus on factors on overall CPU performance and to define achieving a fast CPU cycle time without compromis­ the chip's microarchitecture. The performance ing the quality of the design. model was updated as the microarchitecture

4 . Dip,ilal Te chnical jou rnal 26 Vol. .Vf! .> Summer 19').! Th e NVAX CPU Chip: Design Owllenges, Methods, and CA D To ols

detailed box-level floor plans. All floor plans were PERFORMANCE MODEL AND MICROARCHITECTURAL entered directly into the layout database and main­ SPECIFICATION tained throughout the project. Tracking the floor plan at several levels eased the difficulty of integrat­ � ing the box layouts to fo rm the fu l 1-chip composite late in the project. More details on floor plans and RTL MODEL AND CHIP FLOOR PLAN the use of CAD tools are described in the section Floor Plan Techniques. � RTL Ve njication and Schenzatic Design RTL MODEL VERIFICATION AND SCHEMATIC DESIGN We verified the RTL model by running a com­

bination of pseudorandom tests, standard VAX architecture tests, and handwritten implemen­ tation-specific tests. In order to identify flaws SCHEMATIC LOG� IC AND CIRCUIT VERIFICATION AND before time-consuming schematic and layout PHYSICAL LAYOUT DESIGN changes were implemented, we ran regression tests on the model whenever we made changes to the

� design. 'l() track design changes and issues, design­ FULL-CHIP LAYOUT ers posted a description of the changes and issues VERIFICATION along with the ramifications fo r other parts of the design in an electronic bu l letin board . Each new

entry to the bul letin board was mailed electroni­ cally to every member of the team. This tracking Figure 2 NVAXDesign Fl ow procedure helped reduce design iterations caused evolved so that the team could assess the impact of by stale int<>nnation. design changes on performance. We used the RTL model as a specification fo r logic/ The chip microarchitects wrote a textual speci­ circuit design. To synthesize logic directly from the fication of the chip that documented its fu nction RTL model fo r circuits with less critical area and ami microarchitecture. As the details of the design speed constraints, we used a CAD tool, OCCAM. were resolved, the specification was updated and Because these constraints were stringent for a large portion of the chip, engi neers designed most of the expanded to reflect the actual implementation. The fu nctional design verification team used the speci­ circuits. We developed a library of common circuit and layout structures reduce the design and veri­ fication to develop implementation-specific tests. to fication effort. We defined and fo llowed detailed RTL Jl!!odel and Floor Plans design guidelines to ensure design consistency. The design team developed a detailed RTL model More information on the types of circuits used on NVAX 100 MHZ of the chip once the chip specification was the chip is being published in "A VA)( CMOS completed . 'fhis model was written in Digital's Macro pipel ined Microprocessor."'

proprietary 1ms hardware description language. As the BDS code was being written, many logic/circuit Schen1atic Ve rification and Layo ut Design feasibility studies were spawnecl. The model was We held design reviews and used CAD tools to

used to verify that the proposed microarchitecture check schematics for unintended deviations from VA X executed code correctly. It also served to guide the design guidelines. We performed extensive logic implementation and was used by system logic simulation on the schematics. We used SPICE design teams to develop modules based on the fo r accurate critical path timing analyses, and a NVA X microprocessor. 1· ' The lrn. model was up­ staric circuit timing analyzer, NTV, to detect uniden­ dated as the project progressed. tifiecl slow paths c' (NVAX timing verification is The chip floor plan was devised early in the described in a later section.) Figure 3 is a histogram

design to estimate and track the die size and to of slow paths as a function of cycle time that was provide area-impact data fo r microarchitectural generated by NTV about two months before we trade-offs. Once the chip-level floor plan was taped out the first pass of the chip. Because NTV stable, the box design teams developed more does not predict circuit delays as accurately as

Digital Te ciJuical journal I H. 1 ,\'u ..> S!!/11111!!1' I'J'Jl 27 NVAX-microprocessorVAX systems

100 layout design was completed , the fi nal stages of lay­ (f) 90 our verification ensured that the assembled chip I f- <( 80 byout met re liability and integrity guidel i nes fo r Q_ ,----- s global nodes such as power, clocks, and signals. 0 70 _j Most of rile layou t checks were pe rfo rme by CAD (f) d 60 lL tools that used the CMOS-4 layou t design ru les and 0 a: 50 NVAX- specific electrica l ru les. More details on ay w l ­ m 40 out verification are given in the section Layout 2 ::::> z 30 Ve rification To ols.

,----- 20 ,----- Floor Plan Te chniques

10 With 1.) million transistors. nearly half a million oL-������_J�L_� _l��L_� signal nodes, and over lo mi II ion polygons on the 0.90 095 1.00 1.05 1.10 115 1.20 1.25 1.30 i'rVAXdi e. precise monitoring of the floor plan was CYCLE TIME (RELATIVE TO DESIGN TARGET) cririca I. From the earl iesr area esti mares, i r was Note: Dala is normali zed based on 14-ns design target. clear that the chip size was close the maximum The cycle time of 1 0 equals 14 ns. to thar the C\-IC)S-4 technology could support. We had to ensure that rhe die did not exceed the technol­ Ng ure 3 Ea r�)! Fu/1-cbifJ NTV Histog ra m ogy limit.

SPICE. all CJ Uestionable paths f lagged lw NTV were Area Estimation and Pla cement simulatnl usi ng SPICE. Those that were fou nd to he Preliminary estimates of the die area were made by slow were re designed to meet tile target cycle time. extrapolations from previous Cl'l i designs.�" Better Logic and circuit changes resu lting from these area estimates were developed fo r regu l ar struc­ analysl's and the impact of these changes on ot her tures, such as the cache arrays and data paths, by design representations and ve rification tests were creating test layout structures. More acc urate floor tracked on the electronic bul letin hoard. Since many plans of t he control sections of the chip were devel­ people were wo r!.;ingon t e design simultaneously, il oped after the RTL model was written. In most detai led t rac!.;ing of changes and open issues proved cases, the partitioning of the Kif. model was a close crucial to meeting our schedule. A single change approximation to the desired schematic and layou t often required modification a nd reverification of partitioning. To estimate the area of ra ndom con­ the lUI. model, schematics, layo ut, verification tests, trol logic, we used the OCCAM tool, and in some cases the chip textu;tl specification. and in many ctses Digital's proprietary layout syn­ Lmnlt design proceeded on a subbox, or section, thesis tool, CLEO. basis. t\·p ictl section consistl'd of approximatelY A As seen in the ch ip photom icrograph in Figure l, ten related schematics. Each section was checked the clock ge nerator and drivers (CLKGEN) were dis­ fo r conncctivin· ami correctness after irs layou t was triburnl in a narrow north-south channel near the completed. Sections within a box were then inte­ center of the chip. That location was chosen to min­ grated and verified (boxes contained from '5 ro l'i imize clock skew. The 1-box , E-box, M-box, and sectiuns) before the complete chip composite was C-box suhchips are located on the right side of the assembled. chip. Their relative positions were chosen to Schematic \ erification and layout design were accom modate the flow of rhe pipelined data in the performl'CI concurrcnrly during much of the proj­ main clara path. which runs north-south just to the ect. Although his overlap l d design ch nges t e to a right of CLKC ;EN The VIC:, F-box, and P-cache were that increased the amount of layout rework, the placed on the left-hand side of the chip, adjacent to chip development schedule wou ld have been signit� the boxes with which they com municate. icantlv longer if schematic verification and lannt t design had been done serial . design took In terconnect Pla nning and Routing l\· Lt\'OUt After we determined the initial placements of al l about a ,·ca r ro complete. major control sections. we used a new two- level Lc�yout Ve rifica tion block ro uter called GLOW to route rbe layout h>r the

Section- and box- level layout verification was per­ interbox ami intrabox signals. Routing was per­ formed in para l lel with the layout d esign . Once fo rmed at the same time schematics fo r the control

. Digital Te cbuical jo urnal 28 I rJ/. ' ,\u . i Si!JJIJII<'r /'J')l Th e NVAX CPU Chip: Des(Q.Il Challenges, ,Vfethods, and C/4 D 7()()/s

areas were in design. Since layout did not exist Th ird Metal Layer for the control blocks, GLOW was allowed to route The top aluminum interconnect laver (M.'\) in the signals over the blocks with some restrictions, as C:VIOS-4 process was specifically designed to meet well as route signals in cha nnels. We influence([ th e electrical requireme nts of the NVA)( chip. The how GLOW routed signals by specifying some areas third metal layer was designed t( Jr a low sheet resis­ of blocks as opaque (no over-the-block routing per­ tivity and high current ca pacity in order to handle missible) and some as porous (over-the-block rout­ the eJectrical problems associate([ with the power ing is permissible if channels are full) . Typically, gricl, clock distribution, critical signal routing, al1ll blocks that contained regular arrays (such as large array design. progra mmable logic arrays) or critical circuitry were identified as opaque, whereas most ra ndom Power and Ground Distrilnttion control areas were identified as porous. When the NVAX microprocessor is run at maximum The capacitance values of the interbox and intra­ speed, it draws a direct current of about 5 amperes. box signal interconnects were extracted from the Due to CMOS switching tra nsients, the alternating layout and used in circuit simulations of critical current peaks are considerably higher. Distributing paths. In some cases, the placement of sections and power (1;111) and ground (I�) across the chip while th e signal routing were altered to reduce intercon­ keeping power grid voltage drops (IR) under :)00 nect capacitance on critical signals. The use of syn­ millivolts ( 10 percent of minimum y;,")was a major th esis tools such as GLOW , OCCAM, and CLEO challenge. 'To address this constraint and meet allowed a much more detailed floor plan to be interconnect reliability goals, we used tl� e low­ developed than was typical for prior designs. �x T"he resistance M:) layer extensively to distribute 1;"' ability to feed capacita nce information from and I<,· As shown in Figure 4, we designed the floor pla n routing back into SPICE simulations right-hand side of the chip to be covered with proved invaluable. an interdigitated array of alternating �"' and �;,

1 MAIN CLOCK ROUTING (M3) I � POWER � I I I I I OR CLOCK VIC I 0t====l===ti==':I===IIC:::::::::::::I ��-I ' LINES (M3) SUPER BIT � VIC I W+======t=tVIY LINES (M3) r-tj POWER OR I-BOX � Illz ---��j- CLOCK M2 1 STRAPS 8 ��======�I t- :- ,. [lillill] d CONTROL F-BOX L STORE ROM E-BOX:::j M3 POWER -...... :. -�=;:�� tr' .?--- SUPER BIT t1--�.L_, J/Y ..:::: - �J- STRAPS LINES (M3) F-BOX .J. r- - - POWER - lj - � OR CLOCK ����������M-BOX �blJ-� - LINES (M3)

-

C·BOX t-

t- H=\t::::====:::::F=tjt- I I 171I 1 I \ l l J M3 POWER RING \\....M3 POWER SPINE

Figure 4 Jl!l etal 3 Rou ting

Digital Te chnical jo urnal \ "'· 1 ,\"()_ ..i Slllilllter /')')..! NVA.,'X-microprocessorVA., 'X systems

lines, each 17 micrometers wide. Ve rtical metal On-chip Clock Distribution two (M2) Jines are used to strap the power lines In order fo r us to meet the performance goals, it and form a �;lr, grid and a V,.,. grid. The �trt and 1�" was critical to keep clock skews small and edge distribution of the left-hand side of the chip was rates sharp across the chip. As shown in Figure 5, different from that on the right because of the spe­ special attention was given to the clock distribu­ cial layout requirements of the cache arrays and the tion scheme. Differential outputs from an offchip F-box. oscillator were supplied to a receiver located at the Individual cel l layout did not contain M3. The top of the chip. The output of the receiver was power, ground, and clock connections fo r a cell routed to the global clock generator (CI.KCEN), were routed by short vertical M2 lines insicle each which was placed at the center of the chip to cel l. These M2 lines were connected to the M3 reduce clock skew. 'fhe outputs of the gloha l clock grids automatically by a CAD tool. generator were buffe red hy fo ur inverters to

OSCILLATOR_LOW "'- ,.rOSCI LLATOR_HIGH DIFFERENTIAL AMPLIFIER �� /' � UPPER PAD CLOCKS (M3) / �------, -�-/- / ,v\7 t UPPER PAD VIC CLOCK I-BOX E-BOX CLOCK I-BOX VIC CLOCK CLOCKS BUFFER � ;. BUFFER M2 STRAPS : - � - f'.- � -- � E-BOX � ' 1 E-BOX OSCILLATOR DATA ;---!_• >--- E-BOX �I GLOBAL CLOCK : (M3) PATH I CLOCKS CLKGEN -- I (M3) I0 I I I F-BOX F-BOX � CLOCK t v I BUFFER

M-BOX M-BOX --- - CLOCK GLOBAL I I I � BUFFER CLOCKS I - ____.___,_;---- - (M3) I I

P-CACHE C-BOX C-BOX P-CACHE CLOCK CLOCK BUFFER BUFFER

LOWER PAD CLOCK

+ 1....------�LOWER PAD CLOCKS (M3)

Figure 5 Clock (ieneration and Distribution

:)0 Vo l. --i .\'o. 3 Sullllllel' l'J'J2 Digital Te c/Juicaljo urnal Th e NVAX CPU Chip: Design Challenges, Me thods, and CA D To ols

increase their driving capability. The clocks were need for sense ampl ifiers and voltage reference gen­ then distributed, using the low-resistance third erators. This substantially reduced the time metal layer (17 mi lliohms per square), from the top required to design and verify the control store ROJ'•'l. to the bottom of the central clock routing channel that spans the chip. Primary Ca che Clocks were supplied to the diffe rent functional A similar technique was used in the 8KB P-cacbe to boxes by locally tapping off the central clock rout­ ease the timing requirements. The three high-order ing and buffe ring each signal with fo ur inverters to P-cache address bits must be translated and conse­ further increase the signal's driving capability. This quently become valid later than the untranslated buffe ring helps to minimize the capacitive loads lower-order bits. By dividing the P-cache into eight seen by the clock phases in the central routing subarrays, each with its own sense amplifiers, the channel in which the RC delays are held to 30 pico­ cache subarray access can be starred before the seconds (ps). To reduce distribution skew between three translated address bits are va lid. \Vhen the the global clock Jines, loading on each global line last three address bits become valid, the outputs of was balanced by adding dummy loads to the more the subarrays are multiplexed onto the M3 super bit ligh tly loaded lines. The buffe red clock phases Jines, resulting in a faster cache access time. were distributed to the east and west of the central clock routing channel, again using M3 to reduce RC delay. The east-west clock routing was strapped Layout Veri fication To ols with M2 as shown in Figure 5. These straps were Ve rifying the NVAX chip layout presented several not allowed to cross box boundaries. Box-level CAD software challenges. Prior to the NVA)( design, clock skew was reduced by using a common the existing layout verification tools were able to section buffe r design and layout, and by carefully extract fu ll-chip from layout fo r all large tuning the buffer drive capabi lity to the clock load designs in a single batch process. However, the in each section. existing layout extractor cou ld not hand le Finally, before the clocks were used by the logic, designs such as NVAX with over one million transis­ the clock signals were locally buffe red. These final tors. Also, a more accurate capacitance extraction stages of local buffe ring served two purposes: they algorithm was required to calculate side-to-side and reduced the gate loading on the east-west clock fringing capacitance, which came to show signifi­ routing, and they helped to sharpen the clock edges cant effects in the small physical dimensions in seen by the logic. CMOS-4. Furthermore, accurate interconnect resis­ The global clock routing network was spaced so tance extraction was needed fo r NVAX . A combina­ that the RC delays of local clock branches wou ld tion of new CAD tools (see Figure 7) and design never exceed a negligible 125 ps. We calculated the methods was employed to meet the NVAX layout RC delays of local clock branches using the WAWOTH verification requirements. layout interconnect analyzer (described in the section New Proprietary CAD To ols) and, where Pa rtitioning Us ing "Clean Belts" necessary, rerouted branches to meet the 125-ps To address the problem of extracting parasitic design goal. A sample RC plot, generated by capacitance data from such a large design, the NVA)( WAWOTH fo r a section of local clock routing, is given in Figure 6. The clock skews and edge rates chip layout was constructed so that each chip parti­ across this 1.62-centimeter chip are less than 0.5 ns tion could be independently extracted without and 0.65 ns, respectively. introducing inaccuracies in the results. The chip was partitioned into nonoverlapping regions, each of which had a rectilinear annulus or "dean belt" Microcode Control Store around its periphery. A clean belt is a rectangular The design of the 12KB ROiVI control store was sim­ region that contains only metal I ines and satisfies a plified by dividing it into fo ur subarrays. Each subar­ number of layout design rules beyond those set by ray has its own Ml bit lines. The Ml bit lines from the technology. The clean belt layout ru les pre­ the su barrays are cascaded onto low-capacitance vented design rule violations within the clean belt M3 super bit lines that extend over all fo ur subar­ and between adjacent clean belts. The rules also rays. Since the capacitance of the M3 super bit lines ensured that extracting parasitic capacitance from is low, the access time is very fast, obviating the a region enclosed by a clean belt cou ld be done

Digital Te chnical jou rnal Vo l. 4 No. 3 Sum.mer 1992 31 NVAX-microprocessorVA., "X systems

101.6 102.0 Note: nmes are given in picoseconds. 103 2 105.8

105.7

104.0 104.0 103.2 101.8 101 .9 101.6 101.5 1011 1007

104.7

100.7 107.0 106.2 100.6 107.0 106.2 100.9 101.0 102.2

I

...-- ..., ..... � ...... ,...._ ......

Figure 6 W;i W011-fRC Delay Ano�)'sis Resullsji.Jro Clock Node

accurately regard less of the materials that border out cell instances and in many cases needs to the region. Partitioning the chip in this manner extract cell-only definitions. In contrast, the previ­ made it easier to locate global wiring errors. ous net list extractor "flattened'' the layout data into one nonhierarchical cell ;1 11 d, therefore, extracted Hierarchical Ne tlist hxtraction all clara ·wi thout reusing previously extracted cell A new netlist extractor, HI LEX, was used to meet the defi nitions. The actual performance improvement high data c:qncity requirements of the NVAX micro­ realized by HILEX depends on the hierarchy of the processor. HII.EX is more efficient than the previous chip layout design. If very few cel ls are replicated, in-house net list extractor because it recognizes lay- or cells are repl icated in a way that requires the

.1 2 1·1>/. ..J No. 3 Slllllilll'r J')'J.! Digital Te cbuical jourual The NVAXCP U Ch ip: Design Ch allenges, Me thods, and CA D To ols

CUP HI LEX

CAPACITANCE NETLIST

SCHEMATIC WLC REX NETLIST

MISMATCH MATCH FILE FILE

Figure 7 Simplifi ed Layout Ve rification To ol Flow

extractor to explode the cells (i.e., create more flat VMS operating system architecture. To go beyond data) to extract them properly, then minimal per­ the VMS virtualmemory limit, the internal memory HILEX fo rmance improvements are seen. An example of management routines within were modified the latter situation is an array of a repeated pair of to allocate additional heap from the process stack VMS overlapping cells that fo rms one or more transis­ (in Pl space) when the memory allocation tors due to the overlap (one cell contains diffusion routines indicated that PO memory space was areas that become source and drain regions when exhausted. This technique was used to al locate the overlapped with the other cell, which contains 2.5 million virtual pages required for fu 11-chip con­ polysilicon lines that become transistor gates). nectivity extraction. Several layout design guidelines were defined to ensure that perfo rmance improvements from using Netlist Comparison HILEX would be realized. Adherence to the guide­ A utility called WLC was used to verify that netlists lines minimized situations that require HILEX to extracted from layout by HILEX matched netlists explode cells and encouraged the use of hierarchy generated from the chip schematics. Since the in the layout. However, since it was not always pos­ NVAX schematic hierarchy rigorously matched the sible to adhere to these design recom mendations, layout hierarchy only at certain .levels, the con­ HI LEX was designed to handle large amounts of flat nectivity comparison was performed flat. WLC data. employed a graph-build ing and graph-traversing The chip net list was extracted from the complete algorithm that worked well fo r comparing Jess than chip layout prior to tape out. This presented qu ite a 500,000 device connections. However, substantial challenge since even with the use of HI LEX, extract­ paging occurred when comparing larger netlists. ing the chip netlist from the 225MB chip layout file Since the NVAX chip contained 1.3 million devices, in one pass required more than the maximum of the performance of WLC was inadequate for fu ll­ two million virtual pages of memory allowed by the chip netlist verification.

Digital Te chuicaljournal 1992 Vo l. 4 No. 3 Summer 33 NVAJ(-microprocessorVAX systems

To improve the elapsed time of netlist compari­ extraction batch jobs. CUP sectioned the layout son batch jobs, multiprocessing was employed. database into fixed-size stripes, which were HILEX was modified to write the extracted netlist of inserted into the batch queues of multiple crus. the clean belt partitions. Each partition was then This method red uced the data complexity and compared by WLC, in parallel on multiple crus, to allowed as many parallel computations as there its equivalent schematic-generated netl ist. This were processors. During NVAX chip design, capaci­ approach re duced the total elapsed time fo r the tance extraction was partitioned across as many as NVfu'\chip net list comparison from more than three 20 CPUs. Multiprocessing reduced, fo r example, the days to seven hours. Cross -reference files output by NVA _,'\: I-box capacitance extraction from 26 hours to WI.C and the schematic netlists were processed by a just 8 hours using 4 processors. Extraction of the separate program, 1\'lATCH_CHECKER, to verify the E-box took 40 hours using one processor, but only connectivity of nodes that crossed partition bound­ 12 hours with 4 CPUs. Ta ble 2 shows the device and aries. This additional step added only eight minutes node counts of the NVAX boxes (excluding the CUP VAX ro the total elapsed time fo r comparing chip caches), ancl the extraction run times on a net! ists. 6000 Model 500. Each box run resulted in approx i­ mately 500,000 extracted parasitic capacitors. Capacitance Extraction

The small spacing dimensions of the submicron Resistance Extra ction C\·IOS-4 process caused fringing and lateral capaci­ Verifying the NVfu'\ power, ground, and clock net­ tances to contri bute significantly to the total nodal works, and long signal lines requ ired accurate capacitance. The existing layout extraction tool extraction of interconnect resistance from layout. only extracted overlapping parallel plate capaci­ To meet this requirement, the REX resistance tance. Thus, a new layout capacitance extractor, extractor was developed 9 REX processed the HILEX C:Ul', was written ro accurately extract fringing, lat­ output of to produce a series and parallel eral, and area capacitances. combination of resistors that modeled a node's CUP extracted interconnect capacitance from interconnect. The resistor network was generated layout by decomposing interconnect layout into by fracturing the node layout into polygons based pieces of uniform layout cross sections. The geome­ on changes in the layout geometries (width, length, try of each interconnect piece, and its distance bends) of the node or the occurrence of contacts. from layers above, below, and adjacent to it, are The effective resistance of each polygon and con­ used to calculate its area, fringing, and lateral com­ tact, or cluster of contacts, was then determined ponents of capacitance. The empirical fo rmulae from technology parameters and the polygon used to calcu late the capacitive components were geometries. based on cmves of two-dimensional electrostatic The power and ground resistor networks were simulation data of various layout cross sections. extracted fo r individual boxes rather than the entire This technique produced accurate internodal and chip. The resu lting networks were still quite large total interconnect capacitance data. This accuracy due to the fine granularity of the REX extraction 3 resu l ted in CUP being very compute intensive. process. Table shows the extraction times fo r a REX 6000 ')00 Multiprocessing was employed again to reduce job running on a VAX Model and the the elapsed turnaround time fo r capacitance total number of resistors extracted from each box .

Ta ble 2 CUP Parasitic Capacitance Extraction Batch Run-time Data for NVAX Boxes

Single CPU Four CPUs Box Device Count Node Count (Hours) (Hours)

1-box 107, 000 36,830 26 8 M-box 102,000 38,770 29 8 E-box 107, 600 41,760 40 12 C-box 92,400 45,050 42 12 F-box 129,1 50 55,550 45 12.5

Vo l. 4 No. 3 Su11nner 1992 Digital Te chnical jo urnal Th e NVAX CP U Chip: Design Challenges, Methods, and CA D To ols

Ta ble 3 REX Extracted Parasitic Resistance simulation code is designed to execute in a straight­ Data and Batch Run-time Data for line fashion. Due to the high switching event densi­ NVAX Boxes ties we observed on NVAX, 18 percent on average, Extraction Time this straight-line compiled approach to simulation Box Resistor Count (Hours) was more efficient than event-driven simulators, which typical ly fail to compete when event densi­ M-box 602,000 5 ties increase beyond 3 to 5 percent. The CHANGO C-box 621 ,000 5 translation process further optimized the sim­ 522,000 10 1-box ulation by partitioning the simulation according E-box 719,000 10 to signals that should be evaluated during each par­ F-box 1,200,000 36 ticular clock phase. This avoided processing signals during clock phases when signal transitions could not occur. Further, evaluation of a switching event New Proprietary CAD To ols was only performed when the signal could affect

Several other novel CAD tools were specifically the evaluation of some other signal. This prevented designed fo r the NVAX chip. These tools provided simulation of unimportant switching events that practical solutions to verification and analysis prob­ were ignored by the re maining design. Redundant lems that were previously unmanageable or signals (i.e., nodes with the same logical behavior) intractable. were grouped together as a list of synonym signals in order to moclel multiple nodes by only one simu­ lation event. CHANGO Logic Simulator

CHANGO was an important development for NVAX NTV functional verification because it allowed sig­ Ti ming Ve rifier nificantly more simulation cycles and fu nctional NTV is a static timing verification tool developed for verification tests to be performed from the NVAX use on the NVAX microprocessor. 10 NTV processed transistor-level description than was previously 350,000 circuit paths and checked 42,000 timing possible. CHANGO is a two-state gate-level logic constraints on the NVAX design. NTV eliminated the simulator designed to maximize performance need fo r the pattern-dependent dynamic speed through compiled, straight-l ine simulation. Elec­ verification strategy used by other chip designs trical issues such as gate delay and charge sharing and significantly reduced the extensive speed ver­ were not modeled since CHANGO was used ification work needed for SPICE simulations. It fo r functional, not timing, verification. CHANGO's identified critical paths that would have otherwise parallel simu lation capability allowed the simul­ remained undetected clue to rhe complexity and taneous execution of 13 different NVAX model size of the NVAX design. simulations on one CPU, which resu lted in an eight­ NTV read multiple flat transistor ne tlists with or fo ld increase in simulation performance. Overall, without parasitics and automatically identified cir­ CHANGO has been shown to accelerate simulation cuit structures such as complementary, dynamic, one to two orders of magnitude over traditional and cascode gates as we ll as several types of latches. event-driven gate- level simulators. Its high through­ Based on the classification of these structures, NTV put enabled us to boot the VMS operating system identified timing constraints. For example, NTV twice (75 million cycles) prior to tape out. checked that the latch storage nodes become val id To create a CHANGO model, a transistor-level before the latches closed. NTV also read user-speci­ netlist description of the design was input to a pre­ fied timing fo r primary inputs and propagated node processor called GEN_MODEI.. GEN_MODEL trans­ timing throughout the design based on when sig­ fo rmed the netl ist into a logical description of the nals arrived at gate inputs, the drive capability of design, consisting of simple Boolean elements, each gate, and its output loading. D-type latches, and SR . CHANGO trans formed NTV has three delay models that were used fo r this logical description into a highly optimized sim­ calculating gate delay: (1) unit delay was used fo r an ulation stream of VAX assembly code. initial rough timing estimate before real parasitics CHANGO achieved its high simulation through­ were known, (2) a SPICE-calibrated lumped RC put in many ways. Conditional branch latency model was used fo r delay calculation of comple­ penalties were largely avoided because CHANGO mentary gates, and (3) an Elmore-distributed RC

Digital Te chnical jo urnal Vo l. 4 Nu. 3 Summer 1992 35 NVAX-microprocessor VAX systems

model was used fo r other srructu res. 11 NTV flagged derived from average node-switching freq uencies circuits that fa iled to meet rhe identified riming calculated from logic simulation data. Over I OO , R WAWOTH. constraints within a user-specified time tolerance. signal nodes were also analyzed by Like other static riming verifiers, some paths identi­ fied by NTV were ··don't cares" or were logically Conclusions

impossible. The user eliminated these false paths Our design strategies, methods, and CAD tools by deleting timing constraint checks or by specify­ allowed us to complete the i\fVA..'CCPU chip design in ing mutual exclusivity between specified groups of 30 percent less time than our initial schedule bad nodes. required. Ty picaJ parts operate at 8).3 MHz (a 12-ns cycle time) under worst-case conditions fo r tem­ WA WOTH Interconnect A nalyzer perature and power supply. Th is is 2 ns better than our maximum cycle time design constraint. The Trad itional manual techniques for checking RC chip booted the VJ\IS operating system ten days delay, noise, and electromigration (EM) were IR after the first prototype wafers were available, and impractical fo r NVt\..'\ due to the size and complexity booted the lJLTIUX system a few clays later. Mu lti­ of the design. A suite of CAD tools Gllled WAWO"ri-1 processing was running within weeks. Fifteen was developed to perform these checks auromati­ obscure logic bugs were fo und in the first-pass cl lly, more accurately, and in fa r less time than chips, but none of them impeded system debug or wou ld otherwise ba\'e been possible. prototype development. No circuit design bugs Duri ng E.'vl and II{ analysis, current sources rep­ were fo und. Only one design revision was needed resenting gate-switching events were added to a prior to vo lume chip manufacturing. REX-ge nerated resistor network. The magnitude of Carefu l design, complexity management, and each current source was calcul ated based on tl1l' proprietary CAD tools targeted to large custom average switching frequency of the gate, the load it CMOS integrated circuits played cmcial roles in the drove, and whether average or peak current was successful design of the NVAX microprocessor. desired fo r the current analysis mode. The network node voltages were then solved through I.U decom­ Acknawledgments position. Peak voltages were flagged for IR analysis, We wou ld like to acknowledge the contributions of and average and peak current densities were calcu­ the f()IJowing people: Anderson, E. Arnold, lated fo r each resistor element and checked against W R. Badeau, Bahar, M. Benoit, B. Benschneider, EM limits. 1. D. l3ernstein, D. Bhavsar, B ro, M. Blaskovich, During RC analysis, node capacitance was L. i Boucher, Bow hill, Briggs, ]. Brown, S. Butler, proportionately distributed along the resistor net­ P W L R. Ca lcagni, S. Carroll, M. Case, R. Caste lino, work. The resulting RC network was processed A. Cave, S. Chopra, M. Co ey, E. Cooper, R. Cvijet ic, by Carnegie-Mellon's AWE algorithm to generate a il M. Delaney, D. De,·erell, A. DiPace, D. DuVa rn ey. close approx imation of the transfer fu nction fo r R. Du tta, J Edmondson, ). Ell is, H. Fa ir, G. Frances­ the nerwork.12 From this. the step response RC chi, N. Geagan, S. Goel, M. Gowan, J Grodstein, delay was ca lcu lated and the delay response to any Gronowski, Grundmann, H. Harkness, specified edge was fo und through convolution of P W W Herrick, R. Hicks, Holub. Ja in. Khanna, the transferfu nction. C. A. R.

R. Kiser, D. Kos ow. Ladd, S. Levitin. S. ivlartin, Si nce it was neither possible nor necessary to l K. T. McDermott, K. YlcFadden, B. McGee, J Meyer. perform RC and E.\1 analysis on all signal nodes, WAWOTH M. Min;.mli, D. Miner, T. O'Brien, J Pan, H. Partovi, contained a number of tools that identi­ N. Ph illips, .J. Pickholtz, R. Razclan, S. Samudrala, fied only those nodes that would have some chance ). Siegel, K. Siegel. Somanathan, ]. St. Laurent, of fa iling these checks. To decrease run time, we C. R. Stamm , R. Supnik, M. Ta reila, S. 'fhierau f. M. reduced rbe size of the files that were input to Uhler, N. Wa de, S. Wa tkins, and Y Ye n. WAWOTH by eliminating any devices and parasitics that were not related to the node under examina­ References and No te tion. Noteworthy were the la rge data requirements \VAWOTH. WAWOTH NVA X NVAX+ mer by For example, calculated l. G. Uhler et al., "The and High­ the current through the 719,000 resistive elements performance VAX Microprocessors," J) (!!,ital that compose the power and ground nodes of the Te chnical Jo urnal, vo l. 4, no. :) (Summer

E-box. Current sti mulus of the network was 1992, t hi s issue): 11-23.

I b/. .f .\h 3 Sum111er 1')9.! Digital Te cbuica/ jo urunl Th e NVAX CP U ChijJ: Design Challenges, Methods, and CAD To ols

2. B. ZetterJumi,J Farrell, and T. Fox, "Micropro­ Te chnicaljo urnal, vol. 1, no. 7 (August 1988): cessor Performance and Process Complexity 95-108. in CMOS Technologies," Digital Te chnical 8. W Durdan, W Bowhill, Brown, W Herrick, jo urnal, vol. 4, no. 2 (Spring 1992): 12-24. J. R. Marcel lo, S. Samudrala, G. Uhler, and 3. ). Crowell, K. Chui, T. Kopec, S. Nadkarni, and N. Wade, "An Overview of the VA)\: 6000 D. Sovie, "Design of the VAX 4000 Model 400, Model 400 Chip Set," Digital Te chnical 500, and 600 Systems," Digital Te chnical journal, vol. 2, no. 2 (Spring 1990): 36-51. jo urnal, vol. 4, no. 3 (Summer 1992, this issue): 60 -72. 9. J Hwang, "REX-A VLSI Parasitic Extraction To ol fo r Electro migration and Signal Analysis;' 4. L. Chisvin, G. Bouchard, and T. We nners, "The 28th Design Automation Conference (1991): VA)( 6000 Model 600 Processor," Digital 717-722. Te chnical jo urnal, vo l. 4, no. 3 (Summer 1992, this issue): 47-59. 10. J. Pan, L. Biro, .J. Grodstein, B. Grundmann, and Y. Ye n, "Timing Verification on a !.2M­ 5. R. Badeau et al., "A 100 MHZ Macropipelinecl Device Fu ll-Custom CMOS Design," 28th VAX CMOS Microprocessor," Accepted for Design Automation Conference (1991): publication in IEEE Jo urnal of Solid State 551 -554. Circuits, vo.l . 27, no. 11 (November 1992).

6. SPICE is a general-purpose circuit simulator 11. W Elmer, "The Transient Response of program developed by Lawrence Nagel and Damped Linear Networks with Particular Ellis Cohen of the Department of Electrical Regard to Wide-band Amplifiers," Jo urnal of Engineering and Computer Sciences, Univer­ AjJjJliedPh ysics, vol. 19 (1948): 55-63. sity of California, Berkeley. 12. V Ragbavan and R. Rohrer, "A New Nonlinear 7 T. Fox, P. Gronowski, A. Jain, B. Leary, and Driver Model fo r Interconnect Analysis," 28th D. Miner, "The CVAX 78034 Chip, a 32-bit Sec­ Design Automation Conference (1991): ond-generation VAX Microprocessor," Digital 561 -566.

Digital Te cbnical journal Vo l. 1 No. 3 Summer 1<)92 37 Wa lkerAnders on I

Logical Verificationof the NVAX CPU Chip Design

Digital's Nl-ft.X high-pe1jonnance microprocessor has a complex logical design. A rigorous simulation-based verification effo rt was undertaken to ensure that there were no logical errors. At the core of the effo rt were implementation-oriented, directed, pseudorandom exercisers. Tb ese exercisers were supplemented with imple­ mentation-specifi c fo cused tests and existing l'l1Xarch itectural tests. Onf:JI 15 logical bugs, all unobtrusive, were detected in the first-pass design, and the operating systern booted with first-pass chips in a proto�vp e system.

The NVAX CPU chip is a highly complex VAX micro­ Group (SEG) whose primary responsibility was to processor whose design required a rigorous verifi­ detect and eliminate the logical errors in the NVAX cation effort to ensure that there were no logical design. The detection and elimination of timing, errors. The complexity of the design is a result of electrical, and physical design errors were left to

the advanced microarchitectura l features that make separate efforts. 2

up the NVA ..'\ architecture, such as branch predic­ Given the design complexity, the critical need fo r tion, micropipelining and macropipel ining tech­ highly fu nctional first-pass chips, and the fact that niques, a three-level hierarchy of instruction the designers had other responsibilities related caching, and a two-level hierarchy of write-through to the circuit and physical implementation of and write-back data caching. 1 Also, the chip was ini­ the fu ll-custom chip , special attention to logical tially intended fo r two different target system con­ verification was considered a requirement. Every figurations and had to be verified fo r operation in verification engineer approached the verification both. Product time-to-market goals mandated a problem with a different fo cus. Each member of short development schedule relative to previous one group of engineers fo cused on the verification projects, and there was a limited number of verifi­ of a single box, while other engineers fo cused on cation engineers available to pertorm the tasks. fu nctions that spanned several boxes. Certain veri­ The verification team set two key goals. The first fication engineers were avai lable throughout the was to have no "show stopper" logical bugs in the project to test the fu nctions of the chip that first-pass chips and, consequently, to be able to required extra attention. This variety of perspec­ boot the operating system on prototype systems. tives was an important aspect of the verification Meeting this goal would enable the prototype strategy. Most verification engineers to llowed the system hardware and software development teams process described below. to meet their schedules and would allow more intensive logical verification of the chip design to 1. Plan tests to r a fu nction. continue in prototype systems. The second key team goal was to design second-pass chips with no 2. Review those plans with the design and architec­ logical bugs, so that these chips could then be ture teams. shipped to customers in revenue-producing sys­ tems. Meeting this goal was critical to achieving the 3. Implement the tests. time-to-market goals for the two planned NVAX­ basecl systems. 4. Review the actual testing with the design and architecture teams. Te am Org anization and Approach

Logical verification was performed by a team of engi­ 5. Implement any additional testing that was neers from Digital's Semiconductor Engineering deemed necessary.

38 Vo l. 4 No. 3 Surmna 1.992 Digital Te chnical jo urnal Logical Ve rification of the NVAXCP U Chip Design

Simulation and Modeling Methodology (i.e., board) and then system models. The chip veri­

The verification effort employed several models of fication team performed only a limited amount of the fu ll NVAX CPU chip and of the individual design simulation using these module and system models. elements. Each model had its strengths and weak­ These simulations were used primarily to verify nesses, but all models were necessary to ensure a that the chip model fu nctioned correctly in a more thorough logical verification of the design. accurate model of a target system configuration and to better test the multiprocessor support ftmc­ tions in the design. The system development BehavioralModels groups perfo rmed more extensive simulations with The behavioral models of the chip were written by such models. design team members using the DECSJM behavioral modeling language; to achieve optimal si mulation Schematics-derived Models performance, they were written in a procedural Schematics-derived models were created and simu­ style. These models are two-state models that are lated at both the box and fu ll-chip level. The logical ly accurate at the CPU phase clock bound­ CHANGO simulator is a two-state, compilation­ aries. These fa irly detailed behavioral models repre­ driven simulator and, like the behavioral model, is sent logic at the register transfer level (RTL), with accurate only at the CPU phase clock boum.laries.2 every latch in the design represented and the com­ The fu ll-chip CHANGO model linked together the binational logic described in a way similar to the fo llowing: the code that was automatical ly gen­ ultimate logic design. eratet1 from the schematics; C-code models fo r Behavioral simu lations were performed first on chip-internal features such as control store and box-level models, where most of the straightfor­ random-access memories (RAMs); C-code models ward design and model ing errors were eliminated. to perform simulation control fu nctions; and the A box is a fu nctional unit such as the instruction same DECSIM behavioral models for the backup prefetch/decode and instruction cache control cache, main memory, and system fu nctions that unit, the execution unit, the floating-point unit, were used in the fu ll-chip behavioral model. The the memory management and primary cache con­ simulation perfo rmance of the fu ll-chip CHANGO trol unit, or the bus interface and backup cache model was only about one-half that of the unaccel­ . 1 erated, fu ll-chip behavioral model. A.lthough these The box-level models were then integrated into models were not useful fo r electrical or timing veri­ a full-chip behavioral model, which also included a fication because they did not model timing or elec­ backup cache model, a main memory model, and trical characteristics of the design, their simulation models to emulate the effects of several system con­ performance made them extremely useft!l fo r logi­ figurations. The pseudosystem models did not cal verification. model any one specific target system configuration Another fu ll-chip model was created to run on but could be set up to operate effectively like any an event-driven, multiple-state simulator. How­ target system configuration or in a way that exer­ ever, only a limited amount of simulation was per­ cised the chip more intensely than any target fo rmed using this model, because its performance system wou ld. Av ailable early in the project, this was very slow when compared to the CHANGO and model was the primary vehicle fo r logical verifica­ behavioral models. Since it was the only model that tion until the schematics-derived, fu ll-chip, in­ could run on a multiple-state simulator, this third house CHANGO model was created. The fu ll-chip model was used primarily to verify chip power-up behavioral model could simulate approximately 13 VMS CPU and initialization. cycles per VAX second and was used to simulate about one billion CPU cycles. The procedural, ft1 ll-chip behavioral model also PseudorandomExercisers ran on a hardware simulation accelerator where it Early in the project, it became apparent that, given achieved simulation performance of about twice the limited number of engineers, the short sched­ that of the unaccelerated simulation. The simula­ ule, and the complexity of the NVAX chip design, a tion accelerator was used primarily fo r simulating strategy of developing and simulating al l conceiv­ long, automated, noninteractive tests. able implementation-specific test cases wou ld be In addition, the model was encapsulated in an ineffec tive . This strategy would have required the event-driven shell and incorporated into module engineers to implement tedious, handcrafted tests.

Digital Te clmicaljournal Vo l. 4 Nu. 3 Summer 1992 39 NVAX-microprocessor VAX Systems

Instead, the verification team adopted a strategy script files that contain hierarchical text generation that depended heavily on the use of directed, pseu­ templates and implements the basic fu nctions of dorandom tests referred to as exercisers. This strat­ a programming language. egy generated and ran many more interesting test SEGUE provides a notation that allows the user to cases than would ever have been conceived by the specify sets of possible text expansions. Elements verification and design engineers themselves. of these sets can be selected either pseudoran­ The basic structure of an exerciser consisted of domly or exhaustively, and the user can specify the the fo llowing five steps, which were repeated until weighting desired fo r the selection process. For a failure was encountered: example, a hierarchy of SEGUE templates typically comprised three levels. At the lowest level, a SEGUE 1. Set up the test case. template was created to select pseudorandomly a 2. Simulate the test case on either the behavioral or VAX opcode, and another template was created to the CHANGOmodel . select a specifier, i.e., operand. At an intermediate level, the verification engineers created templates 3. Execute the test program on a VA,'\ reference that called the lowest-level templates to generate machine. short sequences of instructions to cause various 4. Analyze the simulation and accumulate data. events to occur, e.g., a cache miss, an invalidate from the system model, or a copy of register file 5. Check the results for fa ilure. contents to memory. At the highest level, these Figure 1 depicts the interoperation of the tools intermediate-level templates were selected pseudo­ used to construct an exerciser and its basic flow. randomly with varied weighting to generate a com­ plete test program. Setup Because the SEGUE tool was developed with Setting up the test case involved generating a verification test generation as its primary applica­ short assembly language test program, activating tion, the syntax allows fo r the easy description of some demons to emulate various system effects, test cases and the ability to combine them in inter­ and selecting a chip/system configuration fo r esting fashions. Using SEGUE, the verification engi­ simulation. neers were able to create top-level scripts quickly The assembly language test programs were gen­ and easily that could generate a diverse array of erated using SEGUE, a text generation/expansion test cases. These engineers considered SEGUE to tool developed fo r this project. This tool processes be a significant productivity-enhancing tool and

SEGUE SCRIPT FILE

SEGUE�

DEMON AND CONFIGURATION ASSEMBLE� /LINK SETUP AND SIMULATION CONTROL r ! BEHAVIORAL OR CHANGO SIMULATION

MEMORY DUMP LOG AREA FILES FILE

c oMPAR E E AM NE ..______---1 ___ x I __ ___, _,....- 1 ._I __.,.__ ___.I l.___s.,.A_v_E_s_A_N_A_L_vs,_ls

PASS/FAIL CUMULATIVE INDICATIONI COVERAGEI DATA

Figure 1 Ve rification To ol Flatu fo r Exercisers

40 Vo l. 4 No. 3 Summer 1992 Digital Te clmicaljow-nal Logical Ve rijication of the NVAX CP U Ch ip Design

preferred using SEGUE to hand-coding many and various internal signals, were created. As the fo cused tests. exerciser test programs executed, various VA.'\. Before simulations were performed, model architectural state information was written period­ demons were set up. Demons were enabled or dis­ ically to a region of modeled memory referred to as abled , and their operating modes were selected the clump area. When the simulated execution of pseudorandomly. Demons may be fe atures of the the test program completed, the contents of the moclel environment that cause some external cl ump area were stored in a file. Also, the test pro­ events such as interrupts, single-bit errors, or cache gram was executed under the VMS operating invalidates to occur at pseudorandom, varying system running on a VA X computer used as a refer­ intervals. Demons may also be modes of operation ence machine. At the end of execution, the con­ fo r the system model that cause pseudorandom tents of the memory dump area were stored in variation in operations such as the chip bus proto­ another file. col, memory latency, or the order in which data is returned. Some demons were implemented to fo rce Analysis chip-internal events, e.g., a primary cache parity A tool called SAVES allows users to create C pro­ error or a . These chip-internal grams in order to analyze the contents of binary demons had to be carefully implemented, because trace files. SAVES was used to provide coverage anal­ sometimes they fo rced an internalstate from which ysis of tests, and to check fo r correct behavior of the chip was not necessarily designed to operate. In chip-internal logic ancl give a pass/fail indication. a pseudorandomly generated test, it is frequently For coverage analysis pur poses, information diffi cult or impossible to check fo r the correct han­ such as the number of times that a certain event dling of an event caused by a demon, e.g., check occurred during simulation or the interval that an interrupt is serviced by the proper handler between occurrences was accumulated across with correct priority. However, simply triggering several simulations. This data gave the verification those events and ensuring that the design did not engineer a sense of the overall effectiveness of enter some catastrophic state proved to be a power­ an exerciser. For example, a verification engineer fu l verification technique. who wanted to check an exerciser that was Chip/system configuration options such as cache intended to test the branch prediction logic was enables, the floating-point unit enable, and the able to use the SAVES tool to measure the number of backup cache size and speed were also preselected branch mispredictions. pseudorandomly. Aside from resting the chip in all Frequently, verification engineers used the SAVES possible configurations, e.g., with a specific cache tool to perform cross-product analysis and data disabled , varying the configuration in a pseudo­ accumulation. For cross-product analysis, the engi­ random manner caused instruction sequences to neer specified two sets of design events to he ana­ execute in very different ways and evoked many dif­ lyzed. The analysis determined the number of times fe rent types of bugs unrelated to the specific con­ that events from the first set occurred simultane­ figuration. Also, specific configurations and demon ously with (or skewed by some number of cycles setups would significantly slow down the simu­ from) events in the second set. For example, one lated execution of the test program, sometimes to verification engineer analyzed the occurrence of the point where intended testing was not being different types of primary cache parity errors rela­ accomplished. To work around this problem, the tive to diffe rent types of memory accesses. verification engineer could fo rce the configuration Analyzing the cross-product of state machine states and demon selection to avoid problematic setups. against one another, skewed by one cycle. allowed state machine transition coverage to be quickly Simulation and VAX Reference Execution understood. SAVES After assembling and linking the test program, it The verification team used this information was loaded into modeled memory, and its execu­ about the exerciser coverage in the fo llowing ways: tion was simulated on either the behavioral or the • To enhance productivity by helping the engi­ CHANc;o model. As the test program simulation neers identify planned tests that no longer rook place, a simulation log file and a binary-format needed to be developed and run because the file, which contained a trace of the state of the pins exerciser already covered the test case

Digital Te cbnica/ jo urual Vo l. ..j ,Vo. .) Summer !<)<).! 41 NVA.X-microprocessor VAX Systems

• To indicate significant areas of the design where CHANGO models, and proved to be very effective at coverage may have been deficient detecting subtle, complex bugs in the design. Each exerciser concentrated on testing a sing! · box, a • To determine how the exercisers might be subsection of a box (e.g. , branch prediction logic), adjusted to become more effective or thorough, or a particu lar global chip fu nction. By adjusting or to fo cus on a particular low-level fu nction of the SEGliE template weightings, preventing or fo rc­ the chip design ing the use of a particular demon, or fo rcing a par­ Pa ss/Fa il Checking ticu Jar configuration parameter, the exercisers could be controlled at a high level to fo cus on low­ Several checking mechanisms were employed to level fu nctions. Ve rification engineers traded inter­ determine wil ·ther tests passed or fa iled. The SAVES esting SEGUE templates among themselves to tool was used to check for correct behavior of the provide each exerciser with a rich and diverse set design . especially where correct behavior was diffi­ of possibilities fo r code generation, while still cult to observe at a VA X architectural level. For maintaining the intended fo cus of the exerciser. example , the verification engineers used SAV ES to check the proper functioning of performance­ Focused Te sts h enhancing features such as branc prediction logic, Several fo cused tests were generated to supple­ pipelines, and caches. ment the exercisers. These were necessary to rest VJ\'IS A command procedure automatically implementation-specific aspects of the design that scanned simulation Jog files fo r error output from could not be checked by comparing resu lts against any of several design assertion checkers built into a VA..'\ reference machine. In some cases, an exer­ the model. These assertion checkers varied widely ciser could have been used to test a part icu lar func­ in complexity. For t:xample, simple assertion check­ tion, but the verification engineer judged it easier ers ensured that unused encodings of multiplexers' to hand-cocle a fo cused test program than to con­ select lines never occurreu . As another example, a trol an exerciser in order to accomplish the testing. more sophisticated and complex assertion checker Focused tests were necessary and particularly chal­ verified that the CP had maintained cache coher­ lenging to create and maintain when very precise ence and proper subsets among the three caches timing of events was required to test a certain sce­ and the main memory. nario of chip operation. This timing could be VJ\IS The same command procedure checked the achieved only by handcrafting an assembly lan­ s imulation log file to verify that the simulation of guage test and running it under carefully controlled the execution of the test program reached the simulation conditions. proper completion . Finally, a sim­ Each of the fo cused tests was run at least once on ple program compared the memory dump area files the fu ll-chip behavioral model and then again generated by the simulation ancl the reference on the fu ll-chip CHANGO model. machine execution to verify that the memory dump areas were identical. Although thc simulated test OtherTe sts program may have fo llowed a diffe rent exccution Several tests that had been used fo r the verification path from the VAX reference execution because it of previous VAX implementations were also usecl was simulated in the presence of demons, the com­ fo r verification of the NVAX C!'LI chip. The use of pletion points of both executions were the same, these tests a.l lowcd the NVA X logical verification and the VA ..'\ architectural state information that was team to fo cus on the implementation-specific com­ compared was identical. plexities of the NVAX clesign anu not expend as

If these checks fo und no errors, the exerciser much effo rt on implemenration-inclependent, VAX looped back to generate another test case. Because architectural verification. this whole process was automated, the verification The HCORE suite of tests can be used to verify engineer could run this test continuously, on a II several permutations of all VAX instructions, as wel l available computing resou rces. as some VAX arch itectural concepts, e.g. , memory management.'�HC ORE was valuable in that it was the Other Aspects of the EYercisers first test used to debug both the fu ll-chip behav­

Till' l·xercisers were the core of the NVAX CPU chip ioral model and the CHANGOmodel . verification effo rt. They were run nearly continu­ Sma 11 portions of the HCOR E suite were used as a ously throughout the project on behavioral and/or nightly model regression test. In general, very little

42 Vo l. 4 .Vo. j 5111111111!1" 1992 Digital Te cbnical juurua/ Logical Ve rification of the NVAX CP U Ch ip Design

regression testing of the NYAX models took place; simulation process and a box-level CHANGO the team believed that using computing resources simulation process under the VMS operating system to run pseudorandom exercisers and other new and then communicating between the processes. tests was of more value to the verification effort Stimulus and response data from the behavioral than consuming resources with extensive, frequent simu lation .isuse d to drive the inputs to and check regression testing. Consequently, the entire HCORE the outputs from the box-level CHANGO model. In suite was run at only a fe w key checkpoints during addition to comparing primary outputs from the the project. box, this technique was used tO compare many AXE is a VAX architectural exerciser that pseudo­ chip-internal points. The POTF technique elimi­ ra ndomly generates single-instruction test cases. 1 nated the need to extract and maintain large pat­ MAX is an extension of AXE that generates multiple­ tern files fro m behavioral simulations and proved instruction test cases with complex data dependen­ to be a straightforward way of comparing the two cies between the instructions. Both tools set up models. Exercisers and fo cused tests were run enough VA X architectural state to prepare fo r a test using the POTF method, and several bugs were case, simulate the test case on a model, execute the quickly and easily isolated. Because a close correla­ test case on a Wv'C reference machine, compare V�'{ tion between the behavioral models and the imple­ architectural state information from the simu­ mentation as represented by the schematics had lation and VAX reference execution, and finally, been maintained, fe w conceptual, logical design report any discrepancies. Each test case may fo rce errors were fo und by the box-level, POTF simula­ some number of exceptions; the �'CEand MAX tools tions. These simulations were, however, extremely ensure that all exceptions are detected and prop­ useful fo r finding simple schematic entry errors. erly handled. MAX The �'

Digita/ TecbllicalJourua/ Vo l. 4 No. 3 Su mmer 1992 43 NVAX-microprocessor VAX Systems

from the system disk image file, ancl wrote data to a information about a bug was col lected on the proto­ small cache of internally maintained disk blocks. To type system, ancl then the fa iling scenario was accelerate disk transfers, the disk model would reproduced on the behavioral model, where the examine cache state and use the system bus for the scenario could be analyzed and better understood. disk transfers only when the data was present in the The chip-internal signals were extremely difficult cache and required a cache invalidate or write back. to observe, but a 12-bit, allowed access In other cases, the data was transferred directly into to one of eight sets of signals from various sections the memory subsystem in zero simulated time. of the chip. The ability to monitor the control store While tracking the progress of the simulation, address bus by means of this parallel port proved to the team identified operating system code that be an essential debugging feature. executed a time-consuming search algo ritbm. To The control store patching mechanism that was limit the amount of time spent in this loop, the part of the chip design helped identify some bugs code was rewritten to implement a much faster in prototype ch ips. The debugging engineers suc­ algorithm. However, because the booting simula­ cessfully used microcode patches to work around tion effort could not be restarted from the begin­ several of the hardware and microcode bugs. In ning, several utilities were developed that allowed cases where a microcode bug was patched, exten­ the code to be replaced in the system disk image sive system testing verified that the planned change fi le and in simulated memory during a pause in the was correct. simulation. To provide the abil ity to restart the simulation Bug Tracking and Design Release effort and move it to any available computing Bug detection was a key status indicator through­ resource, simulation state was saved after every out the logical verification effort and thus 50,000 to 100,000 cycles of simtdation. In total, J\1VAX helped to steer the team's work. Hugs were tracked approximately 25 million cycles were simulated. carefu lly with an on-line system and analyzed each The simulation was stopped at the point where mul­ week to consider trends, successfu l and unsuccess­ tiple processes were created ancl the main start-up ful bug-finding techniques, and bug hot spots, process began executing. Even though this effo rt which required additional attention. The bug detec­ identi.fied no bugs in the design, it did provide a high tion rate was fairly constant throughout the project degree of confidence that the design was ready to at about 22 per month, with the exception of the be released for fa brication of first-pass chips. last month in which the rate dropped to nearly zero. An analysis of the bug-detecting effectiveness Prototyp e Ch ip Ver ification of each testing technique shows that all test tech­ The prototype NVAX chips were verified in several niques were effective and seemed to complement VAX 6000 Model 600 multiprocessor systems. The each other. Ta ble 1 shows the percentage of bugs CPU modu le was the only new hardware com­ detected by each technique. This table includes ponent in the system; the backplane, memory, and data on the ever-valuable, nonsimulation verifica­

110 subsystem were known to be robust, because tion technique of simply reviewing, inspecting, and they were used in the VA.,'\. 6000 Model 500 system. discussing the design and its many representations. One logic analyzer was connected to the system The decision to release the design fo r fabrication bus, and another was connected to the pins of the of first-pass chips was a consensus decision made by the verification, architecture, and design teams. NVAX chip The strategy fo r the early prototype verification From a verification perspective, the design was was to hoot the low-level console user interface, ready fo r release when the bug detection rate run the I-ICORE suite of tests, boot the VMS operat­ remained at zero fo r several weeks and the majority ing system, and then run the User Environment ·rest of the planned tests had been implemented. The Package (UETP) system exerciser. Within 10 days of verification of some areas of the design was receiving the first prototype chips, all these tasks deferred until after the release of the first-pass had been accomplished. Later, the AXE and MAX design. The development team decided that any exercisers were run on the prototype systems. bugs that might be founcl in these areas wo uld not The rigorous testing that continued on prototype have a significant negative impact on the system systems revealed a few logical bugs which had gone development schedule, whereas additional delay in undetected during simulated veri.fication. Ty pically, releasing the design would.

Digital Te clmicaljo urnal 44 Vol. 4 No. 3 Surnmer 1992 Logical Ve ri(ica tiun of the NVAX CPU Ch ip /Jeslj{ll

Ta ble Bug Detection Using 1. One simple bug was detected by running the 1 Va rious Te chn iques HCORE test suite on the prototYpe system with Percent of the floating-point unit (F-box) llisabled. This bug Te chnique Total Bugs Found could have been fo und in the same way through simu lation, but the test suite was not run as a Focused tests 28 final regression test with the F-box disabled . In Directed, pseudorandom ge neral, fo cused tests like HCORE were not run exercisers 23 with varied chip/system configurations. The ver­ Review, inspection, ification team concluded that all foc used tests observation, thought 20 should be run with different chip/system config­ Detection technique unknown 12 urations. At the minimum, a configuration that AXE 8 disables all possible functions should be H::sted. MAX 6 Cl'll HCORE 2 2. Another hug was discovered because the chip generated spurious writes to memory in the Other prototype system. The exercisers prolx1bly did generate the conditions necessary to evoke this bug; however, the spurious writes went unno­ Results and Conclusions ticed. It is extremely diffi cult to verify that a machine does everything it is supposed to do Only 15 logical bugs were fo und in the first-pass and nothing more. Additional assertion checkers CPU chip design, all of which were either eas­ NVAX or monitors in the models might detect such ily worked around or did not impact normal system bugs in the fu ture. operation. The nan1re of the bugs fo und in the first­ pass design ranged from straightforward bugs that 3. A third bug was evoked when a prototype escaped detection to r clear-cut reasons to extremely system exerciser executed a translation bu ffe r complex bugs that required hours or weeks of rig­ invalidate all (TBIA) instruction under certain ormJs prototype system testing to uncover. Some of conditions. On a rea l system, the TBIA instruc­ the bugs escaped detection during simulated verifi­ tion is used only by the operating system. In our cation fo r classic reasons such as: verification effort, the TIIIA instruction was little used by the exercisers that were simulated. • Tes ting of the function was performed just Operations that are performed only by the oper­ before release, in a hurried manner. ating system should not be underemphasized in exercisers. • Simulation performance prohibited running a certain type of test case. 4. One first -pass bug was related to the halt inter­ rupt, which is used only during debugging oper­ • A test was not run in a certain mode due to the ations. The halt interrupt receivecl minimal difficulty of running it in all possible modes. testing and was not tested at all in any type of exerciser. Discovering this bug was especially • It took an exerciser running on a simul ator a annoying because a similar bug had escaped long time to encounter the conditions that detection by the initial logical verification effort would evoke the bug. fo r a previous VAX imrlementation. This turn of events reinforces the be.lief that there is va lue in • A test was inadvertently dropped from the set of exercisers that were run continuously re viewing the escaped bug lists from other proj­ ects. Also, during the verification effort, there Details about five of the more interesting hugs seemed to be a natural, but erroneous, tendency fo und in the first-pass design fo llow. Included to umlertest fu nctions used infrequently or not is info rmation about how the hug was detected, at all during normal system operation. Such a hypothesis on why the bug eluded detection fu nctions sometimes require extra ::tttention, before first-pass chips were fabricated , and les­ because they may be quite complex and may sons learned from the detection and elimination have been given less carefu l thought during the of the bug. design process.

Digital Te cbuica/ journal liJI. ·I .Vo.. i S111nml!r l')')l NVAX-rnicroprocessorVAX Systems

5. A state bit that needed to be initia lized on power­ development and support of tools. 'l'he CHA GO up was not. This problem was noticed during simulator would not have been possible without initia lization simulation but erroneously ratio­ the significant contributions of Kevin Lackl. Will nalized as being acceptable. Design assu mptions Sherwood provided high-qualitv, top-level guid­ and assertions about initialization should be ver­ ance through all phases of the project. 'fhe system ified through simulation or other means. development groups performed system-level simu­ lations and rigorous prototype testing. The VAX Overall, the 1\f\/A X Cl'll chip logical verification Architecture Group A..'<:E/�LAX team, once again, effort was a success. The pseudorandom testing provided and supported an effective verification strategy detected several complex and subtle logi­ tool. Lastly, the success of the project and the final cal bugs that otherwise probably would not have quality of the NVAX chip logical design are as much been detected by simulation. The extensive simula­ a tribute to the work of the '\f\IA X architecture and tion performed on the schematics-derive(! model design teams as they are to the work of the verifica­ of the chip provided a high degree of confidence in tion team. the design. The go als of producing highly functional first­ pass chips and bug-free, second-pass chips were References both met. Neither the bugs in first-pass chips nor 1. G Uhler et al., "The NVAX ami NVAX+ High­ their work-arounds impeded prototype system performance VA X Microprocessors," Digital debugging in any significant way. and first-pass Te clm ical Jo urnal, vol. 4. no. :) (Summer chips with work-arounds were used in prepro­ 1992, this issue): 11 -23. duction, field-test systems. The verification team corrected the 15 first-pass design bugs fo r second­ 2. D. Donchin et al., ·'The NVAX Cl'l f Chip: CAD pass chips, which were shipped to customers in Design Challenges, Methods. and To ols," revenue-producing systems. Digital Te cl.mical Jo urnal, vol. 4. no. 3 (Summer 1992, this issue): 24-.37 Acknowledgments 3. R. Calcagni and W Sherwood, "VAX 6000 The NVAX logical verification effo rt was perfo rmed Model 400 CPU Chip Set Functional Design a SEG by team of engineers from the microproces­ Ve rification," Digital Te chnical Jo urnal, sor verification group. Members of this team vol. 2. no. 2 (Spring 1990) 64-72. included Wa lker Anderson, Rick Calcagni, Sanjay Chopra, and John St. Laurent. 1\f\/A.,.'( architects Mike 4. D. Bhamlarkar, "Architecture J\

46 vbl. 4 No. .l St1111111er 1992 Digital Te cbuical jounwl Lawrence Chis vin Gregg A. Bouchard Th omas M. We nners

The VAX 6000Mo del 600 Processor

The Model 600 is the newest member of the VAX 6000 series of XJ'vf£2-based, multi­ processing computers. Th e Mo del 600 processor integrates easi�y into existing plat­ fo rms. Each processor module provides 40.5 SPECmarks of pet:formance made

possible by the N11.4X CPU chip. The major VLSI interfa ce chip, called NEXJ'vfl, was created using Digital's internalCMOS-3 design and layout process. Th e ability to design andfa bricate the inte1jace chip internal�)' was critical to delivering a work­ ing CPU prototype module on schedule. Th e aggressive module timing goals were met by employing previous module experience in combination with extensive SPICE simulation.

6 Ta ble Comparison of CPU Performance The Model 00 system is the latest addition to 1 Digital's VA)( 6000 fa mily of midrange sym metric VAX 6000 VAX 6000 multiprocessing computers 1 The Model 600 was Benchmark Model 600 Model 500 designed to be integrated cost-effectively into an existing VAX 6000 platform, and to provide a signifi­ SPECmark3 40.5 15.3 cant perfo rmance improvement over the previous SPECint 30.9 14.2 genera tion of systems. Ta ble 1 compares the perfor­ SPECfp 48.6 16.1 mance of the Model 600 with the prev ious genera­ VUPs 30.2 12.4 tion Model 500 on several important benchmarks. Notes: SPECmark is a quantitative measure of performance, NVAX The powerful single-chip microprocessor determined by running a suite of 10 benchmark programs. enables this level of performance.2 SPECint defines the performance on the subset of the tests that are integer intensive, and SPECfp defines the floating-point Design Goals intensive tests. A VA X-1 1 /780 system has a performance of 1.0 SPECmark by definition. The SPEC tests used for the table The primary goal of the project was to de.liver a values are from the System Performance Evaluation Cooperative, Release 1. VUP is a VAX unit of performance. One VUP equals the module that incl uded the appropriate support performance of a VA X-11/780. fu nctions, performance, and VAX 6000 system com­ patibility. Equally important, the module had to be delivered on schedule to prevent an adverse time­ This paper relates the background and design 6 6 to-market impact on the program. This included process fo r the V�'( 000 Model 00 processor the delivery of a working prototype module before module. The module and its system context are

the first NVAX microprocessor chips were available, described, as well as many of the design decisions si nce a V�'( 6000-based platform would be used to that were made du ring the project. The first section debug the initial NVAX CPU chips. Furthermore, the describes the module in general terms, along with prototype module had to allow the VMS operating some of the trade-offs and choices associated with system to be booted and tested. its development. The second section fo cuses on the Our goals were achieved . When the NVAX CPU NEXM!support chip, which is the primary interface chip, the prototype module, and the NEXMI sup­ device positioned between the NVMC CI'U data and port applications specific (ASIC) address Jines (NDAL), the XMI2 system bus, and the were integrated fo r the first time, the hardware support peripheral ROMBUS. It discusses the very worked almost immed iately. The software team large-scale integration (VLSI) design process used to was well prepared, and the li.l ll V;\'! S system boot create and verify the fu nctions of this interface. The took place 11 days after the hardware was put final section details some of the physical aspects of together. Moreover, the first-pass modules were the module design, including the important work used fo r all the system debugging, and were of suffi­ performed to ensure good module signal integrity cient quality to be used fo r system field test. and thermal management.

Summer 19?1 Digital Te clmical]ourual Vo l. 4 No 3 47 NV -AX microprocessor VAX Systems

Description of the Processor Module 6000 600 CPU Figure I shows the VA.,'C Model mod­ ule. Figure 2 is a block diagram of the module, showing the major subsections. The CPU module

contains two VLSJ components: the NVAX micro­ processor and the NEXMI ASIC interface chip. The module also holds the backup cache static random­ access memory (SRAM) devices, the X.VJL2 corner bus interface, and the supporting logic necessary to implement a VA X 6000 processor node.

The NVAX CPU directly controls its external backup cache SRAMs. When the clata is resident in the cache, the NVAX CPU modifies it there, and implements a write-back scheme 2 When the data

misses in the backup cache, or when an !/0 access is necessary, the NVAX CPU places the command on the NDAL, along with any associated control and data information. The NEXM I accepts and processes the command. The NEXMI Vl.SJ chip provides an interface to the Figure 6000 !Vloctel 600 functions necessary to integrate a VAX 6000 proces­ I T� Processor Mo dule sor module into an XMl2-basecl system. In particu­ lar, the NEXMI chip: CPU • Forwards invalidate traffic to the NVA.;x fo r lookup and potential write back • Translates the NDAL bus commands to XMJ2 bus commands • Controls NDAL bus arbitration

X.NU2 • Returns read data from the bus to tbe NVAX The NEXMI contains a programmable interval CPU (via the NDAL) timer, reset logic, halt arbitration logic, and secure

INPUT OR EEPROM OUTPUT K PORT � XMI2 I n BUS � ROM RAM 6 I ADD I I I TAG STORE I I ------,---l -1 -! -0 B s - -> <,----- R M U

t ! XMI CHIP INTERCONNECT NDAL BUS NEXMI ASIC XMI2 OSCILLATOR INTERFACE CORNER BUS - I ��I -�<.------,> CHIP INTERFACE t

Figw·e 2 Block Diagram of the Model GOO i'Vl odule

48 Vo l. 4 No. 3 Su/Jimer 19')2 Digital Te e/mica I jo urnal The VAX 6000 Model 600 Processor

console logic on chip. It accommodates the rest of non-write-back queue services both the XMI2 logic the support functions through the ROMBUS. and the sse logic. Depending on the fu nction, the The ROMBUS is controlled by and interfaces to the sse logic might then send the command on to the NEXMI; it supplies a path fo r the low-speed devices ROMBUS. Each queue is loaded in the NDAL time that are necessary to create a working computer. domain, and a request signal is sent to the appropri­ The devices on the ROMBUS are off-the-shelf parts ate XMI2 or SSC logic for processing. that contribute specific fu nctions, including ROM On read commands, data is returned from the for the boot diagnostic and console program, a XMI2 responder queue or sse control section, and stack RANI fo r storage of diagnostic/console fo rwarded to the NDAL through the NDAL transmit dynamic information, an electrically erasable pro­ section. The NDAL is then requested, and when bus grammable read-only memory (EEPROM) for saved access is granted the data is driven onto the NDAL to state, a universal asynchronous receiver/transmit­ be accepted by the NVAX CPU. ter (UARD for console communication, an input The XMI2 logic also sends potential invalidate and output port, and a time-of-year (TOY) clock. addresses to the NVAX , where the info rmation is The bare module itself is a sophisticated compared with the existing tag address in the printed wiring board (PWB) with the following indexed backup cache block. An invalidate address characteristics: is nothing more than the address associated with a command initiated on the XMI2 by another pro­ • 11.024-inch by 9.18-inch module size cessor. If the cache block matches, and if the NVAX • 10-layer module with 0.093-inch thickness must relinquish control of the data, the block is

• 4 signal layers, 4 power/ground layers and either invalidated or written back. The choice 2 component/dispersion layers depends upon the type of transaction and the state of the cache block. • 0.010-inch vias In the Model 600, the NDAL arbitration is handled • 5-mil etch/75 mil space minimum fo r signals by the NEXMI chip. The NVAX CPU does not imple­ ment the NDAL arbitration on chip because the • 10-mil etch/40 mil space minimum for clocks microprocessor must accommodate many diffe r­ NEXMI SupportChi p ent types of system platforms. A method of arbi­ tration that is fair and efficient on one type of NEXMI The chip is the routing and control interface system (e.g., a single processor workstation with between the three major buses that reside on the several potential NDAL master nodes implemented VAX 6000 Model 600 processor module. A func­ in off-the-shelf programmable devices) might be less tional diagram of the NEXMI is provided in Figure than satisfactory for another type of system (such NDAL, 3. The three major buses are the XM!2, and as the NEXMI). RO MBUS. The NEXMI chip contains the following The NEXMIand the l\TVAX CPU are the only two functions, each fu nction is associated with one or nodes on the NDALin this implementation, so a sim­ more of the three major buses: ple priority scheme is used. The NEXMl always has

• NDAL bus arbitration highest priority, since it is either returning data or fo rwarding XMI2 bus transactions fo r potential • NDAL receive and transmit logic invalidation and write back. In both cases, some • NDAL/XMI2 queues other entity on the bus is actively waiting fo r the information to be returned. • XMI2 data/invalidate responder queue

• XM12 bus control logic Choice of NEXMITe chnology • System support control (SSC) logic The NDAL is the NVAX CPU external interface bus. It The NDAL receive logic latches and decodes com­ is a 64-bit, bidirectional, multip lexed address aml mands and data from the NVAJ( CPU, and routes data bus that runs synchronously with the CPU. them to one of two queues. If the command is Although it is significantly slower than the internal a write back, consisting of an address and 32 bytes CPU speed (an NVAX with a 12-nanosecond [ns] of write data, it is placed in the XMI2 write-back clock cycle has an NDAL with a 36-ns cycle), it is still queue. All other commands (reads and 8-byte aggressive in many respects, and presented a chal­ writes) are placed in the non-write-back queue. The lenge to design using standard parts. The NDAL

Digital Te cbnical]ournal Vo l. 4 No. 3 Summer 1992 49 Vl 0 RECEIVE WRITE-BACK QUEUE DATA � CMD/ADDR WRITE-BACK � COMMAND/ � LATCH ADDRESS F1EK I �D � =1 lL- c l �vDATA XMI XCI NDAL WRITE-BACK CONTROL /'-- RECEIVE � DATA LOGIC <====> LOGIC p ,v"' = N DAL � ��� B us d 1 8� " XMI COMMAND/ COMMAND/ ADDRESS ADDRESS AND NON-WRITE-BACK DATA

NON-WRITE-BACK � COMMAND/ADDRESS NDAL � 1- OTHER NON-WRITE-BACK TRANSMIT � 1- GRANTS QUEUE LOGIC sse ( NDAL COMMAND/ ARB ADDRESS

OTHER REQUESTS REO RESPONDER sse ROMBUS WRITE CTL QUEUE ,A---" CONTROL � XMI BUS DATA DATA/TYPE LOGIC RESPONDER DATA/TY PE <====> QUEUE

LATCH SSC RETURN DATA/TYPE DEC

KEY:

ADR CMP ADDRESS COMPARE PAR GEN PARITY GENERATOR CMD DEC COMMAND DECODER REO CTL REQUEST CONTROL NDAL ARB NDAL BUS ARBITRATION XCI XMI CHIP INTERCONNECT PAR CHK PARITY CHECK

Fig ure 3 Block Diagramof the NEXMI Chip The VAX 6000 Model 600 Processor

arbitration fo r the next cycle happens in parallel direct contact with the VLSI design, process, and with the data transfer fo r the current cycle, and the manufacturing groups. As our design progressed, fu ll request and grant loop must be performed in we had constant communication with the people one NDAL cycle. In addition, the NDAL data path is who wrote and supported the computer-aided bidirectional. An NDAL master must be able to trans­ design (CAD) tools, the people who had designed mit its data to the receiver within a single cycle, the circuits and standard cell library elements, and then allow time fo r the bus to become tristate the people who would eventually fa bricate the before the next master can drive the bus. chips. At any stage of the project, we could deter­ The original product plan was to use several gate mine how to obtain a performance advantage with­ arrays to implement the module control logic. One out risk to either chip yield or product reliability. gate array would have contained the XMI2 interface When a tool problem surfaced, or when a tool did logic, and the other would have control led the not do exactly what we needed, the issue wou ld be system support functions and interface. It became im mediately addressed and resolved. By having easy access to the individuals who had clear early in the project that the external 110 cells of the gate array could not meet the timing man­ detailed knowledge of the circuits, we could attain dated by the NDAL specification. Furthermore, we the maximum possible speed and density from were unable to obtain timely and accurate SPICE the design. We benefited from the experience of models of the gate array output stage, which com­ the internal fu ll-custom design community, and pounded our general difficulty in determining the profited from access to its complete and accurate design trade-offs for the modtde.4 SPICE libraries. Consequently, we were certain of The use of an external commodity gate array how we could obtain a design advantage, yet still implied another design-related drawback. The nor­ allow the chip to perform reliably under worst-case mal gate array design process included the submis­ conditions. sion of our design fo r chip layout after we had The Digital semicustom design process completely verified it for both logical correctness (described in more detail in the section NEXMI and estimated timing constraints. The timing was Design Process) provides constant feedback on cir­ estimated since the actual timing could not be cuit layout. Therefore, our fu nctional and logical known until the ASIC was ro uted . The gate array timing simulations were always up-to-date with rou ting would be performed by the vendor, and accurate gate and wire delays. By the time we were actual delay numbers would be used to verify the ready to freeze the design (after we had veri.fied design again. This sometimes meant changing logic it fo r logic and timing correctness), only one interconnections to fix timing-related violations, regression run was needed on a minor change from especially i.f the design team had been aggressive in the last iteration. This approach prevented last­ either the chip cycle time or gate usage. The design minute surprises, and resulted in a smooth transi­ would then have to be verified again, and sent back tion from design description, through layout, and to the vendor where the process was repeated until into fabrication. everything worked at speed. Consequently, we decided to design the NEXMI NEXMI Design Issues ASIC chip using Digital's CMOS-3 standard cell pro­ During the design of the NEXMI chip, several inter­ cess at the Hudson, MA site. Once the decision had esting problems were addressed. been made to use the internal process, we realized other significant benefits. We believed that we Interblock Co ntrol Signal Sy nchronization The could collapse the design into one package, since clock that drives the NVAX and NEXJVI I chips runs at we could make use of Digital's chip expertise to 36 ns, and is not synchronized to the 64-ns clock create fu ll-custom sections where necessary. We that drives the XMI2bus interface. Without any spe­ would know early in the design cycle if our assump­ cial care, the data that is generated in one clock tions were wrong, since the design flow uses a sub­ domain can cause the target latching device out­ chip approach. The approximate size of each puts to enter a state called "metastable." This state subchip is known as soon as the first pass of the is characterized by oscillations, or an output volt­ structural design is finished. age level that is neither high nor low fo r an Another major advantage of using Digital's inter­ extended period of time. This can cause unreliable nal CMOS design process was that it afforded us system operation.

Digital Te chuical ]ourual Vo l. 4 No 3 Summer 1992 51 NVAX-mi croprocessor VAXSystems

Traditionally, a synchronizer is provided fo r the state machines, interblock control signals, and control signals between two different clock queue head and tail pointers. \XIe then grouped domains to allow communication without causing them into similar units so that related signals were metastability. A synchronizer is a cascaded set of visible within one control group. Internally, the sig­ latches, allowing the first latch to go metastable, nals were segregated within their boxes, and routed but characterized so that it settles down before the so that they did not adversely affect the top-level second latching device is sampled. The drawback chip routing. to a synchronizer is that it increases the latency of communication due to the cascaded latching NEXMI Design Process devices. The NEXMI chip contains 250,000 transistors in a There are no inexpensive and easy remedies fo r die that measures 0.595 inches by 0.586 inches. It is latency of communication when the system goes housed in a custom-designed, 339-pin, ceramic pin from idle to active. However, we decided to reduce grid array (PGA). This section describes the NEXMI the synchronizer latency on a busy system. Our chip design methodology and the unique CAD tools method optimizes the case where information is that were used to design Digital's largest CMOS-:1 in either the NDAL queue or the X;\'112 responder semicustom chip. (The chip is physically larger ancl queue, waiting fo r transmission when the current has more transistors than any previous standard command has been successfully transmitted. For cell produced using the Digital semicustom pro­ these situations, we created a set of round-robin cess.) This section also covers many of the trade­ synchronizers, with a ring of request and done sig­ offs made during the chip design, and explains the nals in each direction. While one request/done pair reasoning behind our decisions. is being serviced, the pipeline overhead fo r the next The design of the NEXMI chip can be character­ pair is hidden by the overlapping synchronizers. ized by three major phases:

• Behavioral modeling NEXMJQue ues For the three major queues in the NEX;vn chip, we determined reasonable queue sizes • Structural design based on our chip space constraints and perfor­ • Physical chip implementation nunce simulations. System testing on the real hard­ ware during our debugging and system integration The three design phases of Digital's internal phase confirmed that our decisions were correct. CMOS design process significantly overlap each None of the queues cause performance degrada­ other. Each design stage is explained ancl analyzed tion on an actual running system. in this section. We decided to make the non-write-back and the XMI responder queues into integrated queues BehavioralMo deling Phase rather than have separate queues fo r each function. The first major design effort fo cused on describing Queuing theory shows that a shared resource is bet­ the chip functions at a high level of abstrac­ ter utilized when only one queue is served by the tion. This is normally referred to as behavioral or next available resource, rather than having a sepa­ fu nctional modeling, and the design team used rate queue fo r each resource.' Digital's internal hardware description language, DECSIM-BDS, fo r this task (' Functional design sec­ Visibili�J' Po rt A problem that always exists with tions were allocated to different design engineers, dense, complex integrated circuits is how to diag­ and interfunctional block boundary descriptions nose problems that happen in an actual running were specified. system that do not show up during simulation. A behavioral modeling strategy was attractive fo r Although no such problems appeared in the NEXMI many reasons. A wel l-defined hierarchical partition chip, the designers wanted to provide visibility to of the design was quickly realized, and smaller sub­ as many internal states as possible. To this end, we sections (or subchips) within the chip were identi­ created a parallel port to provide visibility to some fied . Behavioral models of each subchip were important internal signals. initially developed to prove functional correctness. Given the size of the design, we could provide As the design progressed, these subchips were only the minimum number of signals fo r visibility replaced by fu nctionally equivalent gate-level struc­ We tried to predict the internal signals that might tural models. Using this mixed-mode fu nctional help diagnosis and adjustment, such as the internal simulation strategy, each designer could progress

52 Vo l. 4 No. 3 Stlllllllet· /')')2 Digital Te cbuicaljournal The VAX 6000 Model 600 Processor

at his/her own pace, from behavioral definition tion process, we were also able to take advantage of to structural implementation, independent of the a wealth of fu ll-custom knowledge provided by our status of other subchips. support groups. One of the original reasons fo r The behavioral modeling effo rt identified many using the internal VLSI process was the tight timing architectural problems early in the design cycle. on the NDAL. Therefore, the NEXMI pad ring was Details such as queue sizes and structure, flow con­ custom designed fo r speed, control, and TTL- level trol mechanisms, and interblock communication compatibility. Other major handcrafted sections protocols were all emphasized, and each decision were the dense, multiported queue stmctures. yielded valuable information about the fe asibility of To coordinate our large, multiple-person chip a single-chip implementation. design, we used the organized chip design (ORCHID) Behavioral modeling allowed a functional file management system. The ORCHID system man­ description of the NEXMI chip to be integrated into ages the files created by each design tool fo r every a system-level model quickly. This enabled the veri­ hierarchical subchip . It contains translation tools fication team to write and debug tests early in the that convert the schematic data into file fo rmats design cycle. As each structural subchip was fin­ accep£ed by the simulation, layout, and verification ished, it replaced its previous behavioral counter­ tools. The system al lowed individual designers to part in the verification model, and was tested for work independently on subchips at various stages fu nctional correctness. This step-wise, integrated of development (schematic entry, simulation, floor approach permitted the tests, models, and logic to plans, layout, verification) yet still maintained a be verified incremental ly. coherent hierarchical design database that could be One major advantage of our behavioral modeling shared by all members of the design team. The idea strategy was that it let the design progress without of a shared database facilitated the reuse of com­ targeting a specific technology. The designers mon logic (e.g., counters, parity trees, testability fo cused their attention on logical implementation devices), and designers often borrowed from each rather than on the technology-specific details, such other to avoid primitive design duplication. as the timing and loading constraints imposed by NEX.Ml a choice of technology. The Choice of Semicustom Design Flow Te chnology section described the process of select­ Figure 4 shows the semicustom design process and ing the C.MOS-3 implementation path. The initial the individual tools that were used during the struc­ behavioral phase of the project aUowed some of the tural and physical implementation phases of the design to be finished before the final choice of NEXMI CMOS technology was made. chip design. ALOE, Digital's in-house graphical editor, was our schematic entry vehicle, and was used in conjunc­ Structural Design and Chip tion with the primitive symbols and models from Imp lementation Phases the CMOS-3 standard cell library (SCL3). The tools The next major project design phase involved map­ within the ORCHID system were used to translate ping the behavioral subchip models into their schematics into generic wirelists with estimated equivalent gate-level structural representations. delays. The schematics were then input to other The design methodology chosen fo.llowed directly tools fo r logic and timing verification. Specific from the decision to implement NEX.Ml using design tools included the internal logic simulator, Digital's C.MOS-3, !-micrometer, semicustom pro­ DECSIM, a timing analysis tool, AUTODLY, and the cess. The semicustom process includes a fully spec­ SPICE circuit simulator. ified library of primitive elements, cal led standard Automated logic synthesis was used to convert cells, similar to the cells included in a gate array some behavioral models into structural entities. library. OCCAM , an internal CAD tool, was used to synthe­ The advantage of the semicustom approach is size a gate-level represen tation of the scattered that it gives the designer ful I control over the place­ address decode, and was able to minimize the logic ment and routing of the individual primitives to meet the aggressive bus timing.7 Controllers or groups of primitives (subchips) within the chip. X.MI embedded within the and sse subchips were This allows the engineer to easily take advantage of synthesized using SMD2SIM, a tool developed at special placement fo r speed-critical paths. Because Digital's Boxboro, MA site fo r another project. This we were using the internal tool suite and fabrica- synthesizer allows large programmed logic array

Digital Te chnical]our1lal Vo l. 4 No. 3 Summer 1992 53 NVAX-microprocessor VAX Systems

------, DESIGN ENTRY I I OCCAM ALOE SMD2SIM LOGIC SCHEMATIC STATE MACHINE I SYNTHESIZER EDITOR SYNTHESIZER I I I ---- ��������� � � ���������� ����� ------l GENIEt ORCHID TRANSLATION TOOL SU ITE I GENERIC WIRELIST I GENERATOR I I I FLT I HIERARCHICAL FLATTENER I I I � I , ------l I DLYEST ATLAS I I I I DELAY WLU/CHAS I TOOL SPICE WIRELIST TWEDT I I ESTIMATOR I I I SUITE EXTRACTOR TWOLF EDITOR I I I I I DLYDIST I DELAY I - - - 1---- 1 I - - - DISTRIBUTOR I I TWOLF DLYCALC SPICECNV TIMBERWOLF I I I I ii DELAY WIRELIST PLACE/GLOBAL I TRANSLATOR I I ROUTER I I I I I SCASM I I I I IDECSIM STANDARD CELL I I I I ASSEMBLER NET LIST I DETAILED ROUTER I I TRANSLATOR I I I --I I FAME ______- FLOOR PLANNER I------t ------I SIMULATION _ l AND TOP-LEVEL I I I I ROUTER I I I -- -- I ------I AUTODLY DECSIM SPICE ------., r ------I TIMING LOGIC ANALOG I ANALYSIS TOOL SIMULATOR SIMULATOR I I I I I MEGAN I I VLSI LAYOUT [ ______I EDITOR [ ______I I I I I I HI LEX DRC HILEX/CUP I I HIERARCHICAL DESIGN RULE CAPACITANCE EXTRACTOR CHECKER EXTRACTOR I I I I I � I I IVCMP I I WIRELIST I COMPARISON I I PROGRAM PHYSICAL CHECKS I L ______

Figure 4 Semicustom Design Flow

(PLA) structures to be realized from a simple LISP­ After each gate-level subchip was created, the like behavioral description. During the course of ATLAS tool suite was used to place and route the the design, controller function changes became standard cells within the subchip. The TWOLF edi­ easier to maintain using a textual description. tor (TWEDT) was used to place special cells, such as

54 Vo l. 4 No. 3 Summer 1992 Digital Te chnical journal Th e VAX 6000 Model 600 Processor

clock buffers and timing-critical gates, within the sizes of some of the driving transistors, based upon subchips. The rest of the subchip was then placed potential problems identified by the XREF program. automatica lly and globally roured using the TWOLF Figure 5 is a representation of Digital's CMOS-3 program, which re lies on a simu lated annealing design process. It shows the chip floor plan, along placement algorithm .x Lastly, the standard cel l with an example of the schematics used to drive the assembler known as SCASM was used to assign stan­ layout process, and a timing diagram created by the dard cel l rows and to complete the cletai.led rout­ simulation program. ing.9 SCASM uses both channel routing and "over-the-cel l" routing, which reduced the overall Ve rification of the VAX 6000Mo del 600 size of the subchip by permitting metal routes over Early in the design cycle, our verification team was the underlying cells. Subchip post-layout timing able to integrate a behavioral chip-level model into information was then fed back into the logic and a CPU module model and perform system-level test­ timing verification tools fo r a more detailed analy­ ing of the VAX 6000 Model 600. These tests gave the sis. The ability to analyze post-layout delay informa­ designers timely fe edback on the effects of their tion and to iterate through the layout process early design decisions on the system as a whole. NDAL/XMI2 HEMAR, in the design phase was crucial to meeting our An bus monitor, called the schedule. was written to verify the bus activity between the The top-level floor plan was progressing in paral­ NDAL and XMI2 ports of the chip, and to log valuable lel with the process of subchip placement and rout­ cycle information. The NEAT, an NDAL cycle emula­ ing. FAME, another tool from the ATLAS tool su ite, tor, was also used to create NVAJ\ transactions on was used for the chip fl oor plan and top- level rout­ the NDAL. It was written to reduce simulation test ing. 1o. 11 Info rmation about top-level routing was fed run times and to ease the test verification process. back to the subchip place and route tools to change Most of the tests written were fo cused tests, tar­ the basic shapes and subchip boundary pin place­ geting a specific function within the NEXMI chip. ment information. This was used to keep inter­ However, a random bus exerciser was also written subchip ro uting to a minimum, which in turn to test unexpected combinations, and was instru­ reduced overall load ing and timing delays. Many mental in catching two design flaws that the successive iterations were necessary to achieve an fo cused test cases had missed. In all, over 100 optimal top-level floor plan that wou ld meet our fo cused tests were written and verified against the timing goals and fit onto a single die. Fortunately, simulation model, giving us a high degree of confi­ much of this work could be done by the designers dence that first-pass silicon wo uld be fu nctional. themselves. The quality of the prototype systems is due in large After the final place and route iteration had been part to this complete verification. done on the the entire chip, an exhaustive set of A subset of the functional tests was also used checks was performed to ensure design integrity by manufacturing to generate chip test vectors. and circuit reliability for first-pass silicon. A VLSI Test vectors were automatically converted from design ru le checker was run on the individual sub­ DECSIM-formatted trace files to Ta keda pattern sets chips, then on the en tire chip, to verify the layout through a special program called TEMPEST. Prior to against the CJ.VIOS-3 process design ru les. As a pre­ the return of first-pass silicon, the generated pat­ cautionary measure, the original chip schematics tern sets were converted back into l)ECSI M fo rmat, were extracted to SPICE wire! ists using an internal to verify that the pattern sets wou ld run success­ wirelist tool, and another program, called fu lly on the Ta keda cl1ip tester. This test pattern H!LEX/CUP, did the same thing using the geometric generation and simulation process reduced the information in the fi nal layout database. A wirelist time needed to debug the test patterns when the comparison program, called IVCMP, compared the NEXJ\11chips became available. two wirelists to ensure that the design we had sim­ ulated was the same one we wou ld build. Mo dule Design Issues Finally, a program called XREF used the capaci­ During the design of the VAX 6000 Model 600 tance from each internal chip node and the topol­ processor, many module-related issues were ogy of the routed interconnection database to considered . The next section describes some of the pred ict the cross talk each signal could expect. more interesting issues that we encountered during Changes were made to the signal rou ting and the the physical module design process. In many cases,

1992 Digital Te chnical jo urnal Vol. 4 No. 3 Summer 55 NVAX-microprocessor VAX Systems

Figure 5 Representation of the CMOS-3 Design Flow

we describe the method that was adopted by the The fo otprints fo r the two SRAM cache chips that design team to prevent problems through careful represented the cache size trade-offs were diffe r­ planning and analysis. ent, which made it difficult to create one prototype module to accommodate both sizes. Creating two Backup Cache Size and Sp eed separate modules to test the different combinations We chose to defe r the selection of the backup cache of cache sizes would have burdened the rest of size and speed until late in the design cycle. This the project, so special surface-mount pads were allowed us to make the decision based upon tbe designed to handle the diffe rent SRAI\1 geometries pricing and availabi.lity of the SRAMdevices, and the on the same module. Once the cycle time of the performance to be gained from combinations of NVAX CPU was determined to be 12 ns, we were able these devices. We also wanted to wait until we to make the fi nal choice on the cache size and SRA.M NVA)C CPU. knew the final speed of the The two speeds. The final decision was to provide a 2MB 512 most likely cache sizes were kilobytes (KB) and cache with 20-ns 256KB by 4 SRAI\1 s fo r the data and 2 megabytes (MB). We fe lt that simulation could 15-ns 64KB by 4 SRAMs fo r the tag. This provided the provide insight into the performance trade-otis , best balance between cost and perfo rmance in the and simu lation studies were performed with both Model 600 system. sizes to establish approximate performance num­ bers. We knew, however, that running real modules on a variety of benchmarks under actual workloads ROMB US Decisions was the most accurate way to determine the perfor­ General-purpose computing systems need a core of mance side of the price/performance trade-otT system fu nctions to supplement the computing

Summer 56 Vo l. 4 No. 3 1992 Digital Te clmical jo urnal The VAX 6000 Model 600 Processor

power of the processor. On the VAX 6000 series of Previous VAX 6000 systems used a system sup­ computers, this includes: port chip for most of these functions. After we

decided to combine as much support logic as possi­ • ROM to hold the boot, diagnostic, and console ble into the single ASIC device (NEXMI), we had to code determine what fu nctions to include inside the • EEPROM to hold items such as boot paths and chip, and what fu nctions to locate external to the error information chip. We saw a potential schedule risk ifsome func­ tions, such as the UART and TOY clock, were placed • Console terminal UART inside the NEX.i\1 1. These functions were relegatecl Time-of-year (TOY) clock • to outside the chip since industry-standard, off-the­

• Battery backed up RAM shelf components were available to perform exactly the functions we needed. • Program mable interval timer Once we decided to implement the UARTand TOY

• Time-of-day register clock outside the NEXMI, and added the normally externalRO M, EEPROM, and input/o utput ports, we • Input ports to sense external switches and other found that the NL'(J\'11 pin cou nt was higher than we state could afford. Since all the support devices were • Output ports to drive module light-emitting slow, and each one was byte-wide by its nature, diodes (LEOs) we solved the pin problem by creating a single, slow-speed, bidirectional bus, which we called the • Reset logic for the module RO MBUS. Figure 6 is a block diagram of the ROM BUS • Halt detection and arbitration and its components. The ROivl BUS reduced the

• Secure console logic NEXtvii pin count by 40 pins to r the same fu nctions.

I I ROM 2 EEPROM UART TOY CLOCK r----1 ROM 1 I I � ROM O

• I I I ROM � ADDRESS

ADDRESS GENERATION lv!Di LOGIC OUTPUT PORT 0

ROMBUS OUTPUT CONTROL DEVICE SELECTION PORT 1 r--- LOGIC

.. I INPUT PORT I

RAM ROM BUS I I DATA<7:0>

ROM BUS COMMAND<3:0> NEXMI CHIP

Figure 6 Block Diagra m of the ROMB US

Digital Te chttical]ournal Vo l. 4 No. 3 Summer 1992 57 NVAX-microprocessor VAX Systems

Using the R01'-'I BUS fo r the UART and TOY clock ceramic through-hole PGA on a 100-mil grid was the reduced the risk entailed in designing them inside best trade-off. the semicustom NEXMI ASIC, but the ROJ\'IBUS itself was eventually burdened with 12 separate devices. NDA L and Backup Ca che Simulation The most This presented a large capacitive and direct current important modul.e-level signal integrity analysis load to all the components on the bus. The UART was the extensive characterization of the NDAL and and TOY chip in particular were not well suited to backup cache. As the interconnection bus between NVAX this large load. We attempted to find CMOS-equiva­ the and the NEXMI, the NDAL controls not Jent devices fo r all the TTL components to eliminate only the transmission speed between the compo­ the direct current problem, but were unsuccessful. nents, but also the speed at which the NVAX can run We finally decided to spl it the bus into two sec­ (since the NDAL is synchronous to and scales with tions, and place all the CMOS low-drive-capability the NVAX speed). The speed of the backup cache devices on a separate segment. A transceiver con­ was a perfo rmance concern for obvious reasons. nected the two segments. We determined that estimates of the intercon­ Given that most of the devices on the ROMBUS nect and routing would be insufficient fo r our were non-Digital components, we had the normal aggressive timing goals. Several trial layo uts were problem of obtaining accurate SPICE models to performed to provide accurate input fo r the SPICE determine the ROMBUS timing. Extensive lab testing simulations. The I/O drivers fo r both the f\TVAX and was performed on the devices to characterize their the NEXM! chips were known to be complete and output delays under different capacitive loading accurate, since they were both designed internally. conditions. These loading tests helped generate The timing requirements fo r rei iable operation SPTCE models fo r the RO MBUS simulations. were also well understood for the same reason. The module-level SPICE simulations provided guidelines Signal Integrity Considerations about the length and rou ting rules associated with Module signal integrity analysis was one of the most each high-speed signal trace, such as: important aspects of the project. A system that • Daisy-chain routing of all signals includes components running as fast as the NVAX ami NEXMI can easily see performance degradation • Clock routing scheme (e.g., matched length and or unreliability if the module signals are not care­ termination) fully placed, routed, and terminated where neces­ • Maximum length requirements fo r each type of sary. The past experience of the signal integrity network team played a major role in the eventual success of the process. Several approaches were taken for this • Treeing, or ordering, requirements fo r networks aspect of the design. • Impedance of different networks

Pa ckage Selection SPICE simulations were per­ These requirements were used as design guide­ formed on each level of the design. This included lines. A cross-talk prediction program within the the chip, package, module, and backplane. Even layout tool verified that the coup I ing between sig­ though each particular simulation analysis was nals was within an acceptable range. After the proto­ fo cused on one aspect of the design, we under­ type modules were delivered , measurements were stood that each individual area affected the entire taken of all the critical signals on the module, show­ system. For example, the selection of a package ing excellent correlation with the SPICE results. spanned several important levels of design hierar­ chy. The connections between the die pads and the ThermalMa nagement module signal pins, as well as the connection of the During the early phases of the module design pro­ package to the board, needed to be considered; as cess, the power dissipation of the NVAX chip was did the module layout that accompanied each pack­ estimated to be as high as 20 watts, though the final age size and type. One aspect of the signal integrity figure for the 12-ns component that we shipped might show that a particular package type was with the product was 14 watts. Cooling such a part superior to another, while another aspect might presented a challenge to the module designers. favor a different approach to module component Since the Model 600 was an upgrade option in the 6 interconnection. Electrical SPICE simulation deter­ VA){ 000 fa mily, there was no possibility of system mined that the performance and rel iability of a modifications to improve the thermal design.

Summer Digital Te cbnical]ourual 58 Vo l. 4 No. 3 1992 Th e VA, \' 6000 Model 600 Processor

Figure 1, which is a picture of the Model 600 Engineering and Computer Sciences, Univer­ modu le, shows that the heat sink fo r the NVAX is sity of California, Berkeley. larger than the package. The fi nal dimensions of the 5. S. Ross, Introduction to P1-obability Models heat sink are 3.285 inches by 3.285 inches by 0.325 (Orlando, Florida: Academic Press, Inc., inch, while the PGA package is only 2.2 inches by 2.2 1985). inches. One of our limitations was the maximum component height of 0.420 inch on side 1 fo r any 6. M. Kearney, "DECSIM: A Multi-Level Simlila­ module in a VAX 6000 backplane. This restriction tion System fo r Digital Design," Proceedings fo rced the heat sink to grow wider rather than of IEEE International Conference on Com­ fVAX (1984): higher. The chip was also placed closer to the puter Design: VLSJ in Cmnp uters edge of the board , where the airflow provides max­ 206 -209. imum cooling. The final size of the NVAX heat sink 7 R. Brayto n, ASV, and Wa ng, "MIS: A Multi­ was a compromise between system requirements, A. ple-Level Logic Optimization System," board area, and thermal performance. IEEE Transactions on Compute r-A ided Design, CAD-6, 6 1987). Summary vol. no. (November The decision to use Digital's proprietary tool suite 8. C. Sechen and A. Sangiovanni-Vincentelli, and fabrication process was proved correct by the "Timberwolf 3.2: A New Standard Cell Place­ quality of the VA X 6000 Model 600 and its delivery, ment and Global Ro uting Package," Proceed­ on schedule, fo r NVAX CPU debugging and product ings of the 23 rd Design Automation sh ipment. Access to accuratein formation about the Conference (1986): 432 components allowed decisions to be made early, 9. ). Reed, A. Sangiovanni-Vi ncentelli, ancl M. and entailed less risk. This advantage, coupled with Santamauro, "A New Symbolic Channel a seasoned module development team and exten­ YACR2," Router: IEEE Transactions on Com­ sive functional, timing, and circuit simulation puter-Aided Design, vol. 4 (July 1985): 208. resu lted in a successful project. 10. N. Chen, C. Hsu, and E. Kuh, 'The Berkeley Acknowledgments Building-Block (BBL) Layout System for VLSI Design," Digest of Te clmical Pap ers, The authors would like to acknowledge the fo llow­ IEEE International Confer·ence on Computer­ ing people for their contributions to the VA X 6000 Aided Design (1983): 40. Model 600 Processor: Chuck Benz, Mike Gowan, Chris Houghton, Dave Ives, Ke ith Johnston, Mike 11. M. Marek-Saclowska, "Two-Dimensional Kagen , Diane Kirkman, Doug Koslow, Bill LaPrade, Router fo r Double Layer Layout," Proceedings Don MacKinnon, Lisa Noack, Del Ramey, Sharad of the 22nd Design Automation Conference Shah, Steve Thierauf, Mike Wa rren, and Beth (1985): 117 Zeranski.

References and Note

VA X 6200 1. B. Allison, "An Overview of the Family of Systems," Digital Te chnical]ournal, vol. 1, no. 7 (August 1988): 10-18.

2. G. Uhler et a!., "The NVAX and NVA X+ High­ performance VA X Microprocessors," Digital Te chnical jo urnal, vol. 4, no. 3 (Summer 1992, this issue): ll-23.

3. 3, 4 SPEC Ne wsletter, vol. no. (December 1991 ).

4. SPICE is a general-purpose circuit simulator program developed by Lawrence Nagel and Ellis Cohen of the Department of Electrical

Digital Te chnical jo urnal Vo l. 4 No. 3 Summer 1992 59 jo nathan C. Crowell Kwong-Ta kA. Cb ui

Th omas E. Kop ec Sam;yoji ta A. Na dk.arni Dean A. Sovie

Design of the VAX 4000 Model 400,500, and 600 Sy stems

The design of Digital'sNVAX CPU chip prouided the opportuni(J! to bring RJSC-class

pe1jormance to deskside crsc VA)(com puter �J IStems. Tb e new l'l!X 4000 Mo del 400, 500, and 600 low-end systems take fu ll aduantage of the pe1jormance capabilities

of the Nl'i:l.X microprocesso1: The three sys tems offerji·om two to fo ur times the per­ fo rmcmce of the previous top-of-the-line VA X 4000 Model 300 sy stem in the same deskside enclosure. To achieve this increased pe1jormcmce, Digital's systems engi­ neers designed a neu' higb-pe1jonnance memory controller chtjJ as part of the CP U module, whose basic design is shared by the three sy stems. In addition, a high-per­ fo rmance memmy module and a VLSJ bus adapter cbip were designed.

The design of Digital's NVAj"\ high-performance goals. As a result, the new rop-of-the-line VA X 4000 microprocessor offered systems engineers the Model 600 system performance is fo ur times that opportunity to design computer systems with sig­ of the Model 300.

nificantly improved performance. 1 The project Achieving the performance goals required

structured to use the l\TVAX CPU chip to upgrade the designing a new, high-performance memory con­ VA)< 4000 line of low-end deskside computers troller chip called the NVAX data and address lines resulted in a fa mily of three new systems and the (NDAL) pin bus memory controller (NJYIC). The associated CPU modules. These systems share a objectives were to double the memory bandwidth basic CPU module design but offer a range of per· of the VAX 4000 Model 300 system and to provide fo rmance capabilities. This paper fi rst presents a total system memory capacity of 512 megabytes the goals of the development project and then (MB). To support the NMC specifications, a high­ describes the architecture, design, and implementa­ performance memory module called the MS690 4000 400, 500, tion of the resulting new VA)( Model was designed. 2 and GOO systems. To reduce hardware and software development cost and the risk of fa iling to meet project sched­ ules, the system design incorporated all existing Goalsof the VAX 4000Pro ject high-performance 110 adapter chips. These devices Schednk and performance goals were of prime include the second-generation controller concern to the engineers committed to upgrad ing chip (SGEC), the Digital Storage Systems Inter­ the VA X 4000 fa mily of computers. Time-to-market connect (DSS!) shared-host adapter chip (SHAC), was a key goal of the development project. the CVfu"\ Q22-bus interface chip (CQBIC), and the

Consequently, ful ly qual i.fi ed systems were ready to system support chip (SSC) L' ' A very large-scale be sh ipped to customers when the NVAX CPU chip integration (VLSI) bus adapter chip was required to was released ancl available in volume. The initia.l provide CVAX pin (CP) buses to connect these 110 goals fo r performance sp ecified that the new sys­ devices.; The NDAL- to-CP bus adapter chip (NCA) tems provide three times the performance of the was designed to meet this need. 4000 300 The three CPU modules designed to upgrade VA.\: Model system. Ultimately, the perfor­ existing 4000 Model 300 systems retain the mance of the NVfu\: CPU chip exceeded its design VAX

60 Vol. 4 Nu. 3 Su11nner 1')92 Digital Te cbnical jo urnal Design of the VAX 4000 Mode/ 400, 500, and 600 Sys tems

same BA440 system enclosure used fo r these sys­ buses, a thick-wire and ThinWire Ethernet adapter, tems. The upgrade requires that the older MS670 a Q-bus adapter, and the console serial line. The memory module used in the Model 300 be replaced CPU module performance is 16, 24, and 32 times the with the new, higher-performance MS690 memory performance of the well-known VAX-1 1/780 system, module. The new systems had to support all of the fo r the three new VAX 4000 systems, respectively. Q-bus option modules that were supported on the The CPU cycle clock speed and cache sizes deter­ VAX 4000 Model 300. mine the product performance, as discussed in the section CPU-cache Subsystem. Sy stem Overviewof the VAX 4000 The backplane in the system enclosure provides Models 400, 500, and the signal interconnection and power distribution 600 between system components. There are connec­ The BA440 system enclosure shown in Figure l sup­ tors and slots fo r the CPU module, fo ur slots fo r ports the VA)( 4000 Models 300, 400, 500, and 600. MS690 memory modules, and seven Q-bus slots. This pedestal enclosure was designed to operate in The CPU module has a 270-pin connector that an open office environment. To allow the systems receives the module power and connects the CPU to operate quietly, the cooling fans are speed con­ to the NVAX memory interconnect (NMI) bus, the trolled, based on the ambient temperature. The system DSSJ bus, and the Q-bus. The backplane was enclosure power supply provides 644 watts of modified to support tile wider 72-bit clata path of direct current from a standard IS-ampere wall cir­ the new MS690 memory modules. This new back­ cuit. The system was designed and qualified to plane was phased into the BA440, enabling most operate in an environment with a temperature VAX 4000 Model 300 systems to be upgraclecl with­ range of from 10 to 40 degrees Celsius. out requiring a backplane change. The system enclosure supports up to fo ur DSSI or small computer system interface (SCSI) tape inte­ INTEGRATED STORAGE grated storage elements (ISEs). These cableless ELEMENTS bricks support either one 5.25-inch, fu ll-height drive or two 3.5-inch drives. The ISEs are available in variants that support the 2-gigabyte (GB), RF73 DSSI disk drive and dual 85MB, RF35 DSSI disk drives. The single-system pedestal can support six RF35 devices and a tape drive, providing 4.8GB of storage fo r applications that require high I/O rates. This RF35 configuration can provide over 360 queued 1/0s per second fo r random 1/0s. If RF73 drives are used,

FIVE DEDICATED the single-system box can provide 8GB of storage. SLOTS BEHIND There are several ways to expand the base VAJ( CONSOLE MODULE (CPU PLUS FOUR 4000 system; the most common way is to expand to MEMORY) another DSSI-based system ancl create a two- or FANS three-node DSSI VAXcluster. The Q-bus in the V�'< 4000 system can be expanded to provide 10 addi­ KEY: tional Q-bus slots to each system using the B213A CARD CAGE D Q-bus expansion enclosure. The DSSI expansion enclosures together with the Q-bus DSST adapter (KFQSA) can expand the total available clisk storage Figure 1 BA 440 System Enclosure to 28 DSSI disks. Using the RF73 disk allows up to 56GB of disk storage. The new CPU modules, differentiated only by the part numbers KA675, KA680 and KA690, are utilizecl CPU Module as the engines fo r the VA)( 4000 Models 400, 500, The CPU moclule common to the three new VAX and 600, respectively. All three CPU modules, 4000 systems is based on a highly integrated CPU henceforth referred to as the CPU module, provide ancl l!O system built on the single 21.6-by-26.7- the same l!O fu nctionality, including two DSSI centimeter (8.5-by-10.5-inch) module shown in

Digital Te cbnicaljournal Vo l. 4 No. :) Summer 19')2 61 NVAX-microprocessor VAX Systems

Figures 2 and 3. The CPU moclu le printeel wmng (ns) with a slip cycle, i.e., a two-cycle read (20 ns) board (PWB) consists of the fo llowing subsystems: a using 8-ns SRAMs.Thi s write-back caching architec­ central processor and its associated three-level ture significantly reduces the demands on main cache; a pin bus, bus adapter, and memory control­ memory by caching both reads and writes without ler; and an 1/0 system with integrated control.lers fo r the need fo r a memory access. On all previous VAX DSSI and Ethernet buses. The CPU module also con­ 4000 systems, the caches required that all write tains a CQBIC and 512KB of field erasable program­ operations continue through to main memory, i.e., mable read-only memory (FEPROM) fo r console code. write through. The NVAX CP is clocked by a differential emitter­ CP U-cacheSu bsystem coupled logic (ECL) surface acoustic wave oscilla­ The CPU-cache subsystem is built around the single­ tor. This oscillator runs at 250 megahertz (MHz) chip NVAX CPU, which provides a three-level cache (16-ns cycle time), 286 MHZ (14-ns cycle time), or architecture. The first two levels of cache, which 333 MHz (12-ns cycle time) on the KA675, K.A680, are contained on the chip, include a 2KB virtually and KA690 CPU modules, respectively. The NVAX addressed instruction cache and an 8KB physically chip produces a fo ur-phase internal clock directly addressed instruction and data cache. The thircl from this input and generates system clocks at one­ level of cache, the backup cache, is constructed third the internal clock rate. using static random-access memories (SRAMs) on The new CPU module design supports either a the module ancl is completely controlled by the 128-kilobyte (KB) or a 512KB backup cache. (512KB NVAX CPU chip. The backup cache was designed to for the KA690 module and 128KB for the K.A680 and support a CPU cycle time as low as 10 nanoseconds KA675.) The tag store for the two cache sizes can be

NDAL-TO-CP BUS SGEC ETHERNET SHAG DSSI ADAPTER CHIP (NCA) ADAPTER CHIP ADAPTER CHIP

NVAX MEMORY CVAX 0-BUS SHAG DSSI CONTROLLER (NMC) INTERFACE ADAPTER CHIP CHIP (CQBIC)

Figure 2 CP U Module

62 Vo l. 4 No. 3 Summer 1992 Digital Te chnical journal Desig n of the VAX 4000 Model 400, 500, and 600 Sy stems

ETHERNET DSSI BUS

v v

SGEC SHAG DSSI ETHERNET ADAPTER ADAPTER CHIP 128KB OR 512KB CHIP 1 B-CACHE SRAMS J I I NDAL-TO-CP CP BUS 1 ll r------BUS ADAPTER NVAX CHIP (NCA) CPU f-- SYSTEM FPU L SUPPORT P-CACHE r-i[ CHIP (SSC) (\J ,-- B-CACHE NVAX (/)

CONTROL MEMORY co CONTROLLER :::)a. (NMC) u J J '--- CVAX 0-BUS SHAC DSSI ADAPTER OSCILLATOR INTERFACE J �">. '----- CHIP (COBIC) CHIP A A L ,.s- L I BA440 CONNECTOR NVAX 0-BUS MEMORY �DSSI BUS INTERCONNECT (NMI)

Figure 3 Block Diagmm of the CP U Module constructed from SRANis that have the same 24-pin NDA L Interconnect, Memory package with compatible pinouts. This packaging Controller, and Adapter makes it easy to design the PWB to support either The NDAL bus is a synchronous, multiplexed cache . However, the data store, which is construc­ address and data, pended interconnect. Each ted from parts whose packages have incompatible device on the NDAL bus may be a commander pinouts, requ ired a special dual footprint design (request data transfer), a responder (respond to to accommodate either I6K-by-4K or 64K-by-4K commander requests), or both. In the new VAX SRA.\1s. This fo otprint is designed with the decou­ 4000 systems, the 1\rvA X CPU chip is only a comman­ pling capacitors and series edge-limiting resistors der, the NMC is only a responder, and the NCA 1/0 carefully placed in the outline of the fo otprint to adapter is both J commander and a responder. aiiO\v a dense, geometric pack ing of the 18 RANis Arbitration of the NDAL interconnect is per­ necessary fo r the data store. fo rmed by the N.VIC, which is also the default master The backup cache is a write-back cache with responsible fo r driving valid no-operation bus memory coherence maintained through a direc­ cycles when there is no master activity. The NVAX tory-based broadcast coherence protocol. 1 When CPU is responsible fo r watching all NDAL traffic and NVAX CPU the needs to write data to memory, the performing any invalidates or write backs of pri­ data is first transferred into the backup cache with a mary and backup cache data required to maintain request fo r write privilege command. Once a cache cache coherence with memory. row is stored in the cache as "written," any direct memory access (DiVlA) read to that memory address of displacement of that cache row will result in a I/ O Buses release write privilege transaction, even if the data The CPU module uses a custom third-genera­ was never actually written. The cache controller tion complementary metal-oxide semiconductor inside the 1\rvAX CPU is responsible fo r all activities (CMOS-3) process 1!0 adapter (the NCA chip) to related to cache maintenance. interface between the NDAL bus and a pair of 32 -bit

DigiltllTe chnical jo urnal . vbf. 4 No .3 Sunnner l992 63 NVAX-microprocessor VAX Systems

CP buses. Two CP buses are used to prevent long requirements were met by using a strong driver in response latencies on the Q-bus from interfering the NVAX chip and a parallel R-C termination at the with buffer management on the Ethernet interface. far end of two of the three stubs. The R-C termina­ One CP bus, CPl, operates under the synchronous tions absorb some of the incident energy, reducing Cl' bus protocol, and the other, CP2, uses the asyn­ the reflections to an acceptable amount and allow­ chronous protocol. The faster peripherals, the ing incident-wave switching without the reflected Ethernet and the two DSSI adapter chips, reside on wave reentering the threshold region. CPl and can take advantage of optimizations in the The backup cache data store was designed with NCA to reach a peak bus bandwidth of 33MB per strong drivers and incident-wave switching on the second (MB/s). CP2 has only one peripheral DIY!A address Jines. The routing of a representative cache master, i.e., the CQBJC, which also connects the sse address signal is shown in Figure 4. The stubs were and the console FEPROM to the system. The NCA arranged in such a way that the unavo idable reflec­ acts as a master on both CPl and CP2. Since only tion from the far end of the lines was reduced by one of the buses is asynchronous, the system uses a partial reflection from the center junction. Thus, only one CP bus clock (CCLK) chip fo r signal syn­ signals traverse the threshold region fairly cleanly chronization and CP clock distribution. and settle rapidly outside the threshold region, i.e., The two CP buses share the same clocks. Con­ approximately 3 ns elapse from the beginning of sequently, arbitration is performed by a single pro­ the transition at the driver until the time when grammable sequencer, which serves both buses. a valid signal arrives at the receiver. Tl1e sequencer is clocked at 35 ns (KA680 and KA690) or 40 ns (KA675); this is one-half the CP Printed Wi ring Board Physical Design cycle time. The arbitration for CP2 is a simple two­ priority scheme with the CQBlC at the higher prior­ The NVAX CPU chip draws several amperes of cur­ ity. When more than one master is requesting the rent from the 3.3-volt power supply. This current draw has significant high-frequency components. bus, the minimum DMA request deassertion times The integrated decoupling capacitor on the NVAX on both the CQBIC and the NCA effectively make this scheme behave like a round-robin arbiter. The die helps eliminate some of the high-frequency cur­ internal state does not have to keep track of the pre­ rent pulses, but much of this current must be sup­ plied by the module-level clecoupling capacitors.1 vious master. No special treatment is required for Charge stored on the module in these decou­ lock crcJes because the CQI3ICwil l never perform a pling capacitors supplies this current. An y induc­ lock on behalf of the Q-bus. tance in the path of the current reduces the effectiveness of the capacitors by limiting the rate­ Signal Integrity of-change of the current. In addition, the larger the Signal integrity work began very early in the project, physical area enclosed by the current path, the and the effort was a close collaboration between more radio frequency (RF) energy will be radiated the CPU module design team, the design teams fo r into space that must be contained by the enclosure the three VL5 1 devices, and the VA.'\ 6000 Model 600 in order to meet regulatory radiation requirements. CPU module design team. Because some CPU mod­ These two issues led to the exploration of how to ule team members had experience with designing minimize both the inductance of the decoupling CP bus modules, the signal integrity issues on the path and the physical area of the RF current spread. CP buses were generally well understood. The CP The internalPWB standard, as it existed when the bus data lines were routed with only a length con­ new CPU module was being designed, required a straint. The control signals on CPl required signifi­ minimum of 25 mils (0.025 inch) of surface etch on cant analysis and SPICE modeling to meet both any device before a via could be dropped into an settling time and waveform requirements. inner layer. Traditionally, the inductance of this The most critical signals in the backup cache connection was reduced by using a wide (i.e., 25- are the output enable and write enable signals. The mil) etch for this connection. The inductance of output enable deasserting edge must be transi­ surface etch on the module lay-up used on the new tioned quickly to avoid tri-state contention on the CPU module (calculated with two-dimensional cache data lines. The write enable signal must be transmission line [TDTL)) is shown in Table 1. perfectly monotonic through the threshold region, Ta ble 1 provides the data to calculate the total because it is an edge-sensitive signal. Both of these inductance of a pair of 25-by-25-mil etch segments,

64 Vo l. 4 No. 3 Summer 1992 Digital Te cbnicaljournal Design of the VA X 4000 Model 400, 500, and 600 Sys terns

0 0 0 0 00 00 o ooc EXAMPLE OF A o ooop NVAX oo oo oooc SINGLE AD DRESS ---...... __o o o o p oooooo ooooooo o ooo o o o c LINE ETCH p ooo ooo ooooooo oo o 0 0 0( �p o o ooo o ooo o o ooo o 00 0( o o o o o o ooo o o o o o ooo o o o p c � c 000 00000 0000 0 0 0 0 000 0 0 0 c N�CA � c 000( r------0 0 0 c 0 0 0 0 0 -- 00 ,- 00 ..- 00 00 r-f- 0 0 gg 0 0 0 I 00�� �g 00 �00g DUAL C:o00 g0 00 00 00 00 0 gg 0 0 0 0 0 SRAM 0 0 ( 00 00 00 0 FOOTr � 0 00 00 00 g C � r, ( 00 00 goo oo 0 00 ( 00 00 00 00 00 0 cN�MS []00 00 00 00 r00 0 00 00 00 00 I 00 0 oooc 0 0 0 ( - - r-- ,..----- 00 0 ( 0 0 0 ,-- -I- 00 -I- 00oc .--- -r- 0 0 ggoo g� 0 gg_ 000( oo 00 00 �00g �00 �00g gg00 �0 00 oc 00 00 00 00 00 0 0 0 0 0 0 ooo c 00 00 00 00 00 0 00 00 00 00 00 0 00 oc 00 00 00 00 00 0 00 00 00 00 00 0 0 0 0 c []00 00 00 00 00 0 0 0 00 00 00 00 0 000 ( oooc ,..----- ,..----- ,..----- 0 0 0 0 oooc 00 -I- 00 00 r--f- 00 0 o ooc 00 00 -I- 0 0 -I- 00 00 � 0

00 00 0 o ooc 00 00 (gf:;�00 00 00 0 00 00 00 00 oo 0 0 0 0 0 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00 0 []00 00 00 00 00 0 00 00 00 00 00 0

F(!!,ttre 4 Backup Cache Treeing i.e., approximately 0.4 nanohenrys (nH). (This mea­ inductance of the high-quality, 1,000-picofarad surement is approximate, due to the short dimen­ (pF) Rf capacitor used on the CPU module is sions of the segments.) The effective series approximately 1 nH, including the inductance of the vias connecting the capacitors to the power

planes. The reactance as a fu nction of frequency fo r Ta ble 1 Inductance of a Su rface Etch the inductive component of this clecoupling system Etch Width Inductance is shown in Ta ble 2. Because the resulting impedance is still quite 10 mils 12.2 nH/in high at upper frequencies, multiple capacitors are 25 mils 8.1 nH/in used in parallel. The 1,000-pF capacitors are cho­

Note: This table represents the results of the inductance of a 0.5- sen from two different case styles to ensure that ounce copper strip on a 14-mil FR4 epoxy glass laminate the parasitic inductance of the capacitors is not dielectric over a ground plane. identical fo r all the high-frequency decoupling

Ta ble 2 Reactance as a Function of Frequency for Decoupl ing with and without Dispersion Etch

Reactance Frequency With Dispersion Without Dispersion (MHz) (Ohms) (Ohms)

100 0.89 0.63

200 1.76 1.26

400 3.52 2.51

600 5.28 3.77

Digital Te chnical Jo urnal Vo l. 4 No. 3 Summer 1992 65 NVAX-microprocessor VAX Systems

capacitors. This method of selection staggers the moclnle would lu ve scan capabilities. Thus, the frequency of the parasitic resonances. overall module test strategy could not be based entirely on the ]TAG specification. Cy cleDesi gn Goal and Te sting As the teams reviewed the overall module Although the original design goal fo r the CPU mod­ design, certain issues appeared to show promise fo r ulewas a 12-ns CPU cycle time, as knowledge about the application of a scan-based test: Digital's fo u rth-generation complementary metal­ 1. The automatic test equipment (ATE) pin density oxide semiconductor (CMOS-4) process increased, was likely to be very high in the areas of the the design teams investigated the critical paths fo r a three 339-pin pin grid arrays (PGAs) Reducing 10-ns operation. At this time, the CPU module had this high density would improve the reliabilitr of not been routed, so a 10-ns cycle time was set as the the test fixturing. layout goal. The cache loop layout resulted in a mea­ 2. Previous experience with module manufacture sured requirement that RAMs have an access time of in Digital's plants showed that the risk fo r solder approximately 8.5 ns to meet worst-case timing defects would be high in the cache area, because with no added slip cycles. of the J-lead SRA.lVls. A fast way to isolate these RANismeeting this specification were not readily problems would help reduce debug time. available when the CPU module was designee!. However, several vendors are beginning to ship 3. Providing at least a driver or a receiver fo r use by devices at this speed today. The NDAL data lines the JTAG scan ring wou ld eliminate the need to were capable of running at the 30-ns NDAL cycle use continuity structures to verify bonding and that is generated with a 10-ns CPU clock. The point­ solder integrity on the VLSI parts. to-point NDAL arbitration signals are the tightest Eventually, the module and chip teams settled on timing path on the NDAL. The combination of analy­ a subset implementation of the .JTAG scan-based sis by the NMC team and a careful hand-routing of test. The NVAX CPU chip implements scan latches these signals by the module team allowed appropri­ on all data ancl control pads (inpu t and output, in ate NMC speed binning (i.e., sorting the chips based the case of bidirectional pads). Thus, the cache on correct operation at the fastest possible speed) SRAMs can be tested using only the JTAG port, and to meet the timing requirements fo r a 30-ns NDAL the NVAX CPU can act as the driver fo r scan testing cycle goal. of the NDAL interface. The NMC and the "iCAim ple­ Ve ry few NVAX CPU chips were available that ment receive-only scan, so that the NDAL interface wou l.d function at 10 ns over the Ji.I ll range of volt­ could be tested with no ATE pins required. The age and temperature. However, empirical signal­ external tester was able to test the remaining pins delay measurements and limited-range module solely by driving signals that could be scanned out testing have proven that the NMC and the NCA are of the pad latches. This subset implementation pro­ ready to operate on the CPU module at this speed . vided the same module-level coverage as would An NVAX CPU that functions at this 10-ns speed can have been possible using a fu ll J1AG implementa­ operate the cache with no slip cycles using the tion. In addition, the implementation removed faster SRAMs. Future products may be based on an some design obstacles that were causing implemen­ NVA.-'\ running at a 10-ns cycle, as sufficient yields at tation problems in the chips. this speed bin are reached. The ability to use scan-based testing is advanta­ geous to the manufacturing process in the fo llow­ ModuleTe sting ing two areas: The module and chip teams considered more than l. Tl1e scan tester can find open circuit defects one approach when determining what module­ in the cache area where the bed-of-nails tester level testability features to implement in the Vl.SI could not resolve whether tbe problem was a fix­ devices Digital was building for VAX 4000 Model ture contact problem or an actual open circuit. 500 computers. The scan-based Te st Access and Boundary Scan Architecture (JTAG) proposal was 2. The ability to create "virtual test points" on coming into its own, ancl the teams desired to fo l­ scanned nets has allowed the test coverage of low that specification, if scan-based test fe atures the bed-ofnails tester to be expanded without were to be used 6 However, no other devices on the having to purchase an expensive tester upgrade.

66 Vo l. 4 No. 3 Slimmer 1992 Digital Te cbnicaljournal Design of the VAX 4000 Mode/ 400, 500, and 600 .�ystems

Unfortunately, the module and chip teams' previ­ clock. The NMI timing scales with the NDAL clock ous experience with scan-based testing at the mod­ cycle time. ule level had been spotty at best. The ability of this In this section, we describe the architecture of test to reduce the pin density, therefore, was not the NMC, the objectives of the NMC project, and the used to fu ll advantage in the problem areas under results of the effort. the large PGA devices. Based on the experience test­ ing the CPU module fo r the three new VAX 4000 sys­ NMC Architecture tems, fo llow-on products have been able to use this As shown in Figure 5, the NMC is partitioned into testing fe ature to good advantage. The teams now SL'X major sections: the NDAL arbiter, the NDAL inter­ have a firm base of experience on which to base face, the transaction handler, contro.l and status reg­ h1ture test strategies. isters (CSRs), the memory interface, and the O-bit interface. The NMC responds to all memory space The NVAXMemor y Controller addresses when NDAL address bit 29 is equal to 0 The NMC is a 520-hy-500-mil custom chip fabricated and responds to I/O space addresses in its al.locatecl using Digital's !-micrometer CMOS-3 process and range, i.e., 2101 0000 . . 2101 FFFF (hexadecimal). contains 148,000 transistors packaged in a 339-pin The NDAL arbiter gives highest priority to the PGA. The NMC provides the interface between the NMC fo r returning read data. The two l/0 nodes NDAL bus and up to 512MB of main memory by have second priority; their requests are hand led in means of the N.MI. The NDAL bus supports three a round-robin fashion. The CPU bas lowest priority. other nodes on the NDAL-the NVAXCPU and up to The NDAL interface consists of an input section two 1/0 adapters (101 and !02). In the new VAX and an output section. The input section moni­ 4000 systems, the NCA serves as both l/0 nodes on tors the NDAL fo r a new transaction every cycle. A the bus. The NMC contains the arbiter fo r the NDAL valid transaction that has been decoded by the NiVIC and also helps the NVAX CPU maintain cache-mem­ is put into one of fo ur transaction queues ory coherency in the system by interfacing with a (INQUEUEs). There is one queue fo r each of the separate O-bit memory. NDAL nodes: CPU, 101, 102, ancl the fo urth, which The NMI can operate with either a 32- or 64-hit­ stores release write privilege transactions. wide data path and supports single error correc­ Commander nodes on the NDAL initiate release tion, double error detection, and nibble error write privilege transactions to release the write detection, and runs synchronous with the NDAL privilege of blocks in memory. The NMC must

CONTROL CURRENT ADDRESS � DATA � NDAL ADDRESS ADDRESS " TRANSACTION ----v O-BIT HANDLER INTERFACE _;:, � WRITE DATA v ----v -.> DATA "

v

NDAL NDAL � INTERFACE v CONTROL " < CSRS r v ADDRESS-----) � ADDRESS v MEMORY � WRITE DATA v INTERFACE _;:, CSR READ v DATA DATA " " NDAL l =( READ DATA OUT .-1 v v K MEMORY READ DATA

NDAL ARBITER

Figure 5 JWAXMe mory Co ntroller Block Diagram

Digital Te chuicaljourual Vo l. 4 No. 3 Summer 1992 67 NVAX-microprocessor VAXSystems

accept a release transaction from a node, irrespec­ NM C Project Objective and Results tive of the state of its INQUEUE; otherwise, there is a The NMC project objective was to create a high-per­ potential fo r deadlock. Consequently, the NMC has formance memory design that would be com­ a separate queue fo r release write privilege trans­ patible with the VA,'\ 4000 Model 300 memory actions. The output section of the NDAL interface subsystem ancl could provide two to three times buffe rs up to six quaclworcls (i.e., six groups of fo ur the performance of that subsystem. (The VAX 4000 contiguous H)-bit words fo r a total of 384 bits) of Model 300 has a bandwidth of 47.4MB/s, at a cycle read data that must be returned to the NDAL. In a time of 28 ns, using 100-ns DRA.JV!s with a 32-bit normal fu nctioning system, a buffe r depth of six memory data bus.) This goal was achieved by using quadwonls together with an arbitration scheme a 64-bit memory data bus and an interconnect that that gives the NMC the highest priority will never operates at a cycle time as low as 36 ns, using 100-ns result in a fu ll output queue. Therefore, there was DRAMs, and at an NDAL cycle as low as 30 ns, using no need to check for a full queue and to stall while 80-ns DRAMS The asymptotic bandwidth on the loading the queue. NMI using the 100-ns DRAJ\'1 technology and a 64-bit The transaction hand ler arbitrates between the data path is 111.11MB/s, i.e., 2.3 times the bandwidth fo ur JNQUEUEs and stores selected transactions in a of the VAX 4000 Model 300. Using faster 80-ns current transaction buffe r. The current transaction DRAMs, the bandwidth is 133.33MB/s, i.e., 2.8 times buffe r serves as a pipeline stage; this buffer allows the bandwidth of the VAX 4000 Model 300. the corresponding INQUEUE to be loaded with the The NMC interface is efficient from the moment it next transaction while the current transaction is receives a transaction on the NDAL until it starts a being serviced. The NMC can service back-to-back transaction on the NMl. This timing path was transactions with no stal 1 cycles on the NMI. extremely tight and results in real memory read The CSR section of the NMC contains memory bandwidth of 63 6MB/s and 76.32MB/s at 36-ns and configuration registers, error status registers, and 30-ns cycle times, respectively. mode and diagnostic registers. The NMC chip was designed to meet an NDAL The memory interface contains the data path, cycle time of 36 ns, which made the timing very address path, and control fo r up to fo ur memory critical. Most of the NMC chips produced can modules on the :VII.Th e data path contains all the exceed this goal and will run at an NDAL cycle error correction and detection logic. The address time of 30 ns. Future designs based on the 10-ns path contains the row and column address multi­ NYAX chips wiiJ require this cycle time. plexers and a refresh address counter. The control To meet the performance goal, we chose to use is provided by a state machine that can perform a 64-bit memory interface. However, achieving multitransfer read operations, multitransfer write compatibility with the Model 300 memory modules operations, and read-modify-write operations. presented a chalJenge with respect to the ECC The O-bit interface directly controls the O-bit generation and checking mechanism. A simple dynamic random-access memories (DRAMs), which approach would have been to include two separate are housed on the CPU module. For every memory ECC trees, one for 64-bit operation and the other transaction, the N:V!C reads the corresponding O-bit for 32 -bit operation. This design would have been in parallel with the memory access. If the block of very area-intensive, so we chose 64-bit ECC code memory is written, the memory transaction is such that 32-bit ECC was a subset. 64-bit ECC aborted until the corresponding release transaction requires eight check bits, and 32-bit ECC requires is received by the NMC. If the block is unwritten, seven check bits. In our 64-bit code, the eighth the memory transaction is allowed to complete. check bit depends solely on the upper 32 bits. In 32- Initiating the memory transaction in parallel with bit mode, we fo rce the upper 32-bits to a known the O-bit access reduces the transaction latency on value; therefore, that check bit is always a fixed transactions that are not written. Since most mem­ value in 32-bit mode. ory accesses are to unwritten locations, using this The VAX 4000 systems do not al low the use of 32- scheme improves memory performance consider­ bit memory modules, because it is difficult to meet ably. In addition, the system design engineers were Q-bus latency requ irements with the slower mem­ able to use inexpensive DRAJV!s to implement the ory. This system constraint indirectly affected tl1e O-bit memory instead of faster, more expensi\re NMC. The CQBIC and SGEC devices, fo r example, had SRAMs. stringent low latency requirements. The Q-bus

1992 Digital Te cb11ical ]O tll-nal 68 Vo l. 4 No. 3 Summer Design of the VA X 4000 ;Hodel 400, 500, ond 600 �) �ste111s

!�Heney problem had to be solved without causing the G chip (used in the VAX 4000 l'vlodel .1 00 SYs­ SGEC latency problems. Thus, the latency issue was tems), the NCA chip was a new design. To optimize addressed in the fo llowing way: the D.VlA to memory bandwidth and the bus access latency, the NC:t\ provides the two CP bus interface • The NMC has a mode that indicates whether or ports (CP l and CP2, mentioned previously), which not the Q-bus is present in the system. During operate independently. Th is strategy has three this mode, the transaction handler gives the advantages. First, the Q-bus adapter and the system Q-bus node (102) the highest priority. However, read-only memory (ROM)/console are connected to to keep the SGEC latency within required limits, the CP bus separately from the Ethernet and the the transaction handler must service trans­ mass storage devices. This arrangement allows the actions fro m the 10 1 node at strategic times. system to tune and optimize throughput on the • Tb minimize the latency seen by any one node, Ethernetand mass storage devices without degrad­ NDAL the three nodes require separate queues. ing the bus access latency seen on the Q-bus. NDAL The simplest implementation of the input Second, the loading on the two CP buses is reduced. interface would have been to have two queues, Therefore, each bus can operate at a higher fre­ one fo r release write privilege transactions and quency and without externalbu ffers: this also saves one fo r all other transactions. Thus, preserving modu.le area. Third, the dual-bus structure allows NDAL the order of transactions would have been the use of a simpler bus arbitration scheme. very easy. With three queues, it is necessary to In addition to using the dual-bus strategy, the NC:A compare queue addresses to preserve trans­ uses write butler and read prdetch transactions to action ordering. allow DMA devices, particularly the SJ-lAC and SGEC

• The NMC combines the CPU request for write chips, to operate efficiently in double ocraword privilege, the OMA read, and the DMA write into a mode, where a two-octawonl (32-byte) burst of data single transaction before retiring the data to is transferred within a single bus grant. For write

memory. This optimization reduces the latency transactions, the NCA buffers up to two octawords of written transactions. of data. Thus. the bus can operate without stalling. Although the features that were added to the while the NC:A arbitrates fo r the NDAL bus for the buffered write trans:.tctions. For read transactions, NMC chip to reduce Q-bus and sc;Ec latency increased the complexity of the chip, these features the NC:A contains a hexword-size (32-byte) prefetch successfully keep the Q-bus latency below 8 buffe r. \Vhereasthe maximum burst length is only microseconds. an octaword on the C:l' bus, the NCA requests up to a hexword of data during DMA memory read opera­ NDAL-to-CP Bus AdapterChi p tions. The extra data is stored in the prefetch buffer

NCA is a fu ll-custom, high-performance 1/0 con­ and is immediately available if the subsequent Cl' troller chip that provides the electrical and func­ bus read transaction targets the same address as the 110, tional interface between the 64-bit NDAL bus and prefetched data. For NVA X initiated up to fo ur the .1 2-bit CP bus. In the new VAX 4000 systems, the operations can he buffered simultaneously. NDAL The NCA is partitioned into fo ur major sections: supports three chips: the NVAX CPU, the NCA, the NDAL, C:l' I, C:P2, and registers. The NDAL inter­ and the N•'-'IC. On the CP bus, the NCA supports the face and each Cl' bus interface contain the master SHAC:, sc;Ec, CQl:llC,and sse chips. The NCA is fabri­ cated in Digital's Ci\-!OS-:1 process, contains 155,000 and slave sequencer and controls fo r the corre­ transistors, and is packaged in a :H 9-pin I'GA. The sponding bus. The NCA chip also has l/0 queues and design goals for the NCA project were high quality, an internal arbiter to select operations from CP l, improved performance, optimized time-to-market, CP2, and NCA register read transactions to the NDAL and leveraged use of existing CP bus chips. The NCA bus, based on a predetermined priority. team achieved aJJ design goals and completed the The register section contains the control and sta­ project by the scheduled manufacture release date. tus registers and the interval clock timer registers. The interval clock is a software-programmable NCA Architecture Overview timer used by the operating system to account for and Pa rtitioning time-dependent events. Although the original concept of the NCA was moti­ The NCA supports parity check and detection vated by a memory and 1/0 controller chip called NDt\1. on both the and the CP buses. The NCA also

Digital Te chnical jo urnal Vo l. 4 No. 3 Slllllllll'r /')')2 69 NVAX-microprocessor VAX Systems

supports all interrupts defined by the NDAL and CP by up to 66 percent. For an 80-ns CP bus cycle time, bus protocols for other types of error conditions. the maximum bandwidth is 33 33MB/s. When an error occurs, an error status bit is set in the NCA error . Depending on the Te stability type of error, an address may be available fo r diag­ To assist module testing, the .\iCA contains features nostics. The NCA also provides a mechanism to that comply with the IEEE Standard Pl149. 1 .JTAG fo rce a parity error condition on any of the buses to testabilityc' At the pin level, five special pins are help debug the interrupt routines of the operating provided and work in combination with the inter­ system software. nal test access port controller inside the NCA and a bed-of-nails tester to perform short- and open­ Q-bus Latency Support circuit interconnection tests. To reduce latency seen by the Q-bus devices, the NCA provides special logic to gain priority MS690 MemoryModu le from the NDAL arbiter. The NCA informs the arbi­ The MS690 fa mily of CNIOS memory modules was ter of the imminent Q-bus operation, fo r which designed to support the memory requirements set latency is a concern. When a Q-bus is present in fo rth by the f'NA)( memory controller. 2 The NMC the systems, the NCA is programmed to use the two requires the MS690 memory module to provide a IDs mode on the NDAL and to enable the Q-bus two-way, bank-interleaved, 72-bit data path. In addi­ present bit of the control register. Upon detection tion, a self test fe ature is provided that was used on CP of the Q-bus "map" read transaction on the 4000 300 the V�'<: Model memory subsystem. The bus, the NCA immediately asserts a signal to the MS690 module returns a unique board identifica­ NDAL arbiter. The arbiter will not grant the bus tion signature when polled by the NMC. The mod­ to other devices until after the Q-bus read trans­ ule used existing qualified parts and fits on a action is accepted by the NMC or until the signal quad-sized PWB. is deassertecl. Requests from the buffe red write at A common goal of Digital's Electronic Storage the same interface are masked off until the signal Development (ESD) teams is to utilize a single PWB is deasserted. Using this scheme, the Q-bus latency design to accommodate as many memory sizes in the new VAX4000 systems was never more than as possible. The ESD teams ro utinely stretch the 8 microseconds. boundaries of Digital's manufacturing processes to provide world-class memory subsystems. Because Enhanced CVAXPin Bus memory subsystems fo rm the core of the ESD char­ The NCA supports the standard bus protocol in ter, the ESD teams are uniquely tuned into, and both synchronous ancl asynchronous modes. The actively shaping, present and future device specifi­ existing CP bus protocol does not utilize the maxi­ cations fo r aU types of random-access devices. This mum bus bandwidth possible with the standard CP advance and intimate knowledge allows us to build bus protocol. The fastest data transfer rate is two current technology products with the hooks neces­ cycles per fo ur-byte (i.e., 32-bit) longword, because sary to capitalize on the next generation of storage the two primary signals fo r the handshake use the devices. same clock phase for the assertion. When the sent The MS690 options are available in 32 MB, 64MB, signal is detected, it is already too late to generate and 128MB sizes and are selfconfiguring. The the received signal within the same cycle. To MS690 memories communicate with the NMC by achieve the one-cycle transfer rate, a modified pro­ way of the private NMI. AJJ control and clocks sig­ tocol is used. The received signal is changed to rep­ nals originate off-board via the NMI from the NMC. resent a ready-to-receive signal. The received signal Up to to ur memory modules of any density mix may is asserted regardless of whether or not the sent sig­ coexist on the NMI with a maximum memory size nal is asserted. When the sent ancl received signals of 512MB are asserted at the same time, both the master and The MS690 is an ex tension of the existing 39-bit MS670 slave devices know that the data was successfully memory product designee! fo r the VA)\. 4000 transferrecl. Model 300 product line. The DC562 GMX was This protocol works with the ex isting CP bus designed and produced in Digital's Hudson, Mass­ chips and has increased the theoretical bandwidth achusetts, plant for the MS670 32MB memory. This

70 Vo l. 4 No. 3 Summer 1992 Digital Te cbnicaljournal Design of the VAX 4000 Mode/ 400, 500, and 600 Sys tems

GMX is a semi-intelligent, 20-bit-wide, 4-to-1 and of the MS690 memory option and thus help control 1-to-4 transceiver, with internaltest/compare/error costs by reducing product-unique inventory. Al l logging capabilities for its five l/0 ports. The MS670 parts, except the bare PCB, are used on products required eight banks of 39 bits of data, hence the already produced in volume at Digital's Singapore requ irement of two GMX chips per module. and Galway, Ireland, manufacturing plants. The KA670 CPU module used in the VA.,'{ 4000 The MS690-BA memory module, which uses lOO­ Model 300 transfers 32-bit longwords of data. For ns 1M-by-1 M DRAMs, can support NMC cycle times every longword, 7 bits of ECC must be allocated, of 36 ns and 42 ns, respectively, for the VA X 4000 i.e., 8 (32 + 7) 312 The CPU module used Model 400 and 500 systems. The MS690-CA/DA mod­ x == DRAivl s. 64 ules use 80-ns 4M-by-1 M DRAMS and can accom­ in the VAX 4000 Model 500 transfers -bit quad­ words of data. For every quadword, 8 bits of ECC modate 30-ns, 36-ns, and 42-ns NMC cycle times. must be allocated, i.e., 4 x (64 + 8) == 288 DRAMs. The MS690 memory is configured as two inter­ Pe rformance leaved bank pairs, each 72 bits wide (64 bits of data The CPU 1/0 subsystems on all three products pro­ plus 8 bits of ECC); all transactions are 72 bits. The vide exceptiona l performance, as shown in Ta ble 3. memory module supports quadword, octaword, The pair of DSSI buses on the CPU modules for the and he.l(word read/write/read-modify-write trans­ VAX 4000 Models 500 and 600 were tested under actions. Transactions less than 72 bits, i.e., bytes, the VMS operating system performing single-block words, and longwords, are not supported. (512-byte) reads from RF73 disk drives. The read Doubling the data word length is advantageous in rate was measured at over 2,600 1/0s per second two ways: the l/0 bandwidth effectively doubles, with both buses running. The Ethernet subsystem, and 24 fewer DRAMs are required. This last benefit based on the SGEC adapter chip, is also very effi­ resu Its from the fact that only one additional bit is cient. It has been measured transmitting and receiv­ required to protect 64 bits of data as compared to ing 192-byte- long packets at a rate of 5,882 packets protecting 32 -bit data. The available PWB space per second. Packets 1,581 bytes long can be trans­ al lowed room fo r two additional GMXs to handle mitted at a rate of 9.9 megabits per second. the 33 additional data bits. The ability to use the The performance of the CPU subsystem has tradi­ existing GMX integratedcir cuit eliminated the need tionally been measured using a suite of 99 bench­ for a new, 40-bit-wide, <�MX-type VLSI development. marks 7 Scaling the resu lts against the performance Because DRAMs are edge-sensitive devices, mod­ of the VAX-1 1/780 processor and taking the geo­ ule layout, balanced etch transmission I ines, and metric mean yields the VAX unit of performance signal conditioning are extremely important to a (YUP)rat ing. The processor VUP rating for the new quality product. The MS690 design team used VA X 4000 system with the lowest performance, a combined total of 18 years of memory design the Model 400, is twice the VUP rating of the sys­ experience along with extensive use of SPICE tem it is replacing, the Model 300. The two new modeling to determine the optimal P\VB layout.The high-end systems provide three and fo ur times the result was a double-sided, surface-mount P\X-13 performance of the Model 300-an impressive per­ panel that can accommodate all density variations fo rmance increase.

Ta ble 3 Summary of Performance Results for the VA X 4000 Models 400, 500, and 6007

Metric Unit Model 400 Model 500 Model 600

SPEC Release 1.0 SPECmark 22.3 30.7 41 .1 SPECint 17.1 24.9 31 .8 SPECfp 26.6 35.4 48.7 Single User 99 VUPs 16.9 23.8 31 .4 TPC-A tpsA-Iocal 51 .0 62.4 103.0 Integer MIPS 34.2 43.4 64.4 Whetstone Single MIPS 47.6 71 .4 83.3 Whetstone Double MIPS 32.3 45.5 52.6 LI NPACKD (1 00 by 100) MFLOPS 4.8 6.9 9.5 UNPACKS MFLOPS 7.5 10.5 14.7

Summer Digital Te e/mica/ jo urual Vnl. 4 No. 3 1991 71 NVAX-microprocessor VAX Systems

The system performance in multistream and Electrical and Electronics Engineers, Inc., transaction-oriented environments was measured 1990). with TPC Benchmark AH This benchmark, which VAX Sys tems Performance SunuJW1J' (May­ simulates a banking system, genera lly indicates per­ 7 nard, Digital Equipment Corporation, fo rmance in environments that are characterized by MA: Order No. EE-C0376 -46, 1991). CPU concurrent and l/0 activity and that have more than one program active at any given time. The per­ 8. Tra nsaction Processing Performance Coun­

fo rmance metric is transactions per second (TPS). cil, TPC Benchmark A Sta ndard Sp ecification The measured performance of the VA X 4000 Model (Menlo Park, CA: Waterside Associates, 600 system was more than 100 TPS, tpsA-local. As November, 1989). shown in Ta ble 3. the performance of the new VA.'{ 4000 400, 500, 600 Model and systems is impres­ General Reference sive, even compared to RJSC-basedsystems.

DEC STD 03 2- 0 VAX Ar·cbitecture Sta ndard Acknawledgments (Maynard, MA.: Digi tal Equipment Corporation, The authors would like to acknowledge the contri­ Order No. EL-00032-00, 1989). butions of the fol lowing members of the design teams: Matt Baddeley, Roy Batchelder, Abed Chebbani, Harold Cullison, To dd Davis, Joe Ervin, Nannette Fitzgerald, Ava nindra Godbole, Judy Gold, Bryce Griswold, Thomas Harvey, Jim Kearns, John Kumpf, Steve Lang, Dave Lauerhass, Bill .Munroe, Chun Ning, Jim Pazik, Cheryl Preston , Matt Rearwin, Dave Ro uil le, Prasad Sabada, Stephen Shirron, Adam Siegel, Jaidev Singh, Emily Slaughter, Michael Smith, Steve Thierauf, Bob We isenbach, and Pei Wo ng.

References

1. G. Uhler et a!., "The 1\f\!A X and NVAX+ High­ performance VAX Microprocessors," Digital Te chnical Jo urnal, vol. 4, no. 3 (Summer 1992, this issue): 11-23

2. KA680 CPU ;VJodule Te chnical Ma nual (May­ nard, MA: Digital Equipment Corporation, Order No. EK-KA680-TM-001 , 1992).

3. B. Maskas, "Development of the CVAX Q22- bus Interface Chip," Digital Te chnical Jo urnal, vol. 1, no. 7 (August 1988): 129-138.

4. .J. Wi nston, "The System Support Chip, a Multifunction Chip fo r CVA.'{ Systems," Digital Te chnical Journal, vol. 1, no. 7 (August 1988): 121-128.

5. CVAX CPU, VA X P Rubinfeld et al, "The a CMOS Microprocessor Chip," ICCD Proceedings, (October 1987): 148-152.

6. lt'EE Standard Test Access Po rt and Bound­ IEEE 1149.1- my Scan Architecture, Standard 1990 (New Yo rk, NY: The Institute of

72 Summer Digital Te chuical jo-unwl Vo l. 4 No ..3 1992 Jo nathan C Crowell David W.Marusk .a

TheDesi gn of the VAX 4000 Model l()() andMi croVAX 3100 Model 90 Desktop Sy stems

The !'vlicroVAX 3100 Model 90 and VA X 4000 1Vlodel 100 systems were designed to meet the growing demandfo r low-cost, high-performance desktop servers and time­

sharing systems. Both systems are based on the NVAXCPU chip and a set of VIS! sup­ port chips, wbich prm1ide outstanding CPU and f/ 0 performance. Ho used in like desktop enclosures, the two sy stems provide 24 times the CPU petformance of the original VAX-11/780 computer. Wi tb over 2.5 gigabytes of disk storage and 128 megabytes of main memor;� the complete base �ystem fits in less than one cubicfo ot of sp ace. Th e sys tem design was b(f!,b�yleveraged fr om existing designs to help llleet an aggressive schedule.

The demand fo r low-cost, high-performance desk­ found in all previous VA X 4000 systems, i.e., Digital top servers and timesharing systems is increasing Storage Systems Interconnect (DSSI) storage and

rapidly. The MicroVAX 3100 Moclel 90 and VA..'\ 4000 Q-bus expansion. The CPU mother board fo r the Model 100 systems were designed to meet this VA X 4000 Model 100 system is cal led the KA';2. NV1\.'\ demand . Both systems are based on the CPU The .KA';0 and KA52 crus are built from a com­ chip and a set of very large-scale integration (VL'il) mon CPU mother board design; the base CPll support ch ips, which provide outstanding CPU and mother board is configured to create either the

1/0 performance. 1 KA50 or the KA52 product module. The OSSl and Each member of the MicroVAX 3100 fa mily of Q-bus optional hardware is added to the C:l'll systems constitutes a low-cost, general-purpose, mother board to convert a K.A';Oto a KA 52. Also, to multiuser VA X system in an enclosure that fits provide the additional su perset fu nctionality fo und on the desktop. This enclosure supports all the on the KA52 CPU, the diffe rent system read-only required components of a typical system, including memories (ROMs) are added during the manufactur­ the main memory, synchronous and asynchronous ing process. In this paper, the KA ';0 and KA 52 Cl'lls communication lines, thick-wire and ThinWire are referred to as the CPU mother board or module, Ethernet, and up to five small computer system except where differences exist. interface (SCSI)-based storage devices. The system design was highly leveraged from The Micro VA X 3100 Model 90 system replaces the existing designs to help meet an aggressive sched­ Model 80 as the top performer in the line; the new ule. This paper describes the design and implemen­ moclel has considerably more than twice the CPU tation of these systems. power of the previous model.2 The Model 90 system also includes performance enhancements Design Goals to the Ethernet and SCSI adapters, as well as an The design team's primary goal was to develop a CPU increased maximum system memory of 128 mega­ mother board that wou ld provide at least twice bytes (MH). The CPU mother board fo r the MicroVA X the CPU performance of the MicroVAX 3100 Model 3100 Model 90 system is called the KA50. 80, while supporting all of the same J/0 hmct ional­ The VAX 4000 Model 100 system is housed in the ity of the previous systems. This new system would same desktop packaging as the MicroVAX 3LOO leverage the core CPU design from the VAX 4000 Model 90 and provicles the same base functionality. Model 500 system, thus delivering the high perfor­ The VAX 4000 Model 100 adds two key fe atures mance of the NVAX CPU chip to the desktop .\

Digital Te e/mica/ journal 4 Vo l. No. :) Summer 199.! 75 NVAX- microprocessor VAXSystems

The team set additional goals to increase system measures 14.99 centimeters (5.9 inches) high by capability and performance. These goals were to 46.38 centimeters (18.26 inches) wide by 40.00 cen­ timeters (15.75 inches) deep. 1. Increase the maximum system memory from The system enclosure contains two shelves that 72MB to 128MB support the mass storage devices. In the MicroVAX 2. Provide error correction code (ECC) protection 3100 Model 90, these storage locations are cabled to to memory using memory arrays that previously support SCSI disks and tapes. The upper shelf sup­ supported only parity ports three SCSI disks, whereas the lower shelf supports two SCSI devices (any combination of 3. Increase the performance of the Ethernet adapter removable or 3 1/2-inch disks) with access through 4. Increase the performance of the SCSI adapter a door in the front of the enclosure. In the VA X 4000 Model 100, the top shelf is configured to support Early in the project, the team proposed creating a three 3 1/2-inch DSSI disks; the bottom shelf sup­ second CPU design that would have the fe atures of ports two SCSI devices, as in the MicroV�'{ 3100 the larger VA X 4000 systems. This proposal resulted Model 90. in the design of a DSSI adapter option for the CPU The VAX 4000 Model 100 DSSl support is provided mother board, as well as a Q-bus adapter to provide by high-performance DSSI adapter card based on a means to upgrade the CPU power of older a the shared-host adapter chip (SHAC), i.e., a custom Q-bus-based MicroV�'\ systems. VLSI design with an integrated reduced instruction The project to design, implement, and field-test set computer (RJSC) processor.' The system is con­ these systems was accomplished under an aggres­ figured with DSSI as the primary disk storage. The sive schedule. Both designs were ready to ship to DSSI bus exits the enclosure by means of a connec­ customers in just over nine months from the official tor on the back panel . This expansion port can be start of the project. used to connect the system to additional DSSI Sy stem Overview devices, or to fo rm a DSSI-based VAXcluster with a second VAX4000 Mocle1 100 or any other DSSI-basecl The MicroV�'C 3100 Model 90 system supports system. the same 1/0 functionality as the previous genera­ The Q-bus support on the VA X 4000 .Model 100 is tion of systems, the MicroVAX 3100 Models provided by the VLSI adapter chip, i.e , the CVA.,'{ 40 and 80. The fe atures include a SCSI storage adap­ Q22-bus interface ch ip (CQI31C).' There are no ter, 20 asynchronous communication ports, two Q-bus option slots in the system enclosure. The synchronous communication ports, and an Q-bus connects to an expansion enclosure through Ethernet adapter. a pair of connectors at the rear of the system enclo­ The VAX 4000 Model 100 includes the same l/0 sure. Two shielded cables and the H9405 expansion fu nctionality as the Jvi icroV�'{ 3100 Model 90. In module are used to connect the Q-bus to the expan­ addition, the system provides the I/O functionality sion enclosure. The near end of the Q-bus is termi­ of the larger V�'<4000 systems, that is, a high-perfo r­ nated in the system enclosure . mance DSSI storage adapter and a Q-bus adapter port that connects to an external Q-bus enclosure. Both systems provide 24 times the CPU perfor­ CP U Mother Board Design mance of a VA X-11/780 system. The memory subsys­ The design goals presented engineering with con­ tem uses Digital's MS44 single in-line memory straints that fo rced design trade-offs . Some key B 32MB, modules (SIMMs) and thus provides 16M , constraints were (1) fi tting the required fu nctional­ 64MB, HOMB, or 128MB of main memory. ity on a single 10-by-14-inch module; (2) designing As shown in Figure 1, the system enclosure used the system to adhere to the system power and cool­ to house both systems, namely the RA42B, provides ing buclget; and (3) minimizing changes to the fu nc­ mounting to r the CPU mother boarcl, up to five loca­ tional view of the module over previous designs, to tions for disk and tape devices, a 166 -watt power decrease the number of software modifications supply, and fa ns for cooling the system elements. In required fo r operating system support. addition, the enclosure shields the system from The primary way the design team minimized radiated emissions. All 1/0 connections are filtered system development was to leverage as much as

and exit the enclosure through cutouts in the practical from ex isting designs. The CP mother rear panel. The system enclosure is compact and board design used components from the VAX 4000

Digital Te cbuical jo urual 74 Vo l. 4 Nu. 3 Summer 1992 The Design of the VAX 4000 Model lOO and Micro VAX 3100 Mode/ 90 Desktop Sys tems

OPTIONAL DHW42 1/0 MODULE AND LOGIC BOARD

Q-BUS EXPANSION PORTS ------=--.c.,

DSSI PORT ------��-��

SCSI PORT

OPTIONAL DHW42 PORTS ---;--;-c-

ASYNCHRONOUS MODEM ----­ CONTROL PORT

THICK-WIRE ET HERNET CONNECTOR

THINWIRE ETHERNET CONNECTOR ------'

SYSTEM AC POWER CONNECTOR ------'

ON/OFF SWITCH

Figure 1 VAX 4000 Model 100 Sy stem Enclosure Showing the CPU and Connectors

Model 500, MicroVA. '< 3100 Mocle1 80, and VA Xstation The CPU mother board includes a linear regulator 4000 Model 90 systems. Using proven design that generates local 3.3-volt (V) current fo r the components allowed fo r a shorter development CPU core chip set. The voltage is stepped down cycle, smaller design teams, and consequently, a from the 5-V supply. The regulator is necessary higher-quality design, while meeting an aggressive because the 3.3-V direct current (DC) of the system schedule. is not sufficient to meet the ::!:: 3 percent tolerance The design is structured so that both CPU mother regu lation or to supply the required maximum boards can be built using the same printed wiring current. hoard (PWB). The added fu nctional ity fo r the KA52 is provided by a daughter card, additional hardware CPU Core ami cabl ing, and different system RO Ms. The shared The CPU core consists of three chips: the NVA. '< CPU design helped reduce the complexity in testing and chip, the NVA.'< data and address lines (NDAL) qualifying the system design. memory controller (NMC) chip, and the NDAL­ The CPU module contains three major sections: to-CVAX pin (CP) bus adapter (NCA) chip. The the CPU core, the memory subsystem, and the 1/0 NVAX chip directly controls the 128-kilobyte subsystem. Figure 2 is a block diagram of the basic (KB) backup cache. The core chip set is inter­

CPU module fo r the VA)< 4000 Model 100 and connected by means of the NDAL pin bus, as shown MicroVt'v'< 3100 Model 90 systems. Figure 3 is a pho­ in Figure 2. The NDAL bus is 64 bits wide, has a tograph of the module, including the DSSl daughter 42-nanosecond (ns) cycle time, and supports card option. pended transactions -' The peak bandwidth of the

19V2 Digital Te clm icaljotn-llal Vol. 4 No. 3 Summer 75 NVA.'X-microprocessorVAX Systems

r------� CPU CORE I B-CACHE NVAX CPU I 128KB OR CHIP I 512KB

I I NDAL-TO-CP I BUS ADAPTER I CHIP (NCA) NDAL BUS

1- - ..___ _ � ------��------"""'?:""" ____,,-- I 1 I I .--,--_j- ----, I I -- I 1/0 SUBSYSTEM I I I CP BUS 2 CP BUS 1 1 I I 1 I I ,------� I I I I I SHAG DSSI I I FLASH ROM :- ADAPTER CHIP _.. I I I I I �------� I B I MEMORY SIMMS I 16, 32, 64, 80, 128MB SYSTEM I CEAC/SQWF EDAL SUPPORT CHIP I I CHIP BUS (SSC) I � ����:;�------_ ------I I I I DSW42 I I CVAX 0-BUS I SYNCHRONOUS :- SGEC ETHERNET +- INTERFACE CHIP : COMMUNICATIONS I ADAPTER CHIP 1 (CQBIC) I OPTION I I ·------1 I i'"------I 1 DHW42 1 t I I I ETHERNET SCSI ADAPTER .__ ASYNCHRONOUS -.I : COMMUNICATIONS : I I OPTION I ·------1 I QUART I (FOUR SERIAL LINES) I I l ______� Note that - +- indicates a connection to an external bus.

Figure2 CPU Module Block Diagra m

NDAL bus performing 32-byte operations is 1=:>2 cycle clock and NDr\Lbus clocks are generated with megabytes per second (.VI R/s) an on-chip clock generator suppliecl by a 286-mega­ hertz (;VIHZ) oscillator. The CPU is based on a high-performance Chip The i\-' VAX CPU chip is an advanced NVAX NVAX CPU macropipeline(l architecture similar to that of the implementation of the VA ,'\ architecture in Digital's VAX 9000 CPlJ L' The VIC: allows the caching of t

Digital Te cbnicnl jo unwl 76 HJ!. 4 No . .) Sui// Iller I'J'J2 The Des(({ll of the li,4 X 4000 lvlodel 100 and Micro VA X 3JOO Mode/ 90 Desktop Systems

Figure 3 CP U Mother Board

of 148,000 transistors and is the high- speed inter­ interface from the NDAL to the C:P bus 6 The NC:A face to the system main memory. The N;vic is the consists of 1'5'5,000 transistors and supports two CP arbiter fo r the three chips on the NDAL bus, namely, buses. Tbe CP bus used on the CVAX microproces­ the NVAX , the NCA, and the NMC The NMC chip sor family is also used on many of Digital 's custom manages the array of ownership bits that corre­ 1/0 adapter chips , such as the CQBIC, the SHAC. the spond to each 32-byte segment of memory. Each of second-generation Ethernet controller (SCEC), and these segments corresponds to a cache line. The the system support chip (SSC).-'· •-�8 Thus, the hard­ ownership bit indicates whether the valid copy of ware and software designs fo r these I/O functions the data is in memory, in the CPU write-back cache, cou lcl be leveraged from previous efforts. The NC:A or in an l/0 devices buffer. performs direct memory access (DMA) from the l/0 The NMC has fo ur command queues that accept clevices ami supports the cache consistency proto­ read, write, and remove write privilege trans­ col required fo r the NDAL bus. actions from the NDAL bus. Buffers hold the read The NCA was designed to optimize DMA traffic data to be returned to the node that requested the from CP bus devices. In the KA'50 CPU, the Cl' bus data. Tbe NMC and the memory su bsystem provide devices include the sc; Ec Ethernetadap ter, the SSC, the 9'5MR/s of bandwidth shared by the NVAX and the fielcl erasable programmable read-only memory the 1/0 devices. (FEPROlvl) subsystem , the CP-to-EDAL adapter chip (CEAC), and the SCSI quaclword first in, first out NDAL-to-CP Bus Adapter Ch ip The NCA chip, also (SQWF) chip. In addition, the asynchronous com­ implemented in Digital's C:MOs-3 technology, is the munication option is attached to the CP bus . The

Digital Te e/mica/ jo urnal Vo l. 4 No.. > S!IIJ/JJI<'r !<)'J.! 77 NVAX-microprocessor VAX Systems

KA52 C:Pl! also attaches the CQBIC: Q-bus adapter Support fo r the SCSI bus is provided by the 5:) C94 chip ancl the SHAC DSSI host a(lapter chip. SCSI adapter chip.10 'T'he 53C94 chip is interfaced to the system on the EDAL bus ancl uses the SQ\VF chip Memory Subsystem to increase its DMA performance. The SQWF The memory subsystem is controlled by the NMC chip makes it possible to buffe r data moving to the chip. The main memory is implemented using MS44 CP bus. The SCSI bus operates in synchronous mode SIMMs and low-cost gate array (LCGA) chips to pro­ fo r high-performance storage access of 5MR/s. vide an interface between the N1'vi C: and the SIMMs :> The QUART gate array supplies the logic fo r fo ur The SIMMs are used in groups of four to provide built-in serial ports. The QUART, originally used two interleaved banks, each with a 64-bit data path on the DZQ ll Q-bus device, provides the same and eight bits of ECC. This interleaving scheme software interface as that device. The third port increases the bandwidth of main memory by alter­ provides modem control functions by means of nating data between both banks of memory. aclditional logic; the first, second. ancl fo urth ports are data leads only. The ECC provides single-bit error correction and double-bit error detection. The SGEC Ethernet adapter chip was chosen The individual SJMiVIs are available in either 4,vl l:l because it provides higher performance than the or I6rviB variants. Since fo ur SIMMs fo rm a complete Ethernet adapter used on the MicroVA,'\ 3100 Model functional set, sets can be 16MB or 64MB in size. 80. The SGEC is the adapter chip used on all VAX Therefore, because the system supports up to two 4000 systems. In addition, this chip directly inter­ sets of SIMiVIs, the total system memory size can be faces with the C:l' bus. either 16MB, 32MB, 64.-YlB, BOMB, or 128MB, depend­ The limited size of the CPU mother board ing on the combination of SIMM size and the num­ required the DSSI adapter to be added by means of ber of sets. a daughter card. The Q-bus adapter chip and bus To coincide with the cache coherency scheme termination are provided directly on the mother board. used in the NVA,'\ CPU chip, the NMC keeps track of the cache lines that have write privilege reserved by the CPU or 1/0 devices. This state is stored in sep­ Consoleand Diagnostics arate dynamic random-access memories (DRAN!s). The MicroVAX 3100 Models 80 and 90 differ in their 'T'hese DRA:vrs interface directly to the NMC: by console designs and diagnostics. Because the basic means of a private bus. The ownership bits are pro­ CPU core of the MicroVAX 3100 Model 90 and the tected by ECC. VAX4000 Model 100 systems is very similar to that of the VAX 4000 Model '500 system, the design team l/ 0 Sub�J!Stem decided to adopt the console of the Model 500 and Because the MicroVA,'\ 3100 Model 90 was intended add the required commands and functionality. as an upgrade for the Model 40 and 80 systems, the Borrowing proven designs, such as the console of 1/0 subsystem of the earlier systems dictated the another NVA.X-based system, significantly shortened design of the new Model 90. In addition, the 110 the product development schedule. subsystem of the KA52 CP module fo r the VAX One enhancement ro the CPU mother board was 4000 Model 100 supports two fu nctions fo und in the addition of a FEPROM subsystem. If an update is the orher VA,'\ 4000 systems, the DSSI adapter and required, the console and diagnostic code on the the Q-bus adapter9 The 1/0 subsystem includes CPU can be reprogrammed in the field. In contrast, a ThinWire and thick-wire Ethernet adapter, four previous systems required the memories to be in built-in asynchronous terminal lines, a connector sockets and the parts ro be replaced in the field. fo r the asynchronous option, and the CEAC and With FEPROMs, a program is loaded from any SQWF chips. bootable device. This program erases the FEPROMs A bus interface was incorporated in the I/O sub­ and reprograms them with the new ROM image. system to support the DS\Xf42 synchronous commu­ This enhancement serves as an easy mechanism fo r nication option, the SCSI adapter chip, and the updating the ROMs in the field to provide new fe a­ QUART fo ur-port asynchronous control ler chip. The tures or to fix bugs that may be discovered. C:EAC and SQWF chips, which are gate arrays On power-up, the CPU starts executing from the (lesigned fo r the VA,'\station 4000 iV!()(Iel 90, are FEPROM memory and runs the power-up self test to used to create the EDAL bus. help verify that the system is fully operational.

7H Vo l. 4 Nu . .J Slllllll/er I'J'J.! Digital Te cbnicaljourual Tbe Design of tbe VA X 4000 !Vlodel 100 and Micro VA X 3100 JV!ode/ 90 Desktop Sy stems

Upon completion of the execution of this test, the onboard diagnostics. However, as fo r the console, system transfers control to the console program. the philosophy used in developing these diagnos­ Depending on the values configured in nonvolatile tics was to leverage as much of the software design memory, the console program either boots the as possible from existing designs. system with the correct parameters or stops fo r With the advent of larger boot and diagnostic console input. RO Ms, the diagnostic coverage of the power-up self­ In earlier systems, the speed of executing from tests greatly increased, including extensive testing ROM could he more than an order of magnitude of the cache, the memory, and the CPU core. These slower than running from cached main memory. tests help assure the customer that a fa iled compo­ The NVAX CPU chip added the virtual instruction nent will be detected and reported upon power-up. cache, which allows the caching of instruction In many cases, the new tests can help isolate fa il­ stream rekrences from 1!0 space. This feature ures in the individual SIMM or cache chip. This fe a­ greatly increases the performance of the ROtvt code. ture is used extensively in manufacturing, as well as by field service. Console During the power-up sequence, an instruction exerciser (HCORE) is run to test the floating- point The console program gains control of the CPU hardware. This test provides very good coverage whenever the processor halts or performs a restart of the floating-point unit. In the past, HCORE operation such as power-up The console provides has been run as a standalone diagnostic in manu­ the fo llowing services: facturi ng before a system is shipped. The design l. Interface to the diagnostics that test compo­ team fo r the two new desktop systems believed nents of the CPU and system that this test should run on every system power-up selftest. 2. Automatic/manual bootstrap of an operating The CPU core is designed to function over a wide system fo llowing processor halts range of environmental conditions. Some variables 3. An interactive command language that allows of the environment are temperature, voltage, and the user to examine and alter the state of the minimum/maximum component parameters at a processor given clock frequency. Exceeding the wo rst-case design envelope can cause unpredictable resu lts. There are minor diffe rences between the KA '50 For example, to avoid problems caused by a defec­ anu KA '52 consoles. Largely, these differences relate tive main clock oscillator that may be running too to the KA '52 CPU mother board support fo r the DSSI fa st, the diagnostics measure the speed of the CPU bus ami the Q-bus. Although the console is similar cycle clock to determine if it is within the accepted to that fo und on the VAX 4000 Model 500, some new tolerance. If the cycle is faster than the design mar­ commands were implemented to provide function­ gin dictates, an error is reported. ality that exists on previous MicroVAX 3100 systems. These com manus incluue LOGIN ancl SET PSWD (set password), which give support fo r a secure con­ Design To ols sole; SET/SHOW SCSJ_ID; SHOW CONFIGIJRATION; The design of the CPU mother board uniquely SHOW ERROR; and various com mands to support a merged components from several designs. The system exerciser. success of this approach relied on the use of design On the KA '52 CPU, the console supports the DSS1 tools to perfo rm the merge and to verify the cor­ bus and the Q-bus with a set of commands. These rectness of the merge. commands allow polling of the DSSI bus to deter­ The normal design process is to create a set mine what devices are present and to configure the of design schematics and to verify these through internalparam eters of each device. Thesystem can simulation. Once the design is logically verified , be booted from devices on the SCS1, DSSI, or Q-bus the layout process begins. The layout process chips, as well as over the Ethernetport. includes the use of the SPICE simulator to give direc­ tion to the physical layout structure and to check Diagnostics the integrity of the layout.11 After the layout is com­ Diagnostics help isolate fau Its in the system down plete, the database is fed back into logic simula­ to the level of the field-replaceable units. Significant tion, which again verifies tbe correctness of the effort was expended on the development of design database.

Digital Te cbuiutl jo urnal Vo l. 4 No. .J Sllllllller I'J'J.! 79 NVA.,'X-microprocessorVAX Systems

The CPU mother board designers took a diffe rent read operations from RF35 disk drives. The read approach. Since the physical placement of the con­ rate was measured at more than I ,200 l!Os per sec­ nector portion of the module was the same as fo r ond. The SCSI adapter on both CI'Us was measured the MicroVA.,\. 3100 Model 80 module, this design at more than 500 I/Os per second fo r single-block element was used as the starting point fo r the over­ reads. all design. The database was edited using the VA.. '< The Ethernetsu bsystem used on both the KA50 layout system (VLS). and the only components and KA52 modules is very efficient and has been saved were those that were to be used in the new measured transmitting 64 -byte packets at a rate CPU module. This procedure provided the correct of 14,789 packets per second. The measured receive placement for all 1/0 connectors that exited the rate fo r 64-byte packets was 14,78') packets per system enclosure. second. In addition, the VAX 4000 Model 500 CPU core The performance of the CPU subsystem has was used as the basis fo r the CPU mother board etch traditionally been measured using a suite of 99 layout. The Model 500 design has proven to have benchmarks. 12 The results are scaled against the very good signal integrity due to its well-thought­ performance of the VAX-ll/780 processor, and out circuit board layout. To leverage Model 500 the geometric mean is taken. This calculation yields work in the layout of the CPU mother board, the the VA X unit of performance (V1 JP) rating. The pro­ designers extracted the signal cessor VUP rating for both the KA'>Oand KA')2 CPUs routing from the Model 500. This signal routing is 24 l's -more than twice the performance included the CPU core and cache treeing, the most of the MicroVAX 3100 Model 80. Table I presents critical areas. This approach eliminated the need to a summary of the performance resu lts for the VA X model critical signal interconnect in the design and 4000 Model 100 and the Micr(WAX 3100 Model 90 guaranteed that the signal integrity and connector systems. layout would be identical to that of the proven The performance of the system in multistream Model 500. and transaction-oriented environments was mea­ Database comparison tools were used to guaran­ sured with TPC Benchmark AH This benchmark, tee that the schematics matched tbe physical layout which simulates a banking system, generally indi­ database. As a final step, the physical layout data­ cates performance in environments characterized base netlist was used to create a simulation model. by concurrent CPU and l/0 activity and in which DECSIM, Digital's simulation tool, \Vas used to verify more than one program is active at any given time. the final correctness of the design database. The performance metric is transactions per second (TPS). The measured performance of the VAX 4000 Peiformance Model 100 is 50 TPS tpsA-Iocal; that of the Micro\'AX The CPU I/O subsystems on both the MicroVAX 3100 Model 90 is 34 TPS tpsA-Iocal. The diffe rence 3100 Model 90 ami the VAX 4000 Model 100 pro­ in performance between the VAX 4000 Model 100 vide exceptional performance. The DSSI bus on and the Micro VA X 3100 Model 90 is a result of their the KA52 CPU was tested umler the VMS operat­ different disk subsystems, i.e., the DSS! and SCSI ing system perfo rming single-block ('>12-byte) adapter support.

Ta ble 1 Summary of Performance Results for the VA X 4000 Model 100 and the MicroVA X 3100 Model 90 Syste ms13

VA X 4000 MicroVAX 3100 Metric Unit Model 100 Model 90

SPEC Release 1 .0 SPEC mark 30.5 30.5 SPECint 24.3 24.3 SPECfp 35.5 35.5 Single User 99 VUPs 23.8 23.8 Integer VUPs 19.6 19.6 Single VUPs 26.0 26.0 Double VUPs 31 .3 31 .3 TPC-A tpsA-Iocal 51 34

80 Vo l. 4 No. .) S!IIJJJJ/er 1992 Digital Te c/JIIical jo urnal The Design of the VAX 4000 Model 100 and Micro VAX 3100 i\tlodel 90 Desktop Sy stems

Acknowledgments Te chnicalJo urnal, vo l. 1, no. 7 (August 1988): The authors would like to thank all the designers of 79-86. other products whose logic components were used 9. M. Callander, Sr., L. Carlson, A. Ladd, and M. 4000 100 3100 in th � VA .\: Model and MicroVAX Norcross, "The VAXstation 4000 Model 90," Model 90 designs, namely, the VA Xstation 4000 Digital Te chnical jo urnal, vo l. 4, no. 3 Model 90, VA.'\ 4000 Model 500, and MicroVAX (Summer 1992, this issue): 82-91. 3100 Model 80 teams. The authors wo uld also like 10. NCR to acknowledge the many engineers who helped Microelectronics Products Division, NCR design and qualify the VA.'< 4000 Model 100 and 53C94, 53 C95 , 53 C96 Advanced Controller CA: MicroVAX 3100 Model 90 systems on a very aggres­ Data Sheet (Santa Clara, NCR Corpora­ sive schedule. Special thanks to Harold Cullison tion, 1990). and Judy Gold (diagnostics and console), Matt 11. SPICE is a general-purpose circuit simulator Benson and Joe Ervin (module design and debug), program developed by Lawrence Nagel ancl Cheryl Preston (signal integrity), Dave Rouille ELlis Cohen of the Department of Electrical (design verification), Colin Brench (EMC consul­ Engineering ancl Computer Sciences, Univer­ tant), Larry Mazzone (mechanical design), Billy sity of California at Berkeley. Cassidy and Wa lt Supinski (P\X'B layout), and Rita 12. Bureau (schematics). VA X Sy stems Pe 1formance Summary (.Maynard, MA: Digital Equ ipment Corpora­ tion, Order No. EE-C0376-46, 1991). References and Note 13. SPEC: A Ne w Perspective on Comparing 1. NYAX+ G. Uhler et al., "The NVA.V.. and High­ 10 Sy stem Performance, rev. (Maynard, MA: performance VAX Microprocessors," Digital Digital Equipment Corporation, Order No. 4, Te chnical jo urnal, vo l. no. 3 (S ummer EE-EA379 -46, 1990). 1992, this issue): 11-23. 14. Transaction Processing Pe 1formance Coun­ 2. Micro I-'ll X 3100 Models 40 and 80 Te clm ical cil, TPC Benchmark A Standard Sp ecification Information Manual, rev. 1 (Maynard, MA: (Menlo Park, CA: Wa terside Associates, Digital Equipment Corporation, Order No. November 1989). EK-A0525-TD, 1991).

3. KA680 CPU Module Te chnical Ma nual (Maynard, MA: Digital Equipment Corpora­ tion, Order No. EK-K.A680-TM-001 , 1992)

4. B. Maskas, "Development of the CVAX Q22- bus Interface Chip," Digital Te chnical jo urnal, vol. 1, no. 7 (August 1988): 129-138.

5. ]. Murray, R. Hetherington, and R. Salett, "VAX Instructions That Illustrate the Architectural

Features of the VAX 9000 CPU ," Digital Te chnical Jo urnal, vol. 2, no. 4 (Fall 1990): 25-42.

6. .J. Crowel l et al., "Design of the VAX 4000 Model 400, 500, and 600 Systems," Digital Te cl:mical Journal, vo l. 4, no. 3 (Summer 1992, this issue): 60-72.

CMOS 7 P. Rubinfeld et al., "The CVAX CPU, a VA X Microprocessor Chip,'' ICCD Proceedings (October 1987): 148-152.

8. G. Lidington, "Overview of the MicroVA.V.. 3500/3600 Processor Module," Digital

Digital Te chnical Jo urnal Vu l. 4 Nu . .3 Summer 1992 81 Michael A. Cal/.ande1; Sr. Lauren M. Carlson Andrew R. Ladd Mitchell 0. No rcross

TheVAXs tation4000 Mo del90

Tb e VA Xs tation 4000 JV!ode/ 90 is the latest member of tbe VAXstation product line. Based on t!JeNl'l�X CPU, the .Hodel 90 u·as designed as a module upgrade to the VAXstatio11 4000 Model 60 system. Th e ,lfodel 90 has 2. 7 times the CP U pe1jomwnce of the Model 60 and prol'ides IJase-lel 'el, tu ·o·dimensional gmpbics pe1jomwnce of 266, 000 l'ecto rs per second It supports up to 128MB of memoJ:J\ an SCSI- I bus inter­ fa ce, a TURBOcbannel op tion, a �y ncf:n·onous conmlwlication op tion, and seueral graphics op tions. TIJedesign temn used on�v progmnnnable def!ices to implement the new logic designed into tbe svstem. In addition. a breadboard j)ml'ided the basisfo r logic and softu·are uerificatioll.

During th<: summer of 1991, Digital's Semicon­ ment in the Model 60 components. Second. by ductor Engineering Group began p lan ning a new using as many components as possible from the VA}\ workstation based on the NVA.,'\ CPU chip. 1 The Model 60 des ign, we could reduce the hardware development process had three main goals: and software engineering effo rt requ ireJ to pro­ ro increase CPU performance. to maintain an duce the new system. The �lode! 90 system modu le aggressive time-to-market schedule, and to pro­ p rovides a d irect upgrade from the Model 60. The vide upgrade compatibi l i ty with the VAXstation only system component or option on the M()(lel 60 Model GO. that is not sup ported by the Model 90 is the entry­ The pri mary goal of the VAXstation 4000 Model level graphics option. 90 design was to implement a workstation with This paper presents the design methodology we well over twice the CPU perfo rmance of its p rede­ fo llowed to meet our project goa ls. It discusses the cessor, the Model 60. The advent of high-p erfo r­ four major components in the i'vl odel 90 system . It mance workstations based on reJ uced instruction describes the physical design of the system board set computers (RISC) required any new VAX work­ and the breadboard system we used for logic verifi­ station to provide a significant p erfo rmance cation and debugging of software. The paper ends increase over previous VA X workstations to be com­ with a comparison of p erformance data fo r Digital's petitive in the marketplace. The Model 90 met this workstations. goal by ach ieving 2.7 times the p erformance of rhe VA '\station., .Model 60. Design Methodology The second major goal of the project was to The design methodology used during the Mod<:! 90 develop and sh ip the system as qui kly as possible. c project consisted of the fo l lowing approaches : Th is was mandated by comp etitive pressures in the workstation market. We proposeJ an aggressive • Complex logic software. and firmware from best-case schedule which fo recast a breadboard existing designs would be used whene,·er running within three months of the project pro­ possible. posal , prototypes running the VMS system within • All new logic would be implemen ted using pro­ five months, ancl a customer ship elate within grammable technology. eleven months. The develo pment teams achieved almost every project milestone within a few weeks • A brct(lboard would be bu i lt as earl.y as possible of the proposed s hedule. c • Logic wou ld be simulated only if it could not be The final major goal of the project was to design verified with the breadboard . the system such that it cou lJ be offe ree! as a simpie modu le upgrade to the V&'< st :ttion Model 60. There These approaches were influenced ami shapnl were two main reasons for design ing the system as by our aggressive schedule, by the emergence an upgrade. First, it protected the customer's invest- of new programmable technologies, and by the

82 lb/. -1 .\·u. .J Sil/111111!1' /')').! Digital Te c!JIIicnf jo urnal The VA Xsta tion 4000 Model 90

availability of certain VA X system designs that attached to a VA.'( 4000 Model 500 2 Unlike a con­ included some of the subsystems that we planned to ventional prototype, the breadboard logic was use. These influences are discussed in this section. expected to change; therefore we included recon­ 'T'he strategv of using existing hardware anti soft­ figurable connections to the FPGAs. AJso, the bread­ ware components stemmed from our goal to board system did not need to meet any of the deliver the Model 90 as quickly as possible. The physical constraints, such as size and layout, that project schedule did not allow time for the develop­ are normally required of a conventional prototype. ment of any major new pieces of hardware or soft­ An early breadboard system provided the clear ben­ ware. Consequently, we used as much hardware efits of rapid testing and change of hardware and anti software from other VA.'( products as possible. software. Our aggressive schedule also prompted us to Because we could change logic quickly and easily explore diffe rent technologies and verification on the breadboard, the role of simulation on this methods. On our previous projects, we used con­ project fo cused on verifying module interconnect, ventional gate array or standard cell technology, and not on exhaustive logic verification. We main­ and typically we strove for exhaustive logic si mula­ tained a working system-simulation model, with a tion and timing verification prior to releasing chip basic set of regression tests, as a reference fo r logic designs. Our goal was to have fu lly fu nctional first­ changes and as a tool fo r debugging. Logic verifica­ pass silicon. Unfortunately, the first pass of a gate tion was performed on the breadboard to an extent array was rarely fu lly functional. This approach had not possible using simulation. two consequences on the project schedule: (1) First-pass hardware was usually delayed as much as Major Sy stem Components possible to all.ow fo r more thorough logic simula­ Figure 1 is a block diagram showing the primary tion and timing verification, and (2) a second pass components in the Model 90. In this paper we focus was needed if first-pass si licon was not hilly fu nc­ on fo ur distinct components in the system: the tional, adding several months to the overall project core, the memory suhsystem, the 1/0 subsystem, schedule. and the graphics subsystem . At the time of our design, several new pro­ The core chip set is composed of a 74 .4 -mega­ grammable silicon technologies were emerging hertz (MHZ) NVAX CPU. the i'NAX memory con­ that promised perf(>rmance, densities, and package troller (NMC), and the NVAX 1/0 adapter (NC:A). The sizes comparable to gate array technology. As the NVAX CPU also controls a 2'56-kilobyte (KI3) write­ logical design of the system progressed through its back secondary cache that reduces memory read early stages, we evaluated these new technologies latency and decreases memory write traffic. and found that they had matured enough to be used The memory subsystem supports a 64-bit data in the design of the Model 90. We chose Xi I inx field path to main memory that is composed totally of programmable gate arrays (FPCIAs) and AM D single in-line memory modules (SIMMs). Main mem­ PALs to implement large-scale integration, and stan­ ory sizes of up to 128 megabytes (MB) are supported dard PA Ls to implement smaller logic fu nctions. by the Model 90. These programmable technologies allowed us to The 1/0 subsystem comprises two independent build first-pass hardware with the fu ll expectation 32-bit buses that communicate with the various l/0 that we would need to make inevitable changes in ami graphics options of the Model 90. One bus response to logic bugs and timing problems. interfaces to the optional TlJRBOchanneJ adapter, Fortunately, with the new technologies, bug fixes the firmware re

Digital Te cbnical jo unwl Vo l. 4 No. .) Summ<'r /'J'Jl 8:3 NVAX- miCI·opr-ocessor VAX Systems

NDAL BUS

ETHERNET CONSOLE CONTROLLER ROMS

1:-Dv SCSI LCSPX MODULE DMA RAMS

SYNCHRONOUS MEMORY SIMMS COMMUNICATIONS 16 - 128MB

QUAD UART SPXG OR SPXGT MODULE

TOY CLOCK

LED REGISTERS

CONFIGURATION REGISTER

ETHERNET ID ROM

Figure I Block Diagm m ql the VA Xs ta tion 4000 Jl!Jo de/ 90

Finally, the graphics subsystem provides support was completed and stable. The NVAX CPU is a fu lly fo r three diffe rent graphics options. These options custom complementary metal-oxide semiconduc­ include one low-cost gra phics option and two high­ tor (CMOS) CPU fa bricated in Digital's 0.75 -micro­ performance three-dimensional accelerators. meter CMOS-4 process. The NCA and NMC are also The majority of the components used in the fu lly custom GvJOS chips, but are fabricated using CMOS-:1 :\1o tlel 90 had been used in previous VAX systems. Digital's l.O-micrometer process. The three Ta ble I lists the maj or ;\'lode! 90 components anti custom chips communicate with each other over a inclicates the source of these components. 64-bit bidirectional bus named the NDAL. The NVAX CPU con tains a 2KB virtual instruction VAX station 4000Mod el90 Core cache, an HKB write- through instruction/data pri­ NVAX Cl'! l, NCA, The the !\.\IC, the and the backup mary cache, and, on the Model 90, interfaces to cache compose the core of the system modu le. This a 256KH write-back instruction/data secondary core architecture was taken directly from the VAX cache. It contains an on-chip floating-point 4000 Model ')()() system. This architecture was cho­ unit and branch prediction logic. The NVA)( CPU sen because it would meet our performance goals; pipelines instruction execution at the macroin­ 1/0, it provided simple interfaces to our memory, struction leve l as wel l as the traditional micro­ and graphics subsystems; and because the design instruction level.

84 l'\1/. 4 ,Yo . .> 1/JJ/))Jer /')<)1 Digital Te ciJnicnljo urnal Th e VA Xsta tion 4000 Mode/ 90

Ta ble Model Component Source NMC provides most of the memory control signals, 1 90 and only simple bank selection logic is required Component Source external ly. Core chip set VA X 4000 Model 500 The Model 90 memory subsystem implements Ethernet controller VA X 4000 Model 500 two sets of memory using the same :16-hit wide Memory SIMMs VA Xstation 4000 Model 60 SIM:'vls that are used in the Model 60 system. Four SIMMs are required fo r each set. By using either the SPXg and SPXgt VAXstation 4000 Model 60 graphics options 4JVIH or 16MB SIMMs, the Model 90 allows memory TURBOchannel option VAXstation 4000 Model 60 configuration sizes of 16.VIIl, :)2 ,VJ B, 64:\<111, HOMB, or 12HMB. Synchronous VA Xstation 4000 Model 60 communications option Due to module space constraints and cost SCSI controller VAXstation 4000 Model 60 concerns, we investigated alternatives to the fo ur Enclosu re, power VA Xstation 4000 Model 60 GMX memory data path chips used on the VA X supply, cables, brackets 4000 Model ')00 memory modules. We determined Memory transceivers VAXstation 4000 VLC that two low-cost gate arrays designed fo r the LCSPX graphics option Modified VXT 2000 VA Xstation VI.C could he used instead. These mod ule gate arrays provided the same multi plexer and c;:.1x EDAL controller New design transceiver fu nctions fo und in the chips. N.'vl l SPXg and SPXgt New design Because the on tile Model 90 consists of only interface two Joacls, the high-drive capability of the GMX

chips was not required. We used a simple PA L to decode the bank selection signals from tbe NMC The NCA provides direct memory access (DMA) and to generate the control signals required fo r the and programmed 1/0 (PIO) support between the gate arrays. 64-bit NDAL bus and two 32-bit bidirectional COAL Because the Model 90 design uses the NMC, we

buses named CP 1 and CP2. In the Moclel 90 system, received an additional benefit of having error cor­ these buses run at an 80-ns cycle time and interface rection code protection at no additional cost to the

to all the graphics and 110 devices in the system. system. The N\1C implements a single-bit error cor­ The NCA also contains the VAX standard interval rect, double-bit error detect code (SEC :/DED) across timer register as well as many of the l/0 control and every 64-bit word of memory data. The eight bits of status registers. error correction code replaced the eight bits of par­ The NMC services NDAL memory requests using ity used on the Model 60. a 64-bit dynamic random-access memory (DRAM) bus called the NMl, which is protected with an error 1/0 Subsystem correction code. The NMC, as configured in the Given that the Model 90 system was an upgrade to Model 90, supports up to 12HMH of main memory. 1/0 It also supports a directory-based broadcast coher­ the Model 60, a requirement of the subsystem design was to provide support fo r all I/O ence protocol to maintain coherency between the devices/options fo und on the Model 60. The Model write-back cache of the NVAX CPU and the system's 60 1/0 design consisted of an interface to a 16-bit DMA devices. bus known as the EDAL where most of the system 1/0 devices resided. The iV Iodel 60 also supported a Memory Subsystem TlJRROchannel adapter that connected to a ::)2-bit The memory !'iubsystem of the 1VIodel 90 is based on CDAL bus. the design of the V8vV.. 4000 Model ')00 system. In the The main task of the Model 90 l/0 subsys­ memory subsystem, tbe i'\:'viC handles all NDAL tem design was to provide an interface between memory references by transferring them over tbe the two :)2 -bit Cl' buses provided by the i\' CA and 64-bit NMI. The NtviC supports data transfer rates up the 16-bit EDAL bus and the 32 -bit TlJRIIOchannel 5MB NMI to ')8. per second over the when used with adapter option offered on the Model 60. The design a 74 .4-,VI Iiz NYAX CPU. Memory is configured in sets; work necessary included a small PA L design fo r the each set contains two banks of interleaved 64-bit TURROchannel interface on the CP2 bus and the wide memory. External multiplexers and trans­ design of two programmable gate arravs for the ceivers are required to perform interleaving. The interface between the :)2-bit C:P I bus and the 16-bit

Digital Te chnical }oun1lll V" /. 4 1\'o. 3 Summer 1')')1 NVAX-microprocessor VAX Systems

EDAL. The following list describes the Model 90 • Synchronous Communications Option-The I/0 (levices :mel options and explains wlw each Model 90 supports the same multiple protocol­ was chosen. communications option that is offe red by the Model 60. 'l'his interface was implemented on • Ethernet-T'he Model 90 Ethernet interface is the EDA L bus ancl allows use of synchronous implemented with the second-generation wide-area network communication through pro­ Ethernet controller (SGEC), which provides an tOcols such as high-level data l ink control (HDLC) Ethernet connection through a ThinWire or and synchronous data link control (SDLC) . thick-wire cable, selectable by a switch on the rear of the system box. The SGEC, \vh ich con­ • Miscel laneous EDAL Devices-The other devices

nects to the CPl bus and was used on the VA X and registers on the Model 90 16 -bi t EDAL are a 4000 Model 500 system, facilitates scatter/gather 16-bit system configuration register, an 8-bit mapping and dual internal FIFO buffering. We light-emitting diode register, an Ethernetidenti­ chose the VAX 4000 Model ')()() design to imple­ fication ll.OJvl, and a \Vatch chip. Al l of these ment an Ethernet control ler because it required devices also existed on the Model 60 EDAL bus no new logic design. and were accessed in a similar manner.

• Small Computer System Interface-The SCSI bus interface is implemented using the NCR 5:iC:94 CEA C and SQWF Ch ip Designs SCSI controller chip that was used on the Model One of the major pieces of design work required fo r 60. i The NCR 53C94 device connects to the EDAL the Model 90 l/0 subsystem was to interface the

bus and performs DMA operations to and from 32 - bi t CP I bus to all the 1/0 devices that reside on main memory in concert with the two pro­ the 16 -bit EDAL bus. This interface was partitioned grammable gate arrays known as the CEAC and into two tightly cm1pled designs called the CE.r\C SQWF DMA virtual-to-physical address transla­ and SQWF. The CEAC chip is primarily responsible tion is performed by the SQWF chip based on fo r handling control of l/0 register read and write 8, 192 mapping registers implemented in exter­ requests from the NCA to the various clevices on the nal static RAMs. EDAL. The SQWF chip hand lcs D.v!A. transfers and buffering of data between the SCSI controller chip • Serial Lines-The DC708') quad universal asyn­ and the NC:A. chronous receiver/transmitter (UART) chip was The CEAC chip, which was fi rst implemented in a chosen to provide the Model 90 with fo ur serial Xilinx 3090 FPGA and later converted to a conven­ lines fo r the keyboard, mouse, modem, and tional gate array, is a 3,400-gate design and uses L 19 /console ports. The DC7085 provides a 1/0 pins of a 160-pin plastic quacl flat package 64-entry FIFO queue that is shared by all four (PQFP). It performs the CP I bus arbitration receive lines and is implemented in a small exter­ between the SQ\V F for SCSI D.VIA, the sc;E<: fo r nal SRJ\J'vl. Etl1ernet DMA traffic, ami the 1CA fo r l/0 register access. The CL\C responds to NCA 1!0 accesses that • Sound-The Model 90 sound functionality is are directed at internal CTAC/SQWF registers ami implemented using the AMD 79C:i0 sound chip EDAL device registers. Its slave sequencer controls just as it was in the Model 60. The programmed read, write, and chip-sl'lL:ct signals that control 1/0 interface to this device allows both record EDAL CE.\C and playback functions through a jack on the devices. The has C:Pl and EDAL multi­ plexing logic which selects between addresses ami front panel, and provides voice-quality sound. data and is controlled by the slave sequencer. The • TURllOcllanneJ-The Model 90 provides a single CEAC chip contains an interrupt controller which slot into which any TURBOchannel option that is consists of interrupt request and mask registers, supported by the VMS operating system may be priority

86 Vo l. Xu. .) Slll11111(!1' 199.! Digital Te cbukal jourual Tb e VAXstatinn 4000 Model 90

110 1/0 pins of a !60-pin PQFI'. The SQ\\ TF responds and software development cost, we mer with a to requests from the NCR ';;'SC94 SCSI controller number of graphics hardware and software engi­ SCSI neers. We fo und that a new X terminal, the chip to do D:\IA transfers. During D:v!A, VXT the SQWF chip helps to optimize utilization of the 2000, was being developed with graphics based on CP l/NDAL!N.vtl buses by buffering up to eight bytes a cost-reduced version of the SPX graphics module of data in either direction. The SQWF performs byte originally used in the VAXstation 3100 system. This swapping to map the NCR ';3C94's 16 -bit transfers module was close to the Model 60 LCG in both cost to arbitrary main memory byte boundaries. The and performance. In addition, it was designed to SQWF contains a 22-bit main memory address byte interface directly to a COAL bus and was software counter and a direction bit which are accessible as com patible with the VA Xstation 3100 SPX. As a registers in 1/0 space. The SQWF chip also performs result, the module cou ld interface directly to our DMA virtual-to-physical address translation by refer­ CP2 bus with a minimal nu mber of changes to its encing an H, 192-page address map store based in supporting software. SPX externalSH AM. To use the VXT2000 module in the Mo del 90, we needed to lay out the module again to fir the Graphics Subsystem physical constraints of our system. This new mod­ ule was named LCSPX (low-cost SPX). No logic One of the keys to producing a workstation around design work was required on the LCSPX or on the the VA ,'< 4000 Model 500 core was the ability to inte­ system module to support it. A connector on the grate graphics support into the system successfully. CP2 bus provides the interface to the LCSPX module. In addition, maintaining the high level of graphics AJ though the performance of the 2000 SPX performance fo und in the Model 60 was viewed VXT module was cl ose tO that of the LCG on the Model as an important goaL The Model 60 offered three 60, we wanted to extract as much performance out very good graphics options. 'fhe Model 60 low-cost of the i.CSL'X module as possible. To improve the graphics (Lc:c;) option fe atures an inexpensive performance of the LCSPX, we increased the clock frame buffermod ule and two-dimensional graphics speed of the module. A speed analysis of the mod­ acceleration logic cont:�ined within a l:�rge gate ule was performed to determine how much margin array on the system modu le. The other Model 60 ex isted in the design. The original 2000 SI'X graphics options, SPXg and SI'Xgt, are three-dimen­ VXT sional graphics accelerators. The SPXg is an 8-plane module ran at 20 ,v1 Hz, and we determined th:�t by upgrading a number of components, the LCSPX option, and the SPXgt is a 24-plane option. The could run at 25 MHz. As a result of this 25 rercent three-dimensional graphics options simply re place the LCG fra me buffe r in the Model 60. We rea lized increase in speed, the performance of the I.CSPX that the Model 90 system had to support a high-per­ module exceeds the performance of the Model 60 fo rmance, entry-level, two-dimensional gra phics LCG fo r almost all operations. option ami the three-dimensional SPXg and SPXgt options. The first major task in the design of the SPXg and SPXgt Graphics Op tions Model 90 was defi ning the entry-level graphics On the Model 60, the SI'Xg :� nd SPXgt graphics option. options plug into the LCG fra me buffe r port, and a subset of the LCG control logic provides access to LCSPX Graphics Op tion these options. To support SPXg and SPXgt on the From the start of the Model 90 project, we knew· Model 90, a port that emulated the LCG frame buffer we cou ld not support ru ;. The I.CG control logic port was required. The Model 60 supports both :1 on the Model 60 was embedded within a very large PTO :�nd a DMA interface to the SPXg and Sl'Xgt, but gate array that also served as a memory and 1/0 the Model 90 supports only a 1'10 interface. controller. 'T'h is part was not compatible with our We considered a DMA interface fo r the Model 90, core architecture. Redesigning the Model 60 LCG but discarded the idea for several reasons. A DMA logic to fir our system would have been a major interface similar to the Model 60, which supports design task requiring a midsize gate array. This virtual DMA, requires more logic than would fit in was we ll beyond our engineering schedule and the progra mmable technologies we were consider­ resources. ing for the .Model 90. A simpler DMA interface To find a graphics option that wou ld provide would not have been compatible with VMS graphics the desired performance and have a low hardware system software and wou ld have required a large

Digilal Te cbuical jo urua/ Vol. 1 i\'u. j Sullmwr 1<)9.! 87 NVAX-micr oprocessor VAXSystems

number of changes to the software. Final ly, it breadboard allowed quicker ami more accurate appeared that processing on the SPXg :. llld SPXgt hardware verification than logic simulation. In modules, and not data bandwidth, was the limiting addition, the breadboard allowed debugging of fa ctor in performance in the Model 60 system. console and VMS software earlier than a conven­ Based on this analysis. a high-speed PIO interface tiona I prototype. was built. The breadboard system was based on a \1,\..\. 4000 The SPXg/Sl'Xgt interface on the Model 90 simply Model ')00. Logically, the breadboard simply translates C:P2 bus read and wri te commands into extended the Cl' I and CP2 buses of a VA X 4000 fra me buffe r port transactions. The interface is Model 500 system to include the complete 1/0 and pipelined such that it can keep up with the peak graphics subsystems of the Model 90. The Model 90 transfer rate of the CP2 bus. We implemented the breadboard was an eight-layer etch module and majority of SPXg/SPXgt interface logic using two included all the devices on the Model 90 CP I, CP2, AMD :VlACH PALs. One of these large PA Ls contains and EDAL buses. 'fhe breadboard system used a VA,'\ the control sequencer and generates all CP2, frame 4000 Model '500 test backplane that allowed com­ buffe r port, and (lata path control signals. The other plete physical access to both si(l es of the VA_,'{ 4000 l'viACH PA L contains an and address Model 500 CPU module. A socket with pins that data path. A few miscellaneous medium-scale inte­ extended I inch through the back of the module gration components make up the remainder of the was used on the NCA chip of the VA X 4000 Model interface. Performance analysis of the SI'Xg and 500 CPU modu le. The breadboard, wl1ich contained SPXgt modules shows that performance on the the holes fo r the NCA, was then attached to the VAX Model 90 is virtually the same as on the Model 60. 4000 Model ';00 CPU module by solderi ng it to the extended socket pins. CP bus clocks were not Physical Design directly routed to the breadboard logic. To control The physical design of the Model 90 system board clock skew. a phase lock loop (PLL) was used on the presented many challenges. Being a module breadboard to regenerate the CP bus clocks. With upgrade from the Model 60, the Model 90 used a this configuration, the breadboard system was able system board that had many fixed-position obsta­ to run at fu ll speed . cles fo r placement and routing, such as connectors Once the breadboard system was assemblecl , and stand-off post holes. ln addition to the seven we were able to execute console com mands after connectors and the single sw itch along the back of a quick debugging of the system. At this rime, very the unit, seven more connectors scattered about little of the breadboard logic was being used the module had to retain their positio ns . Also. the because the console program was using the VA,'{ i\·lodel 90 had to fit eight SLVL VIS in the same area that 4000 Model '500 I/O devices and not the Model the Model 60 had six SIMMs. Furthermore, pro­ 90 devices. The hardware team began debug­ grammable technologies generally provide logic ging the breadboard logic piece by piece. of less density than conventional gate arrays, and Debugging was quick because a completely func­ therefo re require more module space. To meet tional console and 1/0 system al really existed. these challenges, we eliminated on-board main Simpl e fu nctions, such as register reads and writes, memory (8:VIB were present on the Model 60) and were debugged using the console examine and reduced the size of the secondary cache from the deposit commands. More complex fu nctions, such originally planned ')J2K.B to 256KR. as reading ami writing to an SCSI disk, were tested The Model 90 system board measures 16 inches by writing test programs in VAX :-'LACRO, download­ by 10.5 inches, and has 8 layers of etch. approxi­ ing them into memory, and executing them using mately 100 surface-mount and through-hole com­ the console. ponents, 23 connectors, 'i oscillators, and over 300 After some of the major pieces of functionality discrete resistors and capacitors. All components were verified hy the hardware group, members are mounted on a single side. Figure 2 shows the of the VMS group began to use the breadboard. Model 90 system module. A modified version of the Vi\,IS operating system was used to debug V�VIS device drivers. Drivers Mode/ 90 Breadboard Sy stem fo r the serial lines. LCSPX, SPXg. SI'Xgt, and the SCSI Our logic verification strategy depended on build­ port were debugged. In addition to software debug­ ing a breadboard early in tl1e design cycle. 'fhis ging, this effo rt provided the software to perform

88 Vol. 4 No . . > Slllllll/er 1')')2 Digital Te cbnical jo urnal Th e VAX station 4000 Model 90

CEAC SQWF NVAX 1/0 ROMS CHIP CHIP ADAPTER (NCA)

MEMORY SIMM NVAX CONNECTORS

Figure 2 VAX station 4000 lHodel

extended verification. The hardware group was As soon as we assembled the first Model 90 pro­ able to use graphics test packages running under totype system, we realized the benefits of a.ll DECwinclows software. disk exercisers. system tile work performed using the breadboard system. exercisers. and other tools supported by the VMS During the first day of debugging, we ran the operating system. This prov ided a ve rification envi­ console program and booted the VMS system ronment we could never achieve with traditional with minimal effort. We also ran DECwi ndows si mulation methods. software using the l.CSPX and the SI'Xg and At this point, we were still using the VA)\ 4000 SP:Xgt graphics options. Th is quick debugging Mod el 500 console. The breadboard was then used allowed additional protOtype systems to be built to debug the Model 90 console code. We disabled immediately and sh ipped to various develop­ the system support ch ip, which controls much of ment and verification groups throughout the the console support hardw:�re in the VAX 4000 company. Model 500, and began using the Model 90 console support hardware. A base console that included Pe rformance minimal power-up self-test, bas ic command sup­ The VA.X station 4000 Model 90 represents the port, and SCSI hoot support was debugged by the fastest VAX workstation ever produced . Its Ci'llper ­ Model 90 console team. Once the console was func­ fo rmance surpasses previous VAX workstations and tional, the VMS group re turned and debugged boot is comparable to Digital's RJSC-based workstations. support fo r the Model 90 usi ng the bread board. By utilizing the NVA)( CPU chip, the Model 90 When this was finished, the software was com­ achieves 2.7 times the performance of the .Model 60 plete ly debugged and ready to be loaded onto the when measured aga inst the SPECmark bench­ first Model 90 prototype. marks.' Ta ble 2 gives the CPU performance of the

Digital Te ch11ical jo urnal Vol. -1 .Vo . 3 Sunnner 1')<)2 89 NV AX-microprocessorVAX Systems

Ta ble 2 CPU Performance Compa rison the Model 60 and the Model 90. Ta ble 4 compares Workstation SPECmark* Rating the three-dimensional gr aphics performance of sev­ eral of Digital's workstations using standard three­ VA Xstation 4000 Model 90 32.7 dimensional metrics. In ad dition, Ta ble '5 gives VA Xstation 4000 Model 60 12.0 three-d imensional performance using the picture­ VA Xstation 31 00 Model 76 6.8 level benchmark (PLB) suite. DECstation 5000 Model 240 32.4 Summary Note: The NV AX CPU chip provides the high performance •sPEC mark is a quantitative measurement of performance, determined by running a suite of ten benchmark programs. that makes the \'A,'\station 4000 iV lodel 90 competi­ tive in today's market. The design methodology used during the project allowed us to develop and VA Xstation Model 90 compared to other Digital ship the Model 90 qu ickly and to prov ide a simple workstations. upgrade path fo r existing VAXstation customers. LCSPX is the entry- leveL two-dimensional graph­ ics option offere(l on the Model 90. The perfor­ Acknowledgments mance of this option is better than the LCG option The authors would like to acknowledge the fo llow­ offered on the lV Iodel 60 fo r most graphics opera­ ing people fo r tneir contributions to the VAXstation tions. Ta ble :) compares the LCSPX graphics perfor­ Model 90 development: Gregg Bouchard , Paul mance to Digit

Ta ble 3 Tw o-dimensional Graph ics Performance Compa rison

Two-dimensional Two-dimensional Area Fill Vectors Workstation (Mpixels per second} (Kvectors per second}

VA Xstation 4000 Model 90 LCSPX 18.2 266.0 VA Xstation 4000 Model 60 LCG 14.6 21 6.0 VA Xstation 31 00 Model 76 SPX 14.2 183.0 DECstation 5000 Model 240 PXG 13.9 263.0

Ta ble 4 Three-di mensional Graphics Performance Comparison

Three-dimensional Three-dimensional Polygons Vectors Workstation (Kpolygons per second) (Kvectors per second}

VA Xstation 4000 Model 90 SPXgt 33 295 VA Xstation 4000 Model 60 SPXgt 33 300 VA Xstation 4000 Model 90 SPXg 30 295 VA Xstation 4000 Model 60 SPXg 30 295 VA Xstation 31 00 Model 76 SPX 6 57 DECstation 5000 Model 240 PXG 52 302

Di_q,itnlTe chnical jounwl 90 l'ol. ·-i No.. i S11111111�r I?'JJ Th e VA Xs ta tion 4000 Model 90

Ta ble 5 PLB Graphics Performance Comparison

GPCmark PLBiit Results* Printed Circuit System Cyl inder Workstation Board Chassis Head Head Shuttle

VA Xstation 4000 Model 90 SPXgt 13.2 11.8 8.5 8.3 13.5 VA Xstation 4000 Model 60 SPXgt 12.3 11 8.4 8.5 12.5 .1 DECstation 5000 Model 240 PXG 10.0 11.7 14.9 19.2 18.3

Note:

"GPCmark is a quantitative measurement of performance, determined by dividing a normalizing constant by the elapsed time, in seconds, required to perform the test.

References 3. TURBOchannel Ha rdwa re Sp ecification (Palo Alto, CA: Digital Equipment Corporation, 1. (;. Uhler et al.. '"The f\fVA X and NVA X+ High­ TfU/ADD Program, 1990). perfo rmance VA,'\ Microprocessors:· Digital Te chnical Jo urnal, vol. 4, no. 3 (Summer 4. Small Computer -�:ystem. Interfa ce (SCSI) 1992, this issue): 11-2�. (New Yo rk: American National Standards Institute, 131-19R6. l9R6). 2. J. Crowell et al ., "'Design of the VAX 4000 A.J"J SI X.) Model 400, 500, ancl 600 Systems," Digital 5. Digital 's VA Xs tation 4000 Fa milv Pei/Or­ Te chnical jo urnal, vol. 4, no. 3 (Summer mance Summary Ve rsion 3. 0 (Maynard: 1992, this issue): 60-72. Digital Equipment Corporation, 1992).

Digital Te chuical jo urnal Vo l. 4 No ..'i Summer 1992 91 Brian Po rter I

VAX 6000Erro r Handling: A Pragmatic App roach

The VMS op erating sy stem 's CPUdepeudent support of the V,4X 6000family ot com­ puters implements a complex and sophisticated set of error-handling routines. At

tbe start of a ��\IS session, these routines help cons/met tbe necesscn:rframeu•ork to support tbe f/ 0 subsy stem us tbe system begins to enwge. Fo r IIlllCh of a 1'.1/S ses­ sion, these routines t/.?en fc()' dormant ll'ilbin the SYSWA image. Periodical/); u•ben aroused, !bey peer into hardware registers lookiltgj(Jr signs of trouble. Often, all is tl'e/1, and the routines retum to hibenwtion. On t/.wse occasions tl'hen the hard­ tmre requires assistance, error handling takes co111p!ete control of t!Je sy stem. It bas but one mission: identifJ ' tbe err()}; recol'er if possible, but at all costs ensure tbat the integrity of the SJ!Stem re111ains intact and that data ispre served

Error hand ling is the set of routines that resides ment of error hand I ing to support the CPl l modules in the CP! ·-dependent loadable image known as and memory controllers that make up the system SYSLOA. Each processor model th:tt supports the kernel in the VAX 6000 series. This paper explains VA X system architecture and V.\·!S operating system our error-handling strategy to not only red uce the has its own SYSLOA im:tge. Error handling is imple­ amount of unique coding, but also provide an mented with other common routines like console opportunity to enhance, mature, and improve exist­ support and secondary processor start -up. Error ing VA X 6000 products. hand I ing is unique fo r each processor model. Individual processor models bring with them a Development of Error-handling wealth of error detectors and consistency checkers. Routines fo r the VAX 6000 Platform Each device has to be independently interrogated The VA X 6000 platform provided a unique opportu­ and reset once triggered. nity to develop error-handling ro utines. As shown

Error hand ling of one fo rm or another resides in Figure l, the X.vl l backbone of the system allows throughout the VMS operating system. In some con­ the creation of increasingly powerful systems that texts, trying to edit a file in a directory structure retain much of their operating characteristics. that does not exist can be considered an error. Th is Incre:tses in processor capability are gained by paper discusses only errors that deal with the merely exchanging processor modules fo r more underlying CPU and memory hardware on which powerful models. We decided th:lt error hand ling the VMS system is running. Jt describes the develop- should not be any differen t. On prior systems,

·------.. I I I 1 110 1 I 1 1/0 N I I I I ·------J ·------J

·------.. I I I 1/0 2 I I I ·------1

S) ·stem Figure 1 Block Diagram

92 I vi. i .Vu. 3 Sllllllill!r 1'.192 D(�ital Te cbnical journal VAX GOOO Error Handling: A Pragmatic Approach

a complete set of error-hand ling routines fo r each Objectives CPU model had to be implement ed. We adopted an Error handl ing must identify the error and recover approach to error hand I ing that could be carried if possible. Above all, it must guarantee the fo rward from one processor to the next with l itrle integrity of the system and the preservation of data. or no change to the initial error-hand ling model. An important project goal was to produce a This approach handles identical errors in the same robust and quality product that would have pre­ way with the same code base. dictable perfo rmance. We chose to have a single The protocol of the X1'<1 l bus was modified to error-handling model that could be implemented allow support of write-back caching schemes of the fo r all VAX 6000 Cl'lJ models. We also adopted an VA X 6000 Model 500 and VA,'( 6000 Moclel 600. implementation methodology that included the However, this had no ill effect on the overall error­ capability to allow rigorous testing of the many handling model we decided to use in the support of code paths contained in the various configurations. the VA,'\ 6000 fa mily of processors. To accomplish this goal, we designed the test and verification strategy in conjunction with the over­ VAX 6000 Fa miZJ! Error Delivery all system design of the kernel error-handling sub­ Identical mechanisms were used ro structure error system. In addition, we designed and implemented del ivery on each processor in the VA,'{ 6000 fa mily. an object-oriented code base fo r errors that are Each processor has two system control block (SCI1) common across the platform. Errors are handled in interrupt vectors and a single SCR exception vector. this way when they are associated with main mem­

The inrerrupt vectors deliver hard and soft errors. ory. with XMI bus protocols. or with the support of The exception vector del ivers machine check vector processors. exceptions. Most frequently occurring errors are associated with main memory. The error handling fo r main Hm 'd Error in Hard errors can be ca tego­ terrupts memory is composed of three major fu nctions. The rized in the fo llowing way. Hard errors occur as first handles the complexity of support fo r two dif­ conditions that are not synchronous w the program ferent memory controller types ancl their internal counter (PC). In almost all instances, systems can­ error cond itions. The other two fu nctions are logi­ not recover from hard errors. They indicate that cally spl it between single-bit error correction code data or machine state has been lost. Hard errors are (ECC) fa ilures and double-bit ECC fa ilures. normally fa tal. Hard errors are delivered through Common error-hand ling interfaces and routines SCI! vector (i O (hexadecimal); interrupt priority were established fo r the VAX 6000 fa mily of proces­ level (JPL) is raised to decimal. 29 sors. The use of common files and interfaces ensures that errors are hand led in exactly the same Soj/. Error illterrupts Soft errors. on the other way for each CPU model . band, generally signal an asynchronous condition, with respect to the PC, that has been corrected by Full Support of the Sy mmetric hardware, or that can be overcome with some soft­ Multiprocessing Pa radigm ware intervention. Soft errors are normally always The VAX fa mily of CPUs are symmetric multi­ benign to system operation. Soft errors are del iv­ 6000 processing (SMI') systems. The error-handling ered through SC:R vector 54 (hexadecimal); IPL is model assumes that more than one CPU is always raised to 26 decimal. active. The synchronization of error handling

Machine Check Ex ceptions Machine check excep­ throughout the system has numerous benefits. If an tions are internal processor conditions that are syn­ error condition were detected throughout the chronous to the PC. If the condition can be system, it would be a very complicated procedure corrected when the instruction that caused the to ensure that all CPl Js reacted consistently. Such exception is reexecuted, the result is the same as if errors would clutter the error log with reports the condition had not occurred. Many of the from every CPU and X.'

Digital Te cbuica/ journal Vo l. 4 No. 3 S11 111111er I'J'JJ 93 NVAX- microprocessor VA)( Systems

first CPll to respond to the error. Machine state and data. These messages are time-stamped and is create(! that informs other CPU nodes that the sent to the system console device OPAO:. event has been logged on their behalf. As each CPU Messages can be output in two diffe rent modes. node responds to the error condition, it can inter­ Interrupt driven mode is the most common and rogate this state. In the event that all error con­ uses the standard terminal driver functions of the ditions have been logged on behalf of a CPU, the running V:VIS session. Messages that use this mode error condition is cleared and the interrupt or describe the disabl ing of some part of the CPU ker· exception is dismissed. The one entry in the error nel at system start-up or during the current session. log for these types of errors clearly indicates that The other mode of output is synchronous and is in other nodes were active. Info rmation about the line with error processing. This mode is reserved nodes affe cted and state indicating how the node fo r hardware errors that are nonrecoverable ami was affected is recorded in the single error log result in a system crash. The message is output just entry. prior to calling the suc;cHECK mechanism that would terminate the current VMS session abnor· CP U Configuration Data in the Error Log mally. Messages are always descriptive of the error or exception condition and contain all the machine A CPU running with some of its hardware disabled state available at the time of the error. may have operating characteristics that cause other Formatted messages allow fo r errors that occur CPlJs to incur error conditions of some type. An as the system is being initialized to be reported ami error log entry from a VA)C 6000 CPU always described should the system fa il to boot. The out­ includes the configuration of other active CPUs put of messages is fu lly synchronized between the on the system. For example, if the CPU at node 6 primary and secondary CPUs of SMI' systems. The is running with irs backup cache disabled, other primary CPU outputs messages about errors occur­ crus include this info rmation with their error log ring on secondary processors. data. Thus, potential error conditions can be easily identified. Error Rate Checking and Loop Error Log Filtering Detection The VA X 6000 fa mily of CPlls provides a great deal of Some errors that occur at too high a rate are filtered error detection. The erro r conditions signaled in from the error log. Errors that are delivered by the many cases are benign to the system if the appropri­ soft error vector are invariably benign to system ate action is taken. Howe\'er, blind recovery from operation. It is important that they be reported errors c:.1n be a downfaJ I in itself. It is not uncom­ because they can ind icate an impending fatal error mon fo r so many benign fa ilures to occur that error in some subsystem. However, if these errors are handling is the only task being performed by the occurring too often, only a subset is sent to the system. Error handling on the VAX 6000 family error log. The algorithm is based on an error count implements a system of rate checking ami loop over time. If an error is occurring too rapidly, detection to combat this ]J roblem. logging of the errors is inhibited. At a later time, log­ ging is reenablecl . Errors that do not appear in the error log are still counted, and the accumulated Ra te and Loop Detection Tinze Base totals are displayed by other error conditions that The timing standard used by the rate checking and are sent to the error log. loop detection subsystems is the CPl I TODR register. The TODR is independent of soft· Message Fa cility ware and increments every 10 milliseconds.

Error handling on the VAX 6000 has the unique abil­ ity to output fo rmatted messages. Integral to the Rate Checking of Errors error-handling subsystem is a message processing Each error condition has an associated rate check faci lity that is composed of specialized ro utines database. The database tracks TODH values fo r the and modified versions of several VMS system ser­ three most recent errors. If these errors occur tuo vices. The modified system services include SYSFAO fa st, special action is taken in addition to that ancl S'!"SCVRTI:VI. The message facility provides the required to service the error. This may involve dis­ error-handl ing subsystem with the capability to abling the signaling of the error condition itself. For output fo rmatted messages that contain both text example, some errors that are reported outside the

SIIIIIIJ!er 94 Vol. 4 No. 3 19').! Digital Te chnical jo urnal Error VA X 6000 Ha ndling· A Pragnwtic Approach

• Setup and synchronization CPU can be turned off. When sufficient data about an error has been col lected, the error may be dis­ • Saving of state abled fo r a period of time. Hardware fe atures such as internal cache errors can also be disabled. If • Parsing of data cache errors occur that are recoverable, but are • Processing and accounting of state occurring too fa st, the cache is disabled. The occur­ Error logging rence of mu ltiple errors can ind icate a broken • structure, whereas a single error can indicate a sin­ • Error reset and dismissal gle transient event. This logical orga nization provides flexibi lity to the implementation being addressed at the time. Loop Detection The parsing of data step was added at the same Multiple errors of diffe rent types can also occur fre­ time that support of the VAX 6000 Model 400 was quently. In this situation, the system is operational, implementecl. but it is continuously at high IPL, servicing error interrupts or exceptions. This operating scenario is Setup and S)mchronization detected by the frequency of transitions in and out Synchronization is accompl ished by acquiring the of error hand I in g. When error-hand ling code MCHECK spinlock. The use of spinlocks is a V:'I'IS threads are entered and exited. the TODR value is technique that provides atomic access to code saved . During ex ecution of error handling, the threads and data structures and ensures that only enter TODR value is compared to the last exit TODR one CPU at a time is in error handling. Thus it is pos­ value. If the result is too close, a count is incre­ sible to compress an error condition occurring men ted. If the close relationship of exit to entry throughout the system into a single error log entry continues to occur, a loop condition is declared and by the first CJ>ll to service the error. appropriate action is taken. Most often this means Following synchronization. the S.'vll' sanity and the system is shut down. spinlock acquisition timers are disabled . lf an error occurs at the boundary of one of these timers, a Error-handling Mo del fa lse termination of the session can occur due to The traditional approach to error hand] ing in the the time consumed by the execution of error han­ V;'vlS operating system has been to interrogate regis­ dling. The SiVI l' sanity and spinwait timers are mech­ ters and act on the data directly in real time. anisms used in VMS to ensure that CJ>Us active in a Another approach has been to save only a subset of multiprocessor system are interactive with each processor stare that has a linkage to the error deliv­ other and the synchroniz:Hion primitives that con­ ery vector and then act on this data during a later trol access to various resources. The sanity timer is parse operation. used as a watchdog timer to ensure that CPUs When designing the error-handling model fo r the respond to hardware clock interrupts on a regu lar VA X 6000 series, we decided to save all CPU state basis. Each CPU ::tetive on the system monitors that is visible to macro progra mmers in buffers spe­ another CPU fo r its response to hardware clock cific to each CPU. Al l interrogations are then made interrupts. The spinwait timer guarantees that one on the data in this buffe r. Information on the hard­ CPU does not retain ownership of a spinlock ware state is saved as well as the current system resource fo r more than an al lotted time period. time. Any action taken by error hand ling is also Error handling is always executed at an IPL above recorded in the buffe r. This approach has several which hard\vare clock interrupts can be serviced. advantages. First, a distinct fo otprint of the last As a resu lt, it defeats tbe sanity timer mechanism. error is contained in the system image in memory. Some of the actions taken by error hand I ing can Shou ld the system fa il, the data is saved when a cause a spin wait timer to expire if the error being crash clump is taken. Second. the many ex ecution serviced occurs roo close to the timer boundary By possibilities are made easier to test and ver­ disabling these SMP timers. a time period is starte(l ify. Finally, conditions are easier to diagnose if the over when the error being serviced is dismissed original data that error handling processed and the and timers are reenabled. actions that were taken are recorded in an error log. The buffe r associated with the CPU experiencing The error-handling process fo r the VAX 6000 the error is initialized to zero and is ready to receive series consists of six distinct steps: the latest error state. Jf the error is a machine check,

Digital Te ciJilical ]our11al \'til. ·I No. .) St.tmJJU'I' !')').! NVAX-microprocessor VAX Systems

stack space is allocated and initialized to allow fo r register read indications can help in the diagnosis of error dismissal and error-hand ling ex it. the original error condition. When mach ine checks or soft error interrupts The recovery blocks used when error state is col­ are serviced. the cache subsystem is uncondition­ lected have special flags to indicate that they were ally disabled . Error handling does this to preserve establ ished by erro r hancl l ing. If an error does

any error state that may be in the cache. Jf the error occur, control is returned to the appropriate po int happens to be cache- related . the state can be in the error-handling exec ution thread . The origi­ extraned at the appropriate time. Cache-related nal saved state of the first error is nor ove rwritten. errors are not reported by h<.�rcl error interrupts on

some VA..'< oOOO models. When hard error interrupts Pa rsing of Data are serviced on these processor models, the cache The parsing of data step was added to support the is not disabled by software. VAX oOOO Model 400 and later models. The data col­ The machine ched: flow has an additional check lected in the saving of state rou tines is parsed as to determine if the error is associated with recov­ a a separate step. When error data is parsed and pro­ ery bl ock. lkcovery blocks on the VMS system cessed in a single step, as in the VA)( oOOO Model 200 provide kernel-mode macro programmers with and VAX 6000 Model _100, it is difficult to make the the ability to protect an execu tion thread from necessary errors invisible to the error log. 'Wh en the effects of fa tal machine check exceptions. the error data is parsed and an error mask is pro­ Normally when a machine check occurs in kernel cluced that represents the error conditions present, mode, the code thread being execu ted loses con­ it is much easier to detect if all current conditions trol. Unless the instruction can be re started . the have been serviced . Since it is also easier to detect session is terminated. The placement of kernel­ \f.\;JS conditions with only one erro r present, expected mode execution threads within the context of error conditions can be processed. The VA,'< oOOO a recovery block ensures that a machine check Model 400 and later models have many benign error will cause control to be p:.tssecl to the end of the syndromes that have their logging fi ltered. recovery block. along with status to indicate that the machine check has happened. The macro pro­ Processing and Accounting grammer can select what type of mach ine check During the processing and accounting step of error to be protected from. In ge neral. this is limited to handling, the data in the CPU private local buffe r is those machine checks caused by references to the parsed and acted upon. These routines detect if a physical address space that do not respond or specific error condition is global and sensed return data. throughout the system or local to the particular If it is determined that the current delivered CPU . Global errors include the state from other machine check is protected by a recovery block crus and devices in the error log if required . If and that error handling establ ished this condition. the global error is the only one present and it is the error is dismissed immediately without further expected, machine state is set to indicate that error action. More details are given in the fo llowing logging should not occur. Should this CPU be the section. fi rst to process the global error, it samples data in registers of the other CPlJs and devices and leaves Sewing of State state to indicate the error condition has been ser­ All available CPU error state is saved regard less of viced and is ex pected. Consequently, the context of error type or deliverv mechanism. Machine checks global errors is included into a si ngle entry in the also save the internalstat e passed by microcode on error log. XMT parity errors are serviced in this way. the stack. Each register is read into its local storage Local errors record only the state from the CPU buffer within the context of a recovery block. A experiencing the error in the error log. valid bit is associated with each local copy register Error handling supports the notion of expectecl cel l acco rd ing to its status as it ex its the recovery errors, or errors that sometimes occur as a result of block context. This is important because each cell operations perfo rmed by error han el l in g. These has been in itial ized before use. A register value of errors are not reported to the error log. For exam­ zero may be significant, and a failed register read ple, duplicate tag parity errors can sometimes wo uld al low the initialized val ue to be int erpreted occur when the backup cache on the VAX oOOO as not having any error or state bits set. Fa iled Model 200 and VAX 0000 Model _100 is invalidated .

Digital Te cbuicnl jo urnal 1.-f!f. 4 .\'o ..) S11111111�r 19')2 VA X 6000 Error Ha ndling: A PragmaticAp proach

To cause these errors to be invisible to the error log, memory subsystem error, a flag to indicate that a mask of error bits to ignore is set up when backup no memory errors are present is set. This reduces cache invalidate operations are executed. At the the burden on the error log buffers of the V;v\S same time, a fo rk process thread that ultimately system and reduces the clutter and confusion of clears the appropriate mask bit is queued. If an error registers from a device that does not have an error that would be invisible to the error log error condition present. occurs, the "ignore-this-error" bit is sampled in the Any error that ultimately causes the system to recovery thread fo r the error condition. If the bit is crash is also logged to the system console terminal set, the error is ignored. If the error does not occur. through the SYSLOA message facility. Errors can eventually IPL is lowered until the queued fo rk pro­ occur during start -up before V\'IS error logging is cess runs and clears the bit in the mask. This guar­ available. Errors can also occur and terminate the antees that later real errors that have previously session before the system completes initialization. been expected and have not occurred are not For these reasons, fa tal errors are always output to excluded from the error log. the console terminal before the session is termi­ nated. Errors that occur at start-up of secondary processors are monitored by the primary proces­ Logging Error sor. Any output required is done by the primary The data collected by the error-handling routines is processor. sent to the VMS system's error log after it is tallied fo r size. The amount of record space is allocated by internal VMS routines. The raw data that describes Error Reset and Dismissal the context of the current error is copied to the VJVIS The last step of error handling resets error condi­ error log buffer along with the current values of tions that have been serviced and dismisses the accounting data fo r the CPU. The accounting data is interrupt or exception. The image of the data saved a count of the individual error conditions that have is used as a mask to reset error conditions. This occurred on this CPU fo r the current session. Any technique guarantees that double error conditions CPU that has some part of its fu nctionality disabled are not lost. includes that data in the error log as well. For exam­ Registers that require initialization are reset ple, a CPU that is executing with a disabled cache using the contents that were read when the error may cause errors to occur on other CJ'Us. It is useful was first serviced. Most error conditions are write­ to know that a CPU is running in a degraded mode one-clear. That is, to reset the error condition, a when investigating problems that are occurring on mask of the error conditions set has to be written to a system. The error log records of all CPUs clearly the appropriate register to clear the error. The use indicate any Cl'l!s operating at reduced capacity. If of the original contents of the register as a mask all CPlls are running unimpeded, the error log con­ guarantees that an error condition occurring dur­ tains a flag to indicate this status. ing the processing of an error cannot be lost. Reset The amount of data included in the error log fo r of the VAX 6000 Model 200 and VAX 6000 Model :)00 any given error can be different. The data describ­ error registers includes a later probe of the register ing the CPU context is the same except in the case fo r the absence of error indications. Should an error of machine checks. These errors also include inter­ still be present, the error-handling process is nal state passed by the microcode through the restarted and it treats the condition as a new error. stack. Depending on the error condition, context After errors are reset, the cache subsystem is invali­ from the XMI bus, the memory subsystem, or an dated and a check is made to determine if it should external XMI adapter can be included. The error be reenabled. Processing of the error could deter­ data is organized into various subpackets that are mine that the cache or indeed the C:l'l l should be signaled to be present by a flags field contained taken off-line because of an error rate or finite in a header section of the CPU context packet. count that is too high. If all is well. the cache sub­ For example, an error can occur that describes a system is reenabled . The MCHECK spinlock is now fa ilure of a transaction between the CPU and mem­ released and the interrupt or exception is dismissed ory. If the data collected from the memory subsys­ by executing a return from exception or interrupt tem during the processing and accounting step (REI) instruction. indicates an error is present, this is inclucled in If the error being dismissed is a machine check, the error log record. If there is no indication of a the additional storage allocated by error hand I ing

Digital Te cbnical jo unwl Vt!l. 1 No. 3 Summer 1')92 97 NV�"X-microprocessorVAX Systems

ami error state left by the microcode has to be If a condition handler bas not been established, the removed. As shown in Figure 2, the additional stor­ VMS process is aborLed. Kernel-mode threatls that age is an array of quadwords. These quadwonJs rep­ ex perience fa tal machine checks always result in resent program counter/processor status longworc1 the termination of the VMS session. (PC/PSI.) pairs that direct control to routines that must be executed prior to control being returned Support of the VAX 6000 Mode/ 200 and to the exception PC. 'fhe additional postprocessing VAX 6000 Mode/ 300 that takes place for noncorrectable memory errors The error-hand ling support t( Jr the VAX 6000 tvl odel and errors that cause a process to he aborted are 200 and VAX 6000 ;\·lodel 300 is identical These two VA X dispatched using this mechanism. processor models are the same logical CPU. 'file Machine check processing takes place at II'L 31 6000 Model :)00 is a selected faster component set Fatal memory error recovery uses the V�IS system·s of the VAX 6000 Model 200. page fa ult code threads. These threads use spin­ The VAX (> 000 Model 200 system presented a locks that cannot be acquired from IPL .'H. When llis­ unique problem fo r error hand ling because the pri­ missing machine checks, the PC/l'SL pairs are mary cache is internal to the CPU chip. Errors from interrogated to determine if they are nonzero. If the the primary cache c.lo not cause an interrupt or probed quadworcl is zero, the stack pointer is exception. These errors can never cause a fa ilure or upllated to unwind by the quaclworcl al located . This wrong result should they occur. Because all cache continues until all three array elements have been structures on the VA X 6000 Model 200 are write­ probed. If the array element is nonzero, the REI through, data can be both in cache and in memor�·. passes control to the PC and PSL described by that and it is always consistent. If a parity error occurs array element. Synchronization is thus preserved, on either the data or tag section of the primary and spinlock acquisition rules are obeyed. Even­ cache, microcode can alwaYs fetchanother COJ)\' of tually the array is traversed, and each element is the data from memory. If a primary cache rag or removed. State left by the microcode is removed, clara error occurs, microcode sets a status bit to control is passed back to the original exception PC, indicate the error in an internal . and instruction retry is attempted. If en·or hand ling The internal processor register is private to the determines that the execution thread shou!d be local. CPU. Previous C:J>lls with this type of error sig­ aborted, the original exception PC is replaced by naling used a polling technique to detect these fa il­ VMS SMI' a PC/PSL pair that returnscont rol to exception ures. On systems, only the primary Cl'l l is routines. From these routi nes, control is normally interrupted on a regular basis to allow polling rou­ passed to an appropriate mode conuition hand ler. tines to run.

STACK GROWTH VMS ALLOCATES THREE FROM HIGHER TO ADDITIONAL OUADWORDS. 31 0 LOWER ADDRESSES EACH QUADWORD IS USED AS A PC/PSL PAIR FOR EXIT I UNCORRECTABLE MEMORY OUADWORD 3 - - ERROR EXIT PROCESSING PROCESSING ROUTINES.

QUADWORD 2 - ABORT EXIT PROCESSING -

QUADWORD 1 - NORMAL EXIT PROCESSING - -THE STACK IS HERE WHEN THE STATE IS PASSED VMS GAINS CONTROL BY MICROCODE ON THE MICROCODE STATE 1 AFTER MACHINE CHECK STACK AT MACHINE MICROCODE STATE EXCEPTION. CHECK EXCEPTION 2

MICROCODE STATE N

EXCEPTION PC - - EXCEPTION PSL

Figure 2 1Hachine Check Exception Ex it Processing Stack Fo rmat

9R I i1/ . . 4 So. .J S""'"'"'" J'J'J2 D(r:;itnlTe e/m ica/ jo urnal VA X 6000 Rrror /-la ne/ling: A Pragmatic AjJfJroach

Since we had no precedent fo r reference. we detection incorporated within each increaseu anti

tlesigned a system whereby the primary CPl 1 uses bec.tme more complex. The ove rall model imple­ the interprocessor interrupt mechanism to inter­ mented on the VAX 6000 iYlodel 200 was main­ rupt secondary processors. When the secondary tained, but a step was added between the steps of

CPU receives the CPU-specific interprocessor inter­ saving of state and processing and accounting. The rupt, it reads the appropriate internal processor new step, parsing of data, was prev iously a part of register. places the data in a known location, and processing and accounting. Error hand I ing support sets an indicator flag. On later pol l cycles, the of the on-board CPU electrically erasable pro­ C:l'll (EEI'I{O:'vl) primary sees the indication from the sec­ grammable read-only memory was also ondary Cl'l l and interrogates the known location added. The EEPHO,vl was until now used only by CP! I fo r any error bits. If no errors are detected , each console support. secondary Cl'l! is polled once every ten seconds. The overall model now became

Should an error be found. the secondary CPl l witb Setup and synchronization the error has its pol ling frequency increased to •

once every second. If ten successive polls indicate • Saving of state error conditions, the secondary CPU is signaled to disable its primary cache. If th is occurs, entries are • Parsing of data made in the error log ami to the system console by • Processing and accounting the primary CPl. on behalf of the secondary CPU. During systems integration of the VAX 6000 • Error logging .Model 200, certain r:mdom-access memory (RAM ) • Error reset and dismissal devices used fo r the backup cache exhibited exces­ sive parity error fa ilures. The problem was so Storage space fo r machine check and hard error severe that special error-handling software and is shared in the VA X 6000 Model 200 system. additional Cl'l hardware fu nctionality were devel­ However, this support became too compl icated to oped to isolate and diagnose the fa ilures. TI1is work manage . In the VA X 6000 Model 400 and later mod­ was so successhil that the hard·ware functionality els, both sets of error state are available in crash was made a permanent fe ature of the processor, dumps. ami the error-handling routines were made a per­ manent parr of the SYSI.OA image . The har(l\vare Et'PROM Support fu nctionality and software routines allowed fo r the fa iling data bit in the backup cache to be identified The experience gained from systems integration of at the rime of the fa ilure. The VA X 6000 Model 200 the VA.,'( 6000 Model 200 showed that real-time diag­ experience had a lasting effe ct on error handl ing nosis by the operating system has many benefits. However, the scheme us d by the 6000 Model across the VAX 6000 family. The ability to diagnose e VAX 200 was and recorded only the result­ cache parity errors to the bit level in the operating cumbersome system remains a characteristic of error hand I ing ing diagnostic data to the error log. The challenge was twofold: to make the mechanics of cache parity on all VA,'\ 6000 systems. error diagnosis easier, and to make the data more widely available. We achieved both goa ls by using Supp ort of t he VAX 6000 Jl1odel 400and the EEPROM on the CPU module. The VA X 6000 VAX 6000 Mode/ 500 Model 500 made add itional improvements by using EEPROM. AJthough the CPU chips on the VAX 6000 .Model 400 both on-board and high-speed RAM and and VA X 6000 Motlel 500 are the same. the SYSLOA EEPROM an d RA.ivl structures exist within the images are not. The major diffe rence between the physical address space of the VA X 6000 fa mily. two systems is the write-back cache subsystem These structures are primarily used by the console

implemented by the VAX 6000 Nlodel .:;oo. To facili­ fo r cross-session storage of data. High-speed Rlu\•1 is tate write-back cache strategies, the Xi\-11 bus was used fo r general heap storage by the console. RAM enhanced to support a directory-based broadcast and EEPROM structures have physical addn.:ssl.'sthat coherence prorocol.1 are in the 1/0 region of the address space. Address 'fhe VAX 6000 Model400 and VAX 6000 Model '500 references to 1/0 address space do nor cause cache systems represented a dramatic increase in system lookups. The code threads that perform data complexity f<>rerror handling. The amount of error extraction "·vere placed in the EEPROM and RAM

Digital Te cbuicnl jo unw/ VtJI. " .\'u. j Su111mer 19'J2 NVAX-rnicropi-ocessor VAX Systems

structures to avoid special hardware operating present. It vector errors are detected. the appropri­ modes. A fe w simple routines enabled easier diag­ ate support routines-soft error, bard error, or nosis of cache parin· error failures and a method of mach ine check-are invoked .

disahl ing the cache that does nor disr upt error Error handling supports a maximum of fo ur state. The error-ham! I ing SOl vectors were pointed vector processors. If the number of errors or the to the LIYR0.\1 so the routines that disable the cache rate of errors becomes wo great, vector processors could do so without making cache references. (On are removed from use. Error h�l nd I i ng never th e VAX 6000 .Vl odel "iOO. the cache disabling rou­ removes the only or last remaining vector proces­ tines are placed in the high-speell H,\�1.) sor. Support of errors has the When an error occurs, control is first passed same characteristics as support fo r CPU-related ro individual routines that reside in f/0 space. These error conditions. Portions of the vector processor routines disable the cache subsystem and then are disabled if the associated error rate becomes too returnco ntrol to the SYSLOA image in main memory. great. If other errors continue. the unit is 1-emove(l

The VA X () 000 Model 500 has an error transition from use. The notions of ex pected errors and mode (l:T.\·1). which allows the backup cache to be errors that are invisible to the error log also exist. partially disabled. New blocks are not allocated when in ET.\1 mode . Data re quests are filled from Support of the VAX 6000 Model 600 the cache. Error interrupts or exceptions on the Error checking and detection on the VA X 6000 VAX 6000 Model '500 dispatch to routines that exe­ Model 600 are very complex processes. There are cute from l/0 sp:tee ami place tbe write-back wel l over 160 unique soft and hard erro r conditions backup cache into ET,VJ and disable the write­ as categorized by the software. The actual count tiJrough primar�· cache. declare(! by the hardware is much greater. The dis­ The EEPR0,\·1 on both the VAX 6000 Model 400 and parity results from the way software groups error VAX 6000 IV lodel 500 is also used to store fa ilure conditions. The VAX6000 Model 600 error hand ling info rmation. \Xfhen errors occllr, a counter that is fol lowed the enhanced model implemented on t h e associated with the specific error condition is VA,'( 6000 Model 400. The state saved fo r interroga­ incremented. The number of error conditions is tion by VAX 6000 Model () ()() error hand I i ng consists finite and fu !Jy (iescri bnl by tile error mask pro­ of 40 internal and XMI-acldressable registers. duced b\· the parsing of data routines. Writing to Support of the VAX 6000 ,VIodel 600 also included tile C:l'l . EEI'RO.vl is rime-consuming compare(! to support of the on-board CPU EEPROM fo r the long­ writing to main memory A byte write to EEPROM term storage of fa ilure info rmation. The support L1kes on the order of 15 milliseconds. ']() avoid t h is of the EEPRO.VI was extended to include the history overhead, the EEPROM VMS data actually resides in of the cache subsystem perfo rmance in previous main memory d uring a V.\-IS session. As each Cl'l l is sessions. initialized by the VMS system, the contents of the Like the VAX 6000 Model 500 system, the VAX V.\·lS area arc read into individual Cl'l ! memory 6000 Model 600 implements a write-back backup regions. Updates that are required are made to cache strategy. The VAX 6000 Model 600 backup these regions. When C:PlJs are stopped or when the cache operates using a directory-based broadcast

system is shut down or has crashed, the region of coherence protocoL 1 Each 32-byte cache block is in memorv associated with a particular CPl ! is written one of three states: invalid, vaJ id/read-on ly, or back to that CIT's EEPR0.\-1. In addition to error valid/written. Multiple caches may hold read-only inform:ltion, a count of seconds run in a v:v1 s envi­ data simultaneously; written data may be held by ronment is tallied. only one cache in the SYstem at a time. Write privi­ lege fo r a block must be obtained before modifying Ve ctor Processor Support the data in that block. One set of routines supports the VA.'\ () ()()() .Ylodel Certain backt�p cache error conditions are severe cf ()() ami VAX () 000 Model ')00 vector processors. enough to disable the cache. The backup cache may Thcst: rou tines are orga niznl in an identical. manner contain ''' ritten data that is unavailable elsewhere

to the CPU routines and hlllow the same steps in the system . To access that data, the backup cache relate(! to C:Pl l error conditions. During the process­ is put into cr:v1 , a state which allows written data to ing and :1ccounting of C:P1 error conditions, a check be accessed by the cache controller, but disal lows determines if any vector processor errors are the use of read-only clara.

[(l() l·rJ/. ·i ,\'o . . i Sllll/111<'1' /')').! Digital Te cbuica/ journal VA X 6000 l;"rror Ha ndling: A ?rognwtic AjJjHooch

A cache enters ETM as a fu nction of either soft­ When this occurs, an appropriate message is sent to ware or harthvare. The cache is put into ETt'v! by the console term inal during system start-up. hardware only when cache data may have been cor­ Ta g parity errors that occur in the VIC are diag­ rupted, or when cache data may be inconsistent nosed in an unusual manner. Unlike other caches with data in memory Thus, correctable backup on the VA.'\ (iOOO Model 600, the tag store of the VlC cache errors do not cause a transition into ETM, contains a virtual address. To determine which hit but uncorrectabJe errors do. A parity error on the has failed when a parity error occu rs, the tag sto re NVA.Xdata and address lines (NDAL) interface causes is probed to re trieve the contents of the fa iling tag location. The associated clara store location is the cache to enter ETM beca use an invalidate or write-back request may have been missed. A cache probed to retrieve its co ntents: each bit in the had transition into ET!VI occurs when a request for write tag address is then flipped in turn. As each bit is privilege or a write hack does not complete suc­ flipped, the range of the re sultant virtual address is cessfully on the VAX 6000 Model 600. The state compared to page tables to deter mine its valid it�· of the cache is likely to be inconsistent with that of within current context. The virtual address is trans­ memory. lated. and the resulting physical adtlress is mappnl

Three requ irements govern cache operation dur­ to allow error hand I ing to read the contents of the ing ETM : (l) The state of the cache is preserved as page . Th e appropriate contents of the newlY much :ts possible to allow software to diagnose the mapped page are compared to rhe contents read problem. (2) Memory references that hit writ ten from the VIC data store. Jf one ami only one match blocks in the cache are processed, since this is the is fo und, the failing tag bit is identifieLI. :'1,1asks of

only source of data in the system. (3) Cache fa iling bits from all VA X 6000 :Yi odel 600 cache CPU coherency requests from the NOAL are processed structures are stored in the EEPROM along with normally so that cache state remains consistent other fa ilure info rmation.

with memory. The instruction pipel:ne com plicates VAX 6000 Al though complex, ETM allows the software to Model 600 error hand I ing. Jn many instances, choose when and how to disable the cache. To errors experienced are in no wa\· related to tile cur­ make the process of error handling less cumber­ rent instruction being executed or interru pted . some, the backup cache is unconditionally put into \X1 hen an error cloes occur. care must be taken to ETM by the software when any error condition is fu lly understand in what context the error has an being serviced . effect. ECC protects both tag and data stores on the backup cache on the VAX 6000 Model (J OO. Correct­ Co rrectableMe mory Errors able EC< : errors in the backup cache have a record Correctable memory errors are data errors that are of fa iled syndromes kept by error-h:mdl ing rou­ COITectecl by the memory controller before data is tines. Should the same synd ro me fail on more than returned to the requester. They occur primarily one occasion in a single V,\·IS session. the backup because ofalpha particle radiation, affectonl y a sin­ cache is disabled. gle cei L and are transient in nat ure . Co rrectable If uncorrectable backup cache errors occur, memory errors are completely benign. To deter­ the error-hand ling ro utines determine if the block mine if a memory cont r oller reporting correctable

is owned by the cru and attempt to flush the errors has real defects. multiple errors must be block back to memory. If successful or if the block viewed.

is not owned, the backup cache is disabled before VA X 6000 error handl ing implements a scheme returning the system to normal oper:J. tion. If whereby error data re ported by mt·mory con­ the data cannot be recovered , the V,VIS session is troll ers fo r correctable errors is compressed into a terminated. structu re cal led a fo otprint. The h>otprint reduces

If the backup cache is disabled by erro r handl ing the data reported into a t·I s. Furthermore, the fo otprint

I Digital Te cbnical jo urnal Vol. 4 No. 3 S11111mer I'J'.J2 0 I 1\'"V�"X-microprocessorV�"X Systems

block maintained per fo otprint is used to track the The data is not corrected in the storage DRA.\,ts on context (repeat errors, scrubbing, etc.) of this error the controller. if the location is read over and over as well as other errors that match this footprint m. again, the same error and correction cycle occurs The assumption here is that most correctable each time. This continues lllHil the location is memory errors are a result of DRAM component updated with write data. An interrupt can be gener­ fa ults, hence the granularity of the unique DRA.\·1 . ated fo r each error correction cycle. Care must be As shown in Figure ,1, the fo otprint fo rms a :)2-bit taken when scrubbing memory locations. The data integer from the XMl node lD,the ECC syndrome of in any given memory address can be shared by any the error, and the memory controller bank in error. number of Cl'Us or 1/0 devices. When this is the The integer is used to locate other correctable case, a higher-level software protocol is normally errors that have occurred in an internal database. used to synchronize access. Error hand ling would Along with the footprint, the address of the cor­ not he privy to these protocols. VAX 6000 ��lode! rectable error is passecl to a set of routines that per­ 500 and VAX 6000 Model 600 memory scrubbing is fo rms all processing of correctable memory errors. possible because of the XYII2 hus protocol. Bet<>re a The database tracks the range of addresses that have CPU can modify anv location in memory, it must be experienced correctable errors for the same foot­ the exclusive owner of the :)2-byte block in which print. This aids in the diagnosis of row ancl column the address resides. o·wnership is effected at the fa ilures with the DHA.VIs that make up memory con­ primitive hardware level and so exclusive access is troller sto rage. On t he VAX6000 Model ')00 ancl VA.'\. guaranteed.

6000 Model 600 systems, memory scrubbing status \XI hen a correctable error interrupt occurs on a is also tracked. \�X 6000 Model 500 or \�'\ 6000 Model 600 system, Memory scrubbing is a technique for reducing error handling rewrites the fa iling location \V ith its the number of error interrupts from locations that contents. The abilit\' to cause an interrupt is dis­ are re errors caused by alpha panicle distm­ abled in memory controllers that continue to bance. Scrubbing removes transient faults from report errors with the same fo otprint or that have the system, which in turn reduces the number of not responded to scrubbing. This action occurs interrupts that result from such errors. In addition, after sufficient data has indicated that something it helps to differentiate transient errors from per­ other than alpha particle disturbance bas occurred manent DRA,'vl component faults, as captured in and the memory controller may require service. the error log. This information was previously The rate of correcuble memory error interrupts unavailable. is checked to reduce the burden on the system. If When tbe VAX 6000 memory controller detects the rate of errors occurring becomes too high, the a correctable memory error, circuitry in the con­ ability to interrupt is disabled at the problem con­ troller corrects the data returned for the request. troller fo r a period of time. Correctable memory error data collected lluring a V,viS session is sent to the error log at the end of the VMS session. 31 23 15 0

XMI NODE ID FAILING BANK ECC SYNDROME I I UncorrectableMem oryErrors LOWEST ADDRESS l 'ncorrectable memory errors experienced by the Cl'l are reported as machine checks. These HIGHEST ADDRESS machine checks are synchronous with the PC mak­ COUNT ing the reference. Uncorrectable memory errors occur when data is lost by the memory cont roller FLAGS and cannot be re-created by its ECC circuitry; fo rtu­ I 2 I 1 I 0 nately, these errors seldom occur. Uncorrectable KEY: memory errors represent a serious problem to the 0 BUSY 1 SCRUBBED execution thread that experiences them. The hard­ 2 INHIBIT THE REPORTING OF CORRECTABLE ware cannot assist in the recovery of this type of MEiviORY ERRORS error; recovery is totally a software fu nction. If the page that experiences an uncorrectable Figure 3 1Vlenwi:J' Correctob/e Error error is a process private page that has not been Fo otprintSt ructure modified, and the code thread currently executing

102 lbl. ·i So. .3 St1111111er I'J')l Digital Te c/)llical]ourual Vt'lX 6000 Error Hcmdling: A Pragmatic Approach

is at pageable priority, the error is not considered represent synchronization points between an i\1TEST fatal. The error-hand I ing routines arrange fo r the error-handling execution thread and the page to be re-created in a diffe rent physical page in si mutator. memory by inva lidating the necessary memory The MTEST simulator is a privileged image and management structures. As a result, a translation­ consists of a user interface, a number of nonpage­ nor-valid exception occurs when the instruction able internal buffe rs, and simulator routines. The that experienced the exception is retried. The page user interface allows the internal buffe rs to be fault mechanisms of the V,\1 5 system do the actual selected and loaded with data patterns of the user's re-creation. The original page with the error is pur choice. The user interface also allmvs the user to on a I ist of bad pages internalto the VtvJ S system. If pass control to the SCfl vectors of the VMS system. the page does not meet the criteria fo r replace­ In our case we used the vectors that are the I inkage ment, either the process is deleted or the v.vJ S ses­ to error-hand ling ro utines. Once in control, error sion is terminated. If the process is deleted, the handling would execute its model until it reached a page is marked "bad" by error hand ling, ancl the DEBl iG_TR ANSFER code segment. The segment process run-down routines in VMS retire the page would determine that this was an error being simu­ to the bad page I ist. lated and return control to NITEST. MTEST would then decide if the synchronization point was one Te sting fo r which the user has data. The clara would be transferred from the buffe r named in the Early in the project. we decided the ability to test DEBUG_TRA NSFER code segment to the address also and verify had to be built into error handling to pro­ declared in the segment. By judiciously placing tl1e duce a predictable, robust, and quality product. DEBUG_TR ANSFER synchronization points and care­ Although the 6000 fa mily and CI'Us in general VA)\ fu lly selecting an appropriate data pattern, we !lave a nu mber of fea tures that allow errors to be were able to simulate any and all error conditions generated , they rend nor to be general-purpose. In fo r the appropriate CPU. most cases. they are designed fo r use by special In this way. we were able to verify many complex diagnostic software that does not operate in the algo rithms and code paths that wou ld have been context of an operating system , e.g., the VMS oper­ difficult to ex ercise. We were also able to verify ating system. We chose to implement a scheme error hand I ing and error logging from the point of whereby errors would be simulated in software on error to the error log file. MTEST can be either inter­ the target hardware. This approach gave us several active or procedure-driven. This aspect allowed us clear advantages. The most important was that the to maintain a library of procedures that could be approach cou ld be extended as the power and com­ used at any time to verify that operational charac­ plexity of CPU models increased ancl that complete teristics fo r individual errors had not changed when control was with the designers. No special hard­ code paths that affected many error types were ware equipment or C:Pll fe ature would be required. modified. The only precondition was that certain software MTEST was the primary tool we used fo r testing. implementation guidelines had to be fo llowed to During the test ancl verification phase, prototype make use of the simulator. hardware that bad real error conditions became Machine check test (MTEST) consists of two available, and we used these prototypes. parts, a utility ancl an error-hand ling implementa­ tion methodology. The methodology consists of using main memory storage as the primary agent Conclusions that is acted upon by error handling. This method The VAX 6000 family now has a robust and complete also fir into our model of retaining data in memory. set of error-handling routines that accomplished The other requirement was the strategic placement our project goals. In fact, many routines were never of the DEUUG_TRANSFER macro. DEBUG_TRANSFER before part of the VMS system. These ro utines expands to produce a code segment that deter­ include the ability to report complete error context mines if the current error being serviced is an error to the system console and the ability to group fa il­ simulation or not. If it is, data that resides in mem­ ures occurring across the system to a single error SMI' ory that is being interrogated is modified, in con­ l.og entry. An important feature is the ability to cert with i\HEST. to reflect the error condition recognize and retire failing processors from the DEBUG_TR ANSFER VMS being simulated. code segments active set of a session and allow the session to

Digital Te dmical jo urnal Vol. 4 No. 3 Summer 19')2 103 NV AX-microprocessor VAJ( Systems

continue. These ro utines and others su pport the 6000 CPl!s showed great discipline in producing entire range of VAX 6000 CPU models. The object­ engineering specifications that met the needs of oriented approach to error conditions not on the both hardware and software engineering groups. CPU module has made support and introduction of The many hours spent painstakingly describing newer routines easier. The ability to test at will any intricate details of erro r conditions ami the produc­

or all error-hand I ing ro utines has been a tremen­ tion of p:.�rse trees al.l owecl the structured approach dous advantage. we set our to achieve. Special thanks to Mike Uhler fo r his parse trees ami to Nick Carr, who suggested Acknowledgments this paper be written. Our success resulted from a number of factors, Reference including the advantages of designing the ability to test into the pr

actually executing a code thread to determine the performance Wv'\ iV licroprocessors:· Digital effectiveness of its design goal. The various engi­ Te chnical Journal. vol. 4. no. 3 (Summer neering grou ps involved in designing the many 1992. this issu e): 11-2}

Digital Te e/mica/ journal 104 vbl ..; No . . i Su1111ner I'J'.!l ISSN 0898-901X

Printed in U.S.A. EY-J884E-DP/92 ll 02 19.0 Copyright © Digital Equipment Corporation. All Rights Reserved .