Digital Technical Tournal Digital Equipment Corporation Managing Editor Richard W Beam Edltor Jane C. Dlak Pcoductloa St& Production Editor - Helen 1 Partenon Designer - Charlotte Bell Typographers -Jonathan M. Bohy Macgaret Burdine lllusultor - Deborah Kc~lcy Advisoiy Board Samuel H. Fuller, Chairman Robert M. Glorioso John W. McCredle Mahendra R. Patel F. Grant Saviers William D. Srrcckr Victor A. Vyssutsky The Digital Technical Journal is published by Digital Equipment Corporatloa, 77 Reed Road, Hudson, Magsachu~etts0 1749. Changes of address should be sent to Digital Equipment Corporation. attention: List Maintenance. I0 Forbes Road, Northboro, MA 01532 Please indude the address label wlth changes marked. Comments on the content of any paper arc welcomed. Write to the editor at Mall Stop HL02.3/K11 at the published~bpaddress. Comments can ahbe sent on the BNET to RDVAX: :BIAKEor on the ARPANET to B~%RDVAX.DE~DE~. Copyright @ 1988 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational lnstltutions by faculty members and are nor distributed for commercial advantage. Abstncting with credit of Digital Equipment Corporation's authorship is permltted. Requests for other copies for a Pee may br made to Digiul Press of Digital Equipment Corporation. A11 rights reserved. The information in this journal is subject to change without notice and should not bc construed as a com- mltment by Digital Equipment Corporation. Digital Equipment Corpmtion assumes no responslbllity for any errors that may appcss in this document ISSN 0898.901X Documentatlcm Number EY-6742H-DP The following are wademarks of Digital Gquipmcnt Corporation: ALL.IN-I. DEQNA, HSC70. I. 1 1. MicmVAX, MicroVAX 11, MicroVAX 3000, NMI, NOTES, Qbbus, Q22-, Rh81, R.482, RD54, RQDX3, RX50, TKSO, ULTRIX-32, VAX, VAX-11/780, VAX-11/782, VAX 6200, -.- -.- VAX 6210, VAX 6220, VAX 6230, VM 6240, VAX 8200, @##tWt# @3 Dl'fMf 'J dBrUrCBd Gil#fXS @@&&DmOne VAX 8300, VhX 8650, VAX 8800, VAX 8840, VAXBI, JiW!4mdtnmtanre, m~o~wfa~i~sflrrVAXELN, VAX MACRO, VAX SPM, VAX/vMS, VMS. krsMC# &WH%W4? S&~O?B a WE Of Compu-Share is a trademark of Compu-She. Jnc. W tehaaJogy. 72w 81:@- qfLW f~pqgmmsrb GDS I1 is a uademvk of Calma Corpora~iua. pq&mmtmpaw#& the gptm wEm&W$y Mtbr new SPICE is a trademark of the University of Calliornia at m-q-. berkcley Tektronix and DAS are trademarks of Tehniu. Lor Book produalon was done by Digital's Eduatiml Services Mtdia Communications Group in Bedford. Mh Contents

8 Foreword Robert M. Supnik

CVAX-based Systems

10 An Overview of the VAX 6200 Family of Systems Brian R. Allison

19 The Architectural Definition Process of the VAX 6200 Family Brian R. Allison

2 8 Interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus Richard B. Gillett, Jr.

47 The Rok of -aided Engineering in the Design of the VAX 6200 System Jean H. Basmaji, Glenn P. Gamey, Masood Heydari, and Arthur L. Singer 5 7 VMS Symmetric Multiprocessing Rodney N. Gamache and Kathleen D. Morse

64 Perfomance Evaluation of the VAX 6200 Systems Bhagyam Moses and Karen T. DeGregory

7 9 Overview of the Micro VAX 3500/3600 Processor Module Gary P. Lidington

87 Design of the MicroVAX 3500/3600 Second-level Cache Charles J. DeVane

95 The CVM 78034 Chip, a 32-bit Second-generation VAX Microprocessor Thomas F. Fox, Paul E. Gronowski, Anil K. Jain, Burton M. Leary, and Daniel G. Miner

109 Development of the CVAX Floating Point Chip Edward J. McLellan, Gilbert M. Wolrich, and Robert AJ Yodlowski

12 1 The System Support Chip, a Multifunction Chip for CVAX Systems Jeff Winston

129 Developmertt of the CVAX 922-bus Interface Chip Barry A. Maskas

1 39 The CVAX CMCTL - A CMOS Memory Controller Chip David K. Morgan Editor's introduction

In the last paper related to the VAX 6200 system, Rhagyam Moscs and Karen DeGregory describe the development of workloads to measure VAX 6240 performance. As part of their discussion, they include performance measurements and analysis. The second new system based on the CVAX chip set is the low-end MicroVAX 3500/3600 system, which offers three times the performance of its predecessor, the MicroVAX 11. In his over- view of the major sections of the processor mod- ule, Gary Lidington relates how schedule and performance requirements influenced product C. Jane Blake design decisions. Editor Charles DeVane then describes the MicroVAX 3 500/3600 system's two-level cache architecture, The second issue of the Digital TecbnicalJournul with cmphasis on the design of the second-level (March 1986) feat~lred papers on the then cache. He also presents some cache performance recently announced MicroVAX 11 system, a system test results. based on a single-chip VAX implementation. In The high perforn~anceof both the VAX 6200 this seventh issue, we present papers on the sec- family and the MicroVAX 3500/3600 system is ond generation of that chip set, CVAX, the two attributable in great measure to the CMOS VAX new systems that take advantage of its increased family of chips on which these systems are based. performance capabilities, and a new version of Our five final papers address the design and the VAX/VMS operating system for symmetric development of this chip set. Frank Fox, Paul multiprocessing. Gronowski, Anil Jain, Mike Leary, and Dan Miner The new mid-range system based on the CVAX begin the discussion with an explanation of how chip set is the VAX 6200 family of , designers achieved the performance goals for the which utilizes a multiprocessing architecture. single-chip VAX CPU by reducing ticks per The first of two papers by Brian Allison is an instruction and machine cycle time. overview of this highly configurable, expandable A companion to the CVAX CPU, the floating system. Brian's second paper offers insights into point processor chip offers floating point perfor- the architectural definition process for the 6200. mance equal to that of the microprocessor for One of the major decisions made by the 6200 integer operations. The approach taken to attain engineers was to design a new interconnect to this goal and a description of the chip are pre- support the multiprocessor system. Rick Gillett sented by Ed McLellan, Gil Wolrich, and Bob presents an informative discussion of the com- Yodlowski. plexities involved in interfacing a microprocessor Jeff Winston then discusses the development of to a high-speed, multiprocessing bus. the system support chip, which provides a com- To ensure the availability of first-pass func- mon core of peripheral system functions. tional parts, a design verification team of engi- Next, Barry Maskas relates the design efforts of neers worked in parallel with the 6200 module three groups, one in Japan and two in the U.S., designers. Jean Basmaji, Glenn Ga~ey,Masood that resulted in a single-chip interface between Heydari, anti Art Singer discuss the computcr- the CVAX microprocessor and the Q22-bus 1/0 aided engineering and verification principles the subsystem. team instituted for the project. In our final paper, Dave Morgan describes the Rod Gamache and Kathy Morse then describe CVAX memory controller chip, CMCTL, which is the major features of synlmetric multiprocessing optimized for Q-bus-based systems. in the VAX/VMS operating system. Of particular interest is their description of a new synchroniza- tion method implelnented in VAX/VMS version 5.O. Biographies

Brian R. Allison Brian Allison, a consultant engincer for mid-range VAX systems, was the system architect responsible for the coordination of the VAX 6200 system definition and design. Prior to this work, he served as system architect for a project that yiclded several products. including DEBNA, DEBNK, and the KA800. As a member of the VAX-11/750 design team, he wrote various portions of the microcode for that product. Brian holds a B.S.E.E. and a B.S.C.S. from Worcester Polytechnic Institute (1977).

Jean H. Basmaji Jean Basmaji is the technical director of computer-aided engineering and design-verification testing for the VAX 6200 project. A soft- ware consultant engineer, he has also been involved with CAE/DVT planning and scheduling, and has served as CAE/DVT project leader for the VAX 6200 CPU module. Jean joined Digital after receiving his B.S.E E from Lowell Technological Institute in 1077.

Karen T. DeGregory A scnior software engineer in the Systems Perfor- mance Analysis Group, Karen DeGregoly is project leader of systems perfor- mance measurement for the \'AX 8840 and VAX 6240 systems. In addition to planning and implementing the measurements, she helped develop appro- priate workloads for these systems. Prior to this work, Karen was a senior software specialist in the Softwarc Services Backup Support Group. She reccived her B.S. (1980) with honors and distinction and her M.S. (1 98 1) m4 from Cornell University. Charles J. DeVane Charles DcVane is a scnior hardware engineer in the MicroVAX Systems Development Group. For the MicroVAX 3500/3600 pro- ject, he designed the second-level cache on the KA650 CPU module and guidcd the process of module system debug and introduction to manufactur- ing. Before joining Digital in 1981, Charles received a B.S.E.E. from North Carolina State University in Raleigh, North carol in;^. He is a mcrnber of Eta Kappa Nu and Tau Beta Pi engineering honor societies. Biographies

Thomas F. Fox Frank Fox, a principal cngineer in the Semiconductor Engi- ncering Group, workcd on the implementation of the CVAX 78034 CPU chip. He is currently designing a high-performance microprocessor and consulting with the Adv;lnced Semiconductor Development Group on the development of a sub~nicronCMOS proccss. Frank was educated in Ireland and received a B.E. degree from University Collegc Cork (1974) and a Ph.D. degree from Trinity College Dublin (1978). both in electrical engineering. He has published papcrs on ultrasonic instrumentation and magnetic reso- nant imaging (MRI) and has three patents pending.

Rodney N. Gamache A consulting software engineer in the VAX/VMS Soft- ware Development Group, Rod Gamache has been with Digital for 1 1 years. Shortly after receiving a B.S. in mathematics and computer science from the University of New Hampshire, he joined Digital to work on the development of DECnet Phases I11 and IV for both RSX-I IM and VMS. For the last two years, Rod has been project leader for the VMS symmetrical multiprocessing project and has filed two patents on VMS SMP. Rod also serves as a consultant for the architectures of future low-end VAX processors.

Glenn P. Garvey Glenn Garvey is an engineering supervisor presently leading a team in the verification of a new VAX processor. He was the project leader for the system-level verification performed on the VAX 6200 system and has been involved in modeling and verification since coming to Digital in 1982. Glenn was a co-op student at Digital in 1980 and 1981. He holds a B.S.E.E. from Rensselaer Polytechnic Institute.

Richard B. Gillett, Jr. Rick Gillett, a consultant engineer, led the VAX 6200 CPU module project. Prior to his work on the 6200, he served as one of the architects of the XMI bus and was a member of the VAXBI bus project team. Relative to his work on the VAX 6200 design, he has recently filed 13 patcnt applications. Currently, he is system architect for a new VAX system. Rick joined Digital after receiving a B.S.E.E. (summa cum laude) from the University of Ncw Hampshire in 1979. He is a mcrnber of Tau Beta Pi and Phi Kappa Phi.

Paul E. Gronowski Before receiving a B.S.E.E. from the University of Cincinnati in 1984, Paul Gronowski was a co-op student at Digital working on chips for the VAX 8200 and 8.300 systems. Currently a senior engineer in thc Scmiconductor Engineering Group, he has been a codesigner of the E-box for the CVAX 78034 CPlJ chip, designer of the bus interface unit and exponent section for a CMOS tloating point chip, and is now doing advanced devclopment work for a new CMOS microprocessor. Paul is a member of Eta Kappa Rlu Masood Heydari Masood Heydari is an engineering manager rcsponsiblc for the computer-aided design, verification, and testlng of mid-range VAX sys- tems. He is also managing the devclopment of an 1/0 subsystem for VAX products. Since joining Digital in 1981, he has developed the diagnostics for a 36-bit system and has been responsible for functional test pattern genera- tion for several products. Most recently, he was responsible for CAD/DVT on the VAX 6200 project. Masood holds a B.S. and an M.S. in computer engineer- ing from Boston University (1 980). He is a member of Tau Beta Pi.

Anil K. Jain Anil Jain received an M.S.E.E. from the University of Cincin- nati (1980) and a B.S.E.E. from Punjab Engineering College (1978). Upon joining Digital in 1980, he worked on bipolar and CMOS-1 technology while a member of the Device Modeling Group. As a senior engineer working on the CVAX project, Anil designed the bus interface unit for the CPU chip and has a patent pending for the bus interfacc protocol. He is currently working as a project leader on the floating point chip project for vector applications.

Burton M. Leary Mike Leary is a principal engineer in the Semiconductor Engineering Group/Advanccd Developn~entMemory Group and is currently working on the design of advanced memory products. In previous work, he participated in the design of the floating point chips for the MicroVAX and 8200/8300 systems and thc design of a CMOS serial interface controller chip. Mike joined Digital after receiving a B.S.E.E. from the Universityof Mas- sachusetts. He is a member of Tau Beta Phi.

Gary P. Lidington Currently an engineering manager in the MicroVAX Engineering Group, Gary Lidington has served as a system engineer for the MicroVAX 3500/3600 system and as a maintainability engineering project manager for the lMicroVAX I1 and single-board computers with Q-bus multi- processing architectures. Before coming to Digital in 1981, Gary was a co-op student and test cngineer on the L200 project at Teradyne, Inc. He holds a R.S.E.E.(honors) from Tufts Ilniversity and an M.S. in computer engineering from the University of Massachusetts.

Barry A. Maskas Barry Maskas, a consulting engineer with the Semicon- ductor Engineering Group, is the project leader for the development of a cus- tom VLSI memory controller. Prior to his current work, he was the U.S. project leader and architect for the CVAX Q22-bus interfacc chip and co- designer of the MicroVAX IlCPU and memory boards. Barry came to Digital in 197') after receiving a B.S.E.E. from Pennsylvania State University. Edward J. McLellan lit1 McLellan is a principal cngineer in the Scmicon- cluctor Engineering Group. He has worked on thc design of several chips, including thc J- I 1 and the CVAX floating point accelerator chips. He holds one patent for previous work and has made application for two additional patents based on work done for the CVAX floating point chip. Ed joined Digital in 1980 after receiving a H.S. in compilter and systems engineering from Rensscl;~crPolytcclinic Institute. Currently he is project leader for a tloating point chip for a ncw processor.

Daniel G. Miner Dan Miner came to Digital after receiving a B.S. in com- puter engineering from Rensselaer Polytechnic Institute in 1985. A software cngineer in the Semicontluctor Engineering Group, he wrote the debug and di;~gnostictests for the CVAX CPU chip. Dan co-authored and presented a paper on the subject of testability strategy at the 1987 IEEE International Test Conference.

David K. Morgan Dave Morgan is the engineering manager of Advanced Peripheral Development in thc Semiconductor Engineering Group. For his work on several engjnccring projects. Dave has three patents pending. Before joining Digital in 1975, hc was a design engineer at the RCA Solid State Division and holds five piltents for his work on integrated circuit designs. Dave received a B.S.E.E. (1969) from Western New England College, an M.S.E.E. (1972) from Rutgers University, and has pursued doctoral work in solid-state physics.

Kathleen D. Morse As ;I consulting software engineer. Kathy Morse is working on advanced development for the VAX/VMS Development Group. She is one of the VMS SMP architects and consults on enhancements to vari- ous parts of the VlMS executive. Earlier, she implemented the VMS support for the first MicroVAX systems, the first asymmetric multiprocessing VAX sys- tem, and the kIA780 multipart memory. Kathy joined Digital in 1976 after receiving her B.S.C.S. degree from Worcester Polytechnic Institute, where she also earned her M.S.C.S degree in 1985. Kathy is a member of IEEE, thc Professional <:ouncil, and ACM as we1 l as Tau Beta Pi and Upsilon Phi Epsilon.

Bhagyam Moses Bhagy;~rnMoses is the engineering manager of the Mid- range Syste~nsPerformance Analysis Group, a group which she established two and a half years ago. Prior to her current work with the VAX 11000 and VAX 6200 series of systems, she had been involved in the modeling and per- formance measurement of thc VAX 8600 and 8650 systems, PDP- 1 1 systems, IIECSYSTEM-20. and earlier VAX systems. Bhagyam joined Digital in 1979. She receivctl a B.S. degree (honors) in mathematics from Spicer Memorial College and an M.S. it1 applictl mathematics from Howard University. Arthur L. Singer Art Singcr supewiscs the computer-aided design, simula- tion, and timing verific;~tionfor a new I/O adapter product. His previous work includes supervision of the s),stem simulation and timing verification of the VAY 6200 system and microdiagnostic and release support for the VAX 8600 project. Before joining Digital in 1984, Art was cmployed by IPL Sys- tems ;IS a design engineer and manager of a microdiagnostic development group. While at IPL, hc rcccived a patent for the dcsign of an instruction unit. He is a mernbcr of Tau Rcta Pi and Eta Kappa Nu.

Jeff Winston A ~ne~nbcrof the Senliconductor Engineering Group, Jeff' Winaton designed the microsequencer for the VAX 8200 CPU chip and led thc dcsign of two gcner:itions of the MicroVAX System Support Chip. Hc has also contributed to thc dc\.clopment of many CAD tools used in chip design. Beforc joining Digital in 1980.Jeff rcccived his B.S. ( 1979) and M.S. (1 980) in Electrical Engineering from Cornell University. He is currently leading the development of the XIMI interface chip set for a new mid-range VAXCPU.

Gilbert M. Wolrich A consulting engineer in the Semiconductor Engineer- ing Gro~1j>/1-\rchitect~irallyFocused Logic. Gil participated in the J-1 1 con- trol chip design and was project leader for the CVAX C17PA and the J-1 I FPA chip projects. He holds a patent for an AL.U with Carry Lcngth Detection uscd on thc J-l 1 FPA. Gil rcceivcd a B.S.E.E from Rensselaer Polytechnic Institute in 197 1 and an M.S.E.E. from Northeastern University in 1977.

Robert AJ Yodlowski Bob Yodlowski, a principal engineer, is the chip i~nplc~ncntationprojcct lcadcr and lea<[ circuit designer for the CVAX tloating point accelerator. For his work on this project. Bob has three patents pcncling. He was also senior circuit designer on the J- 1 1 floating point chip project. Relative to this project work, he is a co-inventor and patent holder for ALU with Carry Lcngth Dctcction (1987). Before joining Digital in 1977, Bob was a senior rncrnbcr of the technical staff at LFE Corporation in Waltham, MA. He rcceivcd ;I 13,s. in engineering physics from Cornell Univer- sity ( 1968) and an M.S.E.E. from Syracusc llnivcrsity (1 970). Foreword

Multiple process generations related by opti- cal scaling VAX microprocessors as the leading edge chip development projects rn Performance improvements targeted for greater than 50 percent per year 'This program not only provided the LSI Group with an overall structure for its process and chip development projects; it also provided Digital's system groups with a stable, long-term basis for planning system products. Robert M. Supnik The program was also a significant leap of Corporate Consultant, faith. When it was formulated, there was no VLS/ Technology, and MicroVAX business. The MicroVAX I1 system was Group Manager, two years away from shipment. Almost all design Semiconductor Engineering resources in the LSI Group and in the low end sys- MicroprocessorDevelopment tem groups were busy with the MicroVAX chip set and its related systems. Major development In May 1985, Digital introduced the MicroVAX I1 projects in technology, chip design, systems computer system. Based on the MicroVAX proces- design, and manufacturing were required to sor chip set, the MicroVAX I1 system offered bring the program vision to fruition. unsurpassed price, performance, and reliability Work began with development of the under- characteristics. In the three years since then, lying semiconductor technology. Starting in Digital has sold more than 100,000 systems 1983, a team from Semiconductor Manufactur- based on the MicroVAX chip set. There are more ing's Advanced Semiconductor Development MicroVAX-based systems in the field than all (ASD) defined, simulated, and tested CMOS-1, other types of VAX systems combined. Digital's first CMOS process. When first defined, In the same three years, the practice of com- CMOS- 1's key features - N-well base on a p-type puter engineering has advanccd considerably. epitaxial layer, two levels of metal interconnect, Faster processors, bigger memories, quieter pack- 2.0 micron feature size, direct scalability to ages, and more complex software have appeared 1.5 micron feature sizes - were controversial in a steady stream. For Digital to remain compe- within an industry that was still debating NMOS titive, we would need, over time, a second gener- versus CMOS. Over time, these choices have been ation of VISI-based VAX chips and systems. The vindicated. and CMOS- 1 has proven to be a main- chips and systems that constitute the second stream, robust, highly manufacturable process. VLSI-based generation are described in this issue Equally important was development of design of the Digital Technical Journal. methods for larger and more complex chips. The The planning for the second generation began Semiconductor Engineering Computer Aided in 1983. That year, the LSI Group (now Semicon- Design (CAD) Group continuously refined the ductor Operations) formulated a multiyear pro- structured design process first deployed for gram for the development of both semiconductor MicroVAX and V-11. The goals of this effort process technology and leading-edge chip prod- were improved simulation coverage, faster ucts. The key characteristics of this process/ turnaround time, and more extensive automated product plan were verification. One consequence of the increased CMOS (complementary metal-oxide-semicon- use of CAD tools was a dramatic increase in the ductor) process technology (Previous Digital amount of computing power required. This gen- chips were based on NMOS technology.) eration of chip development projects used four times as much computing power as the first VLSI an extensible multiprocessor system. They generation. defined a new system interconnect (supported by The Semiconductor Engineering Microproces- unique chips) to provide unprecedented flexi- sor Group began architectural prework on the bility and extensibility in configuring systems, second-generation chip set (called CVAX) in and new system packaging to support the con- mid- 1984. The overarching goal was simple: cept. However, a general-purpose n~ultiprocessor three times the performance of the MicroVAX system was feasible only if the VMS operating sys- chip set in less than three years - a compound tem could take advantage of the incremental performance growth rate of more than 50 per- power offered bv additional processors. This cent per year. The central processor design requirccl a major restructuring of VMS to support started from the MicroVAX base but drew upon symmetric (all processors equal) multiproces- ideas from other VAX implementations, notably sing. Thus, the definition and implementation of the 8700. The floating point unit design focused the mid-range 6200 system family and of VMS on minimal cxecution flows for the most common symmetric multiprocessing support had to be instructions. Both chips transitioned to imple- closely linked. mentation in 1985. As the engineering development projects pro- The original concept for the CVAX chip set gressed, manufacturing became heavily involved had been to build chip-for-chip analogues of in planning and executing the transition from MicroVAX - a central processor and a floating design to volume product. LSI Manufacturing in point unit. However, as the flexibility of the new Hudson, Massachusetts, introduced CMOS- I into CMOS process, and the efficiency of the CAD multiple fabrication units in order to produce tools, were appreciated by designers, the chip set prototypes quickly and to ramp up to high vol- concept expanded beyond the central processor ume production. System manufacturing groups to include key peripherals. The implementation in Westfield (Massachusetts), Albuquerque (New of these peripheral functions in VLSI chips made Mexico). Puerto Rico, and other sites worked systems faster, more reliable, and less expensive. closely with the system designers to introduce In addition, it allowed peripheral functions to be the new manufacturing processes required for standardized across multiple system implementa- system production. tions and additional functions to be added in The results of these development programs is a modular fashion. The Semiconductor Engineer- family of VAX systems with exemplary price, per- ing peripherals group (now Advanced Develop- formance, and reliability characteristics. More- ment) specified and implemented a memory con- over, the programs leave as residuals a set of troller, a memory driver, a console interface, and VLSI components from which other products can a Q-bus interf;~ce. be built, and base technology from which further After the MicroVAX 11 system shipped in May advances in chip and system design will evolve. 1985, the Low-end Systems Group and the Mid- The initial program vision has been fulfilled, range Systems Group became actively involved even exceeded. Many people, in teams and indi- in the specification of the CVAX chips and in vidually, worked together to bring this about. the definition of new systems utilizing the chip The excellence of the results reflects, in full set. In the low end, the .3500/3600 systems measure, the excellence of the work that they werc defined as evolutionary extensions of the have done . MicroVAX [I. Nonetheless, the performance targets for the new chips posed knotty design problems for a system family bounded by both cost and packaging considerations. In the mid-range, the system designers wished to exploit the CVAX chip set's combination of high performance and low cost by constructing Brian R. Allison 1

An Overview of the VAX 6200 Family of Systems

Digital's VAX 6200 series is a high-performance, expandable family of computer systems that combines low-cost microprocessors with high- performance memory and Z/O subsystems. Based on the CMOS VAX chip set, the VAX 6200 CPU module performs at 2.8 times the VAX-11/70 system; utilizing a multiprocessing architecture, system speeds are available up to 11 times the VM-11/780 system. The memory subsystem utilixes a multi- controller architecture for up to 256MB of total system memory. The XMZ bus, the electrical interconnect for the system, supports the multiple pro- cessors, memory subsystems, and VAXBZ channel adapters. The VAXBI is usedfor all 1/0 devices.

The VAX 6200 family of computer systcms is the with hardware optimized for efficient multipro- most recent ilddition to Digital's line of VAX com- cessor operation. The result is a system that offers puter systems. The VXX 6200 systems. primarily similar performance for a large class of applica- b;~sedon CMOS technology. are mid-range sys- tions at a better price-performance ratio than that tems which cxploit multiprocessing tcchniqucs. offcrctI by traditional single-processor, high-per- 'I'hc VAX 6200 family currently comprises four forn1;rnce conlpiltcr systems. systems, all built from common subassemblies. A primary objective of thc VAX 6200 system ,411). VAX 6200 system may be upgraded to ;lny design is to provide a highly configi~rableand other VAX 6200 system simply b)! adding CPIJ expandable computing environment. To achieve ;ind memory modules to thc existing cabinet. this objective. designers chose a niodular sub- This papcr provides ;In overview of the systcrn assembly design for the tota.1 system. This modu- ;und therefore a contcxt for the five papers that lar design provides for cost-effective basic sys- follow in this issue. Thcse papers describe scv- tems and also allows for system expansion to era1 of thc components in detail, the engineering achieve higher performance. AIl members of the design effort, the pcrforrnance evaluation pro- VAX 6200 family arc housed in the same cabinet cess, and some of the multiprocessing aspects of and use the same basic subassemblies. Thc only the operating system. di ffercnce is the number of processors, amount of In thc past. CMOS-bascd microprocessor tech- mcmon, and number of 1/0 devices. Table 1 nology has been used primarily to build low-cost details the configurations of the VAX 6210, systems. Today, by using multiples of these low- VAX 6220, VAX 6230, and VN( 6240 systcms. cost microproccssors, wc are presented a unique opportunity to produce a high-performance com- System Architecture puter systcrn when thc microprocessors are cou- pled with high-perform;~ncememory and I/O All VAX 6200 systems consist of CPll(s), mem- subsystcms. Although this type of system archi- ory. and 1/0 channel adapters connected to a tecture will not directly result in faster execution common system interconnect known as the XMI. of a single task, it docs result in grcatcr systcrn 'rhc VAXBI is ~tscdas the interconnect to all 1/0 throughput in applici~tions that have sever;~l devices in the system.' All memory and 1/0 simultaneously computable tasks. The architcc- devices are equally accessible by all CPUs in the turc couplcs thc effectiveness of the VMS opcrilt- systcrn Figure I shouis a block-level diagram of ing systcln in multiprogrammed environments the VAX 6200 systcm.

Digfrrrl Technical Journal No. 7 Arrgrrsl I988 CVAX-based Systems

Table 1 VAX 6200 Family System Configurations

VAX 6210 VAX 6220 VAX 6230 VAX 6240 -- Number of processors Main memory VAXBl channels CPU cycle time Cache size (per CPU) Free XMI slots Performance (times one VAX-111780 system) Maximum CPUs Maximum memory Maximum VAXBl channels

UP TO 256MB

4 CPUs MAXIMUM UP TO 11 X VAX-11/780

MEMORY

XMI 100MB/SECOND

VAXBl 6

OPTIONAL VAXBl EXPANDER CABINET

Figure I VAX 6200 System Block Diagram-

Digifal Technical Jour?~d 11 ,Vo ' AII~I~SI1988 An Overz~iewof the VAX 6200 Family of Systems

The primary goal of the VAX 6200 system is the number of I/Os available to VLSl chips and to allow higher levels of system performance the amount of logic that can be put on a module. through multiprocessing. To simplify software Surface-mounted components are used exten- design and to be consistent with previous multi- sively throughout the system. Further, many pas- processor architecture, it was essential to pro- sive components and a limited number of active vide a shared memory resource. All system mem- surface-mounted components reside on side 2 of ory is a global resource accessible through the the modules. All VAX 6200 modules limit the use same address space from each processor and from of surface mount to 50-mil lead pitch compo- all 1/0 devices. A sophisticated multilevel cache nents with vias on 100-mil centers. Across the contained locally in each CPU minimizes mem- modules in the system, there is a mixture of small ory acccsscs on the XMI. Cache coherency is outlinc integrated circuit (SOIC) , plastic leaded maintained totally by hardware. chip carrier (PLCC), and cerquad surface-mount packages. Technology All VAX 6200 XMI modules interface to the The VAX 6200 systems are based on a number of XMI through a set of eight semicustom parts. differenr CMOS technologies. The VAX CPU chip These eight chips are physically mounted on a set and the system interconnect transceivers are section of the module known as the "XMI cor- implemented entirely in Digital's full custom ner." This section of the module is approximately CMOS process featuring a size of 1.5 microns.* 12.7 cm (5 inches) by 3 cm (1.2 inches) and is The interface between each module and the located by the A, B, and C connectors of the mod- system interconnect is implemented in channel- ule. (See Figure 2.) The XMI interface area is less 1.5-micron CMOS gate arrays. Thc number of identical on all modules so that a common elec- gates used in these arrays varies from 18K to 50K trical load is presented to all slots on the XMI. gates. The interface to the VAXBI and the XMI The XMI corner has four 44-pin cerquad pack- arbitration system is implemented in 1.5-micron ages on side 1 of the module and four 44-pin channeled arrays. The on-board CPlJ caches are cerquad packages on side 2. In addition, approxi- implemented with 4 5-nanosecond (ns) 64K-by-4 mately 100 surface-mounted-device (SMD) sig- CMOS static random-access memories (SRAMs) nal termination resistors and bulk power capaci- and industry-standard CMOS cache tag chips. tors are divided evenly across both sides of the All VAX 6200 XMI and VAXBI modules are module in the XMI corner. connected to their respective backplanes by a Figure 2 is a photograph of the three VAX 6200 300-pin zero insertion force (ZIF) connector. All XMI modules. Note that all three modules have modulcs use 10-layer controlled impedance the identical components in the lower right cor- printed circuit boards. All cables from the mod- ner and a similar gate array directly above the ules arc connected through the backplane to XMI corner. improve reliability and to minimize the task of replacing modules. VAX 6200 CPU Module The VAX 6200 XMI backplane is a 14-layer As noted earlier, the VAX 6200 CPU (KA62A) is control lcd impedance printed circuit board. Side based on the CMOS VAX chip known as the CVAX. 1 consists entirely of surface-mount contacts for The KA62A is a single module that implements the ZIF connector. Side 2 consists of plated a full CPU subsystem. Included on the KA62A through holes for power strips and I/O pins, and module are surface-mount pads for resistors. These surface- The CVAX chip, which includes a 1 kilobyte mount resistors form the termination network for (KB) on-chip cache thc XMI signal lines. VAX 6200 XMI modules use a printed circuit An external 256KB cache board very similar to the VAXBI printed circuit w A floating point accelerator chip (CFPA) board. XMI modules have the same finger pin design as the VAXBI, but the module size is w Console support hardware 28 crn (1 1.025 inches) deep instead of An interface to the XMI 20.38 cm (8.025 inches) deep. The VAX 6200 modules make use of advanced Figure 3 shows a block diagram of the KA62A module technology features to maximize both module.

Dfgftal Tecbnical Journal No. 7 August 1988 CVAX-based I Systems

Figure 2 Three VAX 6200 XMI Modules

DIAGNOSTIC

128KB CONSOLE CACHE 256KB l:LYp, 1-1 ;:1 CACHE 1KB CACHE / ;gB 1

CDAL

CPU/XMI SUPPORT GATE ARRAY 1

I XMI -INTERFACE CHIPS I

XMI

Figure 3 VAX 6200 CPU Module (KA62A) Block Diagram

Digital Tecbnical Jourtral No. 7 A~rgusl 1988

CVAX-based Systems

System Interconnect, the XMI The XMl has 30 address bits, and the smallest l'hc XMI. the primary clcctric:il interconnect in adclressablc entity is a single byte. >. Block Diagram An Ovc.r~~ic.ulof the VAX 6200 Fumiljl of Systems

The XMI multiplexes data and address informa- Table 2 XMI Bandwidth Based on tion onto the 64-bit data path. Data transactions Transaction Size are initiated with a "command and address" cycle, followed by multiple data cycles. The max- Transaction Interconnect Size in Bandwidth imum length for an XMI transaction is 32 bytes of Bytes in MB/second data. The XMI cycle time is 64 ns. The effective bandwidth of thc XMI is a function of the data transfer size, as shown in Table 2. The XMI architecture allows for three distinct classes of devices. Processor Nodes Each processor node contains a CPU that exe- out conditions that may be used to detect and cutes instructions and manipulates data con- diagnose failures. tained in XMI memory. The processor node can execute any instruction set compatible with the VAX-style byte addressing and memory locking VAX Console mechanisms. A processor node will have a cache The VAX 6200 system implements the standard that must force all written data back to main VAX console functionality by means of software memory. Any cached processor module must also that conditionally executes on each of the KA62A monitor write traffic on the XMI and invalidate CPIJ modules. Each KA62A CPU module contains any location in its own cache that is written into a serial-line interface, 256MB of read-only mem- main memory. Processor nodes must also be ory (ROM), 32MB of electronically erasable ROM capable of responding to interrupt requests gen- (EEROM), and 512 bytes of RAM Control is erated either by other processors or by 1/0 passed to the console software upon any one of nodes. the following occurrences:

I/O Nodes System power-up 1/0 nodes generally respond to 1/0 space refer- ences either by mapping the data onto another Initialization bus or by interpreting data as a command. An 1/0 node can also become a commander on the Receipt of a control-P character from the con- XMI and access global XMI memory. 1/0 nodes sole terminal may generate interrupt sequences directed toward processor nodes. However, 1/0 nodes do Execution of the HALT instruction not respond to commands directed toward mem- ory space. Some severe error conditions

Memory Nodes Each KA62A CPU has access to console termi- Memory nodes act only as responders on the XMI. nal transmit-and-receive lines carried on the sys- They respond to read and write requests directed tem backplane. Upon power-up, control of the toward memory address space. These requests are system console terminal is dynamically allocated generated either by processor or 1/0 nodes. to one of the CPUs present in the system. This CPIJ, known as the "boot" processor, provides Data Integrity the system interface to the console terminal as The XMI contains a number of features to well as to the switches and lights located on the enhance the integrity and reliability of the system control panel. interconnect. First, all XMI information transfer On receiving commands from the console ter- lines are parity protected, and XMI command minal, the boot processor may run diagnostics or confirmation signals are ECC protected. The XMl boot an operating system. This processor commu- protocol is sufficiently robust to permit detection nicates with other processors by means of a struc- and recovery of all single-bit error conditions on ture maintained in memory known as the console these signals. Additionally, the XMI defines time communications area (CCA) .

Digitd Tecbnicd Journal No. 7 August 1988 CVAX-based I Systems I

Also considered as part of the console sub- system. a TK5O tapc drive is included in each VAX 6200 system. The tape drive is connected to the system by means of a TBK5O controller mod- LIIC loc:ited on a VAXBI 1/0 channel and is used for the following purposes: Saving all volatile parameters for the console subsystem Loading the VhY Diagnostic Supervisor (VDS) when no disk is av;iilable or functional in the system Distributing operating system and layered soft- ware 'l'hc TK5O tapc drive is also available under oper- ating system control as a general-purpose data interchange device Built-in Self-test I:xtensivc bui It-in self-test is used by all modules contained within the VAX 6200 systems. Upon power-up. all modules within the system, with the exception of the DWMBA, perform a self-test in parallel. After self-test is complete, the CPU modules cxatnine each other's status; the one in the lowest slot number that passed self-test is selected as the boot processor. The boot proces- sor then continues to execute an additional test to ensure rnemoqr accessibility and tinally exe- cutcs a test of the DWMBA. Physical Packaging All VAX 6200 systems are housed in the same cabinet, which is 78 cm (30.5 inches) wide by 154 cnl (60.5 inches) tall by 76 cm (30 inches) Figure 6 VAX 6240 System, Front Door deep. The cabinet contains one 14-slot XMI back- Removed plane, two 6-slot VAXBl backplanes, and all nec- essary powcr and cooling to sustain a wide range of configiir;~tions.Figure 6 shows a VAX 6240 The VAX 6200 systems all contain two 6-slot with the front door removed. VAXBI backplanes, which are configured as inde- 'l'he XMI is physically implemented in a pendent channels. The first slot of each VAXBI 14-slot backplane assembly containing ZIF mod- backplane is occupied by the DWMBA/B module, ule connectors, sign:~lterminating networks, and leaving 5 slots for standard VAXBI interfaces. All :I ccntralizcd clock and arbitration system. Mod- systems contain a DEBNK TK50 tape controller ules are loc;~tedon 2 cm (0.8 inch) centers. The and a DEBNA controller as standard XhI I backplane is supplied with + 5 volts 01 for equipment. The two VAXBI backplanes are sup- puieral logic, a separate + 5 V supply for mem- pliedwith +5V, ?12V, -5.2V,and -2V. ory, k 12 \I for the console terminal line drivers, iintl - 5.2 V/- 2 V for emitter-coupled logic Summary (1:CL). Prcscntly none of the VAX 6200 XMI The VAX 6200 family of systems merges the modules utilizes the ECL voltages, but ECL is CMOS VLSl VAX chip, which is used in a number included for potential future use. of Digital's products, with a very high perfor-

Digital TecbnicalJournal No. 7 Atcgrrst 1988 An Or~ervieul qf the VAX 6200 Fanzily of .(;?,stems mance memory and 1/0 subsystem. This hard- References ware, combined with the new fully symmetric I. P. Wade. "The VAXBI Bus - A Randomly multiprocessing capabilities of VMS version 5.0, Configurable Design," Digital Technical allows very high system throughput previously Journal (February 1987): 81 -87. achievable only with ECL technology. Moreover, the extensive use of CMOS technology results in 2. T. Fox. P. Gronowski. A. Jain, B. Leary, and physically smaller systems. These smaller sys- D. Miner, "The CVAX 78034 Chip, a 32-bit tems consume less power and are more reliable Second-generation VAX Microprocessor," due to the lower component count and lower Digital Technical Journal (August 1988. power consumption this issue): 95- 108.

18 Digital Technical Journal No. 7 August 1988 Brian R. Allison I

The A rcbitectural DeBnition Process of the VAX 6200 Family

Tbe architectural definition of Digital's VAX 6200family was governed by a twofold goal: to build a system with higher throughput than previous CMOS, Q-bus-basedsystems at a cost lower than ECL-based systems. Deci- sions made during the @nition process were infruenced byjrm schedule guidelines. Further, the very nature of the multiple processor system imposed its own requirements, particularly in the dejnition of the XMI bus. This new 64-bit-wideinterconnect is spectjcally designed to meet the memory and I/O needs of the symmetric multiprocessor system. Through- out the architectural dejinition process, engineers continually evaluated the interdependency of one design decision upon another and against the project and schedule goals. By this process, the total dejnition of the sys- tem - the XMI bus, the processor module, memory module, console sub- system, and packaging - was achieved.

Definition of thc VAX 6200 family of systems Once the decision to build a multiprocessor began in March 1985. The engineers' intent was was made, the next question was how many to design a follow-on prodi~ctto the VAX 8200/ processors to include. Several small computer 8300 family of systems. still in development at manufacturers were building 8- to 32-processor that time. This paper discusses the system archi- systems at the time. Our belief was that the mar- tectural definition process that took place during ket for systems with numerous processors was 1985 fairly sniall because few applications would run Like the VAX 8200/8.300 family before it, the effciently on these systems. Therefore, wc VAX 6200 family provides a system environment decided to design the VAX 6200 as a 4-processor for a VLSI VAX chip set. This new family of sys- system. with the possibility of expansion to tcms is a mid-range VAX implementation. In this 8 processors. This arrangement would allow us to context, a mid-range system is defined as a pro- sti ll co~~figi~recost-effective 1- to 2-processor sys- duct with more capability than the Q-bus-based tems. If we found a significant number of applica- systems and lcss capability than the emitter-cou- tions could bencfit from the larger number of pled logic (ECL) based systcms. processors, we could expand to 8 processors. Building an cflicient multiprocessor system Project Goals would necessitate optimization of both hardware The primary goal of the VAX 6200 program was and software functionality. The VMS asymmetric twofold: to build a system with greater system rnultiprocessing code (VMS versions 2 through throughput than the CMOS. Q-bus-based VAX sys- 4) that supported the VAX-1 1/782, VAX 8300, tems, and to cnsure system cost was lower than and VAX 8800 systems worked wcl l for compute- that of high-performance ECL-based systems. bound, dual -processor systcms. However, asym- Designers woulcl achieve this goal by designing ;i metric operating system software would not be system architccturc that allows a moderate nunl- acceptable for larger scale n~ultiprocessors.In ber of low cost CMOS VAX microprocessors to the existing VMS asymmetric multiprocessing share a common system environment. Such an design, most operating system code was exe- efficient multiprocessor system environment cuted on the processor designated as the "pri- would offcr higher throughput for a large num- mary" processor. Whenevcr ;I process needed to ber of applications and at a cost lower than a perform I/O or invoke most of the VMS system high-performance single processor. services, the process would have to be scheduled

Digital Technical Journal No. 7 Augusl 1988 The Architeclrrral Dc$nition Process of the VAX 6200 Furnily on the primary processor. The task of making To maintain predictable nlcmory access time, VMS more symmetric in its handling of 1/0 and we decided that the should not be VMS system senrices was undertaken to support run over 75 percent utilized. thcVAX 8840 and theVAX 6200 farnilies.' I)iscussion of how we chose to optimize the LJsing the worst-case anticipated bandwidth VAX 6200 hardware begins in the section 'I'hc needs, 80MB per sccond of peak bus bandwidth System Interconnect. would be required to support 8 processors. necause of the tig.ht schedule and our aware- Schedule ness of the significant amount of time needed to In March 1985 the design of the CVAX chips was design a new system bus, we first looked into thc already well under way. These chips would be feasibility of using an existing bus. We consid- delivered in rime to allow systems to ship in late ered but rejected the existing VAXBI bus, the 1987. Basctl on the CVAX chip set schedule, we primary interconnect for the VAX 8200/8300 established the fol lowing schedule for the devel- system, because of its limited 1.3.3MB per second opment of the VAX 6200 system: bandwidth. We also rejected the NMI bus, the VAX 8500/8700/8800 family interconnect, Six months of architectural definition because this bus uses ECL technology. At one Twelve n~onthsof design/simulation point we even considered using the SBI from the VAX-1 1/780 system with a 64-bit data path Three months to build and test approximately instead of its existing 32-bit data path. After five first-pass prototypes extensive analysis, however, we decided a new Six months to build approximately 70 second- system bus would have to be engineered for the pass prototypes product to meet its goals. Although we would have to define a new bus Three months for final testing and manufactur- for processor-to-memory communications, the ing introduction schedule did not allow us to design a full com- 'l'his two ;~nda half year schedule significantly plement of 1/0 interfaces for the new bus. Since a influenced thc definition of the system architec- large number of 1/0 interfaces would be avail- turc as well as the selection of implementation able on the VAXr131, the design team decided to technologies. (Actual implementation took three use the VAXBI as the interconnect to all 1/0 years. The design/simulation phase took three dcviccs. The new system interconnect, the XMI, months longcr than expccted, and thc first-pass would be used only to connect processors, mem- prototype phase took three months longer than ories. and VAXBI channel adapters. Therefore, expected.) in addition to the requirements listed above, the XMI architecture would also allow multiple The System Interconnect VAYBL channel adapters to optimize 1/0 through- The first order of business was to define a new sys- put where necessary for large systems. Use of the tem interconnect. This interconnect would have VAXRl for 1/0 adapters also had the positive the bandwidth required to support the memory effect of minimizing the number of electrical and 1/0 needs of the multiple processors. Vck interconnects to the XMI; thc physical length of outlined three requirements that would affect the the XMI would consequently be shorter and the design of the new system interconnect. total capacitancc lower. Further discussion of the channel adapters is presented in the section We estimated that each CVAX processor n~oiild VAXBI Channel Adapters. require between 3 megabytes (MB) and 6MU In June 1985 a team of 11 senior-level engi- per second of data to/from memory. This rate neers was assembled to produce the architectural would depend on thc clock rate of the pro- ant! clcctrical specification for the XMI bus and cessor, the selected cache architecture, and the VAX 6200 system. In addition to architectural the cache "hit" rate of the program being and clcctrical experts, this team included one executed. representative from each of the anticipated mod- We also estimated that each processor cot~ld ule clcsign teams. Almost all members had previ- require peaks of 1 MB to 1.5MB per second of ously worked on projects involving the VAXBl 1/0 bandwidth. bus. It was understood that the XMI would be

Digital Technical Journal NO. 7 August 1988 CVAX-based Systems

used solely for the VAX 0200 family of systems, chip woultl be required to supply thc logic to unlike the VAXBI. which would be used across implement the bus-level protocol. many different applications. A strict adherence to As the electrical design of the XMl progressed, this premise greatly helped the specification a bus cycle as fast as 64 ns appeared feasible. tcam to put technical trade-offs in perspecrivc. Although not entirely necessary to support the stated system performance goals, the faster XMI XMI Electrical Inter-ace Definition cycle time was strongly pursued to gain extra Since most of the VAX 6200 system is CMOS and margin in the system design. Furthermore, this transistor-transistor logic (TTL) based, we imme- fast cycle timc would allow the possibility of sys- di;~telydecicled the XMI could not be implc- tem upgrades to faster processors in the future. rncntccl in ECI.. lo maintain a TTL-level bus and Consequently, 64 ns became the stated goal for to achieve the desired bandwidth, the data path the XMl cycle timc; 80 ns was the fall back strat- clearly would h:we to be 64 bits wide. Further, to egy if thc design complexity of a 64-11s cycle meet our goal of 8OMB per second bandwidth. time began to place the overall project schedule thc XMI would have to transfer 64 bits of infor- at risk. mation evenr 80 nanoseconds (ns). (This transfer Logic design across the entire system was done rate assumes a protocol in which address and data assuming a 64-ns cycle time. Eventtially 64 ns arc multiplexed. and up to 32 bytes of data can became the actual speed of the bus as the CMOS be transferred per address cycle.) process was characterized and the first parts were Several electrical alternatives were considered sampled and found to contain sufficient margin to for the XMI. 11 scheme using the commercially support the hstcr cycle time. available components was seriously considered. However, we rejected this scheme XMI Protocol Dejinition because a large number of components would be XMI protocol definition took place in parallel necessary to implement the 64-bit data path. with the electrical definition of the bus. It was The lack of commercially available compo- clear from the start that the bus would cycle sev- nents to drive a 64-bit bus at the required speed eral times faster than the memory subsystem. 'This tinally led us to a decision. We would design a differencc in cycle times immediately led us to bit-sliced custom CMOS bus interface chip set. the decision that the XJMIwould run a "pended" Each chip would transceive I I lines, and seven bus protocol. A pended bus protocol allows con- chips would be i~scdfor the entire data path. trol of the XMI to be relinquished between Although the "sliccd" bus interface woultl use a "read" command and the return of the data from more nlodule real estate than a larger chip, thc the memory subsystem. With mill tiple processors sliced bus design greatly simplified the chip and multiple memory controllers. several read packaging problems. Each chip would fit into a commands could be outstanding at a timc. standard 44-pin cerquad package. A sliccd XMI To optimize data traffic on the XMI bus, intcrfi~cealso allows each chip to dissipate under we needed to define data transfer commands 0.5 watt (W), which enhances reliability and of several lengths. Since VAX instructions may rclievcs the need for heat sinks on the part. With- write as littlc as 1 byte of data, a 64-bit write out heat sinks. the XMI interface parts can be command was defined. (There is a mask field mounted on both sides of each n~odule.This associated with the write command that allows arrangement saves 50 percent of the real estate single bytes to be written.) Since the VAXBI necessary to inrcrfacc to the XMI. bus already had commands to transfer I6 bytes 'To simplify thc design of the full custom Xhll of data per address, it was essential to allow interface parts, we would keep the functional similar commands on the XMI bus to mini- requirements for the parts as simple as possible. mize the interface complexities to the VAXBI. The XMI interface chips have little knowledge of Eventually we added a 32-bye read command the XMI protocol and serve only as the electrical to allow processors to prefetch larger amounts interface. Due to the divergent needs of pro- of data upon cache misses. A 32-byte write cessor, memory, and I/O interfaces, designers command was not implemented, because it already knew that each module would need a would be too great a burden for the memory different VLSI chip for XMI interface functions. controller to buffer multiple 32-byte write We decided, therefore, that each module VLSl commands.

Digital Technical Joumul No. 7 Aug~rsl I988 The Architectural IIeJinition Process of the VAX GPO0 Family

In many cases the protocol of the XMI is simi- a module the same size as the existing VAXBI lar to th;~tof the VAXBI. In part this similarity module 20.32 crn (8.0 inches) by 23.33 cm resulted because the designers of the XMI were (9.187 inches). In addition. 32MB of memory very familiar with the VAXBI. The similarity could fit on the same size module. between protocols was also deli berately chosen System packaging was another factor to con- because it greatly reduced the complexity of sider in selecting the module size. From the very interfacing the XMI to the VAXBI for I/O pur- start of the 'VAX 6200 program, it was not clear poses. what type of sjrstem-levelpackaging was optimal. The bus arbitration scheme is one area where Designers knew, however, that the larger systems designers had to deviate from the method used would primarily be placed in computer-room by the VAXBI bus. The VAXBI uses the main environments. For these applications, a standard bus data path for arbitration, which requires 1 53.67-cm (60.5-inch) tall cabinet would be extra bus cycles. This approach was not feasible necessary. What was not clear was if office-type for the XMI pended protocol, since two arbi- packaging or rack-mount-type packaging would tr;ttions are necessary for each read transac- be required. Since VAXBl formfactor pedestal and tion. Further, the VAXBI arbitration scheme also rack-mount box packages were both available, requires ;I grcat deal of duplicated logic in evely designers found it very attractive to use the same module. Due to thc large number of allowable formfactor module for the XMI to ease thc devel- XMI nodes. it wa not feasible to implement opment of these packages if necessary. Bascd on an arbitration mechanism located on an XMI the functionality fit and the desire to potentially module. 7b implement arbitration on any XMI reuse existing packaging, we decided to adopt module would have required ;I great number of the VAXBI module size for the XMI. signal pins. The solution was to implement a cen- Another advantage to using the VAXBI module tralized arbiter The XMI uses ;I module physi- size was the opportunity to use the VAXBI zero c;illy atti~chedto the rear of the backpl;tne as a insertion force (ZIF) backplane connector. His- centralized arbiter as well as the source of the torically, developing new backplane connector master clock. technologies has proven difticult and time- The subject of data integrity on the XMI was of consuming. The VAXBI uses a five-segment, great concern to the designers. Initially carrying 60-pin-per-segment connector. Of the 300 pins, error-checking and correction (ECC) bits on the 120 pins are assigned to the VAXBI signals and bus was considered. However, this scheme was '1 80 pins to each module for 1/0 use. Since the rejected because additional encode/decode tim- XMI has 32 more data-path bits than the VAXBI, ing would have been required, and because addi- designers chose to allot an extra 60 pins for the tional bits would have to be carried on the bus. XMI signals. This leaves 120 pins for general Eventually a robust protocol was implemented module use. Designers believed the arrange- based on p;~ritydetection and hardware/software ment to be acceptable, since there are no 1/0 retries when errors ;Ire detected. All transient modules for the XMI. The only use for the 1/0 single-bit errors on the XMI arc recoverable. pins is to connect to the VAXBI card cages. The 120 available pins are more than adequate for XMI Physical DeJinition this function. The physical definition of the XMI was a difficult To meet the cycle time goals for the XMI bus, task. There were a grcat number of interdepen- the length of the XMI would have to be limited to (lent tr:idc-offs for module size, module spacing, bout 0.3 meters (1 2 inches) and the numbcr of number of backplane slots, and cabinet size loads limited to approximately 16. The XMI pro- To minimize design complexity, we had tocol assumed a maximum of 16 devices would decided at a very early stage that each modulc interface to the XMI bus. Eventually the number within the VAX 6200 system would implement a of slots in an XMI backplane became 14 for two single function. Thus the task of designing each different reasons. First. 14 slots would allow a module was simplified and the diagnosability of system to have 8 processors, four memory arrays, the system enhanced. Initially, the size of the and two VAXBI channels. Second, a 14-slot XMI XMJ module was largely governcd by the space backplane would be very similar in size to the needs of the processor. Analysis showed that a pair of (,-slot VAXBls that already existed in the processor based on the CVAX chip set could tit on VAXBI pedebtal and rack-mount box packages.

Digital Technical Journal No. 7 AugrrsI 1988 CVAX-based Systems

XMI module spacing of 2.03 cm (0.8 inches) alternatives were explored but ultimately is the same as that on the VAXBI bus. We chose rejected because of the immaturity of thcir CAD this spacing to allow for heat-sink components tools. on side 1 of the module. Enough height would Discussions with LSI Logic Corporation led us remain to allow non-heat-sink. surface-mounted to consider their newly developed 1.5-micron components on side 2 "Sea of Gates" array, which offers up to 50.000 About 18 months into the program, the niodu le routable gates. Although this array did not give us designs were complete. and both the pro- the mature technology we were seeking, it did cessor module and memory module were experi- appear to offer the flexibility needed by all XMI encing great difficulty during printed circuit designs. We ultimately chose the LSI Logic board layout. Although all components could LL10000 hrnily of gate arrays because all designs be placed within the area available, the very could use the same technology. Moreover, we high pin-count gate arrays in use (223 pins) could focus our CAD tool development on a were causing considerable routing problems. single technology. To lower the schedule risk to the program. The 64-bit-wide XMI data path forced the pin designers decided to lengthen the module by count of a single interface chip to be 200-plus 7.62 cm (3 inches). l'he impact to the computer- pins. The LSI Logic LLlOOOO array was offered room packaging was minimal because a 76-cm in a 223-pin pin-grid-array (PGA) package which (SO-inch) cabinet clcpth could accommodate appeared suitable. Although most of the logic the change. However, the change in modulc on each modulc was implemented in surface- length made impossible the adaptation of the mounted components, we did not pursue a existing VAXBI pedestal and rack-mount packag- 223-pin, surface-mount package. We wanted to ing to the XMI. At this time the pedestal-based avoid the manufacturing problems presented by strategy for the MicroVAX 3500/3600 systems components with 25-mil pitch leads. was clear, thus reducing the need to package the VAX 6200 family of systems for office use. meProcessor Module Further, extreniely low sales of rack-mounted The VAX 6200 processor module uses the CVAX VAY 8200/8300 systems led us to the decision chip set to implement the VAX instruction set. that a rack-mount package was not immediately Due to an uncertainty about the final CVAX chip necessary. speed. the CPU module was designed to operate over a range of 70 ns to 100 ns. The intcnt was to XMI Interfc~ceTechnology use "binned" parts in the VAX 6200 system. and The decision to implement the XMI electrical to use the "nominal" parts in the lMicroVAX interface in simple full custom CMOS parts dic- 3500/3600 systems. (Chip manufacturing pro- tated that each module have additional logic to cesses yield parts of different speeds; "binning" complete the XMI interface and to supply mod- refers to the proccss of testing the chips over a ule-specific logic. To simplify both the design of range of speeds.) For the CVAX chip set, the nom- the XMI interface parts and the CAD tools, we inal parts run at 90 ns, and the binned parts run decided that all module-to-XMI interfaces would at 80 ns. be implemented in the same technology. Given A major system-wide architectural issue, which the aggressive design schedule, we would need a primarily affected the processor module, was technology that was mature as well as easy to whether the cache should be write-back or write- design for. through. Although a write-back cache could We initially focused on a family of 2-micron potentially reduce the number of processor CMOS gate arrays available from LSI Logic and writes on the XMI by 50 percent, such a cache Toshiba. However, it quickly became clear that was complicated and had never before been array limitation of approximately 10,000 gates designed for a multiprocessor VAX system. Our would force us to place multiple chips on each final decision was based on the need to reduce module. The use of multiple chips was highly overall risk to the program. Therefore. we would undesirable from the perspective of design implement the more straightforward write- resources, module real estate, and cost. A search through cache design and build the extra band- was started to locate a suitable alternative. To get width into the XMI to handle the additional write the desired logic density, several semicustom traliic.

Digital Technical Journal 2 3 ,Vo. 7 AU~NSI1388 The Archilecturnl Dejinilion Process of the W4X 6200 Family

Once the decision to implement a write- could not be broadcast into each of the CVAX through caclie was made, the major architectural chips. The possibility of maintaining a duplicate issue for the processor module became the cache external tag store proved to be very difficult to organization. The CVAX chip contains an internal implement. Consequently, we chose the alterna- 1 -ki lobytc (KB), two-way set associative cachc tive to store only I-stream data within the internal accessible to the internal micro engine in one CVAX cache. cycle. Due to the long latency to main memory, a A similar problem was knowing when to invali- second-level cache on the processor module WAS date data in the external cache. In this case it was imperative. The size of the second-level c;iche feasible to implement a duplicate tag store. The was determ~nedby the available static ranclom- second-level cache has two tag stores. One is access memory (SWV) technology. The newly located on the CDAL and is used for cache look- available high-speed 64K-by-4 SRAMs would up by the CVAX chip. The second tag store is provicle ;I 256KB cache with only eight parts. located within the XMI interface and is used to Although no accurate simulation was av;~ilahlcto determine if XMI writes hit the second-level indicate the effect of this large cache, the effects cache. When hits are detected, a request is were ;~ssumed to be positive. Therefore we queued to invalidate the entry within the second- decided that the higher cost of the SRAMs was a level cache. worthwhile trade-off given the potential gains in Another problem to be solved on the processor systeln performance. module was the issue of combining writes into A third major issue relative to the caches on the larger blocks before issuing them to the XMI. processor moclule was the invalidation scheme. Since the CDAL data path is only 32 bits wide, the In the past. V/U( processors have managed cachc CVAX chip is incapable of generating a write invalidation, since processors and 1/0 devices cornrnancl any larger than 32 bits. The 64-bit data have alwaj8s shared a conimon memory subsys- path of the XMI would need larger writes to oper- tem. 'l'he issue of caclie invalidation became ate efticiently. The solution to this problem was much more important to our program bec;iuse to implement a "write buffer" in the XMI inter- of the multiprocessor nature of the VAX 6200 face of the processor module. The write buffer system. This type of system could cause large takes advantage of the fact that writes generated amounts of stale data as ;I process migrates from by VAX processors are often sequential. The processor to processor. write buffcr will buffer up to four sequential The .I KO cache contained within the CVkY 32-bit writes and combine them into a single chip c;ii~secl the largest problem. If it were XMI write transaction.' allowed to caclie data that could become stale. every write to memory would potentially have Tbe Memory Module to be inv:ilidated wirhin the CVAX cache. The systeln design goal was to provide the capac- This mc:int choosing one of two approaches: ity for 15MB to 30MB of memory per processor. (1) broadcasting even8write in the systern onto As mentioned earlier, the module size was par- the CDAL bus of every processor, or (2) finding ;I tially governed by the need for 32MB of memory way to maintain a duplicate tag store of tags per memory module. .The number of slots in the within the CVAX chip and only passing writes XMI backplane was also partially determined by known to reference cached data within the CW the desired amount of system memory. onto the COAL. Another alternative was to cache The wide range of possible VAX 6200 con- only instruction-stream (]-stream) data wirhin figurations dictated the need for an expandable the internal cache. This alleviates the need to memory subsystem. Since full memory band- invalid:itc, because I-stre;im data is defined to be width would only be necessary for very large read-only by the VAX architecture. We projected configurations, it was decided to adopt a dis- this altcrn;~tivecould cause a 3 to 5 percent tributed niemory architecture. An individual dcgr~d;irionin CPU performance. memory controller could be made simpler if it Analysis of the cache-invalidate problem did not have to supply fill l XMI bandwidth. Full proved \,enr difficult. bec;iuse we did not know XMI bandwidth could be achieved by inter- what percent;ige of data would be shared in this leaving multiple memory controllers. class of m~~ltiproccssorsystem. With the poten- With the module size and number of slots tial for 8 processors, it was clear that all writes determined, the first architectural decision to be

24 Digital Technical Journal iVo 7 August 1988 CVAX-based I Systems

madc for the memory was internal organization of signal on the XMI that would inhibit all new the memory. The 64-bit width of the XMI made it commands from being issued on the XMI. Unfor- desirable to have a 64-bit data path internal to tun;~tel),,due to the pipeliricd natilrc of the pro- the memory module. The very tight module real cessor and tbe memory array, three additional estate made it very attractive to consider imple- commands could possibly be received by the menting a 64-bit data path to reduce the number rncmory controller after it had determined the of required ECC check bits. A 04-bit-wide data nccd to stop additional recluests. Since the depth path was also attractive given that the processor of the command queue was four. the memory module would issue a read for 32 bytes whenever array would need to "stall" the bus after receiv- there W;IS ;I cache miss. ing onl!? a single command. Sincc this effectively The negative side of a 64-bit internal memory eliminated the command queue, we decided to organization was that any write of less than lengtlicn the depth of the comrnantl queue to 04 bits in width woulcl result in 21 read-modify- cight cntrics. writc operation to calculate the proper ECC 'I'hc VAX architecture forces the use of a hard- codc. An analysis of the expected write traftic ware-based memory lock to control access to through t he processor's writc buffer showed that shared data structures. The memory lock is used approximately 50 percent of all writes would be by some intelligent 1/0 adiiptcrs as well as a ftull 64 bits in width. Further analysis showed processors. that as long as there was at least one memory System performance suffers whcn there is controller for every 2 processors, there would contlicr over different lock variables that acquire be sufficient memory bandwidth for the system. ;I common hardware lock. Given that Digital had Given the performance characteristics of the never built a fully symmetric multiprocessor sys- CVAX processor, it seemed reasonable to require tcm and that major changes wcrc being made to a 32MB memory array for every 2 processors. We VMS, we did not know what tlic lock traftic pat- thcreforc decided to implement the 64-bit mem- tern would look like in a large system. We did ory internal organization. know, however, that the existing VAXBI 1/0 Sincc it was very difficult to tlcsign the mernory ad;~ptersand the CVAX processor coi~ldnot hold modulc to accommodate the full b;~ndwidthof morc than a single hardware lock at one time. thc XMI, designers used mcmon interleaving to Based on this. we designed the memory con- provide an aggregate mcrnory bandwidth com- troller to have up to 16 locket1 locations. This p;itiblc with the speed of the XMl bus. The inter- number seemed morc than adequate given a leave size of 32 bytes was determined by maximum of 8 processors and only three existing the protocol of the XMI, which allows reads of VAXDI 1/0 adapters that i~scrncmory locks 32 bytes per address cycle. (Ethernet, C1, and TK50). 'lhe granularity of 'l'he multiprocessing design of the systcm each lock is 32 bytes to simplify the memory con- madc it possible for a single memory controller trollers' handling of 32-byte read requests. Lock to be the object of several simultaneous requests. congestion is still possible if niultiplc lock vari- 'I'o avoid rejection of processor traffic. we ables arc allocated within the same 32-byte designed the memory controller with an input region of memory. An examination of VMS codc queue. This queue accepts memory access shows lock congestion to be very r;lrc rcclilcsts and services them in a first-in, first-out (FIFO) order. VAXRI Channel Adapter Initially the memon controller was designed We had decided right from the start to design an with ;I four-command qucue that would rcject XM I-to-VAXBI channel adaprer to handle all I/O. ncw requests once the queue was full. As the To mcct the desired maximum 1/0 rates of dcsign progressed, we realized that with our XMI 1 ~Mllto I .5MB per second for each processor, we ;~rbitrationscheme, a processor or VAXBI channel woultl jnclude multiple Xitll-to-VAXBI adapters. ;~daptcrmight possibly be denied memory access. Althoi~gh two VAXBI channels would allow A processor or channcl adapter might be denied 1.5MU per second per processor. it was decided access for indeterminate periods of time if the to ;11low up to eight VAXBl channels to be con- rncmory array was allowed to reject commands ncctccl to the VAX 6200 system. 'l'hc dcsign was whcn its qucue became full. 'li)avoid this prob- not madc more complcx by the change from two Icni, the memory array was allowed to assert a to eight VAXRl channel adapters.

Digital Tecbnical Journal No. 7 August I988 The Architectural Definition Process of the VAX 6200 Family

Designers wanted to optimize data transfer reconfiguration, all processors would have to be from the VAXBI into XMI memory, since statisti- allowed access to the physical console terminal cally more data is read from 1/0 devices than is as well as the physical front control panel of the written. A double-buffered system. This access is accomplished by busing (D1M.A) data path from the VAXBI to the XMI the signals that interface to the console terminal allows transfers at the full 13.3MB per second and front control panel across the XMI. However, VAXBI data rate. we needed a mechanism to ensure that only one For reads of XMI memory, it was known that of the processors would actually respond to the full bandwidth could not be maintained due to console terminal and front control panel. This the memory read latency through the VAXBI mechanism is a protocol whereby the processor channel adapter and the XMI memory subsystem. in the lowest XMI slot that passes self-test Since most 1/0 transfers are sequential, we con- assumes control of these external resources. The sidered building prefetch buffers into the VAXBI processor that takes control of the console termi- channel adapter. Because transfers could be in nal is known as the "primary processor." 'The pri- progress to several VAXBI nodes at once, multiple mary processor communicates with all other pro- prefetch buffcrs would be needed. Since prefctch cessors by means of a message passing protocol buffers can architecturally be considered to be through system memory. small caches, the VAXBI channel adapter would It is necessary for the console subsystem to also have to monitor all XMI traffic for potential have access to a mass storage device. Such access invalidate conditions. Eventually the need for is needed for distribution of software and for large amounts of buffer storage and the compli- loading of diagnostics. The TK50 was selected cation of XMI monitoring decided us against because of its high density and the availability of building prefetch buffers. This decision was a preexisting VAXBI interface (the DEBNK). The influenced by other factors as well. No single TK70 was not used because there was no VAXBI existing VAXBJ 1/0 adapter could read at full interface to it. The only other alternative was the bandwidth, and multiple 1/0 devices could be RX50, which has superior access time but a data spread across several VAXBI channels to achieve a capacity of only 400KB. The longer time to run higher aggregate XMI read bandwidth. diagnostics from the TK50 was unimportant To ease physical implementation, the VAXBI since the system can be diagnosed largely by channel adapter was implemented on two mod- means of diagnostics contained in CPU read-only ules that were interconnected by four 60-pin memory (ROIM) and by the built-in self-test con- cables between their 1/0 pins. Unlike the tained in all VAXBI I/O adapters. Further, the VAX 8500/8700/8800 VAXBI channel adapter, TK50 makes an excellent software distribution the VAX 6200 could not use a single XMI module device and allows VAX 6200 systems to be to connect to multiple VAXBI buses. The 6200 is configured without nine-track magtape drives. restricted by the 120 1/0 pins available on an In previous systems dependent on ROM-based XMI module. console programs and ROM-based diagnostics, code updates have been a problem. To alleviate Console Subsystem the need to physically change the ROM, each The console function in a VAX 6200 system is VAX 6200 processor contains a 32KB clectroni- performed by code run on the CVAX CPUs. This cally erasable ROiM (EEROM). Most console and use of the main CPU-based console contrasts with diagnostic code is accessed by means of an thc more traditional use of a dedicated front-end address table contained in the EEROM. In the processor, which has access to all system event that a code bug needs to be corrected, the resources. We chose to use a main CPU-based address table is rewritten to point to a replace- console primarily because we had no way to ment routine that is also written into the EEROM. externally access the internal state of the CVAX The console program implements a routine that processor. Furthermore, we did not want to add can patch the EEROM image from a database the cost of a dedicated console processor. distributed on TK50 tape. A side benefit to a system design that employs multiple processors, rnemories and 1/0 adapters, Power and Packaging is the opportunity to design in extra availability As noted earlier, we projected that the VAX 6200 by reconfiguring the system in the event of a sin- system would be used as a large system generally gle component failure. To accommodatc for located in computer-room environments. An

Digital Tecbnicd Journal No. 7 August 1988 CVAX-based Systems

important goal was to dcsign for a high degree of tion during the design phase. Careful balancing flexibility and configurability in the system. The of technical complexity with the necessary decision to use a 14-slot XMI backplane had been minimum functionality yielded an architecture based on desired maximum configurations and that could be implemented with a manageable the size of existing pedestal and rack-mount amount of risk in a bounded amount of time. packages. In addition to housing the XMI backplane, the Acknowledgments computer-room package would need to house The initial VAX 6200 system definition and archi- VAXBI backplanes to accommodate 1/0 adapters. tectural group was made up of the following peo- The VAXBI backplane is manufactured in cas- ple: Brian Allison, Charlie Barker, Frank Bomba, cadeable 6-slot segments. It seemed that two Darrel Donaldson, Rick Gillett, Dave Hartwell, 6-slot VAXBI backplanes would provide adequate Dave Ives, Jim Stegeman, Pat Sullivan. Mike 1/0 adapters for most systems. To ensure that all Uhler, and Doug Williams. customers' 1/0 requirements could be met, a design was also initiated for a VAXBI expander References 6-slot cabinet that could house four additional 1. R. Gamache and K. Morse , "VMS Symmetric VAXBI backplanes. Multiprocessing," Digital Technical Jour- To avoid developing a new power subsystem, nal (August 1988, this issue): 57-63. we looked into modifying an existing power sys- tem. Although we could find no preexisting per- 2. R. Gillett, "Interfacing a VAX Microproces- fect match, we did locate a previously designed sor to a High-speed Multiprocessing Bus," 5 volt (V) regulator specified at 100-ampere (A) Digital Technical Journal (August 1988, output. We respecified this design to 120 A by this issue): 28-46. using slightly higher power components. The VAXBI requires f 12 V, - 5.2 V, and - 2 V in addition to the main + 5 V channel. To accommo- date the VAXBI requirements, a new regulator was designed. The XMI backplane is supplied with two of the 5 V regulators (one for main logic and one for memory). Although not required by any current designs, one of the k 12 V, -5.2 V, and -2 V regulators also supplies the XMI for potential future designs. The two VAXBI back- planes are supplied by one +5 V regulator and one of the k 12 V, - 5.2 V, - 2 Vregulators. Conclusion The design of a complex system like the VAX 6200 is much more than making well- informed engineering decisions based on hard data. Engineers based the initial system definition on their perceptions of the needs for future com- puting systems. The definition was further shaped by what was technically feasible with a defined degree of risk. Throughout the architec- tural specification phase, many trade-offs were made with only partial data and the intuitive insight of very experienced engineers. Thc design process for the VAX 6200 system was extremely smooth, and the product was designed within six months of the initial engi- neering goals. Due to the large degree of built-in configuration flexibility, the product definition never changed enough to force a change in direc-

Digital Tecbnical Journal 2 7 No. 7 August 1988 Richard B. Gillett,Jr. I

Interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus

Tbe design decisions involved in interfacing a microprocessor (CVAX) to a bigb-speed, sbared-memory multiprocessing bus (XMI) are more complex tban tbme encountered in designing a single-processor sys- tem. Altbougb tbe same basic interface architectures are used, the sign&- cantly dzrerent multiprocessing environment requires a much more complex implementation. In particular, the performance of a multiproces- sor system is very dependent on the eflciency of its main memory inter- face. To achieve the desired system performance, appropriate compro- mises between design complexity and performance must be made. In tbe case of tbe VAX 6200 system, performance simulations made early in tbe project guided tbe complexity/performance trade-08s. Actual system performance results bave largely confirmed tbe validity of tbe design trade-om.

The primary goal of the VAX 6200 design was to ing system.) We expected its availability could provide a general-purpose, high-performance, be the critical path to product shipment. mid-range VAX computing system. Further, this The operating system software would probably system design would take advantage of Digital's not stabilize in time for us to discover and fix proprietary CMOS technology and VMS version any major hardware problems and still stay on 5.0 symmetric multiprocessing capabilities. VMS our original schedule. Unfortunately, until the version 5.0 has dramatically changed the way we operating system stabilizes, testing for complex approach mid-range system design; no longer do bugs is difficult. This concern about complexity we design a system to support just one or two relative to the schedule affected several design processors. With the ability to effectively utilize decisions. the power of four or more processors within the On the VAX 6200 CPU module, the design same system came the need to design signifi- challenge was to interface a custom CMOS VAX cantly higher performance interconnects to tie microprocessor (called CVAX) to a high-speed these processors together. multiprocessor bus (called XMI) . The trade-offs The VAX 6200 was to be Digital's first CMOS made during the design of a multiprocessor sys- multiprocessor system. The designers were there- tem are more complex than those made in fore strongly motivated to provide the best per- designing a single-processor system. For a single- forming product they could within reasonable processor system, the performance trade-offs are time and complexity constraints. Complexity was relatively straightforward. The goal is to design of particular concern since the product schedule the highest performance single-processor system did not allow for the production of second-pass that is practically possible. For a multiprocessor parts prior to the first shipment to customers. system, the goal of maximum single-processor Complex multiprocessor interfaces give ample performance must be tempered to obtain maxi- opportunities for the kinds of elusive design bugs mum system throughput (i.e., multiprocessor that can be very difficult and time-consuming performance). to exercise and diagnose. In addition, unlike The foundation of the CPU interface is the other recent VAX systems, the VAX 6200 system cache subsystem, which reduces the effective required a major new release of VMS. (In many read access time to main memory. By reducing ways thc new release represented a new operat- the processor's need to access main memory, a

Digital TecbnfcalJournal No. 7 August 1388 CVAX-based Systems

XMl BUS

I TO VAXBI N _ I TO VAXBl 0

Figure 1 VAX 6200 System Architecture cache improves both single-processor and multi- Summary of VAX 6200 System Architecture processor system performance. This paper dis- The basic architecture of the VAX 6200 sys- cusses the complexities involved in choosing the tem shown in Figure 1 is no different from optirnal cache design and the simulation tech- architectures used on recent VAX systems.' The niques used to ensure informed design decisions. architecture most closely resembles that of the One of the biggest problems in cache design is VAX 8800 series. Processors and memories reside choosing the correct set of workloads to charac- on a single, high-speed interconnect called terize the cache performance. Cache perfor- the XMI bus. All memory is shared and equally mance can vary tremendously with different accessible by all processors. Adapters to the workloads. Therefore, we chose a set of work- VAXBI bus also attach to the XMI. 1/0 devices, loads that spanned a wide range of system activi- in turn, are attached to the VAXBI buses. The ties. Toward the end of this paper, we present XMI supports a total of 14 slots, which can actual cache performance results that largely be populated with modules to provide a wide confirm the legitimacy of our approach. range of system configurations. These con- We also examine one of the more complex figurations can range from small single-processor aspects of multiprocessor designs, which is systems with 32 megabytes (MB) of memory ensuring cache coherency across the entire sys- and a single 1/0 channel to a large multipro- tem. Cache coherency refers to the maintenance cessor system with 256MB of memory and multi- of a sufticiently consistent memory state from the ple 1/0 channels. One of the primary system perspective of all processors and 1/0 devices design goals was to support up to eight pro- within the system. cessors with very good multiprocessor perfor- The designers also went to great lengths to mance. This goal guided the performance ensurc maximum system reliability. As part of decisions concerning the bus, memory, and pro- this effort, we generated a set of error-detection cessor designs. and response rules. These rules ensure that the The heart of the system, the XMI bus, is largely operating system software can easily recover a hybrid of the VAX 8800 NMI and VAXBI buses. from almost all transient cache or bus failures. The XMI is a synchronous bus that runs with a Thesc rules are discussed. 64-nanosecond (ns) cycle time. The data path is The following section is an overview of the 64 bits wide, and the maximum transfer rate is VAX 6200 system architecture. It provides a basis 100MB per second. The protocol supports for the subsequent discussions on the challenges "pended reads" (as does the SBI on the of multiprocessor design, the VAX 6200 CPU VAX- 1 1/780 system and the NMI on the 8800). responses to those challenges, the performance In a pended read transaction, the CPU that wishes simulation environment, cache coherency and to read a location requests use of the bus. When error handling, and finally, real performance the request is granted, the CPU transmits the results. address of the desired location. The appropriate

Digital Technical Journal No. 7 August 1988 Interfiicin~N VAX Microprocessor to a High-speed L14~iltiprocessingBus

PENDED PROTOCOL (EXAMPLES NMI. SBI, XMI)

CYCLE 1 I CYCLE N CPU TRANSMITS CYCLES 2 TO N-l MEMORY RETURNS READ ADDRESS ARE AVAILABLE FOR READ DATA TO ON BUS OTHER DEVICES CPU

NONPENDED PROTOCOL (EXAMPLES: VAXBI. Q-BUS)

I CYCLE 1 I CYCLE N CPU TRANSMITS CYCLES 2 TO N-l MEMORY RETURNS READ ADDRESS ARE BUS STALLS READ DATA TO ON BUS (NO OTHER ACTIVITY CPU

Figure 2 Pe1zclt.d 11ersl.i~Noripewded Protocols

memory controller latches the address into an Challenges of Multiprocessor Design input queue and begins a read access to tlic The major challenges faced by the multiproces- specif etl location. In the n~cantimc,bus owncr- sor system designer result primarily from one ship is relinquished by the CPU. and the bus simple system characteristic. The intimate inter- may be used by other devices. When the memon face between processor and memory that most has completed the look-up and has the data. it single-processor systems enjoy must be broken, makes a request for the bus. When granted the and main memory must be shared among a large bus, the Incrnory drives the req~~esteddata on numbcr of devices. This sharing has several the huh, which is latchet1 by the CPU that effects: originally rccluested the data. Pended protocols are contrasted with nonpended protocols in Main memory access time is significantly Figure 2. incrcased . Pended protocols are a big advantage when the Bandwidth to main memory becomes a pre- bus cycle time is significantly less than the mcm- cious commodity that determines overall sys- ory access time. As a c;~scin point, a memory tem performance. read on the XMI bus requires about 500 ns (roughllr 8 XMI cycles). without a pended proto- Complexity results from increased bus traffic col. thcsc 8 cycles on reads would result in and parallel activities. wasteful bus stalls. Another advantage of pcntlcd protocols is that they allow multiple mcmor) In the following sections, we expand on each of controllcrs to be used to advantage. In the case of these cffccts in relation to the VAX 6200 system. the VAX 6200, it was not pr;~cticalto build ;I sin- gle memory controller that could keep up with ;I Increased Main Memory Access Time saturated XMI bus. But it was relati~el!~eas!' to In a single-processor system, main memory is construct a memory controller that could com- generally closely coupled to the CPU. An exam- fortably run at about one third the bus maximum. ple of this closel!. coupled architecture is shown With four interleavecl mcrnory controllers on the in Figure 3 Clearly, this architecture provides XMI, memory controller bandwidth is greater the potential for low-latcncy and high-bandwidth than XML bandwidth. CPU-to-memorytransactions.

Digital Technical Journal No 7 Aug~dsl 1988 CVAX-based Systems

sentcd bclow in the section on the multiproces- sor environment. MEMORY CPU --dl Table 2 reinterprets the data in Table 1 in terms of bandwidth instead of latency. For exam- ple, the MicroVAX 3600 system fetches 8 bytes 110BUS of data from memory on a cache miss. which requires 8 (90 ns) processor cycles. or 720 ns. This corresponds to 8 bytes of data every 720 ns, 110 110 or 1 1.1MB per second. In comparison, the DEVICE DEVICE VAX 6200 system fetches 32 bytes of data on a cache miss, which corresponds to 16 7MB per Figwe J Typical Single-Processor second (32 bytes of data every 1920 ns). Architecture

Table 1 Comparison of MicroVAX 3600 In the VAX 6200 multiprocessor system, mem- and VAX 6200 Memory Latency ory must be shared by several devices and there- fore cannot be closely coupled to a single proces- VAX 3600 VAX 6200 sor. The result is a significant increase in main Cache 1 memory access tjnic. Since the MicroVAX 3600 (CVAX internal cache) 1 (90 ns) 1 (80 ns) and VAX 6200 systems are both CVAX-based. a Cache 2 comparison of the main memory access times for (Second-level cache) 2 (180 ns) 2 (160 ns) thc two systems illustrates this point. Table 1 Main Memory shows the access time in processor cycles for First longword 5 (450 ns) 14 (1120 ns) the two-level cache subsystem and the main Second longword 8 (720 ns) 15 (1 200 ns) memory Third longwo;d na 19 (1520 ns) Table 1 shows that the VAX 6200 takes three Fourth longword na 20 (1 600 ns) times as many processor cycles to access the first Fifth longword na 21 (1680 ns) longword in memory as does the MicroVAX 3600 Sixth longword na 22 (1760 ns) system. The main reason for this difference is that Seventh longword na 23 (1840 ns) the MicroVAX 3600 memory controller actually Eighth longword na 24 (1920 ns) resides on the CPlJ module. Therefore, the sys- tcm architecture is optimized to provide mini- mum access timc for processor accesses to main memory. On the VAX 6200, system memory is a Table 2 Comparison of Processor Read shared resource equally accessible by all CPUs Bandwidths on MicroVAX 3600 and and 1/0 devices. The price of this equality is VAX 6200 Systems (in MB per second) increased latency on all memory references. Note however that although latency has increased, the MicroVAX 3600 VAX 6200 VAX 6200 can support almost ten times more Cache 1 memory bandwidth (the time required per unit (CVAX internal cache) 40.0 50.0 of data transferred). Cache 2 As will be later presented, the VAX 6200 sys- (Second-level cache) 20.0 25.0 tem uses memory bandwidth to compensate for Main Memory increased memory latency. Trading bandwidth First longword 8.8 3.6 for latency is one of the fundamental tools of the Second longword 11.1 6.7 multiprocessor designer. Cache memory systems Third longword na 7.9 essentially convert increased memory bandwidth Fourth longword na 10.0 (manifested as a larger fill size) into lower aver- Fifth longword na 11.9 age read latency (due to the decreased miss rate in the cache resulting from the larger fill Sixth longword n a 13.6 Seventh longword na 15.2 size). This explanation is an oversimplification; details of the trade-offs in cache design are pre- Eighth longword na 16.7

Digital Tecbnical Journal 3 1 No. 7 Augtrst 1388 Interfkcing a VAX Microprocessor to n High-speed Multiprocessing Bus

Limited Bandwidth to Main Menzory of accesses that will never result in data being In a single-processor system such as the returned to the processor. (The second-level MicroVAX 3600, the performance is generally cache will probably "hit" on more than 80 per- limited by the CPU itself and not by the main cent of these references.) This behavior is desir- memory subsystem. The oppositc is generally the able for many single-processor systems but case on large multiprocessor systems where a would be inappropriate for a multiprocessor large number of processors can create a bottle- design in whlch main memory bandwidth is neck to the main memory subsystem. A major precious. goal of the multiprocessor designer is to mini- In the multiprocessor system, main memory mize the bandwidth required to support a given bandwidth is shared by all processors and 1/0 level of CPU performance. In that way, the main devices. Table 3 compares the system bandwidth memory bus can support more processors; in the MicroVAX 3600 and VAX 6200 systems. therefore, the system can attain higher total Since the VAX 6200 uses a pended bus that sup- throughput. For example, assume a processor ports 1 to 8 memory controllers, we present two recluires an average of 20 percent of the total sets of bandwidth numbers for the VAX 6200 bandwidth available to main memory to run a memory subsystem: one for a single memory con- given workload. Based just on bus bandwidth troller and another for a four-way interleaved, considerations, the total systcm performance four-memory controller subsystem. would not exceed five times the single-processor The data makes a strong argument for large performance if thc system is simultaneously run- transfer sizes to achieve high bandwidths on the ning that workload on all processors. For a num- VAX 6200. A large cache fill size can be used to ber of reasons, systems are rarely designed such assure high read bandwidth, and a write buffer that the bus must be saturated to meet its perfor- can bc used to provide longer length write trans- mance goals. This same method of calculating actions. Note that longword writes are particu- performance can be used to estimate perfor- larly inefficient in the memory controller; nine mance at some lower level of bus utilization. A cycles are required for a longword write com- bus utilization level of 75 percent is often used; pared with only five cycles for a quadword write. in that case, the system performance would be This inefficiency results from the implementation limited to 3.75 times the singlc-processor system. of the error-correcting code (ECC) across a quad- This example reveals one of the main com- word on the VAX 6200 memory. (VAX systems promises multiprocessor system designers must have traditionally implemented ECC across a make: increased bandwidth, which would reduce longword .) This implementation improved the the main memory access time seen by a single memory module capacity at the cost of forcing processor, is traded off to reduce the total band- all longword writes to be a read-modify-write width consumed by a single processor and sequence in the memory. thereby increase total system throughput. Band- width is really not the characteristic we are trying Increased System Bus Traflc to minimize; the real goal is to reduce the Another challenge to the multiprocessor designer number of bus and memory cycles used to sus- is the increased memory traffic in the system due tain a given level of performance. As we will to the increased total system performance. For a demonstrate, the efficiency of the transfer gen- given workload, it is fairly accurate to assume crally increases as the transfer size increases. that the traffic to main memory increases linearly Therefore the system can fetch twice as much with the total performance system. Therefore, a data from memory without using twice as VLY 6240 (a four-processor 6200 system) would many bus and memory cycles. This characteristic have roughly four times the main memory traffic is important when evaluating various cachc of the VAX 6210 (a single-processor 6200 sys- alternatives. tem). Since processors must monitor main mem- Again looking at the MicroVAX 3600 design, ory traffic to maintain cache coherency, this the CPU actually starts accessing main memory increase in main memory traffic has to be consid- once the first-level cache has determined a miss ered when looking at cache invalidate implemen- occurred but before the look-up in the second- tations. Again the single-processor system has a lcvel cachc has completed. This overlap means much less severe problem. The single processor the memory controller will start a large number has to monitor only the traffic from 1/0 devices,

Digital Technical Journal No. 7 August 1988 CVAX-based Systems

Table 3 MicroVAX 3600 and VAX 6200 Main Memory Bandwidth (in MB per second with corresponding number of cycles in parentheses) MicroVAX VAX 6200 VAX 6200 3600' XMl Bus Memory (90-11s cycles) (64-ns cycles) (64-11s cycles)

Reads 1 Memory 4 Memories Longword (48) 8.8 (5) 31.2 (2) 10.4 (6) 41.6 Quadword (8B) 11.1 (8) 62.2 (2) 20.8 (6) 83.2 Octaword (16B) na 83.3 (3) 31.2 (8) 124.8 Hexword (32B) na 100.0 (5) 38.5 (13) 154.0

Writes Longword (48) Full Masked Quadword (8B) Full Masked Octaword (16B) Full Masked

These numbers represent a CPU perspective. I/O dev~ceson the Q-bus can use longer transfer lengths.

which typically generate about one-tenth the VM6200 CPU Design Alternatives traflic generated by a single CPU. Extending this This section presents an overview of the argument, it appears to indicate that a VAX 6240 VAX 6200 CPU architecture, followed by a dis- system must handle invalidate look-ups at a rate cussion of the various implementation alterna- morc than 30 times that of thc MicroVAX 3600 tives that we considered during the design pro- system. (The VAX 6200 CPU has to handle invali- cess. We conclude with a list of specific design dates from three other CPUs and for about four alternatives and a discussion of our performance times as much 1/0 traffic.) simulation environment, which we used to exam- The increased systcm bus traffic is a symptom ine these alternatives. of the large number of parallel activities that characterize a ~nultiprocessorsystem. The abun- dance of queues in a multiprocessor system results in a more complex system. The section on Table 4 Summary of Differences cache coherency in this paper discusses several between Single-processor manifestations of this increased complexity. and Multiprocessor Systems Table 4 summarizes the major differences between the single-processor and milltiprocessor Single- Multi- processor processor systems. Characteristic System System This discussion has dcrnonstrated that the per- formance of a n~ultiprocessorsystem is very Memory latency Low Medium dependent on the designers making the right Performance CPU Memory decisions about the CP1J interface. [n the next bottleneck bandwidth section, we discuss the basic architecture of the Invalidate rate Low High VhX 6200 CPU and specific aspects of the multi- Level of parallel Low High activity processing environment.

Digital Technical Journal No. 7 August I988 Interfacing u VAX Microprocessor to a High-speed Multiprocessing Bus

SECOND-LEVEL WITH 1KB CACHE SECOND-LEVEL 256KB TAG STORE f C

4 + C CVAX CDAL + I t +

DUPLICATE INTERFACE STORE GATE ARRAY EEPROM

CONSOLE SERIAL LINE

XCI t t

t t XMI

Figure 4 VAX 6200 CPU Block Diagram

The VAX 6200 CPU is a single-board VAX pro- clock source, and the XMI bus has a single clock cessor based on the CVAX chip set designed and source that provides synchronous clock signals to built by Digital. The 6200 CPU has a CVAX cycle all XMI nodes. time of 80 ns (as compared to the MicroVAX The XMI corner represents a standard set of 3600 90-ns CVAX cycle time); its nominal per- interface components and a physical intercon- formance is 2.8 times the VAX-11/780 system nect that ensure all XMI devices meet the timing (slightly more than three times the MicroVAX 11). and electrical characteristics required by the XMI A block diagram of the module is shown in specification. The XMI corner components inter- Figure 4.Threc major buses are associated with face to the rest of the logic on the module over the module. The CVAX processor chip set com- the XMI chip interconnect (XCI). A duplicate tag municates over the CVAX data and address bus store also attaches to the XCI bus. (CDAL).2'3The SSC chip connects to the CDAL As outlined in the previous section, several bus and provides such functions as read-only specific challenges must be addressed by the memory (ROM) address decoding, time-of-year multiprocessor designer. At the CPU level the clock support, and console terminal interfa~e.~ design responses are as follows: The CVAX chip contains the first-level cache. Also connected to this bus is the second-level cache Implement an effective cache to reduce the data store and tag store logic. The path to the XMI effective access time and the total traffic to bus is provided entirely by the XMI interface gate main memory array and the XMI corner. This gate array provides all necessary synchronization between the CVAX Implement a write buffer to decouple and and XMI. Each CPU module has its own CVAX rcduce write traffic.

Digital Tecbnicd journnl No. 7 August 1988 CVAX-based Systems

Implement a duplicate tag store to reduce the conligilration w;ls cxpcctetl to be the ~iiostread- overhead ;~nd complexity of maintaining ily available. Since the CVAX has a 32-bit data cachc coherency. p:~th,eight 64K-by-4 parts would naturally pro- vide a 256KB direct-mapped cilche (four times Cache Subssystem the sizc of any prcvious VAX). This configuration \Ve will first look at the issues associated with also provided the optimal one-output-load per designing an effective cache. The main character-

Digital Technical Journal ivo 7 A18gtisl 1988 lnlerfacing LI VAX Microprocessor to a High-speed Mt~ltiprocessing Bus

cycles in which the CVAX is stalled waiting for references are made per instruction 6 and an the second-level cache to become available average instruction on the CVAX requires again after a cache miss increases as the fill between 9 and 10 cycles. This would seem to size increases. The CVAX internal cache indicate that the performance penalty would rcmains accessible while the second-level be about 8 percent (0.8 references divided by cache is being filled. 9.5 cycles), assuming the D-stream miss rate in the internal cache is 0 percent. With an The MB per second to main memory required expected more-typical 40 percent miss rate, to support a given level of performance the penalty would be about 5 percent. increases. If twice as much data is fetched on a cachc miss, the miss rate does not drop by ;I CVAX stalls will increase for references that factor of two.5 Therefore, as fill size increases, occur while the second-level cache fill for a the MB per second required to support a given previous reference is still not complete. This level of performance increases. increase results because the CVAX will need to access the second-level cache on all D-stream Thc "available MB per second" of the bus references. increases. The efficiency of buses that do not have separate address lines (such as the XMI) Assuming a low frequency of RE1 instructions, increases as the transfer size increases. Basi- the I-stream miss rate should improve since cally, the required address cycle can be amor- there will be no contention for cache blocks tized over more data cycles. between the I and D streams. (REIs will cause the I-stream-only cache to flush.) 'The "available MB per second" of the memory controller increases. The memory controllers The module space needs will be less because in the VAX 6200 can deliver more MB per sec- there will be no need for an extra duplicate tag ond if more data is fetched for a given fetch to track the CVAX internal cache. Since the address. CVAX intcrnal cache has two sets, it cannot be practically "followed" by a simple sccond- Based on our significant experience with VAX sys- level, direct-mapped cache. tems, we knew that either the 16-byte or 32-byte fetch would be the right choice. The results from Looked at another way, we could afford to simulation would be used to select the final devote more logic to making the second-level value. cachc more effective if we did not support Another major cache design issue was the CVAX D-stream caching. configuration of the CVAX IKB internal cache. = The complexity lessens with one less cache to This cache can be configured to run in a convcn- keep coherent with hardware. We also had t ional instruction and data stream write-through more flexibility in implementing error-recov- mode. In this mode, the cache must be invali- ery mechanisms and would not have to imple- dated when writes occur to a stored block. Alter- ment a complex mechanism to suppress the natively, the cache can be run in I-stream-onllr generation of XMI write transactions when the mode in which the cache does not have to be invalidate queue was at risk of overflowing. invalidated on writes. Instead, the cache is auto- matically flushed on VAX Return from Exception We planned to use the simulation environment to or Interrupt (REI) instructions. The methods wc quantify the performance penalty that results used to ensure the success of this cache from running the CVAX cache in I-stream-only coherency mechanism are discussed in the sec- mode. tion Maintaining Cache Coherency and Hancll ing Write-Buger Subsystem Cache Error Conditions. Assuming all other things remain equal. Conventional write-through cachesgreatly reduce there is a pcrformance penalty for choosing thc reatl trafhc to main memory but do not reduce I-stream-only mode. If we select I-stream-only the write traffic. Therefore, although the mix of mode, the following occurs: reatl and write references from the CPU itself is weighted heavily toward reads, the traffic down- All D-stream references will require a mini- stream of a write-through cache is primarily mum of two cycles instead of one. Generally. writes. Other cache architectures offer the potcn- for VAX CPUs an average of 0.8 D-stream tial to reduce write traffic. A write-back cache

5 6 Digital Technical Journal No. 7 Augusl 1988 CVAX-based Systems

might be considered the obvious approach. By word XMI transaction (if the buffer contains no caching writes as well as reads, a write-back more than an aligned quadword). CVAX CPU cache offers the potential for the highest perfor- reads (unless interlocked or made to 1/0 spacc) mance multiprocessor system. Nevertheless, the are allowed to bypass the write buffers after tirst complexity is significantly higher than a write- being checked for an address match with the through design. Industry experience is that very write buffer. few write-back caches work on first-pass, and Either a read address comparison match or an their bugs arc very difficult to fix. Another risk interlocked or 1/0 space transaction forces the with write-back caches is in the area of error write buffer to be purged. There are several other recover),. It is much more difficult to recover conditions under which the write-buffer must be from transient cache errors with a write-back pro- flushed. These conditions are discussed in the tocol. To avoid the increased complexity and section Maintaining Cache Coherency and Han- resulting scheclule risk, we decided to pursue a dling Error Conditions. hybrid approach. We would implement a write- We believed the write buffer could provide through cache with a write buffer design very about half the bandwidth benefit of the write- similar to that of the VAX 8800 cache.' back cache but with little more complexity than A write buffer resides between a write-through a simple write-through design. As an added cache and the system bus. A write buffer is actu- benefit, the buffer architecture was already ally a simple, very effective form of a write-back implemented and running with very good perfor- cache. A write buffer takes advantage of the local- mance resu Its in a VAX multiprocessor (VAX 8800 ity of write transactions to reduce the number of family). We planned to use performance simula- write references to main memory by combining tions to confirm that the write buffer was ade- several small write references into a single larger quate to meet our performance goals. transaction to main memory. This behavior has three main advantages. First, almost all buses Duplicate Tag Store (including the XMI) increase in efficiency as the As noted earlier, a multiprocessor environment transfer size is increased. This efficiency results puts significant strain on thc cache coherency because every transfer generally requires the logic. The rates at which write addresses on the transmission of an address cycle before the data. system bus must be checked against the addresses This address cycle is basically fixed overhead that stored in the cache require that a different archi- can be more effectively amortized as the transfer tecture be used for servicing invalidates. size is increased. The transfer sizes and relative The 2K-by-9 tag chips used to implement the efficienciesof the XMI bus are shown in Table 3. main tag store are also used to implement a Second. as previously mentioned. the VAX 6200 duplicate tag store. The duplicate tag store runs memory does not efficiently process longword synchronously with the XMI bus and permits write transactions. The write buffer converts filtering of invalidates, so the CPU would stall significant numbers of longword write transac- only on an XMI write hit. It is not uncommon to tions into full quadword and octaword transac- have ratios of 100 to 10,000 to 1 between dupli- tions that arc processed with many times greater cate tag misses and duplicate tag hits. efficiency. The operation of the duplicate tag store is dis- Finally, the buffer helps to reduce the fre- cussed in the section Maintaining Cache Coher- quency of processor "write stalls," that is, pro- ency and Handling Error Conditions. cessor cycle slips due to writes to main memory We have now defined the basic architectural that back up. The buffer largely decouples the issues that needed to be resolved and have indi- processor from the main memory write timing; cated the alternatives we would like to pursue. In the processor perceives that most writes are com- the next section we present the results of our per- pleted in minimum time. formance simulations. The VAX 6200 write buffer accumulates write The following list summarizes what we exam- data until a memory write address falls outside ined in our simulation environment: the addrcss range of the current block. When this occurs, an alternate octaword buffer begins Determine the loss in performance that would filling. The first buffer is emptied either with an result from running the CVAX internal cache octaword XMI transaction (if the buffer contains in I-stream-only mode instead of combined more than an aligned quadword) or with a quad- 1- and D-stream mode.

Digital TecbnfcalJournal No 7 August 1988 Interfucing a K4X Microprocessor to a High-speed Multiprocessing Bus

Investigate octaword (16-byte) versus hex- we know actually exists) could not be demon- word (32-byte) fill sizes for both I-stream and strated when the model ran with a flush every D-stream. Further, examine the relative miss 5,000 instructions. rates, MBs pcr unit of performance, bus cycles To more accurately model the benefits of large per unit of performance, memory cycles per caches, internal studies of complex timeshare unit of performance, and absolute perfor- loads were undertaken. Multiuser program traces mance. Look at a large multiprocessor system's were run against a cache model. subjecting the sensitivity to main memory access time. cache model to the context-switch behavior of a real system. The cache performance results of Determine thc effectiveness of a write-through that run were compared wjth single jobs run with write-buffer cache architecture. In other against a cache model that was flushed after vari- words, can the writes be reduced sufficiently ous numbers of instructions had been executed. to avoid write-back in the chosen architecture. The results indicated that similar cache perfor- Examine the benefits of a two-way, set-associa- mance results could be obtained in simulation by rive cache over a simpler direct-mapped using a single job trace and complete cache design. flushing every 35,000 instructions. The number 35,000 applies only to a 256KB cache; smaller Performance Simulation caches would have a smaller context-switch The basis of the simulation environment was a interval. We decided to simulate the VAX 6200 high-level performance model of the CVAX chip. with the 256KB cache flushed every 35,000 Written in PASCAL., this model was interfaced to a instructions; the IKB CVAX internal cache would configurablc second-level cache, write buffer, be flushed at the more traditional 5,000 instruc- and memory subsystem. The model accepted tion rate. All simulations would represent a sin- instruction traces for input. At the time the per- gle-processor system; main memory access times formance modeling was done, seven standard would be minimum. The performance results benchmarks were available: DIRECTORY, EDT, would generally be presented as a set of relative FORTRAN, LINKER, MAIL, RUNOFF, and SORT. numbers comparing the alternatives. All instruction traces were captured from a Table 5 summarizes all the cache characteris- VAX-1 1/780 system. Since each trace was for a tics we would simulate. single process, one of the major issues was detcr- mining how to correctly niodel the effect of CVAXInternal Cache Conjiguration timesharing on cache performance. The first aspect examined was the CVAX cache The very nature of timesharing has a negative configuration. As shown in Table 6, the I- and cffcct on cache performance as compared with D-stream design offered an average increase in single process runs. Ideally, the cache would be performance of 5 percent over the I-stream-only dedicatcd entirely to holding instructions and cache. We concluded 5 percent average perfor- data associated only with a single process. In mance co~ild be sacrificed in return for the rimcshare systems, processes are not initiated and reduced complexity of the I-stream-only design. then run nonstop to completion; instead the CPU is constantly switching from process to process. This switching requires the cache resources to be Table 5 Cache Characteristics Simulated distributed across a number of processes and Second-level therefore reduccs the effectiveness of the cache. CVAX Cache Cache A VAX- 1 1/780 study0 indicates that the average -- - - number of instructions between context switches Associativity 2-way Direct-mapped/2-way on a VAX system is about 5.000 instructions. A Configuration I & Dl1 only I & D traditional and vcry conservative approach to Size 1 KB 256KB simulating the effect of context switches is to Block size 88 64B flush the entire cache every 5,000 instructions. Fill size 88 1681328 Flushing the cache every 5,000 instructions Tags 1 K 4 K was not a big penalty for small caches that Simulated 5,000 35,000 could quickly refill thenlselves after a flush; context instructions instructions switch rate however, the advantage of larger caches (that

3 8 Digital Tecbnical Journal No. 7 Augrrsl I988 CVAX-based Systems

Table 6 CVAX I-stream and I- and D-stream Octuword versus Hexword Fill Size Relative Performance Choosing an octaword or a hexword fill size was I-stream I- & D-stream the next and probably the most complex major issue. The results are shown in Table 7. In all Average 1.OO 1.05 cases, relative numbers are used with the charac- Minimum 1.OO 1.03 teristics of the octaword machine as the refer- Maximum 1.00 1.07 ence.

Table 7 Octaword versus Hexword Fill Size Results Relative Fill Size Performance

Octaword All 1.OO Hexword Average 1.01 Minimum 1.01 Maximum 1.02 Relative Miss Rates Relative MB/sec Fill Size I-stream D-stream All Reads I-stream D-stream All Reads

Octaword All 1.OO 1.OO 1.OO 1.OO 1 .OO 1.OO Hexword Average .56 .84 .71 1.12 1.68 1.42 Minimum .54 .81 .68 1.08 1.61 1.36 Maximum .57 .87 .76 1.14 1.74 1.52

Percent XMI Percent Memory Fill Size I-stream D-stream All Reads I-stream D-stream All Reads

Octaword All Hexword Average .93 Minimum .90 Maximum .96 1.45 1.27 .95 1.41 1.27

Relative Relative Percent Percent XMI Memory Fill Size Utilized Utilized

Octaword All 1.00 1.OO Hexword Average 1.08 Minimum 1.06 1.03 Maximum 1.09 1.07

Digital Tecbnical Journal 39 No 7 Arrgusl 1988 interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus

A summary of the results in Table 7 follows: more bus cycles and 16 percent more memory cycles to support read traffic to main memory. The fill size has a negligible effect on perfor- Eighteen percent and 16 percent may seem mance (less than 1 percent difference). The like a big increase, but it is important to look hexword alternative delivered an average of at overall bus bandwidth. On a write-through 1 percent better performance. interconnect, the writes generally dominate It is important to keep in mind that the simula- the traffic. tion was performed assuming the minimum The overall bus traffic (taking into account delay from main memory. In a multiprocessor writes) increased by only about 9 percent. system, the alternative with the lower miss Overall memory controller cycles increased by rate increases in performance relative to the even less - only about 4 percent. The low other alternatives as the main memory access increase resulted because the ratio of write time increases. cycles to read cycles is higher in the memory Hexword fetches dropped the overall miss rate controller than on the XMI bus. by almost 30 percent. (As expected, the Based on this data, we chose the hexword I-stream miss rate improvement was much fill alternative. We felt the potential for higher - almost 50 percent.) significantly more consistent performance in The megabytes per second required to main- large multiprocessor configurations (due to tain a given performance level increased by decreased cache miss rate) was worth the esti- about 40 percent overall for the hexword mated 9 percent increase in bus utilization fetch. Write Bufler Eflectiveness and Overall As mentioned earlier, we were not as con- Bus Utilization cerned about megabytes per second as much as We were pleased to find that the write buffer was the percentage of the bus and memory con- about as effective as we had predicted. The data troller cycles per second. In this light the hex- in Table 8 compares the XMI write traffic gener- word alternative required about 18 percent ated with and without a write buffer. The data is quite consistent. On average, the write buffer Table 8 Write Buffer Effectiveness reduced the number of write cycles on the bus by Ratio With Write Buffer/ slightly less than half (45 percent) and reduced Without Write Buffer' the memory controller cycles by slightly more than half (5 1 percent). Write Buffer XMI Memory Table 9 shows the bus utilization by the Miss Rate Utilization Utilization VAX 6200 CPU running the test benchmarks. Average 47.1 '10 .55 .49 Using the average bus utilization number of Minimum 40.4% .50 .42 6.27 percent still yields only 50 percent for a full Maximum 54.9% .64 .58 eight-processor system; the 7.25 percent maxi- mum value yields 58 percent utilization. These The ut~l~zat~onnumbers are expressed as rattos between the ut~l~zat~onw~th a wrtte buffer and the ut~l~zationwithout the write figures are well within our 75 percent utilization buffer. design goal, and we decided to implement the write-buffer instead of the write-back design. Another more conservative way to look at the Table 9 XMI Bus Utilization per CPU data is to assume that we may not have the worst- I-stream D-stream case cnvironment covered in any single bench- Reads Reads Writes Total* mark. Therefore we should look at the "sum of maximums" to determine whether the design Average .89% 1.39% 4.41% 6.27% goal is met. Using the sum of maximums Minimum .24% 1.26% 3.57% 5.27% approach, we require 9.72 percent of the XMI Maximum 1.65% 2.1 0% 5.97% 7.25% per processor, or about 78 percent for eight pro-

' The numbers In thls column are averages of the total XMI bus cessors. This figure is sufficiently close to our ut~l~zat~onacross the seven workloads These numbers are not design goal of 75 percent maximum utilization sums of the ~nd~vldualutil~zation percentages In each column. to be acceptable.

Digital Tecbnical Journal No 7 August 1988 CVAX-based Systems

Eflect of Associativity Table 10 Direct-mapped versus Two-way We next explored the benefits of associativities Cache Performance greater than one. Implementation of a cache All Reads other than a direct-mapped cache was probably Relative Relative not practical. However, we wanted to examine Miss Rates Performance the performance results. Direct- Two- Direct- Two- The results given in Table 10 indicate that a mapped way mapped way two-way, set-associative cache could reduce the overall miss rate by 13 percent, whereas the per- Average 1.OO .87 1 .OO 1.01 formance gain was negligible (1 percent). This Minimum 1.OO .74 1 .OO 1.OO improvement in miss rate is fairly significant; Maximum 1.00 .95 1.OO 1.02 but we determined it was not practical from a module real estate and electrical timing perspec- tive to implement other than a direct-mapped meaning of coherency and the methods employed scheme. To implement a fast two-way cache, two by the VAX 6200 project engineers to ensure separate RAM arrays must be supported. This coherency. We also describe our techniques for implementation requires roughly twice the mod- supporting recovery from all single-bit transient ule area of a directed-mapped approach. A two- cache errors. way cache can be implemented with a single For this discussion, we divide the cache sub- RAM array (cannot start the RAM look-up until system of the VAX 6200 into three sections. Fig- the proper set has been identified), but this ure 5 shows the three major subsystems in the would force the access time to increase by a VAX 6200 cache: cycle. Increasing the access time to the second- level cache would be particularly undesirable to The CVAX internal I-stream-only cache the VAX 6200 designers since we had already decided to configure the CVAX cache in I-stream- The 256KB I-and D-stream cache only mode. (With an additional cycle, all D-stream references would then require a mini- The 16-bytewrite buffer (a form of write-back mum of three cycles.) Board area constraints and cache) increased cache access time are the two most common reasons for rejecting the miss reductions CVAX I-stream-only Cache of the multiway cache in favor of the simplicity The first cache, contained within the CVAX chip and the practical, fast access time of the direct- itself, is configured for I-stream-only operation. mapped cache. In that mode, the CVAX flushes the entire con- tents of the cache whenever a VAX RE1 instruc- Maintaining Cache Coherency and tion is executed. Motivated originally by the Handling Cache Error Conditions potential problems with instruction prefetch As mentioned in the introduction, a major chal- buffers, the VAX architecture defines rules for lenge to a multiprocessor designer is to imple- software to assure that writes to I-stream data ment a reliable scheme for cache coherency. produce predictable results.* In all cases, if the Coherency is a term somewhat difficult to define. rules are not followed, stale data may be read In this section, we give some insight into the from the cache and cause unpredictable results.

1 KB 256KB 16B CPU - I-STREAM I- AND D-STREAM - WRITE -TO MAIN MEMORY CACHE CACHE BUFFER

Figure 5 VAX 6200 Cache Subsystems

Digital Technical Journal 4 1 No. 7 Augusl 1988 -Interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus

Second-level I- and D-stream

256KB Cache STORE STORE The second-level cache is architecturally similar 1 El to caches used on most VAX systems. With a write-through design, the cache stores both

I- and D-stream data. Coherency is maintained by XMI INTERFACE monitoring all writes from other devices to main memory and invalidating cached locations that correspond to any of the monitored writes. The processor does not generate invalidates for its own writes to main memory since the cache is EIGHT-ENTRY write-through; a write by the processor itself that INVALIDATE QUEUE hits in the cache immediately updates the appro- priate location. The VAX 6200 second-level cache coherency logic is shown in Figure 6. A duplicate tag store is located on the multiplexed XCI bus. This store contains a duplicate copy of the 4,096 cachc tag XCI BUS entries, which are in the second-level cache located on the CDAL. The duplicate tag store tracks the primary tag store on allocates by moni- s toring XMI read transactions. Whenever an XMI XMl CORNER memory space read is initiated, the CPU allocates I I the cache block that corresponds to the read address. The duplicate tag store also monitors all XMI XMl BUS write transactions and performs a duplicate tag store look-up. If a hit occurs and the write was Figure 6 Second-level Cache Coherency Logic not from this CPU, then the duplicate tag location is invalidated. The address is then loadcd into an flow the CPU's invalidate queue. Instead of eight-entry invalidate queue implemented in the adding significant complexity to the invalidate XMI interface gate array. Cache invalidates are controller to suppress the generation of XMI not performed in response to an individual CPU's write commands when the invalidate queue is at own writes since the write-through second-level risk of overflowing, the overflow condition is han- cache always contains the most recent data. dled as an exception condition. (This subject is When an entry has been loaded into the invali- discussed in the section Handling Second-level date queue, the CDAL interface logic arbitrates Cache Error Conditions.) For this alternative to for thc CDAL and invalidates the full 64-byte be practical, we had to ensure that invalidate block in which the write address was located. queuc overflows would be very rare; we felt this The use of a duplicate tag store reduces CDAI, was ensured by the depth of the invalidate queue traffic to only necessary invalidate transactions. (eight entries) and the optimized design of the After performing an invalidate, the XMI interface invalidate control ler. gate array checks for any additional invalidates that may have accumulated while thc previous The Write Bufler invalidate was being serviced. If another invali- A write buffer design offers the designer oppor- date request exists, then it is serviced prior to tunities to break cache coherency rules. The release of the CDAL. This procedure ensures that VAX 6200 CPU follows several rules to maintain invalidates are serviced as quickly as possible. coherency. The VAX 6200 hardware automati- The CVAX bus interface ensures that the invali- cally flushes the write buffer under the following date logic is given an opportunity to use the conditions: CDAL between every CVAX bus operation. Though occurring very infrequently, the XMI In response to a write that misses the currently bus could issue writcs quickly enough to over- active write buffer. The current write buffer is

4 2 Digital TechnicalJournal No. 7 August 1988 CVAX- based Systems

flushed while the new write is accepted by tine responds by logging the error and then the alternate buffer, thus write ordering is flushing and reenabling the cache. maintained. The following error conditions cause thc XCP hardware to disable the second-level cache. Thc Before an XMI 1/0 space rcad or write refer- errors are of two forms. The first two are error ence is performed 1/0 references could result conditions that potentially result in a n~issed in the initiation of an l/O operation that may cache update on a write-through; the last three require the data from rhc write buffer. deal with conditions under which an invalidate is Before an interlock read or unlock write refer- potentially missed: ence is performed. Interlock sequences are Subblock valid bit parity errors - The the primary means for synchronization VAX 6200 CPU supportsa doubly-redundant set between processors and must always force all of subblock valid bits. On a cache look-up, outstanding writes to main memory. ifthe twocorrespondingvalid bitsdonot match. Bcfore an interprocessor interrupt is per- then the hardware reports a parity error and formed. ks with interlocks, interprocessor forces a cache miss. If this error occurs on a interrupts are used for synchronization be- write-through that should have hit in the cache, tween processors and must always force all then the cache state is no longer consistent. outstanding writes to main mernory. Cache tag parity errors - The tag chips used Before issuing an XMI read to a location that on the VAX 6200 CPU support parity on the inclucles the data contained in the write full tag address. As with valid bit errors, a tag buffer. The write buffer contents are flushed to parity error can result in a missed write- main mernory and then the XMI read is issued. through. Reads that miss the write buffer do not force a XMI inconsistent parity error - If the CPU write buffer tlush ("write buffer bypass"). detects an XMI cycle that has bad parity and Following the assertion of the CVAX clear- that cycle is acknowledged by another proces- write-buffer pin, the CPU flushes the write sor, then the worst-case assumption is that the bufir to main memory. This form of write duplicate tag logic just missed a write transac- buffer flushing is primarily used to associate tion that should have resulted in an invalidate. failed writes with a given process. If no associ- Duplicate tag store parity error - As with the ation could be made, then the operating sys- previous error, the processor has to assume the tcm would always have to crash the entire sys- parity error resulted in a missed invalidate. tem on every failed write transaction. Invalidate queue overflow - Again, this con- Handling Second-level Cache Error dition is similar to the one above except that Conditions this condition does not require a transient Onc of the major goals of the VAX 6200 design error in the system. Instead, an invalidate was to provide improved system reliability. One queue overflow is the result of a very rare com- method we used was hardware-enforced soft bination of XMI writes that result in a queue failover in response to many error conditions, backup and the potential loss of invalidates. combined with efficient software recovery proce- The system responds to this condition just as it dures. This method was used extensively when would for all other cache errors. dcaling with all types of second-level cache Actual System Performance Results errors. In general, the individual processors have the We were very interested in determining how well responsibility to recover from potential cache our simulation results matched real -world opera- coherency failures. When errors occur that tion. We decided to focus on several key aspects may leave thc second-level cache incoherent, the of the system to bound the task of correlating sim- VAX 6200 processor hardware automatically dis- ulation with the real world. Specifically, wc ables the cache. Disabling the cache ensures that planned to the system can continue to run "safely," albeit at Confirm that the VAX 6200 CPU performs as reduced performance. The processor then posts a expected relative to the MicroVAX 3600 sys- "soft" error interrupt. The interrupt service rou- tems. If the cache subsystem behaves as

Digital Technical Journal No. 7 Aug~rsl 1988 Interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus

expected, then the VAX 6200 performance that our simulation indicated that the percentage should exceed that of the MicroVAX 3600 XMI consumed would be 6.27 percent. (See systems by the clock rate improvement minus Table 7.) the penalty for running the CVtLY cache in I-stream-onlymode. Multistream Performance on Compute- Confirm that the simulation traces adequately intensive Benchmarks "stress" the memory interface such that extra- It is beyond the scope of this paper to present the polation to real workload performance is multiprocessor simulation data that was gener- valid. The percentage of the XMI bus con- ated prior to design. That data indicated that the sumed would be the basis for this comparison. VAX 6200 system performance on compute- This characteristic includes all the effects of intensive benchmarks would be nearly linear references per instruction and miss rates and when running from one to eight processors. illtirnatcly cletermines the performance of a Tests to date have confirmed our high expecta- multiprocessor machine tions. On compute-intensive workloads, a four- processor system consistently provides better Confirm that the cache subsyste~nsupports than 3.95 times the throughput of the single- very effective utilization of multiple proces- processor system (less than 2 percent degrada- sors. VAX 6200 multistream throughput mea- tion). Limited configuration testing on systems surements form the basis of this verification. with up to eight processors indicates that Compare the results from the simulation tests compute-intensive workloads continue to per- with similar workloads run on real machines. form very well. An eight-processor system per- formed at 7.75 times the single-processor (less Comparing VAX 6200 and than 5 percent degradation). Micro VAX 3 600 Systems Fully Characterized Workloads Due to the similarities between the two systems, our first approach was to compare the perfor- We also instrumented a VAX 6200 system to mea- mance of the VAX 6200 to the MicroVAX 3600 sure a number of processor characteristics, systems by running a set of 100 compute-inten- including bus utilization We wanted to deter- sive benchmarks. The VAX 6200 has a 12 percent mine how much the real workload runs varied cycle time advantage (70 ns to 80 ns), but it is from the simulated runs. The test methodology somewhat handicapped by the I-stream-only limi- was quite simple. tation placed on the internal cache. Recall that Command files were created that executed a our performance simulation indicated this single benchmark. These individual bench- penalty would average about 5 percent. (See marks were designed to correspond with the Table 6.) On average then, we expected the simulation traces listed at the beginning of the VAX 6200 CPU to be about 7 percent faster than Performance Simulation sect ion. the MicroVAX 3600 CPIJ. The compute-intensive benchmarks basically confirmed this number; The Digital Command Language (DCL) com VAX 6200 performance averaged 6 percent faster mand files were of the following form: than the MicroVAX 3600 CPU. Multiprocessor Bus Bandwidth @flushcache ! initially f lush the cache Utilization - Real and Simulated estarthardwaresample! start the Workloads measurement hardware We have run several forms of multiuser time- ggetcputirne ! get the initial CPU time sharing workloads on thc VAX 6200 system. These workloads include Digital's standard run benchmark ALL-IN- I workload, an order processing bench- mark (Compu-Share), an electrical CAD work- egetcput ime ! get the final CPU time load, and a software development workload \n @stophardwaresample! stop the all cases, the average percentage of the XiMJ used measurement hardware per processor ranged from 3 75 to 5 0 Recall

Digital TecbnfcalJournal No. 7 August 1988 CVAX-based Systems

The measurement hardware consisted of two Table 11 Simulated versus Actual XMI Bus Tektronix DAS 9200 Logic Analyzers; one mon- Utilization itored the processor bus, and the other was Simulated/ attached to the XMI. The start-measurement Simulated Actual Actual command file simply referenced a specific I-stream I-stream Ratio XMI 1/0 space address on which the DAS 9200 analyzers would trigger and start taking mea- Average 0.84% 0.32% 2.6 surements. Similarly, the stop-measurement Minimum 0.24% 0.17% 1.4 command file would reference another XMI Maximum 1.65% 0.52% 3.2 I/O space address that w-ould cause the logic Simulated/ analyzers to stop acquiring data. Simulated Actual Actual D-stream D-stream Ratio This technique made the measurement process Average 1.63% 0.74% 2.2 simple and repeatable. The overhead of the com- Minimum 1.26OI0 0.26% 4.8 mand file was measured by running the command Maximum 2.10% 1.10% 1.9 file with the "run benchmark" line removed. This overhead was then subtracted from the results Simulated/ obtained from benchmark runs. Run-to-run con- Simulated Actual Actual Writes Writes Ratio sistency was better than k 10 percent. The logic analyzers captured the data neces- Average 4.46% 4.86% 0.92 sary to determine the total number of XMI read Minimum 3.57% 3.84% 0.93 and write references that occurred during the Maximum 5.97% 5.75% 1.04 execution of the command file. This data was Simulatedl used to calculate the total number of XMI cycles Simulated Actual Actual used by the processor. To derive the percentage Overall Overall Ratio of the XMI utilized, the total XMI cycles were reduced by the command file overhead, and the Average 6.09% 4.86% 1.25 result was divided by the benchmark CPU time. Minimum 5.27% 3.84% 1.37 This method ensures that the XMI percentage is Maximum 7.25% 5.75% 1.26 not artificially low due to the inclusion of null time elapsed while the processor is waiting for 1/0 activities associated with the benchmark to complete. The results are shown in Table 11. The data indicates that the simulation traces ments indicated that the simulated traces used required significantly more XMI read bandwidth 25 percent more bandwidth than the actual (on average more than double) than the similar workloads. actual benchmarks. This result is not unexpec- ted, since the simulation runs were designed to Conclusions and Future Work simulate a worst-case timeshare workload. (This The VAX 6200 design experience has demon- goal influenced the choice of 35,000 instruc- strated that trace-driven simulation is a power- tions for the cache flush interval.) The real ful tool in the design of a multiprocessor bus workloads were run on standalone systems, and interface. Because the designers were able to therefore the cache performance was expected to make informed trade-off decisions, the design be higher. We are currently studying the effect of met or exceeded all performance goals; and the heavy timesharing in multiprocessor systems on reduced design complexiry helped bring the cache performance. Initial results indicate that system to market on schedule. It is a tribute to our simulation runs are still conservative. the team's appropriate control of complexity and The results for writes, which are unaffected by to the rigorous verification process'0 that the context switch rates, matched the actual bench- first-pass VAX 6200 CPU printed circuit design marks quite closely. The actual benchmarks and XMI interface gate array are currently ship- required about 4 percent to 8 percent more ping in VAX 6200 systems. At Digital, this level bandwidth than the equivalent simulation trace. of success is unprecedented for a machine of this Combined read and write bandwidth require- complexity.

Digital Technical Journal No. 7 Augusl 1988 Interfacing a VAX Micropro~essorto a High-speed Multiprocessing Bus

The continuing trend toward multiprocessing Chip," Digital Technical Journal (August and faster processors will force increasing depen- 1988): 109-120. dence on complex cache subsystems to deliver 4. J. Winston, "The System Support Chip, a the desired system performance. It follows that Multifunction Chip for CVAX Systems," minimizing the complexity of the cache sub- Digital Technical Journal (August 1988, system will help support ever decreasing time- this issue): 121-1 28. to-market schedules. Accurate cache simulation techniques will be required to select the imple- 5. A. Smith, "Line (block) Size Choices for mentation that meets the performance goals and CPU Cache Memories," ZEEE Transactions is minimally complex. on Computers, vol. C-36, no. 9 (September 1987): 1063-1075. Acknowledgments 6. D. Clark and J. Emer, "A Characterization of The author would like to acknowledge the Processor Performance in the VAX- 1 1/780," work of the following individuals who con- Proceedings of the 11th Annual Sympo- tributed significantly to the VAX 6200 CPU archi- sium on Computer Architecture (June tecture and design: Brian Allison, Giao Dau, 1984). Ron Desharnais, Glenn Herdeg, Dave Ives, 7. J. Fu, J. Keller, and K. Haduch, "Aspects of Bimal Sareen, Simon Steely, and Doug Wil liams. VAX 8800 C-Box Design," Digital Technical Journal (February 1987): 4 1-5 1 . References 8. T. Leonard, ed., VAX Architecture Refer- 1. B. Allison, "The Architectural Definition ence Manual (Bedford: Digital Press, Order Process of the VAX 6200 Family," Digital NO. EX-3459E-DP, 1987). Technical Journal (August 1988, this issue): 19-27. 9. B. Moses and K. DeGregory, "Performance Evaluation of the VAX 6200 Systems," 2. T. Fox, P. Gronowski, A. Jain, B. Leary, and Digital Technical Journal (August 1988, D. Miner, "The CVAX 78034 Chip, a 32-bit this issue): 64-78. Second-generation VAX Microprocessor." 10. J. Basmaji, G. Garvey, M. Heydari, and Digital Technical Journal (August 1988, A. Singer, "The Role of Computer-aided this issue): 95-108. Engineering in the Design of the VAX 6200 3. E. McLellan, G. Wolrich, and R. Yodlowski, System," Digital Technical Journal (August "Development of the CVAX Floating Point 1988, this issue): 47-56.

Digital Technical Journal No. 7 Au~usl1988 Jean H. Basrnaji Glenn P. Garvey Masood Heydari Arthur L. Singer meRole of Computer-aided Engineering in the Design of the VAX 6200 System

The success of the VM6200 design ispartly attributable to the deuelopment and implementation of a total verification plan. The goal of this plan was to shorten the total system design cycle; the approach was toperform su!- cient ueri@ation to ensure thatfirst-passparts would boot and run VMS at speed. The team responsible for achieving the goal began implementing the vert@ationprocess on availability of thefirst design specz@ation. The team's eflorts continued concuwent with those of the module design team. Milestones for the process reflect the um@ation team's top-downfunc- tional approach, proceeding from architectural-levelvenfiation through logic, timing, and system vertfication, and concluding with vector genera- tion. Review and reporting methods established for the project ensured all functions were tested and veriJied.

"l'his paper presents an overview of the computer- The heart of the system is a new interconnect aidcd engineering (CAE) and CAE-based design called the XMI. This interconnect was specifi- verification test (DW) approach to the dcvclop- cally designed to serve as the processor-to- rncnt of the VAX 6200 system. Our intent is not memory interconnect in the VAX 6200 system to givc a step-by-stcp description; therefore, few and its derivatives. Optimizations of and tradc- dctails of the implcmcntation are given. Thc offs in the design of the XMI were made w~ththat CAE/DVT Group dcvclopcrs believe that projcct- function foremost in mind. The kcy featurcs of spccifc problems arc generally best solved by the interconnect are as follows. project-specific solutions. Instead, we offer a bro;ld overview of CAE which includes the engi- The pended bus design allows multiplc trans- necring principles established for the VAX 6200 actions to be in progress at the same time; thus projcct and which we believe will be of use to waste of bandwidth is minimized, for instance. those planning a task of similar scope. during memory read accesses.

A Brief VAX 6200 System Overview The XMI implements the concept of comman- No discussion of CAE or DVT methodologies can der nodes ant1 responder nodes. A commander takc place without a description of the task to node initiates a bus transaction to which a which these methods are applied. For our pur- responder node must respond. poses, the overall task was to engineer, proto- type, debug, and release for manufacture the The XMI is a centralized arbitr~tionintercon- VAX 6200 mid-range computer system. nect. Arbitration logic, resident on the back- The VAX 6200 multiprocessor architecture plane, grants bus mastership according to a is implemented with CMOS technology.'.' The modified round-robin scheme. There is a system is housed in a 156 by 79 by 76 cm cabi- higher priority responder round-robin queue net, which contains a system bus backplane, two and a lower priority cornmandcr round-robin 6-slot VAXBI backplanes, a TK50 tape drive, queue. space for future rack-mount devices, power sup- plies, and blowers. Bus width is 64 bits.

Digitd Tecbnicd Journal No. 7 Augusf 1988 The Role of Computer-aided Engineering in the Design of the VAX 6200 System

Cycle time is 64 ns The XMI supports reads of quadword, octa- word, and hexword length, and writes of quad- word and octaword length.

Raw bandwidth is 125 megabytes (MB) per XMI second. The XMI supports three module types in the VAX 6200 system: the CPU module (KA62A), a DWMBA DWMBA 32MB memory array (MS62A), and an XMI-to- 57 VAXBI adapter module set (DWMBA). Q The CPU module is based on the CMOS VAX (CVAX) chip set, which includes a MicroVAX architecture microprocessor (CVAX) , a floating point accelerator (CFPA) chip, and a system sup- Figure I XhUModule Connections on a port chip (SSC). The module supports full VAX VAX 6220 System capabilities, excepting only PDP- 1 1 compatibil- ity mode. In addition to a two-way associative needed to bring a product to market. The defini- I-stream cache in the CVAX chip, the module tion of CAE and the way engineers use CAE to contains a 256 kilobyte (KB) direct-mapped accomplish this goal differs from project to pro- cache. Performance is approximately 2.8 times ject and even within a single project. Neverthe- that of a VAX- 1 1/780 processor. less, two principles are preeminent. 'The MS62A is a 32MR memory array module 1. CAE should provide the tools, the methods, with an on-board controller. Modules may be and perhaps most importantly, the discipline interleaved irp to eight ways to decrease laten- that together enhance an engineer's productiv- cies. Each module has an eight-deep command ity without unduly restricting his or her cre- queue. The arrays are fully error-correction code ativity. (ECC) protected. The DWMBA is an adapter module set which 2. CAE should provide a continual check to allows the 6200 system to access 1/0 devices on ensure that the engineer's product meets the the VAXBI bus. The DWMBA/A module, which needs of the project in terms of both function resides in a single XMI slot, is connected by cable and quality. to the DWMBA/B module, which resides in a sin- The role of the CAE/DVT Group on the gle VAXBl slot. The DWMBA can support up to VAX 6200 project was different from the tradi- full VAXBI bandwidth of 13.3MB per second on tional CAE role in one significant respect. The write transactions and approximatcly 5.5MB per group's primary responsibility would not be the second on rcad transactions. dcvelopment of CAE tools and processes. Instead, Figure 1 illustrates how these system elements its responsibility was the delivery of first-pass interconnect in a two-processor system with two hardware that was functional at speed. Explicitly, VAXBI channels. our goal was to ensure that the system would Because the VAX 6200 system backplane has boot the operating system (VMS) and run soft- 14 slots, many system configurations are possible ware the first time the system was powered up. with differing numbers of processors, memory The only tools and processes developed were modules, and I/O channels. those specifically necessary to fulfill that goal. In the scctions following, we describe the engi- The project team felt that objective simula- neering process employed in the design of these tion and verification of the hardware and its logic elements. performance by the CAE/DVT Group would (1) enhance the chances of first-pass functional- CAE VeriJicationChallenges ity, and (2) reduce the overall design cycle by and Organizational Struchire paralleling the CAE and the design efforts. Con- The overriding goal of any CAE effort is always sequently, the CGE engineers were active con- the same: to shorten the development time tributors to the architecture and participated in

Digifal Tecbnical Journal Nn 7 As,ot,ct IORR CVAX-based Systems

choosing alternatives, effecting compromises, The DVT specification must begin as early and implcrnenting details of the design. The CAE as specification of the hardware functionality Group was responsible for the correctness and begins. Working the two specifications in parallel quality of the designs and not just for the delivery ensures functional verification of the system. Fur- of tools to accomplish that correctness. To ther, the DVT specification should be treated achieve this goal, tasks traditionally performed with the same formality as the hardware specifi- using DVT methods would be accomplished cation; that is, it should be reviewed, and all using CAE n~ethodology. reviewers must agree upon its completeness. By formalizing the specification review, project CAE Tasks members are in effect establishing its value to the Given the charter described above, the CAE/DVT project. The DVT specification defines what is to Group outlined the following tasks: be simulated; therefore, superior design tools and modeling cannot substitute for the assurance Select a tool suite of design accuracy that the specification affords. As to the verification of the hardware, the m Create a process for the CAE effort responsibility of the CAE team was to ensure bug- Maintain the databases free and operable component, module, and sys- tem designs. Team members ran the simulations, Construct a CAE environment (models and isolated the bugs, and ensured designs were cor- computes) rected by the design team. Simulations were not done exclusively by the CAE team, however. The Generate test cases to run against the environ- environment was available equally to all design ment team members. To the extent that each team felt Isolate and report bugs was appropriate, designers initially debugged their designs before passing them to the CAE Verify the hardware team for more formal debug. In this way, obvious bugs were found more quickly. Design develop- Generate test vectors for outside vendors ers did excellent work in this regard and greatly Generate test vectors for manufacturing eased the burden on the CAE team. Further discussion of the VAX 6200 hardware Fault grade the test vectors verification is presented in the section Verifica- tion Milestones. Define exit criteria for committal of design to hardware Modeling Approach Enforce compliance with exit criteria Hardware verification done in software is by nature a slow process. The major factor con- Though the list is long and has some interesting tribut ing to the slowness of the verification is the tasks, two items constituted the largest portion of size of the design. The size is not simply the num- the work: generation of test cases to run ber of logic elements in the design, but the col- against the environment, and veriJication of lective size of the models of each of the elements the hardware. in the logic network. The generation of test cases is the most time- We used two types of models for the VAX 6200 consuming, least glamorous, and most often over- project, behavioral and structural (or gate level). looked task; yet the test cases are the single most Behavioral models, in general, were more important piece of a superior CAE effort. A suc- abstract and efficient in terms of increasing cessful specification of the test cases (the DVT overall simulation performance as compared to specification) to be run against a CAE environ- detailed structural models. ment requires a lengthy period of development. Behavioral models of many of the components The development time for the KA62A, MS62A, used in the system were generated early in the and DWMBA DVT specifications was approxi- design cycle. As the design progressed and mately 6 man-months each. Moreover, the speci- detailed logic schematics became available, how- fication is not static and must be kept current ever, the behavioral models, in most cases, gave with the evolving design. way to detailed structural models. The exception

Digital Tecbrrical Joumul 4 9 No. 7 Au~lis11988 The Role of Computer-aided Engineering in the Design of the VAX 6200 System was the behavioral models of the CVAX chip set. an additional cluster of eight VAX 8800 systems These detailed models were used throughout the was used. verification process. Given the size and complex- ity of these components, simulation with struc- Ver3cation Milestones tural models, for all practical purposes, was Key milestones were established for the verifica- impossible. tion team throughout the VAX 6200 design In general, our objective throughout the verification process. In December 1985, we verification process was to ensure accuracy and began with the first XMI interconnect verifica- not speed. The slow speed of the more accurate tion; we proceeded to performance evaluation, models was addressed by applying more computc logic verification, timing verification, system power to the task at hand. verification, and vector generation. These milestones were derived as part of CAE StaBng and Resources our functional top-down verification approach. The CAE group was divided into small teams, We selected this approach based on our deter- each responsible for the verification of a VAX mination that if a function works correctly, then 6200 subsystem. The size of the teams varied. all of its component logic must be working The KA62A gate array and module team had four correctly. CAE engineers. Four CAE engineers worked on We therefore chose to model our different the DWMBA gate arrays and modules. The MS62A design objects in the largest reasonable forms and gate array and module was assigned one CAE engi- then functionally test these models. Every step neer. As it turned out, these numbers represented naturally lead into the next task of the system nearly a one to one ratio with the hardware design. This approach was later extended to the designers. As senior, experienced engineers, system as a whole; the system simulation com- team project leaders were responsible for the bined the logic and ran code to exercise the ovcrdll coherence of the DVT plan and its quality, entire system. and were responsible as well for tracking and resolving problems. Architectural Ver@cation Each team included a diagnostic engineer who At the architectural level, the simulations focused was also working on design verification test. This on the verification of the new system intercon- arrangement provided the diagnostic engineers nect, the XMI. As noted earlier, this interconnect, early training and also facilitated testing. More- specifically developed for the VAX 6200 system, over, diagnostic engineers were in a position to is a memory interconnect bus with a new arbitra- easily evolve some of the DVT tests into self-tests tion scheme and a defined bus interface protocol. and ROM-based diagnostics for the VAX 6200 Both the arbitration and the protocol are imple- product. mented in CMOS semicustom technology. The educational background of the CAE team Once the design for the bus protocol and arbi- was a mix of electrical engineers, computer engi- tration was established in a specification form, we neers, and software engineers. Their levels of immediately transformed the specification into experience varied from new college hires to high-level behavioral models: the arbiter chip those with 10 or 15 years of work experience. model, and an XMI commander transactor model. The level of relevant hardware experience in this The behavioral arbiter model represented a group is ~ndicativeof the group's tasks, as com- generic, round-robin arbitration scheme; the pared with othcr CAE groups that are more commander model represented a generic XMI involved in tools generation. commander design. The commander model con- Our computer resources consisted of a cluster tained a flexible user interface to allow the of eight CPUs, including one VAX 8800 system, specification of any desired well- or ill-formed one VAX 8650 system, and six VAX-11/780 sys- transaction to be generated on the bus. Further, tems. All four modules (KA62A, DWMBA/A, the commander transactor model was designed to DWMBA/B, and MS62A) and their associated selectively self-check for any protocol violations. gate arrays were verified throughout most of The two models were the basis for all XMI the project on this cluster. During final regres- design verification. This first level of verification sion testing of each module, in which the full provided feedback to the architecture team set of DVT tests was run against the design, quickly and answered questions about the inter-

Digital Technical Journal No. 7 August I988 CVAX-based Systems

face protocol and arbitration scheme. As a result, Clearly, an architectural verification that con- the arbitration was enhanced and the protocol centrates on a new bus leaves out many other was refined to satisfy the design goals. Specifi- arcas of architectural interest. A severe restriction cally, a few new signals were added, and the arbi- of the scope of the VAX 6200 system's architec- tration was changed from true round-robin to a tural verification was deemed necessary because modified round-robin. of the lack of schedule time and because of the At the next level of architectural verification, immaturity of the art. Nevertheless, architectural we modeled an XMI responder node and incorpo- verification is a key area where much work should r~teclthis motlcl into the simulation environ- be done for the development of the next system. mcnt. The team developed a behavioral XMI memory model ;ind completed a high-level sys- Performance Evuluation tcm model. This model was still totally behav- The next verification task was performance evalu- ioral and represented a system with generic XMI ation. Again, work was concentrated into two commanders and responders. well-defined areas, that is, the bandwidth perfor- Two pieces of tcst code were generated and mance of the XIMI, and the processor perfor- verified on that model. The environment mod- mance in the multiprocessing environment. eled a fully loaded XMI interconncct. The first A model of the CVAX processor was obtained picce of tcst code was structured in such a way from the Semiconductor Engineering Group that every notlc on the XMI generated its own dcsjgn team. We enhanced this model to include trathc. Cornmandcrs generated all possible com- an XMI interface with a memory port. The stimu- mander sequences, and responders generated all lus for this model proved difficult to generate possible respondcr sequences. The goal of this because multiprocessing benchmark traces were tirst test code was to ensure that the protocol was not available. The traffic patterns had to be sound. By protocol soundness, we mean that deduced from single-stream benchmark traces commanders ;und responders can coexist on the and extrapolated for VAX 6200 symmetric multi- XMI and can generate traffic sequences without processing. loss of data. 'Ihc results of this verification gave We ran several benchmarks. We then used the the team sufhcient confidence in the protocol to results to make decisions about the appropriate ;illow the tlesign of the XMI interface compo- trade-offs in the area of the processor cache and nents to procecd. write buffer algorithms. 'These trade-offs dealt The second picce of test code was verified on specifically with cache and write buffer depth the samc environment. Every commander gener- versus performance gaincd. ated the samc sequence of traffic on the XMI. The Other tools were created to decompose XMI goal of this tcst was to verify arbitration fairness tr;uI%c into histograms and to generate reports on and to guarantee that all XMI nodcs got their fair bus bandwidth for the different types of traftic. share of the XMI. The absence of phenomena Eventually, XMI memory design latency targets such as lockouts was also verified. were incorporated into the XMI behavioral mem- This architectural verification proved to be a ory model. These system performance simula- tremendously valuable exercise. First, feedback tions were used to establish such design criteria to the architecture team was accomplished as the memory controller input command queue quickly. Second, this architectural verification depth and the command queue processing for the VAX 6200 project established design algorithm. verification tools that can be used for all future XMI designs. Logic Verzjication In time, the behavioral model of thc arbiter was The next major task was logic verification. The replaced with a structural model derived from main objective of module verification was to the chip design database. We enhanced the accu- ensure that the implementation conformed to all racy of the bchavioral models of the XMI com- design goals documented in the system specifica- mander and memory by incorporating structural tion. (11 other words, the goal nras not to ver~fy models of the XMI interface components once what the design was, but what the design was their gate-level tlesigns were complete. These supposed to be. tools are now in use throughout the corporation Members of the CAE team were assigned to by rrumerous XMI design teams. each design object; each member would work in

Digital Technical Journal 1b. ' Alrgrrsl I088 The Role of Computer-aided Engineering in the Design of the VAX 6200 System

a team with the designers. The verification teams with problems. A number of problems were required complete and coherent specifications found and resolved as a result of this testing. for each design object. These specifications had The timing verification of the module gate to be sufticiently complete to support 130th arrays was performed in a similar fashion. Pat- design implementation and logic verification. terns of functions were extracted from logic Moreover, all functions had to be documented in verification and then applied to the standalone a specification. This documentation served two chip timing models. As each pattern was applied, purposes: (I) to ensure that the function the timing verifier would run a complete check received the proper attention during the verifica- of the gate array and generate a list of violations. tion phase, and (2) to give the responsible CPLE These violations would then be checked by the engineer the information needed to understand designer. If they were valid, logic changes would the functions without referring to logic schemat- be made. The reason that just the gate arrays were ics or meeting with the designer. verified, and not their complete modules, was With the functional specification as the founda- that each module contained some logic for which tion, team members generated a verification no structural model existed (for example, the working document for every design object. This CVAX' chip set on the KA62A module). The lack DVT specification, as mentioned earlier, guided of a complete module-level timing verification the verification work and constituted the primary model was rectified by requiring the module hardware-submittal exit criteria. design team to thoroughly analyze its module. The logic verification was grouped into three This approach was possible only because of the categories: highly bus structured nature of our technology. Basic functional verification. Basic functional System VerzJicntion tests exercised each function as a standalone Once every design object met its exit criteria and piece of the design. This testing isolated obvi- satisfied the specified testing, the next milestone ous bugs. was the start of system simulation. Our task was Interaction sensitivities. Interaction sensitivity to verify the actual design in a system environ- exercised the design as a whole, making sure ment. We constructed a model consisting of mul- that f~~nctionscould interact with each other tiple processors, memories, and 1/0 modules. and could occur in series without cumulative This model contained structural representations fault mechanisms. Testing of function interac- of the actual designs wherever possible. Where tion included any boundary conditions, margin there were multiples of a design object in the sys- testing, and back-pressure on different key tem simulation environment, one instantiated points in the design. copy of the model would be the detailed (and slow to simulate) structural model; the other Error handling. Error handling verification instantiations were the faster yet less accurate tested that portion of the design created speci- behavioral models. fically for error detection and recovery mecha- In addition to actual design objects in this sys- nisms. tem model, we included different types of trans- actor and traffic generators on both the XMI and Timing Verz9cntion the VAXBI buses. Timing verification was performed separately The stimulus for this environment had to be upon all key components in the VAX 6200 sys- specific enough to ensure that every type of traffic tem. All of this work was performed by applying pattern was generated during simulation. The functional patterns to timing models for each of stimulus attempted to stimulate every node and the module gate arrays and the XMI arbiter logic. function concurrently. In a system simulation in 'This work was done using AUTODLY, an internal which the simulation rate is so slow, as much as Digital tool. possible must be achieved in every single simu- The XMI components were tested first. Testing lated clock tick. consisted of applying all possible XMI bus cycles Key to making a system simulation successful against this logic while allowing the timing is to start the simulation only after the constituent verifier to analyze the logic for any timing paths pieces of the system have been very thoroughly

5 2 Digital TecbnfcalJournal No. 7 August I988 CVAX-based Systems

verified in isolation. Givcn the complexity of the retlect any changes in the design as a rcsult of system model and its slow running rate, finding hardwarc debug. The purpose of this on-line soft simple dcsign bugs at this stage is a wastc of representation of the design was twofold. First, schedule time. Instead, the model should idcn- the representation would aid in the isolation of tify the system interaction problems and assure any problems discovered in the lab. Sccond, the developers that the base logic verification was database could be used to investigate any suspi- thorough. cious problem areas that could not easily be trig- Several logic problems were found during our gered in the hardware. system simulation dealing with complex intcrac- tions. some after a few microseconds of simula- Review and Reporting Methods tion. If undetected. these problems would have Throughout the dcsign verification proccss. a seriously impcded progress toward our goal of rne;ins to cnsurc covcragc was established for providing functional first-pass hardware. each phase. At the projcct outset. DVT speci- tication covcragc of functions was assured by Vector Generation se~vrallevels of team rcview. As the simula- The start of system simulation rakcs place, blr tions progresscd, thc project leaders were given definition, near the end of thc logic vcrification the responsibility of ensuring bugs were consis- process. At about that time, we began to prepare tently reported and corrected. Vcctor extraction for submittal of the designs for fabrication. and grading of our gate arrays provided a strong Therefore, in parallel with system sinlulation. measure of the completeness of the vcrifica- test pattern generation was started. tion of these chips. Additionally, the internal Test vectors were nceded at this time, primarily controllers to the gate arrays were measured for to test chips coming off the fabrication line. complete state and product term coverage. Therefore we generated test vectors for the very Lastly, before being released for manufac- large channel-less arrays contained on each of our ture. the design was checked against our own modules. The basic criterion for approval was exit critcria to ensure that thc vcrification attainment of 99 percent internal node toggle was complete. coverage of the gate array logic. In addition to the This section presents details of these mcthods 99 percent internal node toggle critcria, we also and tools for ensuring all functions were tested included the much more stringent criteria of and verificd. 95 percent stuck-at coverage as measured by a fault grading mcchanisni. Thc mcthods uscd to Functional Coverage clctcrniinc covcrage are discussecl in the scction The VAX 6200 projcct team chosc the functional Problem Reporting and Resolution. verification approach to verify all VAX 6200 The vectors were cxtractcd from a strategic sub- tlesigns. One problcrn with this approach is that set of our function;~lDVT simulation and graded there is no method of measuring functional cover- on a hardware accelerator/fault evaluator. age. Since all vcrification is based upon the DVT We set a goal that the vector count should not spccification, functional coveragc will be a exceed the chip's gate count; that is, a chip with reflection of the completeness of this document. 25K gates should have no more than 25K vectors Therefore, the DVT specification becomes the to exercise its logic. The vectoring process. vehicle by which the functional coverage of the including extraction, grading, and complemcnt- verification is to be measured. This specification ing. took an avcragc of one month per gate array. must be made as comprehensive as possible. As is true of architectural verification, vector Therefore, the specification underwent many lev- generation is an area where work remains to bc els of rcview by a wide audicncc, including the done. If we had been able to include some testa- entire design team. bility features in these very dcnsc chips, we could have saved this rnonth of schedule time. Problem Reporting and Resolution Anothcr means used to cnsurc covcragc was the Follow- throzdgh problem-resolution and bug-reporting mccha- Even beyond the prototyping phase, the sinlula- nism. Every design vcrification team project tion database was maintained and ~~pdatedto leader was responsible for tracking bugs in

Digital Technical Jountal :Vo. 7 rl~ig~ist1988 The Role of Computer-aided Engineering in the Design of the VAX 6200 System

the designs and ensuring these bugs were cor- areas of logic were found that had not previously rected and the correction was verified. Communi- been tested. This additional logic yielded addi- cation for this tracking was through VAX NOTES tional bugs. conferences. The fault grading process also provided an For each design verification team, two confer- additional degree of confidence in the coverage ences were created. The first was for bug report- of the functional verification test cases. The ing and bug-fix resolution. Only verification team 95 percent fault coverage goal was achieved with members could write notes in the bug confer- patterns derived from a subset of those test cases. ence. Every entry indicated the date, model revi- It should be mentioned that the hardware fault sion levels, test case number, failure symptom, evaluator was used extensively during this phase and any assessment of the problem. Replies to of the project and proved to be an irreplaceable each entry were entered, either by the project tool. leader or the CAE team member responsible for the failing test, to indicate when the bug was State Machine Coverage verified as being fixed and the model revision lev- Tools were developed that would analyze traces els at the time of verification. If the problem generated from the internal gate-array controllers remained unresolved, the reply would indicate and sequencers. Traces were collected while the any action taken or patches made. functional tests were being simulated and The NOTES conference review ensured that verified. All traces were later analyzed, and cover- all bugs were given the proper attention and age was ensured for every state and product term. visibility. This mechanism was put in place and automated, The second conference was informational. so that after each regression, coverage could be Using this conference, engineers could learn rechecked. about key aspects of the design as the verification After every regression run of all test cases, the progressed. Fore example, they could obtain results were analyzed to ensure that no product information on undocumented features on which terms or states were missed as a result of test certain verification tests were based. modification or bug fix. Additional test cases were generated to find specific and hard-to-activate Fault Grading conditions. Another process, which was implemented to measure functional coverage of the component Exit Criteria patterns, was the fault-grading mechanism. In Before a design is sent to manufacturing, the this approach, all component patterns for the design must meet the exit criteria. These criteria large compacted arrays were generated at the are as follows: functional level. The simulation environment for pattern capture was the same one used for func- All the specified test cases have been gener- tional verification. The stimulus generated was ated and have run bug-free against the latest driven by high-level functions. The test patterns design. were captured at the chip's boundaries while the The system simulation has run bug-free for two chip was being exercised on the module. continuous weeks. Traditionally, component patterns are gen- erated by simulating the chip standalone and In other words, if bugs still exist in the design, driving hand-crafted stimulus through the chip the design is not yet ready for manufacture. simulation. Due to test overlap, the approach As judged by the nearly bug-free condition taken by the VAX 6200 team did not ensure the of the implemented hardware, these design- optimum number of patterns for the maximum completion criteria and coverage metrics were stuck-at coverage. However, the approach proved appropriate for the VAX 6200 development to be very beneficial. Ranging from 20K to 50K effort. patterns for each gate array, the patterns were The VAX 6200 project's tremendous success generated in the very short time of approximately has established the process for future systems one month. In reaching our goal of 95 percent verification and for engineering quality measure- fault coverage with these test patterns, additional ment.

Dlgifd TecbnfcalJournal No. 7 August 1988 CVAX-based Systems

Results Attained arose, we became strongly aware that future The development cycle for the VAX 6200 system generations of hardware will be much more was quite short, and therefore the need to pro- dependent on the type of verification used for duce functional first-pass hardware was very the XMI. Although timing verification was strong. The first XMI specification was released in performed on all gate arrays on the VAX 6200 December 1985. Eight months later, all of the system, this verification. In the future, we feel it XMl parts had been designed and simulated and is important to perform timing verification on the were being manufactured. Two months later, the design during early development. Thus we can parts were up and running. identify and solve the timing problems before During this time, specifications were released they become too entrenched in the design to be for the KA62A. MS62A, and DWMBA, and logic fixed easily. design was begun. Concurrently, test specifica- Since the wire delays for gate arrays can only be tion and test generation also began. In the late estimated until gate layout has taken place, all summer of 1986, all logic design was completed. verification must be repeated once the actual tim- and verification began. Two to three months aftcr ing numbers are returned. Additionally, floor design completion. verification was completed planning of the gate array can have a significant for each module. With a complete and verified effect on the performance and specific wire design, one month was used to generate all gate- delays. On the VAX 6200 project, the layout and array test vectors and then submit the gate arrays final wire delay calculations were performed by for manufacture. our gate array vendor and then sent back to us for In February 1987 - 14 months after the first reverification. These steps can take quite a long complete XMI specification - the DWMBA was time in the design cycle of a gate array. To reduce manufactured. powered on, and run with first- the wait for real wire delays, we plan to perform pass hardware. One month later. the KA62A was all floor planning and preliminary layout opera- powered on and running. Twv weeks later, with tions at the design site. Additionally, this will functional MS62&, the first VAX 6200 system was allow us much more input to the floor plan and powered on. Two weeks aftcr that, on April 1, the layout. first VAX 62 10 system booted VMS with all first- pass functional parts. Summary AJthoilgIi a few bugs were later to be found and The success of the VAX 6200 verification effort fixed, the goal of using simulation to generate can be attributed mainly to the decision to begin hardware that works at spced the first time was verification at the same time as the design and to attained. In fact, many of those original parts are continue verification and design as parallel being shipped with the VAX 6200 systems today. efforts. This decision was implemented by assem- bling verification teams at the same time design Opportunitiesfor Improvement teams were being built. Although our verification process proved to be Verification was performed during each stage quite successful, we plan to make a few changes of development - from initial concept to system in this process for future projects. integration. The architectural verification con- Architcctur;ll verification, in so far as that firmed the XMI architecture and arbitration means an effort to discover system-level inade- algorithms. Performance verification helped quacies or bottlenecks. is in its infancy. We con- define the processor and memory architectures sider this a wide open area where much can I>e and ensured that these architectures could take acconlplished. full advantage of the new XMI. The logic of all As module designs call for increases in speed, XMI modulcs, their gate arrays, and the XMI arbi- timing verification and signal integrity verifica- tration logic was verified against their speci- tion will make a much larger contribution to the fications. not against the designs themselves. total verification effort. Although the XMI inter- Lastly. the entire VAX 6200 system was simulated connect was verified to all circuit, signal, and in a multiprocessing environment, proving that timing specifications. signal integrity was not the different component modules could function emphasized to the same degree in the modules together as a system. Verification from system themselves. Although no significant problems architecture to gate arrays, modules, and then

Digital Technical Journal 5 5 No. 7 Atrgusl 1988 The Role of Computer-aided Engineering in the Design of the VAX 6200 System back to a complete system again, throughout the designs are to be released for manufacture. To life of the project was the only way to assure the make this work successfully, the necessary main verification goal - first-pass, functional resources must be allocated for the verification hardware. effort. Furthermore, project teams must develop During logic verification, attempts were made and follow through with complete verification to perform verification using the smallest detail, strategies. These strategies focus on verification while still keeping the scope of the logic under as a part of the total design process rather than as test large enough to allow system-level testing. a process that takes place after designs are com- By performing all testing at these much higher plete. The VAX 6200 project was proof that this levels, a greater number of functions and more philosophy can be made to work. global functions can be tested at one time. The only drawback to testing at this level is simula- References tion speed. The trade-off of speed for accuracy is 1. B. Allison, "An Overview of the VAX 6200 a good one, for without accuracy the costly alter- Family of Systems," Digital Technical native is to design and manufacture multiple Journal (August 1988, this issue): 10- 18 . passes of hardware. In conclusion, the most important outcome of 2. T. Fox, P. Gronowski, A. Jain, B. Leary, and our verification effort was a management philoso- D. Miner, "The CVAX 78034 Chip, a 32-bit phy that, in the end, verification is as important as Second-generation VAX Microprocessor," logic design. With this understanding, verifica- Digital Technical Journal (August 1988, tion criteria now determine when and whether this issue): 95- 108.

56 Digital Technical Journal No. 7 August 1988 Rodney N. Gamache Kathleen D. Morse

VMS Symmetric Multiprocessing

The symmetric multiprocessing features of VMS version 5.0 effectively utilize the greater computing power of Digital's multiple CPU systems. Key to the SMP design is an innovative mechanism, called a spinlock, that provides a high degree of parallelism for kernel-mode code. Werefor- merly VMS soJttvare used interrupt priority levels (ZPLs) to synchronize processes, VMS now uses spinlocks. Because each VMS resource can be protected by a spinlock, this design provides more synchronization levels than could ZPLs alone. Spinlock granularity directly affects system performance.

This paper describes the major feati~resof sym- share state information that provides load bal- metrical multiprocessing (SIMP) in the VAX/VMS ancing capabilities. operating system. These enhancements are included in VAX/VMS version 5.0. Although it is An interprocessor interrupt capability that impossible to present details of every aspect of enables one CPU to interrupt all other CPUs or the SMP dcsign in these few pages, this paper a single CPU provides an overview of the key mechanisms The set of interlocked instructions (DDSSI. devclopcd for VMS SMP. BBCCI, ADAWI. INSQxl, and REMQxl), which Technology Developments are part of the VAX architecture and thus present in every VAX system Over the last several years advances in computer technology, especially in VLSI, have yielded Cache coherency maintained by the hardware, greater co~npiitingpower in increasingly smaller without software assist;~ncc piickages. VI.SI CPU chips have made possible multi-CPU, single-board computers. These multi- One CPU, known as the primary CPU, that ple CPU systems arc having an increasing impact must have access to all 1/0, console sub- on the gcncral-purpose computing environment. system, and timekeeping hardware The net result is that recent technology trends With these hardware fcatiires. VMS can provide haw redirectetl the challenge of building multi- symmetric multiprocessing support for any VAX processing systems from the hardware engineers system. All code executing in user. supervisor, or to the systems softw;irc engineers. Systems soft- executive mode can execute on ;in). CPU without ware engineers must now design effective ways to restriction. Most (if not al I) kernel-mode code utilize systems with six, eight, or even more can execute on any CPLJ without restriction. The CPUs. only restricted code is that small amount of VAX Hardware Features Required by kernel-mode code that requires access to the time-of-day internal processor register or to the the VMS Operating System console terminal and the console block storage The VMS SMP design requires that certain fun- device. damental features be implemented in VAX multi- The SMP design has no requirement regarding processing hartlware. These features are as the system topology or intcrconncct joining the follows: multiple processors. It supports systems imp1.e- The ability to sharc common memory among ~nentedby means of a single bus architecture. all CPlJs in the system such as the VAXBI bus. as easily as systems that use a cross-bar connection. This shared memory allows all CPUs to execute Therefore, the VIMS SMP design is flexible a single copy of the operating system and to enough to support current VAX systems and VMS Symmetric Multiprocessing

been a nonscal;~blesolution; if more CPLJs were ;~ddeclto the system, system performance would not increase due to blocking for the single lock. A more ambitious yet costly design was to provide a high degree of parallelism for kernel- node code. With this kind of parallelism, many XMI processors are allowed to execute different portions of the executive at the same time. For example. a process adding a system-wide logical name should be able to execute on one CPU whilc another CPU handles a device interrupt for completion of a disk 1/O request, etc. This design would require creation of numerous locks and careful design of the interactions between the critical regions rhat use those locks. This KEY: design approach was the one finally chosen by XMI - SYSTEM-TO-MEMORY INTERCONNECT the VMS engineers, and is discussed in the follow- VAXBI - BACKPLANE INTERCONNECT (FOR 110) ing sections. Figtire I VAX 6200 System Block Dingrum Synchronization in VMS: Raising IPL, Mutexes, and Spinlocks future VAX systems that take advantage of advanc- '.[he original VMS version 1.0 design used two ing technologies and architectures. types of synchronization: (I) raising interrupt priority level (IPL) and (2) mutual exclusion New Multiprocessing Hardware semaphores (mutexes). The VAX architecture The design of recent VAX systems, such as the provides 3 1 IPLs; 1 through 15 are dedicated VAX 8800 and the VAX 6200 series of computers, for use by software, and 16 through 3 1 are offers an elegantly simple, symmetric hardware reserved for hardware. (IPL 0 is not really an configuration. Central to the design of these sys- IPL but rather thc level at which user, supervisor, tems are two new bus architectures - the XMI and cxecutivc mode programs execute.) VMS bus and the VAXBl bus (Figure 1). The VAXBI blocked diffcrcnt types of system events by architecture provides a protocol that allows (I) raising IPL to or above the level at which multiple processors to issue device requests, and that event occurred. For example, process (2) operating system software to spccify which rescheduling was done by means of an IPL 3 processors a device controller will interrupt. software intcrrupt. Code threads rhat modified a The symmetry of these 1/0 subsystems pre- process's context always executed at IPL 3 (or sented a new challenge to the VMS SMP design- higher) to prevent a reschedule. Another exam- ers: to provide an 1/0 database design that woultl ple is the manipulation of device controller regis- make possible simultaneous execution of inter- ters. These registers were always manipulated at rupt handlers, thus taking advantage of these new the device's hardware interrupt level; thus other hardware features. system activity of a lesser importance was blocked out while the time-critical code path me Development of VMS SMP was executed. Critical to SMP was a new method, used through- The second synchronization met hod, mutexes, out the VMS kernel, to synchronize multiple pro- was used to lock purely software constructs, such cessors. One possible SMP design would have as global section descriptors. Mutexes provided a been to create a single lock for kernel-mode meclianism for defining many locks without operations and allow any processor to acquire assigning a unique software IPL to each lock. A that lock. However, the VMS engineers believed mutex was acquired by the operating system on that such a design would not havc provided behalf of a process and was considered "owned" sufficient parallelism to achieve good system by that process. Rescheduling could occur whilc throughput for systems with more than a few pro- a process "owned" a mutex; however, process cessors. This single-lock method wou Id have dcletion could not occur. Lock requests made by

Digital Technical Journal No. 7 Augusf 1988 a process of higher priority for an already owned with an interlocked test-and-set memory opera- niutcx were handled by placing the requesting tion, the BBSSI (Branch on Bit Set and Set Inter- process into a wait state, thus avoiding dead- locked) instruction. The spinlock interlock bit is locks. contained in a separate byte within the spinlock In ;I multiprocessing system, each VAX CPU structure. Unlocking a spinlock is done with the has its own interrupt priority level, independent inverse BHCCl (Branch on Bit Clear and Clear of the others. Thus raising IPL would synchronize Interlocked) instruction. These interlocked oper- on a single CPU but not across the entire system. ations arc atomic memory transactions across all IPLs. then, could not be used to synchronize all processors in a VAX multiprocessor configura- CPUs. Neither were mutexes a viable solution. tion. Furthermore, since memory is common to sincc they could only be used within proccss all processors, the interlocked memory test-and- context and at low IPLs. Therefore. the SMP team set operations providc a sufficient mcthod of created a ncw VMS mechanism that they termed a extending synchronization to all processors "spinlock." Anywhere VMS code had previously within a multiprocessor system. synchronized by raising IPL, the codc would now The use of multiple IPLs as a synchronization acql~irea spinlock; wherever VIMS codc had low- method in VMS provides the capability to sched- ered IPL. it would now release a spinlock. Use of tile events in a prioritized fashion. The inclusion mutexes remained unchanged save that the code of IPLs in the spinlock structure allows the SMP to acquire and release mutexes was protected by synchronization mechanism to appear as an a spinlock. added dimension to 1PLs. Moreover. this SMP The dcsign for spinlocks incli~cleda number of mechanism preserves the ability to schedule critical concepts. First. a spinlock is "owned" bjr events in ;I prioritized manner. a CPIJ, not by a process (as niutexes are). Second. For uniprocessor systems, the SMP dcsign also each spinlock is acquired and released at a par- includes the ability to optimize the routines that ticular IPL that is associated with the spinlock. acquire and release spinlocks. For example, on a Raising IPL when a spinlock is acquired prevents single CPU system. the spinlock acquire-and- other activities from interrupting time-critical release routines are never called. Instead, only a code. Third. CPUs "spin-wait" when blocked mo.cle-to-processorregister (MTPR) instruction is from obtaining a spinlock resource held by executed. thus raising IPL. System performance another CPU, since spinlocks are only assigned to of a single CPU has been measured as only a tiny time-critic;il resources that cannot be locked for percentage less than VMS version 4 performance. long periods of time. Lastly. the dcsign of spin- Mutex synchronization is still the xcond syn- locks includes a mec hanis~nfor deadlock preven- chronization method used in VMS. In the SMP tion or detection since the debugging of "hung" design. mutexes are used for locks that arc held systems is too costly. Therefore, each spinlock for long periods of time and for situations in is a55igned a rank. I3ccause spinlocks must bc which the IPL has to be lowered. Mutexcs are still acql~ircdin order of rank, deadlocks are thus pre- owned by processes. not by CPUs, under the SMP vented. Further. a debugging aid was built into design. the spinlock design. A part of each spinlock data structure is set aside to hold the last eight pro- Spinlock Granularity, Devicelocks gram counters ( PCs) that acquired or releasecl One aspect of the SMP dcsign that directly affects each spinlock. When cn;tbled, these consistency system performance is the granularity of the spin- checks provcd invaluable in determining interac- locks. A coarse granularity (fewer spinlocks) is tions between different components in the VMS easy to implement and debug; however. ;I coarse executive. such as memory management ant1 granularity provides fewer synchronization scheduling. points, and thus processors are blocked for Thc VMS engineer5 implemented routines for longcr periods. A finer granularity (more spin- acquiring and releasing spinlocks rather than locks) provides more parallelism and thus scatter in-line code through the VJMS kernel. The shorter blocking times; however, a finc granular- first step in acquiring ;I spinlock is to synchronize ity is much more complicated to dcsign and the local processor by raising to the IPL of the implement, and requires more synchronization spinlock, just as if it were a uniprocessor system. points. An important concept to rcmcmber is 'Thc actual locking of a spinlock is accomplishetl that, while the system is in a noncontending

Digiful Technical Jounzul No A1lfius1 1988 VMS Symmetric Multiprocessing situation, a synchronization point only adds un- of code. The spinlock design, therefore, has the necessary overhead. That is, if there is never any advantage of providing more synchronization lev- possibility of processors contending for the same els than could be provided by lPLs alone. Hence, resource, then synchronization is not required. the granularity of spinlocks can be much finer Therefore, the SMP team decided that a manage- than that allowed by software IPLs alone. This able n~~mberof spinlocks for the initial design finer granularity in turn provides more concur- was no more than 32. The SMP design provides rency of execution in the VMS kernel. dcsigners the ability to create a finer granularity For example, IPL SYNCH had protected a large of locks in future releases of VMS as performance number of resources and thus would be a good measurements identify time-critical resources. candidate for a finer granularity of spinlocks. As the SMP development evolved, it became Where VMS code had previously raised IPL to clear that a finer granularity of spinlocks for SYNCH, the SMP team had to determine which the I/O subsystem would be easy to implement. spinlocks had to be acquired and then perform With multiple VAXBI buses, multiple CPUs could the conversion. handle different device interrupts simulta- In summary, IPL SYNCH became the following neously. This further improved the parallelism of spinlocks: the system and resulted in a new characteristic FILSYS File system structures (such as for spinlocks: dynamic versus static spinlocks. A file control blocks) static spinlock protects those resources common to all VAX/VMS systems. Therefore, static spin- IOLOCK8 Fork. IPL 8 (map registers, data locks are assembled into the VMS source code. paths and System Communication Dynamic spinlocks synchronize device-specific Services resources) code and so are created at boot time, depending TIMER Timer queue upon the 1/0 configuration of the particular VAX system. Thus the number of dynamic spin- MMG Memory management, page de- locks varies from system to system, whereas the scription database, swapper, and number of static spinlocks is consistent across modified page writer all systems. The dynamic spinlocks used to lock particular devices were named "devicelocks" to JIB Portions of the job information block differentiate them from static spinlocks. A devicelock is used wherever device-specific code SCHED Process control blocks, schedul- previously raised IPL to a device's IPL to block ing database, acquisition/release interrupts. of mutexes Identifying Resources Requiring Per- CPU Context Areas and Spinlocks Interrupt Stacks One of the first SMP development tasks was to Another development task was to identify the identify each VMS resource that needed syn- context that had to be maintained for each pro- chronization and then determine the proper lock- cessor - independent of the general system ing mechanism - spinlock, mutex, interlocked structures. This "per-CPU" context area had to queue, etc. Once this work was complete, the include such items as identification of the current added dimension provided by spinlocks allowed process, a unique CPU identification field, and multiple resources to be protected by a single CPU-specific work queues. In addition, design IPL. For example, IPL 8 (SYNCH) had protected requirements specified that a processor be able the following resources: memory management, to locate its private CPU context area with mini- scheduling, the 1/0 database, the file system, and mal overhead. the timer queue. By adding a new dimension, The easiest solution would have been to namely spinlocks, each of these resources could include an internal processor register (IPR) into be protected by a different spinlock but share the which software could load the virtual address of same IPL. Therefore, in a multiprocessor configu- the context area. Since IPRs are part of the pro- ration, it was now possible to run more than one cessor hardware, each CPU could have pointed to processor at the same IPL. However, the proces- its own context area without confusion. How- sors must be executing different critical regions ever, such a processor register did not exist in the

Digital TecbnicalJournal No. 7 August 1988 CVAX-based Systems

VAX architecture. Therefore another solution was PTE. in case the PTE is cached in the translation needed in order for SMP to execute on existing buffer. This notification is called a translation VAX systems. buffer invalidation request and is accomplished A creative alternative to inventing a new IPR by a write to an IPR. Since PTEs can be cached on was to find a method to use an existing IPR any processor in a multiprocessor system, one for ~ni~ltiplepurposes. The VAX architecture possible implementation would bc for all CPUs includes an interrupt stack pointer (ISP) which to perform a translation buffer invalidation softwarc loads with thc virtual address of the request when any PTE is changed. Since transla- interrupt stack. Since each processor must have tion buffer invalidation must be carefully coordi- its own stack for handling interrupts, this area nated among all CPUs, however, this simple was already CPU-specific. Under the SMP design, approach would have significantly affected sys- the interrupt stack area and the CPU context area tem performance if left unmodified. are treated as one virtually contiguous context Two other features of VAX/VMS memory man- block. When the virtual address of this new con- agement play significant roles in the design for text area is rounded to an appropriate power of translation buffer invalidation in the SMP envi- two, a simple clcaring of the low order bits of the ronment. First, a user-process address space can- virt11;ll address of the ISP yields the base address not be executing on multiple processors simulta- of thc privarc CPU context area. ncously. Second, the cached user-process PTEs This solution provided two similar ways to find are invalidated when a LDPCTX (load process the privi~teCPII context area: context) instruction is executed as part of pro- cess rescheduling. MFPR *PR S-ISP.Rx Using these features, engineers optimized thc RlCL *mask.Rx design to require system-wide translation buffer invalidation only for system address space and not for user address space. Since system addresses DICL3 "niask.SP.Rx (when running on the intcrrupt stack) change less frequently than user space addresses, this new design allowed for a major reduction in Both nicthods return ttic virtual address of the the interprocessor communication traffic. private CPlJ context area. However, the latter case provides the faster mechanism. Process Afln ity Certain operations in a multiprocessor system Translation RI~ffer Invalidation - must execute on particular CPUs. The VMS SMP A Form of Cache Coherency designers termed the binding of a process to a As was already mentioned, the VMS SMP design particular CPU as "process alfinity ." Aftinity for a required that cache coherency be maintained in process is implemented by means of a 32-bit the hardware. However, the VAX architecture mask (one bit per CPU) in the process control includes one hardware cache that is maintained block (PCB). Once a process is assigned affinity, by software, the translation buffer. The transla- the process may only execute on CPUs for which tion buffer caches page table entries (PTEs) to it has affinity. Process affinity is enforced by the speed up address translation froni virtual to phys- VMS scheduler during a reschedule event. (Note ical nicniory ;tddresses. that only for real-time priority processes does Software monitoring of the translation buffer is VMS SMP guarantee to run the N-highest priority appropriate for two reasons. Since page table processes on an N-processor system.) pages are only "virtually contiguous" and not The VMS SMP design currently implements two "physically contiguous" portions of VAX main levels of process affinity: hard affinity and capa- memory, monitoring changes to the PTEs would bilities. Hard affinity forces selection of a single be difficult for hardware. Also, since modification CPU in the affinity mask. This level of affinity is of page table contents is usually an infrequent used when a process must be guaranteed execu- event, this cache is more suitably maintained by tion on a particular CPU, which is specified the software. by the CPU identification field in the PCB. Therefore, as part of its monitoring function, Specifically, hard affinity is used to iniplelnent thc operating system software must notify the CPU diagnostics and to halt a CPU. Whcn hard processor whenever it changes the contents of a affinity is being enforced, the process i~ninity

Digital Technical Jounral No. A~r~~rsl1988 mask is reduced to a single bit, which represents VMS software that access the hardware itself the one CPli on which the process may execute. (such as dcvicc driver routines that alter control 71'licselection of hard affinity is a ven static opcr;l- ant1 status registers) must execute on one of thc tion The selection of which CPIJ to run on is <:Pl[s in the device affinity set for that devicc. For determined prior to scheduling the process, and example. most of the initial processing of a SQIO thc selection remains enforced until othcnvisc rcquest can execute on any CPU. The driver code recluested. actually starts the 1/0 transfer by controlling the Capabi lities providc a logical mapping of pro- tlcvicc by means of the control and status regis- cesses to services. These services may only be ters. Only this portion of the driver code must available on certain CPUs in thc SIMP cnviron- execute on a member of that device's affinity set. mcnt; for example, primariness is a logical capa- Ilndcr the SMP design, all forking and postpro- bility. A capability may be scnriccd by one or cessing occur on the same CPU that received morc CPIls in the SMP environment. For cxani- the device interrupt. The device affinity implc- ple, prin~;~rinessis a capability that is only mcnration uses a "trickle down" method that offereti by ;I[ most one CPU in the SIMP cnviron- rccli~iresno aftinity checks for any of thc queues. menr. Insteacl. fork threads are queued to the appropri- When a process requires capabilities, the pro- arc CPIJ in the first place. The SMP implcmenta- cess intlicates the desired capabil.ities jn a 32-bit tion c~i~c~rsthefork threads by replicating the mahk in the PCB. When the process is scliecluled. I/O l,ostproccssing queue and the fork qi~eucs a comparihon is macle of the current rcqucstctl h)r cach CPU in the per-CPU context area. Thus cap;~hilitiesand the capabilities offered by the c;ccli CPLI can process its own fork and 1/0 post- CPII being rescheduled. If the CPU has thc processing queue without acquiring the various requirctl capabilities, then the process is cxe- sj~inlocksthat would be required for system-wide cured; otherwise, the process is ignored and qi~c~~csFi~nlier, under this design, the set of another process is chosen for execution, An), CPlls to which a particular device is bound under active CPlJ offcring a particular capability m:iy tlcvicc affinity is a proper subset of the CPUs that service any proccss rcquiring that capability can service interrupts for that device. Once the capability is no longer required by a 'I'lic ;~ffinityfield for a device is storcd as a bit proccss, thc capability bit in the PCB is cleared mask in the unit control block (in the field ;~nclthe process can execute on any CPU in the I1CBSI.-AFFINITY). This bit mask represents multiprocessing system. Thus, capabilitics offcr a those CPUs that are allowed to access the much morc dynamic load-leveling of processes specified device. The default value for across the CPUs in the system than does hard CJCI3SL-AFFINITY is - 1, allowing access from affinity. any CPU to the device. As already mentioned, the console subsystem devices are accessible Device Aflnity only from the primary CPU; thereforc, the The VMS SMP design requires that the priman. IJCBSL-AFFINITY mask for these devices is ini- CPC! have ;lccess to all 1/0 devices on the system tialized to the primary CPU only. Due lo hardware asymmetry for certain devices in The affinity field for a device is checked on some existing multiprocessing systems, the VMS entry to only two of the seven driver entry points: SMP design also had to include provisions for dcvicc affinity. Ibr example, usually both devices in the console s~~bsyste~n- the console terminal and the consolc block storage device - can only bc accessed by the primary CPU. This is cspc- If the affinity check fails, the 1/0 request packet cially evident on 8.300 systems, where a physical (IRP) is queued as a fork block to another CPU backplane cablc connection from one of thc from which access is allowed. The fork block in VhYDl slots (us~~allyslot 2, which contains the the CDRP portion of the IRP is used to fork the primary CPU) limits access to thc console sub request to ;inother CPU. The fork block is queued system to the primary CPU. to ;I work request queue in the selected CPU's Device afhnity models the hardware asymmcrr?r per-CPU context area. An interprocessor inter- by allowing only a subset of the processors to rupt is then delivered to notify the CPIJ that work acccss these I/O devices. Onl.)~thc portions of is now present in its work request queue.

Digital Tecbnical Journal No. 7 Atrgt~sl I988 CVAX-based Systems

All other entry points into device drivers are tion. Therefore, each section of code protected serviccd by the primary CPU, which must be by a spinlock can be executed by only one pro- guaranteed access to all devices. These entry cessor at a time. If two processors attempt to points are normally called only during device ini- access the same section of code (termed a critical tialization and include the following entry region), then only one processor will proceed points: while the other(s) spin-waits. To restate Amdahl's Law: You cannot get more than one CPU's worth of work out of any synchronization UNITINIT point. The ability to increase the number of spinlocks CONTROLLER INIT should prove invaluable in future enhancements to SMP, as performance measurements indicate which spinlocks need to change their granularity. UNIT DELIVERY Process affinity is used to provide the device affinity requirements for the $CANCEL system service. When the SCANCEL request is serviced, the UCBSL-AFFINITI' field may not allow access from the CPU on which the request was initiated. If access is not allowed, then the process affinity is changcd to force the process to execute on a CPLJ compatible with the affinity requirements of the de\.icc. Somc VMS routines are always called when 1/0 completes on the same processor that ser- viced the device and fork level interrupt dis- patching. Therefore, device affinity is implicit for thcsc routines, and no aftinity checks are made prior to calling the routines REGISTER DUMP and MOIJNT VER1FICA'I"ION. Future Investigations 'I'he initial VMS SMP design is finished, but many intcrcsting areas invite further investigation 'These include Pcrforniance improvements, perhaps finer granularitjfspinlocks Enhancements for parallel processing Pro\.isions for higher availability The key to the VMS SMP design is the new syn- chronization primitives, that is, spinlocks. The flexibility of the spinlock design will be irnpor- tant in future enhancements to SMP, as already proven in the evolution from static to dyna~nic spinlocks. Gran~11;lrityis another important attribute of spinlocks. which are synchronization points. All s!nchronizat ion points must be factored into the design of any multiprocessor system. Each spin- lock represents at most a single thread of execu-

Digital Technical Journal 63 NO. ' All~1491 I088 Bhagyam Moses Karen DeGregory I

Performance Evaluation of the - VAX 6200 Systems

Performance evaluation is an essential element in the deuelopment of a computer system. An eJort was made to accurately evaluate the perfor- mance of the VAX 6200 system under workloads that represent real customer environments. Workloads were developed to represent three major target markets - Engineering/ScientiJic, Commercial, and Cen- era1 Timesharing. Tbese workloads were wed to drive the VAX 6200 sys- tem. and thus to evaluate system performance in these environments. Performance measurement results indicate that the VAX 6200 system is a well-balanced multiprocessor system and that the multiprocessor perfor- mance isfairly linear across these u~orkloads.

Introduction tion continually debated is how well the bench- The VAX 6240 system is a tightly coupled multi- marks and workloads represent current user envi- processor system basecl on the CVAX niicro- ronments. Since there are many differcnt kinds of processor. Thc system consists of four proccssors computing environments and both the applica- sharing memory through a single, high-speed tions and computing styles are continually chang- bus. This paper describes the process by which ing. it is very difficult to develop representative performance of the VAX 6240 system was evalu- worklo;~dsaccurately. The approach taken here ated under various workloads that represent was to first survey the current customer popula- target markets. Thc method used to develop and tion and identify a few major target markets. verify thcse -workloads is discussed along with Table 1 consists of three surveys obtained from the evaluation of system performance. Wc use different sources, with n being the sample size. the multiproccssor efficiency measure. defined as thc rclative throughput obtained by the addi- tion of each processor, to characterize multi- Table 1 Survey of Customer Environments processor performance. Measurement of the Environment Survey 1 Survey 2 Survey 3 \'AX 6240 systcni indicates that the multiproces- n=llO n=200 n=55K sor ethciency measure is directly dependent on the contention for shared resources generated by Engineering/ 46% Scientific a workload. Commercial 4 0'10 Workloud Development Education 8 O/O One of the major issues in evaluating the perfor- Software 6% Development mancc of a computer system has been in the Miscellaneous -- workload area. In the context of this paper. workloads are software tools used to create inter- active multiuser environments in which the interactive throughput and responsiveness of the Table 2 Distribution of Customer system are the kc17 performance metrics. Con- Environments versely, benchmarks are either single or multiple Engineeringlscientific 4 0'10 copies of programs run in batch mode; the Commercial 40% amount of time to complete execution of thcse General Timesharing 20% programs is the performance metric. The clues-

Digital Technical Journal No. 7 August 1988 CVAX-based Systems

PROCESSOR 1 PROCESSOR 2 PROCESSOR 3 PROCESSOR 4

1 STREAM 1 1 1 STREAM 2 1 1 STREAM 1 1 1 STREAM 4 1

STREAMS RUNNING SIMULTANEOUSLY ON THE PROCESSORS t

THROUGHPUT = NUMBER OF JOBS COMPLETED WITH MULTIPLE PROCESSORS AS COMPARED TO ONE

Figure I Executiorr of klultiple Programs Run in Parallel

Clearly, Engineering/Scientific and Commer- oil reservoir simulation, flight sinlulation, and cial environments dominate the market, with linear equation solvers. Education. Software Dcvclopment, and General Thc scientific stream contains simulation Tirncsharing applications accounting for thc rest. programs that use Monte Carlo techniques Further examination of the Software Dcvclop- ment and the Education environments showcd to track particle movement, along with commonly uscd routines froni national labora- much similarity in function, cxcept that Software tories. Development is slightly more compute intensive. Thus we further simplified the application cate- 'The commercial stream contains the activities gories. as shown in Tablc 2. done by a personnel department to support We identified typical environments in each of salary planning. these categories by evaluating system rcsource consumption in these eovjronments rather than 'The general timesharing stream represents thc by evaluating what an end user does on thc sys- activities donc in a software development or tcn~.Thus we could simplify the number of cducation environment. parameters to CPU, memory, and 1/0 resource Multiple copies of this stream wcrc run simulta- utiliz;~tions.Having identified these typical envi- neously to take advantage of multiprocessor com- ronments. wc collected or developed bench- pute resources (Figure 1). To capture the maxi- marks ant1 workloads to represent them. mum throughput, we ensured that all of the processors were 100 percent busy while the mul- Single Stream tiple streams were running on the system. Acquiring single stream benchmarks was not as difficult ;IS tlcveloping multiiiser workloads. Most Multiuser Workload Development of Digital's customers have benchmarks that Thc overall process of workload development is represent thcir environments. Therefore, we shown in Figure 2. Our goal was to represent typ- acquired a collection of benchmarks to represent ical timesharing environments for the different Engineering/Scientific. Commercial, ancl General target markets. 'The entire strategy consisted of Timesharing from various customer sites. 'These benchmarks are used to evaluate thc single- Identifying typical real sites processor sl,cc

Digital Technical Journul No. 7 Afr~~rsl1988 Performance Evaluation of the VAX 6200 .xj,stems

SYSTEM

RESOURCE UTILIZATIONs -TERMINAL ACTIVITY DATA -USER CHARACTERISTICS -USER MIX I

-VAXRTE SCRIPTS DATARESOURCE '\UTILIZATION 1 ~AcTERISTICs-USER MIX

STANDALONE SYSTEM

Figure 2 Interactive Multiuser Workload Development

In the following sections, we describe how we wide average over a week's accumulation of data. used this strategy to develop two multiuser Based on this comparison, we selected a typical workloads: the engineering workload, which rep- day and a typical system. One hour was chosen resents an Electronic Computer-Aided Engineer- from the typical system on a typical day during ing environment (ECAE); and the Software Devel- the period of peak interactive use to characterize opment Environment Workload (SDEW). the system at full load. Further, based on the user profiles, we Data Collection classified users according to computer usage, Tbo Digital sites were chosen to represent the that is, heavy or light computing (for ECAE ECAE and SDEW environments. Intcrnal sites workload) and heavy, medium, or light comput- were chosen initially to facilitate the data collec- ing (for SDEW workload). We then used the tion process. Both sites had clustered environ- image accounting data and user log files to clas- ments that consisted of a variety of VAX systems sify users according to the type of activity they along with some workstations. performed. We collected information on these clustered Once several user classes were identified, the systems to capture their behavior under the load number of users in each class, or user mix, was generated by the environment over a period of determined. We defined the user mix by looking one week. VAX SPM software was used to collect at (1) the number of users in each class at the resource utilization data (CPU, 1/0, and memory utilization) on all the systems at both user level and system level. VMS Image Accounting was Table 3 ECAE and SDEW User Mix used to obtain resource utilization data on an image basis. Using the SET HOST/LOG Digital ECAE User Mix Command Language (DCL) command, we col- Type of User No. of Users lected log files of user sessions to study user habits. Other user characteristics, such as think Engineer: Heavy 3 time and type rates, were obtained through inter- Engineer: Light 3 views and observations. SDEW User Mix

Data Analysis Type of User No. of Users The performance team studied the cluster-wide Heavy software development 1 resource utilization profilcs in order to select the Light software development 3 time when the intcractive activities were pre- Secretary 1 dominant. We compared resource utilization Technical writer 1 profiles of individual systems against the cluster-

Digital Tecbntcal Journal N,, 7 A.ro.,rr lORR CVAX-based Systems

one-hour peak, and (2) the organization struc- Table 4 User Resource Utilization for Real ture at the real sites. Table 3 shows the user mix Internal System and ECAE Workload for ECAE and SDEW workloads. In addition to CPU interactive users, these workloads also have batch minutes/hour DlO/second jobs running in the background. Userclass Real ECAE Real ECAE Developing the Workload Heavy 1.6 1.5 0.3 0.4 Having identified the user classes and activities, Light 0.5 0.5 0.2 0.1 we then developed an intermediate workload Batch 42.8 48.5 0.0 0.1 using DCL command procedures. This inter- mediate step allowed easier translation to the within 0.1 DIO per second of the values mea- final workload, which was based on VAXRTE sured from the real site. (VAX/VMS Remote Terminal Emulator) scripts. System-level validation - For system-level val- Individual user scripts were developed and vali- idation, we compared the system-level usage of dated. We then packaged the entire workload by CPU, disk I/O, and memory for the 1-hour ECAE integrating all of the user scripts and the batch test experiment to the peak hour of the real jobs. Once development was complete, the system. Figure 3 shows that the CPU was used workload was validated at both system and user 100 percent of the time on the real system during levels against the real internal site. Further vali- the 1 hour; whereas the CPU utilization in the dation was done at the user level against Digital's workload tended to vary slightly more, but was customer sites. always between 70 percent and 100 percent sat- urated. The average CPU utilizations of the real Workload Validution system and the EMworkload are very close at This section describes the workload validation 100 percent and 93 percent, respectively. process using the ECAE workload as an example The DIO utilization over a 1-hour period for of the validation methodology. the two systems is compared in Figure 4. For both Validation against "real" internal site - The systems there is significant variability in the DIO workload was tested using the same hardware rate over the 1 hour period. The ECAE workload configuration as the real system. For the ECAE was slightly more bursty, but the average D10 workload, a VAX-1 1/780 system with 32 mega- rates for the real system and the ECAE workload bytes (MB) of memory, RA81 disks, and six inter- were very close at 3.3 and 3.0 Dl0 operations per active users was tested. The purpose of this test second, respcctively. was to compare the resource utilization of the Memory utilization on the two systems did not workload in an hour-long experiment to the vary substantially over the 1-hour period. How- resource i~tilizationof the real system during the ever, total average memory usage with the typical hour. System- and process-level resource workload. 23MB, was less than on the real sys- utilization data of several different resources tem, 29MB, as depicted in Figure 5. were compared. Although the workload validated very well for User-level validation -To validate the work- CPU and DIO resource utilization, the workload load at the user level, we compared the average used 20 percent less memory than was used at CPU and direct 1/0 (DIO) utilizations computed the real site. This was in part due to the fact that for 1 hour for the different user classes. The during the development of the workload the CPU results are shown in Table 4. and disk 1/0 utilization of subprocesses was CPU utilization for all three user classes vali- added to the resource utilization of the parent dated to within approximately 10 percent, process. Although the workload represents thc which was considered to be well within accept- work done by those subprocesses and the load able limits. Validation of the DIO rate was made placed on CPU and disk 1/0 resources, the somewhat difficult because (1) the DIO rate on a workload does not represent the additional mem- per-user basis was very low (0.3 DIO per second ory required by those subprocesses. As will be for the heavy user), and (2) measurement of the described in subsequent sections, the lower Dl0 rate is only accurate to 0.1 DIO per second. memory utilization of the workload did not con- For all three user classes, the workload came to stitute a problem.

Digital Tecbnical Journal No. 7 Augrrst 1988 - uuuu UU U U u u UU UUUUUUUUUUUU u U U UU U UUUUUUUuUUUU REAL SYSTEM uuuuuuu ECAE UUUuUuuUUUUU UUUUUUU uUUUUUuUuUUU CPU UTILIZATION (PERCENT) U U u u uu U CPU UTILIZATION (PERCENT) - uUuUUuUUUUUU VERSUS TlME OF DAY UUUUUUU VERSUS TlME OF DAY UuUUUUUuUuUU UUUUU uuuu U UUUUUuuuUuuu FROM: 6-NOV-1986 10:14:06.45 UuUUUUUUUUuu uUUUUUUUuUUU FROM: 2-FEB-1987 21:26:13.08 TO: 6-NOV-1986 11:14:39.91 UUUUUUUUUuUu TO: 2-FEB-1987 22:30:13.07 uuuuuUUuuuUU UUUUUUUUUUUu - UUuuuUUUuUUU uUUUUuuUuuUuU UUUUUUUUUuUU EACH COLUMN = 300 SECONDS uuUUUUUUUUuUU EACH COLUMN = 300 SECONDS uUUUUUUUUuUU (5 MINUTES) UuUuUUUUUUUUu (5 MINUTES) uUUuUUUuUuuu UUUuUuuuUUUuu UUuuuUuUuUuU UUUUUUuuUuUuu - UUuuuuUUuUUu CPU IDLE uUUUUUUUuuuuu CPU IDLE UUUUUUUUuUUu uUUUUUuUUUUUU UUUUUUUUuuUu TOTAL PAGE SWAP PAGE &SWAP UUUuUUUUUUUUU TOTAL PAGE SWAP PAGEBSWAP UuUUUuUUuUuU IDLE WAlT WAlT WAlT UUUUUUuUUUUUU IDLE WAlT WAlT WAlT UUUUuuuuUuUu 0.1% 0.0% 0.0% 0.0% uUUUUUUUuUUUU 6.7% 6.5% 0.0% 6.5% - uuUUUUUUUUuu UUUUUuUUUUUUu uuUUUUUuUuUU UUUuUuuUUUUuu uuUUUUUUUUUU UUUUUUuuUUUuu UUUuUUUUUUUU CPU BUSY UUUUUUUUUUuUu CPU BUSY UUUUUUUUUUuU uUUUUUUUUuuUU - INTER uuUuUUUUUUUUU INTER - uuuUuUUUUUUU KERNEL KERNEL STACK UUUuUUUUUUUUu STACK uuuUUuUUUUUU 11.2% 14.4% UuUUuuUUUUUU 6.4% UUUEUUUUUUUUU 2.8% UUUUUuUUUUUU UUUEUUUUUUUUU UUUEUUUUEUUUU UUUUUUUUUuuU EXECUTIVE SUPERVISOR EXECUTIVE UUUEUUUUEUUUU SUPERVISOR - uuUUUUUUUuuU 4.0% 1.9% 2.8% 0.1% uuuuUUUUUUUU - UUUKUUUUEUUUU UUUUUUEUUUUU UUUKUUUUEUUUU UUUKUUUUEUUUU UUUUUUEUUUUU USER COMPATIBILITY USER COMPATIBILITY UUUKUUUUKUUUU UUUUUSEUUUUU 76.6% 0.0% 73.2% 0.0% - UUUSUSEUUUUU UUUKUUUUKUUUU UUUSUEEUUUUU UUUKUUUUKUUUU EUUEUEKUSUUU SYSTEM TASK UEUKUUUUKUUUE SYSTEM TASK EUSEUKKUESUU 21.5% 78.4% UEUKUUUUKUUUE 20.0% 73.3% KUEEUKKUEEEU UKUKUUUUKUUUK - KUEKUKKUKEEU UKUKUUUUKUUUK KUKKEKKUKKKU UKUKUEUUKUEUK KSKKKKKSKKKU UKUKUKUEKUKUK KSKKKKKEKKKU KEY: UKEKUKEKKKKUK KEY: KEKKKKKKKKKU UKEKUKKKKKKEK I INTERRUPT KKKKKKKKI KKKK I - KKKKKKI KKKKU - - INTERRUPT KKK1 KKKKI KKKK I KKKKI I KKI KS E EXECUTIVE - KKK1 KKKKI KKKK E - EXECUTIVE IKIIIIIIIIIE U USER KI U USER lllllllllllK - I I KKKKI KKKK - IIIIIIIIIIII K - KERNEL KI I I KI I I I KI KI K - KERNEL 1 I 1 S - SUPERVISOR I I L S - SUPERVISOR 10:14 10:39 11:04 11:26 21:51 22:16

Figure 3 CPU Utilization for Real Internal System and ECAE Workload for I Hour CVAX-based Systems

- m 0z - mr. 0- 0 0 OO gp " z 22 W 0 23 u3 $> ZX "Z W m 4: 02 IT 0 49 a +u mm 11 ffl mm W 2; -- z + mm 2 IT4: '"2 ww 3 ?F- kz 6 ? - + . 0 5, 6 Lg zp 5

Digital Technical Journal 69 No 7 A~rrgust 1388 PerJ'orrnance Evnlualion of the VAX 6'200Systems

mr. 99 mm 7 r :;wo NN

Digital Technical Journal No. 7 August I988 CVAX-based Systems

A summary of the comparisons of the average Table 5 System-Level Resource Utilization resource utilizations for the real system and the for Real Internal System and workload is prese~tedin Table 5. ECAE Workload Validation against customer sites - This vali- Resource ECAE Real System dation of the workload against the internal system was followed by validation against customer CPU busy 9 3'10 10O0/o systems. The goal of this additional validation DlOIsecond 3.0 3.3 was to determine if the workload was representa- Memory 23MB 29MB tive of the load placed on systems by Digital's customers. Two semiconductor manufacturers in Califor- Table 6 Comparison of Resource Utilization nia were used as validation sites for the ECAE on Customer System and workload. Initially, it was determined that there in ECAE Workload were significant differences between the work Resource Utilization Customer ECAE performed at these customer sites and the work per Hour Sites Workload performed at the internal Digital site. The Digital internal VAX systems were used for logic design CPU (minutes/hour) 3.8-4.5 5.0 of gate arrays, circuit boards, and systems; Dl0 operationslsecond 1.4-2.3 1.8 whereas at the external sites, the VAX systems Memory (MB) 0.7-0.8 0.8 were used for the design of integrated circuits. Specifically, the work differed in the following ways: customer sites. This workload will put a 10 per- cent heavier load on the system, making the per- DECSIM is used extensively within Digital, formance numbers slightly conservative for the whereas SPICE is the predominant simulation computer-aided electrical engineering market. software used by external semiconductor developers. DECSIM simulations require very Performance Measurement and large amounts of memory as compared to the Analysis SPICE simulations done by customers. This section discusses the performance of the sys- Design rule checking is both a time-critical tems in three major applications: Engineering/ and disk I/O-intensive task done by seniicon- Scientific, Commercial, and General Time- ductor designers. Design rule checking and sharing. In each of the environments, single the load it places on the I/O subsystem were stream, multistream, batch, and multiuser work- not executed at the internal Digital site at the loads were tested. time resource utilization data was collected. Single-Stream Performance As a result, we modified the ECAE workload to The first step in evaluating the performance of a include the load placed on the system by design n~ultiprocessorsystem is to establish the base- rule checking and replaced the use of DECSIM level performance of the uniprocessor relative to with SPICE. a well-known system such as the VAX-11/780. A System resource utilization data was collected large number of single-user benchmarks were on VAX 8800 systems for one week at these cus- used to establish this base level. tomer sites. In a manner very similar to the pro- cess used for the initial development of the Single-User Performance workload, the data from these sites was reduced Single-user performance was evaluated by using to a typical peak period. Table 6 presents the traditional synthetic benchmarks, well-known comparison of resource utilization on a per-user industry standards, and real application programs basis in the workload and at customer sites. from engineering, scientific, commercial, and The ECAE workload falls within the range of general timesharing environments. Most of the utilizations observed at these customer sites for synthetic benchmarks are in FORTRAN; industry both disk and memory utilizations. The workload standards are Whetstones, Dhrystones, Linpack, is slightly (approximately I0 percent) more CPU and others. The real applications, as mentioned, intensive on a per-user basis than was observed at represent four environments.

Digital Techrrical Journal No. 7 A~rji~rst1988 Performance Evaluation of the VAX 6200 .~~~slcvns

VAX 6200 PERFORMANCE RELATIVE TO VAX-111780 (VAX-11/780 SYSTEM = 1)

Figure 6 Frecjrzrency Distrihrrtion o/'lhe VAX 6210 Performance on the Single- User /Ic~rtchrrr~~rkSet

These benchmarks were used to cv;~luatc segments can be separated into parallcl threads uniprocessor speed compared to a VAX-1 1/780 that can be run independently. Then the program system. A frequency distribution of the spccdup is decomposed and run, either niariually or factors on all these benchmarks was plotted, ant1 through directives. The program is initiated as a the ccntral tendency was examined. (Scc Fig- single job; then the segments of the program that ure 6.) A high percentage of the benchmarks fell lend themselves to decomposition arc divided between 2.2 and 2.8. into subprocesses and executed in parallel on Table 7 summarizes the performance of the different processors. In the manual decomposi- VAX 62 10 in the single-user environment relative tion method, the optimal number of subpro- to a VAX-1 1/780 system. The perform;incc aver- cesscs for various levels of multiprocessor sys- age of the VAX 62 10 system, across all thcsc tems is evaluated by varying the number of benchmarks. is 2.8 times the performance of a subprocesses and calculating the speedup fac- VAX- 1 1/780 system. tors. In the directive decomposition method, thc compiler rakes care of various optimization fac- Decomposed Single-user Performance tors. These programs were run standalone with VAX 6200 performance on dccomposctl pro- no interference from any other programs on the grams was ev;~luatedthrough thc use of manual systcm. Figure 7 illustrates the decomposition and directed decomposition techniques. To process. begin with, a program is evaluated to see if some The benchmark dcscription is as follows. To evaluate the maximum speedup factors that can be achieved through decomposition, code seg- ments were selected. Such segments as matrix Table 7 Performance of the VAX 6210 in the multipl~cationand convolution are widely used Single-User Environment in enginecring/scientific applications. Different array sizes (from 100 to 1000) were used with Synthetic Benchmark Set: various arithmetic data types such as integer, and Single-user set 2.5 single and double precision. Industry-standard Benchmarks: hi image processing program and the Lin- Whet-s & -d 2.3 pack 1 OOOD program were used to represent real Linpack-s 2.7 application programs, where only certain seg- Linpack-d 3.2 men ts can be decomposed. Dhrystone 2.8 l'hc performance results are as follows. The Real Application Benchmark Set: multiprocessor efficiency measure, defined as the Engineering set 2.8 relative speedup obtained by the addition of each Scientific set 2.6 procehsor, is the key metric used here to evaluate

Digital Tecbnical Journal No. 7 August 1988 CVAX-based Systems

ONE PROGRAM DECOMPOSED INTO PARALLEL CODE

I MAlN I

RUNNING ON: (OBJECT)

PROCESSOR 1 PROCESSOR 4

SUBPROCESS 1 SUBPROCESS 2 SUBPROCESS 3 SUBPROCESS 4

SPEED = TIME TAKEN TO COMPLETE THE JOB MAIN PROCESS (END)

I;iRzue 7 Prograrn I~ecompositlonProcess performance. As seen in Figure 8, the multipro- was recorded and used to evaluate the multipro- cessor efticicncy measure on the program kcrnels cessor batch throughput performance. It is is fairly linear. Multiprocessor synchronization is important that all the streams run simultaneously minimal in this computing environment. The and share resources equally Large differences in performancc was very close to the theoretical the completion times of streams would imply that maximum. A speedup of 3.9 times the uniproces- maximum concurrency was not achieved because sor performi~nce was achieved on the four- of some bottleneck in the system. processor 6240 system. The performance on the Multiprocessor performance on multistrcam image processing program is slightly lower than batch jobs was very close to linear across all envi- what was obscrved on the program kernels. Thus ronments. Results for the commercial stream, performance gained by decomposition depends representing personnel administration, werc only directly on the amount of code that can be run in parallel. (Note: On the Linpack I OOOD program, directed decomposition was used; whereas on the other programs, manual decomposition was used .) Multistream Batch Performance Measurement and Analysis The multistrcam jobs were used to measure the system-level batch performance on the multipro- - cessor systems. As shown in Figure 1 , these multi- W5:::: , , , ple streams were run in parallel to allow concur- 0 rency in the execution of these streams. VAX 6220 VAX 6230 VAX 6240 Maximum concurrency is achieved since each of these streams is identical. No single stream runs KEY: any faster; however, the numher of jobs com- 0 IMAGE PROCESSING pleted increases almost linearly with the addition A MATRIX MULTIPLICATION of processors. Adequate memory was allocated to 0 CONVOLUTION the jobs to avoid unnecessary paging and swap- X LINPACK ping. In addition, sufficient 1/0 resources werc present on the system to preclude 1/0 bottle- Figure 8 Mtrltiprocessor Eflciency through necks. The eI;~pscdtime to complete thesc jobs Parallel Processing

Digital Technical Journal 7 3 No. 7 August 1988 Performance Evaluation of the VAX 6200 Systems slightly lower - probably because of the higher major tasks executed in this workload are com- amount of J/O on this stream. (See Figure 9.) pi le-l ink-execute-debug cycle using FORTRAN, BLISS, and MACRO; utilities used include CMS, Interactive Multiuser Performance RUNOFF, and text editors. Measurement and Analysis Hardware/software setup - Table 8 summa- In the interactive multiuser environment, the rizes the hardware and software configurations. system must support the activities of a substan- tially higher number of users and their frequent interaction with the system. The number of users - on the system increases the amount of context switching, and the contention for shared - - resources is also much higher in this environ- ment. The test methodology included the use of a remote terminal emulator, VAXRTE, to create the interactive multiuser environment. The VAXRTE generated the input for the system under test and I received consequent output. The VAXRTE also VAX 6220 VAX 6230 VAX 6240 logged and time-stamped all interactions and maintained the job mix throughout the experi- KEY: ment. To run a multiuser experiment, the system 0 GENERAL TIMESHARE under test and the VAXRTE system were booted A COMMERCIAL and running. Using scripts, every few seconds the SCIENTIFIC VAXRTE logged a user on to the system under X ENGINEERING test. After all logins were completed, sufficient time was allowed for the system to reach a steady Figure 9 Multistream Eflciency Measures state. The experiment was then run long enough to execute the longest script cycle for the specific workload. While the experiment was Table 8 Summary of Hardware and Software running, VMS monitor and other monitoring tools Configurations were used to capture the resource utilization Hardware Configuration data. When the experiment was completed, data was reduced and analyzed. Processor VAX 6240 Workload description - Three interactive Memory 128MB multi~~serworkloads were used to evaluate the Disk controller 2 HSC70 multiprocessor performance in the three major Disks (Disk configurations differed for each workload; see below.) environments: Engineering, Commercial, and General Timesharing. Number of RA82 Disks per Workload The Engineering environment was represented -- - - Dedicated Compu- by an EWworkload. This workload consists of Use ECAE Share SDEW the types of tasks done by design engineers devel- oping electronic circuits: circuit simulation, System 1 1 1 design rule checking, schematic file transfers Pagelswap 1 1 1 from workstations, and tasks supported by VMS Library - - 1 utilities. Interactive 2 2 4 The multiuser Commercial (Compu-Share) Batch 3 - 2 workload is based on the Compu-Share Order Database - 6 - Processing software package. This workload con- Note: Where necessary, software was distributed sists of three major types of transactions: order over multiple disks to avoid disk bottlenecks. entry, order inquiry, and accounts receivable Software Configuration reporting. The General Timesharing SDEW represents the VMS V5.0 - FT2.1 (A single-processor system was run with multiprocessing turned off.) types of tasks done by software engineers. The

74 Digital TecbnfcalJournal No. 7 August 1388 CVAX-based Systems

Performancc metric - The multiprocessor effi- 4 ciency measure is defined to be the relative multi- W processor interactive throughput compared to 5 the uniprocessor throughput. V) Also considered in this metric is system respon- $ siveness, based on acceptable service criteria for &z * light, medium, and heavy tasks. This metric is w used to evaluate the number of users supported I at peak throughput while the system maintains the service time criteria. System resources o required to support each application are also o VAX 621o VAX 6220 VAX 6230 VAX 6240 identified. Performancc results - The multiprocessor KEY: efficiency measure is very close to linear in both 0 ECAE the ECAE and SDEW environments (Figure 10). COMPU-SHARE This result shows that even in the multiuser inter- A SDEW active environments near-linear performance can be expected if the system is well balanced in Figure I0 Multiprocessor EDciency Measure terms of processor speed and the memory-to-pro- for All Multiuser Workloads cessor bus speed. It also indicates the efficiency of the VMS SMP software. In the Compu-Share cnvjronment, the performance was slightly lower At the peak throughput levels, response time because of the high amount of disk and terminal criteria were maintained in each workload. 1/0 generated by this workload. The perfor- Table 9 compares users supported and resources nlancc of the multiprocessor systems under sym- used by each of these workloads. The maximum metric multiprocessing (SMP) depends directly number of users supported on the VAX 6240 are on the amount of I/O. It is important to note that 38, 120, and 126 users for ECAE, Compu-Share, evcn with high amounts of I/O, the multiproces- and SDEW, respectively. sor cfticiency measure is well over three for the In terms of resource utilization, it should be four-processor system. noted that the multiprocessor synchronization

Table 9 Summary of Workload Resource Utilizations Multiuser ECAE Compu-Share SDEW Number of users supported at the peak 10,20,28,38 30,60, -, 120 36,66,90,126 Resource utilization Number of users 38 120 126 CPU - 6240 Percent utilized 100% 100% 100%

Interrupt 2 O/O 6% 4%

Kernel 12% 29% 20 O/O Executive 3% 7% 7%

MP synch 1O/O 7% 2% User 82% 5 1'10 67% 110 Disk I/O profile Bursty Uniform Bursty Average disk I/O per second 24 113 68 Average buffered I/O per second 82 112 76 Memory Maximum used (MB) 32 60 57

Digital Technical Journal No. 7 Augrrsl I988 Performance Er~uluationof [he VAX 6200 Systems under SMP is handled by spinlocks. A spinlock is a bit in shared memory that is accessible by means of interlocked instructions by all pro- cesses through mutual agreement. Mutual agree- ment implies that a process can set the bit and gain access to the scheduler database if no other process has access to it. If a process tries to set the bit and the bit is already set, then the process continues to "spin" using a sequence of instruc- tions to continue checking to see if the bit is clear. MP synch is the amount of CPU time spent waiting to change the bit or acquire the spinlock and thus gain access to thc scheduler database. MP synch is 1 percent for ECAE. 7 percent for the Compu-Share workload, and 2 percent for the SDEW workload. Since MP synch is the CPU time KEY: spent waiting to acquire spinlocks and indicates IDLE the amount of spinlock collisions, it shows the level of contention for shared resources experi- USER enced by SMP under each workload. For the KERNEL Compu-Share workload, this level is significantly EXECUTIVE higher. The Compu-Share workload generates the 0MP SYNCH most clisk 1/0 compared to the other workloads, 0 INTERRUPT which may be the reason for a higher amount of time spent by this workload in MPsynch. Figure 12 CPU Utilization over Time - The following three graphs, Figures 1 1, 12, Compu-Share Workload 13, present the CPU modes usage profiles. The

0 10 20 30 40 50 60 TlME OF EXPERIMENT IN MINUTES TlME OF EXPERIMENT IN MINUTES KEY

KEY: CPU IDLE USER USER m KERNEL KERNEL (EXECUTIVE EXECUTIVE 0INTERRUPT 0INTERRUPT 0MP SYNCH 0MP SYNCH

Figure I I CPU Utilizulion over Time - Figure 13 CPU Utilization over Time - ECAE Workload SDE W Workload

7 6 Digital Technical Journal No. 7 Augtrsr I988 CVAX-based Systems

Compii-Share workload shows higher but more iinifor~ninterrupt and kernel mode activities. Compii-Sharc's use of databases, which generates licavy 1/0 ancl local locking, is manifested in the heavy kerncl and interrupt mode activity. SDEW does a fair amount of file manipulation. ECAE has much lower 1/0 activity than both Compu-Share and SDEW. The next thrcc graphs in Figures 14, 15, and 16 compare the 1/0 profiles. The disk 1/0 on ECAE and SDEW is very bursty, and it is interest- TlME OF EXPERIMENT IN MINUTES irlg to note that their relativc CPU node profilcs correlate wcll, showing a relationship between Figure I6 DiskI/OUtilizalionoverTime- the two. 'l'he 1/0 on Compu-Share is high but not SDE IV Workload as bursty. Comparing thc disk 1/0 generated by the multiuser environment on thc multiprocessor workloads and the effect it has on CPU utiliza- systems shows that VblS SMP is working cffi- tion. Compi~-Shareputs the heaviest load on thc ciently and that the VAX 6240 system is a wcll- multiprocessor system. However, even with all balanced system in terms of the processor and the synchroniz;ition necessary on this workload, bus speeds. the multiprocessor efficiency measure is fairly high (3.3).'l'hc ECAE and SDEW workloads show Application Characteristics Aflecting high multiprocessor efficiency measures of 3.8 Multiprocessor Performance and 3.9. rcspectively. This level of gain in the This section discusses some of the characteristics in applications that directly affect multiprocessor performancc. Memory-to- Processor TraDc Since these multiprocessor systems share mem- ory, contention to access memory could be a factor that affects multiprocessor efficiency. Therefore applications that generate lower mem- ory-to-processor traffic do perform better, assum- ing there are no other bottlenecks in the system. One way to reduce this traffic is to organize the TlME OF EXPERIMENT IN MINUTES data to improve locality of reference. Data that is accessed togcther should be placed together. Figure 14 Ijisk I/O Utilizulion over Time - ECAE Workload Disk I/O Operations With the symmetric multiprocessing software, 1/O operations can be handled by each of the processors. As a result, the I/O-intensive appli- cations perform much better on the symmetric multiprocessor systems as compared to the asym- metric multiprocessing systems. However, the 1/0 devicc interrupts are still hand led by the pri- mary processor, even under SMP. By reducing the rate at which device interrupts are made, any 0 1 I I I 1 I 0 10 20 30 40 50 60 contention for the primary processor can be TlME OF EXPERIMENT IN MINUTES reduced. To reduce the number of 1/0 inter- rupts, larger block transfers may be better in I/O- Figure 15 I~isk//OUtilizutionoverTime- intensive applications. Thus, an application that Compzt-Share Workload will lend itself to making larger block transfers

Digital Technical Journal 7 7 NO. 7 A~glrsl1988 Overview of the MicroVAX 3500/3600 Processor Module

With minimum bus cycle time down by more single chip, the CVAX designers selected a subset than a factor of two and dynamic random-access of the full VAX instruction set and data types. memory (RAM) access time remaining relatively The implementation includes 175 instructions constant, the opportunity arose to increase per- and six data types (also implemented by the formance by using static RAM to add a cache. MicroVAX I1 system), plus 6 additional string Static RAMS with 35-11s access times and 64-kilo- instructions: CMPC3, CMPC5, LOCC, SCANC, bit (Kb) densities could be used for this purpose SPANC and SKPC. The CVAX also provides micro- at reasonable cost. code support for emulation of 53 additional instructions (six less than the MicroVAX 11) and Design Partzrtztztzoningand five data types. When any of these instructions is Functionality decoded, an emulated instruction exception is To facilitate implementation of the processor generated. This exception causes a set of instruc- module using custom VLSI, the design was parti- tion-specific parameters to be pushed on the tioned into seven major parts: the central process- stack and control to be passed to operating sys- ing unit with first-level cache, the floating point tem emulation routines by the emulated instruc- unit, the second-level cache, the memory con- tion vector in the system control block. As in the troller, the Q22-bus interface, the system sup- MicroVAX 11, the remaining 70 instructions and port functions, and the clock circuitry. Each of three data types are handled by the CFPA chip. these partitions was implemented by a single The CVAX implements the following registers: chip, with the exception of the second-level Sixteen, 32-bit, general-purpose registers cache. This cachc was implemented by pro- grammable array logic (PAL) anal static RPLMs. Twelve VAX standard internal processor regis- Five of the parts are connected directly to a tcrs to support memory management, process 3 2-bit multiplexed address/data (CDAL) bus: control, interrupts and system identification the central processing unit with first-level cache (SBR, SLR, MAPEN, TBM, TBIS, TBCHK, PCBB, (CVAX), floating point unit (CFPA), second- SCBB, IPL, SIRR, SISR, and SID) ' level cache, memory controller (CMCTL), and Five internal processor registers specific to the Q22-bus interface (CQBIC). To reduce loading, CVAX to support the interval clock, first level the chip containing the system support functions cache, error reporting and console emulation (SSC) connects to a buffered version of this bus, (ICCS, CADR, MSER, SAVPC, and SAVPSL)~ the BCDAL. The clock circuitry (CCLK) was sepa- rated from the processor chip to conserve pins as The CVAX also provides a means for accessing six well as to allow designers more flexibility in additional VAX standard internal processor regis- choosing a clock rate. ters to support the time-of-year clock, console To maximize performance, the CVAX, CFPA, serial line, and 1/0 bus (TODR, RXCS, RXDB, second-level cache, and CMCTL operate synchro- TXCS, TXDB, and IORESET) .' These registers are nously from a four-phase clock generated by the implemented in the SSC. CCLK. The SSCand CQBICoperate asynchronous1 y The registers in the SSC are referred to as on a 40-MHz oscillator. The processor module "external" internal processor registers and are was designed to allow the CCLK to be fed either accessed by software in the same manner as other from the 40-MHz oscillator or from a separate internal processor registers, that is, by means of oscillator. The separate oscillator allowed the MTPR and MFPR instructions. However, the CVAX central processor and memory subsystems to be chip generates a special cycle on the CDAL bus sped up when it was determined that the CVAX, with the register number as an address. The SSC CFPA, and CMCTL chips were capable of running responds to these cycles by either supplying the ten percent faster than originally projected. CVAX with the register contents (MFPR) or per- Each of the major parts of the processor forming the register update (MTPR). Accesses to module is described in following sections. other unimplemented VAX internal processor registers will also cause these cycles to be gener- The Central Processing Unit and ated, but the cycles will terminate with an error First-level Cacbe condition. (The cycles are timed out after four The CVAX chip is a microcoded 32-bit VAX CPU. microscconds by a CDAL bus timer in the SSC.) To implement the entire VAX architecture using a When a register write is made to an unimple-

Dfgftal Tecbnical Journal No. 7 August 1988 CVAX- based Systems

mented internal processor register, the CVAX The Floating Point Accelerator ignores the error signal; the result is a long The CFPA chip works in conjunctiorl with the no-operation. When a register read of an unimple- CVAX chip to process floating point instructions mentetl internal processor register is attempted. and to accelerate the execution of some integer the results are undefined. instructions (MULL, DIVL, and EMUL). The CVAX Also ljkc the MjcroVAX 11 system, the CVAX decodes the instructions and sends the CFPA processor implements a memory management control and opcode information by means of a unit. Thc unit supports full \'AX demand-paged dedicated eight-line . The CFPA gets virtual mcrnory. with single-level page tables for its operands from the CDAL bus. Unlike the system space addresses and double-level page MicroVAX 11, all operands do not have to come tables for process space addresses. In addition, from the CPU. Operands come from the CVAX four lcvcls of access protection are supported by only if they reside in the general-purpose regis- the mcrnory managcment unit. A 28-entry, fully ters or first-level cache. If the operands reside associati\r address translation buffer is provided in the second-level cache or main memory, the for storing recent virtual-to-physical address CFPA takes thcm directly off the CDAL bus. When translation (as opposed to an 8-entry translation the CFPA has completed the operation, it returns buffer in the MicroVAX 11). condition codes and exception status by means of Unlike the MicroVAX 11 system, the CVAX in- the control bus, and the unaligned result by the cludes an on-chip (first-lcvcl), physical instruc- CDAL bus. One, two, or three longword transfers tion and data cache. Because chip area was at a may be required to transfer the result, depending premiurii, a 1 KI), two-way set associative organi- on the type of operation. The CVAX aligns and zation was chosen. In contrast to the second-level sends the result to its ultimate destination. To cache, this organization achieves a high hit rate improve DMA latency, the CVAX will grant the for the available chip area through increased con- CDAL bus requests while waiting for the CFPA to trol logic complexity instcad of increased storage return the result." array sizc. The extra control logic complexity of the first-level cache is more efficiently imple- The Second-level Cacbe mentcd in custom VLSI, whereas thc large storage The second-level cache sits directly on the CDAL arrays of the second-levcl cache are morc cfi- bus and bridges the 4-microcycle gap in access ciently implemented with off-the-shelf parts. time between the first-level cache and main Since the first-level cachc organization yields a memory. The project goal for the second-level set sizc equal to the mcrnory page size, cache cache was to maximize system performance look-up and virtual-to-physical address transla- without placing the schedule at risk. Conse- tion can bc overlapped. Thus a cache cycle time quently. designers chose to use large storage equal to the processor microcycle time is arrays to achieve the desired level of performance achieved. (hit rate) rather than complex control logic. By The first-levcl cache is look-through; that is, keeping the control logic simple, the cache cache hits on scad cycles result in no activity on could be implemented in PALS rather than cus- the CDAL bus, thus preserving its bandwidth for tom VLSI. Thus the chance of design errors was DMA transfers. The block size is one quadword so reduced as well as the time needed to correct any that cache misses on cacheable read cycles cause errors found during design qualification. the CVAX to generate a quadword transfer on the The large storage arrays were easily imple- CDAL bus. This transfer results in two longwords mented using off-the-shelf static RAMS. The of data being returned in responsc to a single resulting design was a 64KB, direct-mapped, address. The minimum transfer time is two physical instruction and data cache with write- microcycles for the first longword and one for the through. The inlplementation called for six PAL5 second, which increases the effective CDAL bus for control logic, eight 16K-by-4 static RAMS and bandwidth. Further, the first-level cache is write- four 16K-by-1 static RAMS for the data store, and through. However, to improve performance, the three 16K-by-4 static RAMS for the tag store. CVAX also contains a longword write buffer In keeping with the philosophy of simple con- which allows the CPU cxccute out of the first- trol logic, the second-level cache is look-aside; level cachc while the write operation is being that is, address decoding occurs in parallel in completed..' the cache controller and the memory controller.

Digital Technical Journal A'o. 7 Atrgtrsl 1988 Olwrlliew of the MicroVAX 3 500/3600 I-'rocessor Module

Therefore, the cache does not have to regenerate seq~rentiallongwords in less time. Both the CVAX CDAL bus cycles in the event ofa cache miss. The and the CQBIC utilize this feature to improve second-level cache control logic simply performance. The CVAX generates quadword transfers to fill cache blocks on a cache miss; and 8 Watches the CDAL bus cycles the CQBIC generates quadword, hexaword, or Rcturns data to the CVAY on cacheablc read octaword transfers on block-mode DMA by cycles that miss the first-level cache but hit the devices on the Q22-bus. The combination of mul- seconcl-level cache tiword transfers and the look-through fi rst-level

8 Allocates a block on cacheable quadword cachc made the added complexity of dual pons CVkY rcacl cycles that miss both caches (as i~sedin the MicroVAX 11) unnecessary. To work effectively with the look-aside second-level Updates an entry on CVAX write cycles that hit cachc, the CMCTL must monitor the CDAL bus the second-level cache after starting a memory operation. If the second- Invaliclates a block on DMA write cycles that level cache responds with the data first, the hit the second-level cachc CMCTL aborts its operation before completion. To support a range of CVAX microcycle times Ignorcs DMA read cycles and also maintain the performance advantage of Because the second-level cache stores the same synchronous operation, the CMCTL incli~desa typcs of references as the first-level cache, very programmable wait-state bit. This bit controls the littlc control logic is required to determine number of CPU microcycles used to access the which CVAX references arc cacheable. 'The CVAX RAM array. Moreover this bit allows the same will only generate quadword CDAL bus cyclcs on array modules to be used for processors with cachcablc CPU references that miss the first-level different microcycle times." cache. l'herefore, the second-level cache control The memory controller was not designed to logic only considers qu;tdword read cycles support battery back-up because of the added cacheablc. design complexity and cost. For those applica- To respond within the n~inirnumCVAX bus tions that require support during power outages, cyclc timc (one microcycle for the second long- standby uninterruptable power supplies are a word of a quadword cycle), the secontl-level better solution and are available for small systems cache control logic uses an overlap scheme. The at low cost. second-level cache overlaps the address gcncra- tion and the tag look-up for the second lo~lgword fie Q22-busInterface portion of the cycle with the data access for the Thc CQBIC interfaces the CDAL bus to the first longword portion of the Q22-bi1.4. This chip provides address transla- tion between the 26-bit CDAL bus and 22-bit 7be Memory Controller Q22-bus. In addition, CQBIC handles data The CMC'rL chip is the interface between the buffering between the 32-bit synchronous/asyn- CDAL bus and the Inenlory array. The chip is a chronous CDAL bus and the 16-bit asynchronous full 32-bit, single-portcd, synchronous memory Q22-bus. Q22-bus addresses are translated to controller with 7-bit error-correcting code CDAL bus addresses by a programmable mapping (ECC) and supports up to four memory array function (scatter-gather map), which is software modi~lcs(two more than thc MicroVAX II) compatible with the MicroVAX I1 system. This The CMCrL longword write buffer minimizes function gives the CPU the capability to map any the effect of write operations on CPlJ perfor- page of the 4 megabyte (MB) Q22-bus address mance. (Both caches are write-through.) 'The space to any page of the main memory address CMCTL also supports rnultiword transfers on the space. Thus Q22-bus DMA devices can transfer CDAL bus. On these transfers, the CMCTL utilizes directly to or from discontiguous pages of main page mode in the dynamic RAMS to achieve the memory. CDAL bus addresses are translated into perforn~anceof an eight-way interleaved memory (12 2-bus addresses by a direct mapping function. subsystem without the use of additional banks or This function maps the 4MB Q22-bus memory interconnect complexity. The size of thc transfer space and the 8KB Q22-bus 1/0 space into the is encoded in bits 31 through 30 of the physical VAX 1/0 space. Thus the CPU can directly access address (up to four longwords). Thus with only a Q22-bus memory or device registers by means of single address, the memory controller can fetch two ranges of 1/0 page addresses.

- -- Digital Tecbnical Journal No. 7 August I988 CVAX- based Systems

DMA write refcrenccs are buffered in two natu- design ensures that thc operating system does rally aligned octaword buffers and transferred not attempt to use the block of memory where to nuin memory by the most efficient combina- the scatter-gather map resides. The on-board tion of multiword transfers. The two octaword firmware does not include these pages in a list of buffers allow an entire block-mode transfer (up good n~emorypages that is passed to the Operat- to 16 words) to be buffered by the CQBIC. After ing system at boot time. An interesting side effect the first buffer has been filled by the Q22-bus of putting the scatter-gather map in main memory device, it is emptied into main memory while thc was that the relatively long latency on some Q22-bus device fills the second buffer. Since tbc Q22-bus DMA cycles uncovered latent design CDAL bus is faster than the Q22-bus, the first bugs in several Q22-bus DMA devices. The buffer is enipticd and ready for input from the designs of these devices had been verified by Q22-bus device before the second buffer has been empirical testing with existing processors rather filled. This arrangement allows the interface to than by testing to the Q22-bus specification. providc sustained throughput at maximum To maintain software compatibility with the Q22-bus transfer rates with no additional latency. MicroVAX I1 system, the scatter-gather map is ref- Q22-bus block-mode DMread references are erenced through a 32KB block of 1/0 space translatcd into quadword transfcrs on the CDAL addresses. The CQBIC responds to writes in this bus. Thc four words are buffered in a single quad- address range by buffering the data so the CVAX word buffer and supplied to the DMA device on cycle can complete, updating the cache if there demand. Before the buffer is emptied, the next is a hit, requesting the CDAL bus, and updating quadword is prefetched. This prefetch elinli- the entry in main memory. If any DMA operations nates additional latency on all but the first trans- are pending, they are completed before CQBIC fer. To keep the latency of the first transfer at a gives up the CDAL bus. This prevents multiple minimum, the CQBIC responds to the DMA successive map updates by the CPU from locking device after receiving the first longword of a out DMA activity long enough to cause Q22-bus quadword CDAL bus cycle, rather than waiting devices to timeout (in 10 microseconds). for the cntire quadword transfer to complete. On reads to this address range that miss the To tit the entire Q22-bus interface in a single cache, the CQBIC has to latch the address and chip, some changes had to be made to the bus force the CVAX to retry the cycle. In this way, interface architecture of the MicroVAX I1 system. CQBIC can acquire the CDAL bus to fetch the On the MicroVAX 11, the scatter-gather map was entry from main memory. When the CQBIC relin- stored in a dedicatcd 32KB static RAM array quishes the CDAL bus, thc CVAX retries the cycle. within the bus interface. On the CQBIC, not and the CQBIC provides the processor with the enough space was available to implement this requested map entry. This retry mechanism is storage array internal to the chip. Moreover, not also used to implement the interlocked instruc- enough pins were available to provide a dedi- tions in the VAX instruction set. cated bus to an external static RAM array. The On all interlocked instructions, the CVAX gen- sol~~tionwas to storc the scatter-gather map in a erates one or more sequences of a read-lock cycle 32KB block of main memory and to implement a followed immediately by a write unlock cycle. 16-entry fully associative cache for map entries The CVAX identifies these special locked cycles in thc CQBIC. The cache functions in the same by placing a unique code on the parity lines at manner as an address translation buffer. When address time. The CQBIC recognizes the read- translating a Q22-bus address, the cache is lock code and forces the CVAX to retry until the checked for the appropriate map entry. If the CQBIC can become master of the Q22-bus. Once entry is found, the translation takes place at maxi- the CQBIC has mastership of the Q22-bus, mem- mum spced. If the cntry is not found, then there ory is effectively locked and the cycle proceeds. is a delay while the entry is fetched from main The CQBIC releases the Q22-bus (unlocking memory. The translation is then performed. This memory) on the next CVAX bus transaction even delay is elinlinated on DMA transfers that cross a if it is not a write unlock cycle. This release pre- page boundary, because the entry that maps the vents memory from staying locked if the CVAX next page is prefetched when the DMA operation has to abort the instruction due to an error en- reaches a page boundary. On most DMA transfers, countered on the read-lock cycle. this delay is negligible because it is amortized Like the MicroVAX I1 Q22-bus interface, the over a large numbcr of Q22-bus transfers. The CQBIC gives the CPU the highest rather than the

Digital Technical Journal No. 7 August 1988 Ouerzk?w of the MicroVAX 3500/3600 Processor Module

lowest priority when arbitrating the Q22-bus. time-of-year clock. (The VAX standard serial line This priority assignment reduces interrupt replaces the serial line chip used as the console latency, since the processor is delayed for a maxi- on the MicroVAX 11. The clock replaces the mum of one DMA transaction before being off-the-shelf clock chip.) Since the console con- grantcd the bus to acknowledge the intcrrupt. trol/status registers (RXCS and TXCS), console Because the CPU accesses memory over a dedi- data buffers (RXDB and 'IXDB), and the time-of- cated interconnect rather than through the year clock (TODR) are VAX internal processor Q22-bus,CPU references to the Q22-bus are very registers, they are accessed by means of special infrcqucnt. Therefore this priority schcme does CDAL bus cycles as described in the section The not have a negative impact on DMA performance. Central Processing Unit and First-Level cache.' To support a range of CVAX microcycle times To save board space and cost, the SSC provides and fixed Q22-bus timing, the CQBIC was two programmable address strobes for decoding designcd to run at a fixed clock rate, asynchro- additional board-level registers. These address nously to the CPU/memorysubsystem. This design strobes decode the second-level cache control made it easier for engineers to optimizc perfor- register (CACR) and the MicroVAX II-compatible mance of the slower asynchronous Q22-bus boot and diagnostic register (BDR).~ (where bandwidth is at a premium). These opti- To prevent the processor from "hanging" on mizations are made at the expense of lowcr per- unanswered CDAL. bus cycles the SSC provides a formance on the faster CDAL bus (where there is programmable watchdog timer for the CDAL extra bandwidth) due to synchronization d~la~s.~bus. The timer starts at the beginning of a CDAL bus cyclc. If the timer expires before the System Support Functions cycle completes, the SSC asserts the error line, The SSC contains all those functions required to causing the CQBIC or CVAX to abort the cycle. support the on-board firmware, the time-of-year This timer could not be used for all CDAL bus clock, and the console serial line. The chip pro- cycles. To do so, the timer would have to be set to vidcs the logic necessary to interface the two a value greater than the Q22-bus timeout value 64 KB read-only memories (ROMs) containing the (10 microseconds) so that CPU accesses to thc firmware with the BCDAL bus. Since the ROMs are Q22-bus would not be timed out prematurely. organized as a 64K by 16-bit array, the SSC must Moreover, the timer would have to be set to a generatc two ROM cycles to satisfy each 32-bit value much less than the Q22-bus timeout value CDAL bus cycle. 'This ROM unpacking function so that unanswered CDAL bus cycles would not saves board space as well as the costs related to a cause Q22-bus timeouts during DMA. Since 32-bit-wideROM array. the CQBIC contains a 10-microsecond Q22-bus The SSC assists in the firmware emulation of a watchdog timer, the CDAL bus timer was set to VAX console processor by providing two address 2 n~icroseconds(greater than the longest CDAL spaces through which the ROM may be bus cycle) and disabled on all Q22-bus refer- accessed - the halt-mode ROM spacc, and the C'11ces. run-modc ROM spacc. Any I-stream rcad from the To support a range of CVAX microcycle times, halt-mode ROM space protects the processor the SSC was designed to run at a fixed clock rate, from external halt conditions and extinguishes asynchronously to the CPU/memory subsystem. the front panel run light. Any I-stream rcad out- Since the performance of the functions in the SSC side the halt-mode ROM space, including reads was not critical, the performance impact was not from the run-mode ROM space, enables external a concern.8 halt conditions. Under this condition, thc front panel run light is illuminated. The firmwarc is Hardware Interrupts organized so that console emulation code is exe- The interrupt logic is spread among three chips: cutcd from the halt-mode ROM spacc, and diag- CVAX, SSC, and CQBIC. The CVAX provides four nostics and boot code are executed out of the interrupt request pins that correspond to stan- run-mode ROM space. 'The SSC also provides the dard VAX hardware interrupt request levels firtnware with 1 KB of battery-backed up RAM for 14 through 17. The CVAX does not provide an storage of data structures and stack space, and a intcrrupt-acknowledge pin. The CVAX acknowl- register for controlling four diagnostic LEDs. edges interrupts when the processor's priority The SSC also contains a VAX standard console level is below the interrupt level by generating serial line and a VAX standard battery backed up an interrupt acknowledge cycle on the CDAL bus.

Digital TecbnfcalJournal No. 7 Augus! 1988 CVAX-based I Systems

The "address" used is the lcvel of the interrupt A lKD, 90-ns. first-level cache request being serviced. 'The data read is the offset A 64KB, 180-ns, second-level cache (instead of the vector within the system control block. of 1 MB of memory) The SSC contains the interrupt-acknowledge pin. The SSC responds to interrupt-acknowledge Multiword transfers (longword, quadword, cycles whenever it has an interrupt pending at hexaword, and octaword versus longword) the level being acknowledged. If the SSC does not CPLJ write buffers (one longword) in the CPU, have an interrupt pending at that level, it asserts memory controller and Q22-bus interface the interrupt-acknowledge signal. The CQBIC passes interrupt-acknowleclge cycles on to the 1 Larger DMA buffers (16 words versus 2 words (222-bus only when the SSC asserts the interrupt- for writes, 4 words versus 2 words for reads) acknowledge signal. This interrupt-acknowledge A 1 (,-entry scatter-gather map cache scheme saves a CVAX pin, at the expense of requiring the devices in the SSC to have the The combination of reduced cycle times and highest interrupt priority at their level (IRQ 14). architectural mechanisms produced a CPU per- The CQBIC uses all four CVAX interrupt formance 3.2 times that of the MicroVAX I1 (as request lines to support the four Q22-bus inter- measured by the mean of the distribution of rupt request Icvels. (BR4 through BR7 are con- results from over 150 CPU benchmarks). Addi- nected to the pins corresponding to IRQ levels tionally, a slight increase in maximum 1/0 band- 14 through 17.) Since the Q22-bus has only width was achieved (as measured by simulation one interrupt-acknowledge line, it is possible with an ideal Q-bus master) for a level 7 (17) device to steal an interrupt- acknowledge cycle intended for a level 4 (14) Reliability device. (This "steal" can occur if the Level 7 Both the MicroVAX II design and the MicroVAX device is closcr to the processor and posts an 3500/3600 design were subjected to extensive interrupt after the lcvel 4 interrupt was acknowl- thermal analysis. This analysis contributed to a edged but before the acknowledgment reached board layout and chip packaging scheme that it.) To prevent this situation from causing a level would minimize junction temperatures, thereby 7 (1 7) device driver from running at a lower IPL, improving reliability. Both designs also ensure a the CQBIC sets a bit that is returned along with high level of reliability by using preconditioned the vector offset. This bit causes the CVAX to set components that have passed a rigorous quali- the processor IPL to 17 before passing control to fication program. the driver. If the bit is not set, the processor IPL is Because of its increased complexity, the set to the level at which the interrupt request MicroVAX 3500/3600 was designed to be more was received. The CQBIC also adds an offset of tolerant of intermittent and transient failure 200 (hex) to the vector returned by the Q22-bus mechanisms. ECC rather than parity is used to device so there is no conflict with existing VAX protect main memory, and the data path between system control block entries. the CPU and main memory (including both caches) is protected by byte parity. There are Performance Relative to the also four timers (three for the Q22-bus and one MicroVRX II Processor Module for the CDAL bus) to detect unanswered bus cycles. The CVAX can detect four types of CFPA The reduction in gate delays due to the new chip technology allowed the processor microcycle errors, four types of memory management unit time to be reduced to 90 ns (versus 200 ns for errors, one type of interrupt error and one type of MicroVAX 11) and the minimum bus cycle time microcode error. Errors that are detected syn- chronous to CPU execution are reported by to be reduced to 180 ns (versus 400 ns for means of a machine check on the same cycle on MicroVAX 11). The increase in the number of tran- sistors made available by the new technology which the errors are detected. (Comparatively, allowed the following architectural mechanisms the MicroVAX I1 reports the errors on the subse- to be used to increase performance: quent cycle.) Unique machine check frames or hardware error flags are provided so that the A larger prcfctch buffer ( 12 versus 8 bytes) proper error recovery routine can be invoked. The recovery routines typically log the error, A larger translation buffer (28 versus 8 entries) clear the error condition, retry the operation a

Digital Technical Journal No. 7 August 1988 Ouerr~ieulof the MicroVAX 3500/3600 Procc.ssor Module specified number of times, and continue if suc- without the assistance of another device on the cessful. If tlic routine is unsuccessful and the Q22-bus.2 faulty hardwarc can be disabled. the system runs in a degraded mode until repaired. Othenvisc. Summary the system will crash. Errors detected asyn- Having (net performance goals, MicroVAX 3500/ chronously to CPU execution are reported by a 3600 systems were shipping in volume within high priority interrupt and arc logged, but in three years of the first shipments of MicroVAX 11. most cases arc nonrecoverable. Errors that arc At that time, two system packages, over twenty corrcctetl by hardware are reported via a lowcr mass storage and communications options, three priority interrupt, so they can be logged. operating systems, and over 200 software prod- Data from reliability qu;ilitication testing ucts (for VMS alone) had been qualificd and were verified that the predominant failure mode was available from Digital. Scores of hardware and intcrmittent, suggesting that the error recovery software products were also availablc from third- capabilities built into the systcm would signifi - parry vendors. This offering would ncver have cantiy increase the uptime of the system. been possible without the level of compatibility that results from strict adherence to existing CPlJ Testability (VAX) and 1/0 bus (Q22-bus) specifications. Most of the arcliitectural mechanisn~s used to increase the speetl of compi~tcrsystems (such References as caches and special purpose buffers) present 1. 'r. Leonard, ed.. VAX Architecture Refer- testability problems. These mechanisms are ence Manual (Bedford: Digital Press, Ordcr almost always tlcsigned to be software transpar- NO. EY-3459E-DP. 1987). cnt. which makes them invisible to diagnostic 2. KAGSO-AA CPU Module Reference Manual software. To solve this problem, special diagnos- (Maynard: Digital Equipment Corporation, tic mocles are provided for the both the first- and Ordcr No. EY-KA650-UG, 1987). second-level caches. The first-level cache cliag- nostic mode provides a way for the CPU to explic- 3. T. Fox, P. Gronowski, A. Jain, B. Leary, itly write the tag store and clear the valid bits by and D. Miner, "The CVAX 78034 Chip. using selected instructions. l'he second-level a 3 2-bit Second-generation VAX Micro- cache diagnostic mode provides explicit acccss processor," IXgital Technical Journal to both the tag and data stores through two (August 1988, this issue): 95-1 08. blocks of 1/0 addresses (the cache diagnostic 4. E McLellan, G. Wolrich, and R. Yodlowski. spacc and the cache tag diagnostic sp:~cc). "Development of the CVAX Floating Point Through the cache diagnostic space, the data Chip." Digital Technical Journal (August store can be read or written, the tag store can 1988, this issuc): 109-1 20. be written and the valid bits can be cleared. Whcn not in diagnostic mode, cache appears in 5. C. DeVane, "Design of the MicroVAX 3500/ this space as high speed RAiil. During power-up 3600 Second-level Cache." Digital Techni- self-test, diagnostic code is transferred from ROM cal Journal (August 1988, this issue): to this RAhl to allow fast execution of thc code 87-94. without requiring that main niernory be fi~nc- 6. D. Morgan, "The CVAX CMCTL. - A CMOS tional. Through thc cache tag diagnostic spacc. Memory Controller Chip," L)igital Techni- the state of the cache tag bits, parity bits, valitl cal journal (August 1988, this issue): bits, and several points within the cache control 139-143. logic can be read. The MicroVAX 3 500/3600 processor modu lc 7. B. Maskas, "De\7elopment of the CVAX Q22-bus Interface Chip," Digital Techni- dcsign also provides a diagnostic mode for main cal Journal (August 1788. this issue): memory and a means of writing to main memory 129-1 38. through the Q22-bus interface. The main mem- ory rliagnostic mode allows memory test times to 8. J. Winston, "The System Support Chip, be significantly reduced. Further, writing to main a Multifunction Chip for CVAX Systems," memory through the Q22-bus interface allows Iligital Technical journal (August 1988. the scatter-gather map functionality to bc tcstctl this issue): 12 1-128.

Digital Technical Journal No. 7 Atrgzrsl 1388 CharlesJ. DeVane I

Design of the Micro VRX 3500/3600 Second-level Cache

The MicroVAX 3500/3600 processor module, the KA650, is a CVM-based uniprocessor that incorporates an unusual cache architecture: a two-level cache. TheJirst level is a small fast cache on the CPU chip, and the second level is a large, somewhat slower cache on the processor module. Along with high quality and high performance, time-to-market was a crucialgoal for this third-generation Micro VAX system product. Consequently,project engineers adhered to a philosophy of design simplicityfor the second-level cache. Cacheperformance measurements support their design decisions.

Tbe Micro VAX3500/3600Project ules in parallel with the CVAX projcct. ;I process Thc primary go;~lof the MicroVAX 3500/3600 that relied hcavily on simulation. In turn, the projcct was simple. The chip designers in the lMicroVAX projcct team provided t he initial Semiconductor Engineering Group (SEG) were debug testbed for CVAX: CVAX first bootcd VMS working on a new single-chip VAX. CVAX.' The in a MicroVAX 3500/3600 system. chip would have its own on-chip cache and was projected to achicvc ;I performance level tl~rcc Overuiew of tbe times thc original MicroVAX chip used in the KA 650 Processor Module MicroVAX 11 system. The MicroVAX Development Thc systcrn functional partition (Figure I) shows C;roup would work in concert with the SEG efort. Iio\i/ the KA650 processor module fits into the Our goal was to ship a high-quality, high-perfor- cntire computer system. The processor module m;lncc CVAX-based uniprocessor, which would communicates to mass storage, communication, be upward compatible with MicroVAX I1 systems. ant1 other 1/0 devices over the Q22-bus. Main This ncw product must be available as soon as memory connects to the processor on a private CVAX chips coil Id be produced in voliime. mcn1or-y bus which uses both the backplane and Given the objectives of high quality and "over-the-top" ribbon cable. A console panel car- MicroVAX I1 system compatibility, the remaining rics bit rate and configuration switches, a single- design goals were carefully prioritized as listetl digit hexadecimal display. a connector for thc below: consolc serial line. and a NiCd battery for the 1. 'I'imc to market processor's time-of-year (TOY) clock. 'l'hc module functional partition in Figure 2 2. Raw compu t;ctional performance shows the basic parts of the KA650 processor 3. Memory expansion rnotlulc. All memory tranic tlows over the CDAL bus (CVAX data/address lines). Only I/O space 4.Direct-mcmon access (DMA)/real-time per- registers reside on the BCDAL bus (buffered formancc CVAX tlata/addrcss lines). 5. System cost and price 'rhc memory controller subsysteni and the 022-bus interface subsystem are each single 6. Adtlitional functionality chips: the CM<:TL (CVAX memory controller) Thc importance of quickly delivering the 2nd CQBIC (CVAX Q22-bus interface chip).l.j MicroVAX .)500/.3600 to market led to a close Most of the systcni support functions are con- working relationship betwecn thc engineers in tained in another chip, thc SSC (systcni support SEG ant1 MicroVAX Development. We designed chip)." Each of these was designed in parallel and built thc ~MicrovAXCPLr and memory mod- with CVAX. as part of a complete CVAX chip set.

Digital Tecbnical Journal .Vo. ' Augf#sI 1988 - - -I>esigtz oJ the fMicroVAX 3500/36OO .Trcond-lerjel C'nche

CONSOLE SERIAL LINE + CONSOLE PANEL MEMORY DATA

I I MEMORY ADDRESS AND CONTROL ------It 4 I DISK TAPE ETHERNET I 1 COMMUNICATIONS 1 I OTHER 110 1

Figure 1 Micro VAX 3 500/36OO System Functional Partition

The primary problern left to the KA650 niod- CVAX designers included an on-chip cache that ule designers was to balance two key goals: to could be accessed in one microcycle, and made design the board-level cache for the highest per- the cache as large as practical. From the module formance possible and to do so without endan- perspective, CVAX can run a bus cycle as quickly gering the project's time-to-marketgoal as two microcycles. However, the memory system requires five microcycles to acccss main memory. Two-levelCache Architecture To cornpcnsatc for this second gap, the module Description designers included a module level cache that The KA650 is Digital's first cornrncrcially avail- could be accesscd in two microcycles, and made able processor to incorporate a two-level cache. the cache as large as practical. The first level is a small cache on the CPU chip with a cycle time of one microcycle, or First-level Cache 90 nanoseconds (ns). 'l'he second level is a large The first-level cache is a 1 kilobyte (KB), two-wdy cache on the processor module with a cycle time set associative cachc with a quadword block size. of two microcyclcs, or 180 ns. In comparison, Thc cache is organized as 64 rows, each row con- the cycle time of main memory system is five taining two sets, and each set containing 8 bytes. microcyclcs, or 450 ns. 'l'wo bits in the cache disable register (CADR) The goal of each level of cache is to reduce select whether the first-level cache stores effective memory access time on processor read I-stream only, D-stream only (ordinarily used cycles. At the chip level, the CVAX processor only for diagnostics), or both I-stream and would prefer to use just one microcycle to access D-stream references. The cache allocates a block memory. However, the CVAX bus interface unit whenever a cacheable read reference misses the (B11J) requires two microcycles to access mem- cache. The CVAX RIU then generates a quadword ory off the chip. To compensate for this gap, the read cycle to fi 11 the allocated block.

Digital Tecbnkal Journal No. 7 August 1988 CVAX- based Systems

f Ac16:2s

A h A I. SYSTEM ICI CONSOLE SECOND- BCDALc31 :O> SUPPORT SUBSYSTEM PANEL r A*.31:2> LEVEL CACHE f Y

,! 1 Av BCDAL TRANSCEIVERS ADDRESS LATCH 1 7

A J \ u - CPU AND FPA 022-BUS CDALc31 :O> INTERFACE 022-BUS P YV -

MAlN MEMORY CONTROLLER i I MAlN MEMORY INTERCONNECT

Figure 2 MicroVAX 3500/3600 Module Functional Partition

The CVAX BIU waits to determine whether a the first-level cache for three microcycles per read referencc hits in the cache before starting qi~adwordb.lock and six microcycles for an octa- thc bus cycle to access memory. This wait helps word. However, these delays stall CPU execution free the processor bus for use by DMA devices, but only if the CPU requires access to the cache dur- rcquires faster RAMS in the second-level cache. ing those microcycles. The processor writes directly through the cache to memory. Therefore, when a cache block Second-level Cache is allocated, the block being replaced need not The second-level cache is a 64KB direct-mapped be written back to memory. The CVAX BIU also cache, which like the first level, also has a quad- incorporates a write buffer to support dump- word block size. This cache is organized as and-run writes by the processor. If the CDAL bus 8K rows, each row containing one set of 8 bytes. is busy when CVAX needs to write, the BIU will The second-level cache allocates a quadword buffcr one write cycle. The buffering allows the block whenever CVAX reads a quadword that processor to continue execution, reading from misses the second-level cache. (Quadword reads the first-levcl cache. Thus, some write cycles are ordinarily the result of allocation in the first- require only one microcycle. level cache. Unusual bit settings in the CADR, When DMA devices write to main memory, the however, can cause the CVAX BIU to generate cachc must be updated to reflect the change in quadword cycles on reads without actually milin memory. Cache data that is no longer con- enabling the first-level cache.) Thus, the second- sistent with the contents of main memory is level cache will include the same kind of data as called stale data. To prevent stale data from accu- the first-level cache: I-stream only, D-stream mulating in the cache when DMA devices write to only, or I- and D-stream references. memory, the cache will check and invalidate one Instead of waiting to determine whether a read or two blocks as necessary. Invalidation ties up reference hits in the cache, the memory con-

Digital Technical Journal No. 7 Augusr 1988 Design of /he MicroVAX 3500/3600 Second-lei~elCnche

troller bcgins accessing memory in parallel with schcclule. At the beginning of the project, we the tag look-up in the second-lcvel cache. If the doubted that 256-kilobit (Kb) static RAMS with reference hits in the cache, the memory con- 45-11s access time would be available soon troller will abort its response to CVAX (although enough. However, we expected 64Kb RAMS to be the control cycle to the memory modules com- mature when Manufacturing would need produc- pletes normally). tion volumes of the parts for the MicroVAX 3500/ L.i kc the first-level cache, the second-level 3600 system. cachc also writes directly through to memory. The 64Kb RAMS were avajlable in three organi- The mcmory controller will perform a dump-and- zations: 64K by 1, 16K by 4, and 8K by 8. We run write if the write is an i~n~naskedlongword. coulcl have arranged these to form a 256KB cache Therefore many write cycles can complete in two (using 32 64K-by-1 RAMS), a 64KB cache (using microcycles. This completion time assumes the 8 16K-by-4 FWMs) or a 32KB cache (using 4 8K- memory modules are not busy completing a pre- by-8 RAMS). The 256KB cache would not have vious dump-and-runwrite, aborted read cycle. or even fit on the module. and so was not consid- refresh cycle. ered The 64KB cache would fit (requiring only During DMA the second-level cache will also slightly more module space than the 32KB check and invalidate one or two blocks as neces- cache) and was actually cheapcr than thc 32KB sary. These chccks prevent stale data from accu- cachc. So naturally we chose the 64KB cache. We mulating during DMA write cyclcs to memory. then added four I6K-by-l RAMS for byte parity. Design of the Cache Organization KA650 Second-level Cacbe We cluickly ruled out organizing the cache with The importance of minimizing timc taken to more than onc set. More than one set would deliver the product to market made simplicity a either require too much logic or run too slowly. high priority. For most major design decisions. To get data fast enough from the correct set on we chose the simplest implementation. a read hit would require a multiplexer and a separate set of RAMS for each set. Thjs additional Cache Speed logic would take more space than we had avail- The cache speed was determined by the fastest able. hiother possibility was to ~lsea "select set" CVAX bus cycle. CVAX can read or write a single signal generated from the tag-store match signals longword in two microcycles ( 180 ns) arid read a as an address bit into the data store RAMS. This naturally aligned quadword in three microcycles organization, however, would run too slowly. (270 ns). Each added wait state costs another The cache performance simulation data avail- microcycle (90 ns). For examplc, a typical quad- able to us ;~ssumedthe cache was flushed on word read from main memory requires five every contcxt switch. We felt this assumption microcycles for the first longword and three might be overly pessimistic for caches as large as ~nicrocyclesfor the seconcl longword - ;I total of 64KI). Furthermore, we expected that more real- 720 ns. Therefore the goal of the second-level istic data would not show a large performance cachc was to allow CVAX to exccute from mem- advantage for a two-way set associative cache ory with no wait states. Prelinlinary timing dia- ovcr a direct-mapped cache. We therefore chose grams determined that to keep up with a 100-ns the simpler direct-mapped organization. CVAX the cache would require 45-11s static WMs. When later in the project KA65O module dcsign- Block Size ers changed the clock speed from 100 ns to When choosing the block size for the second- 90 ns, they also replaced the 45-ns cache RAMS 1cvc.l cache, we again decided in favor of simplic- with 35-11s RAMS. ity. We chosc to make the second-level cache use the same size block as the first-level, which was Cache Size already set at a quadword. At quadword block Increasing a cache's size always improves its size, the second-level cache can allocatc a block performance. Since high performance was a simultaneously with the first-level cache. The major priority, choosing the cache size was sin- second-level cache simply captures the data from ply a matter of finding the largest RAh4 that would the quadword read as it comes from memory over run fast enough, fit on the board, and not risk the thc CDAL bus.

Digital Tecbnical Journal No. 7 Aug~tst 1988 CVAX-based Systems

We had several additional reasons for not cycle, the memory controller begins accessing choosing either a longword block size or a sizc main memory in parallel with the tag check in larger than a quadword. lJse of a longword block the second-level cache. If the cyclc misses the size in the second-level cache complicates the cache, then main memory is prepared to respond control logic and potentially degrades perfor- as quickly as possible. If the cycle hits the cache, mance. To respond to a CVAX quadword read, the the memory controller senses the hit and aborts cache would require two separate tag look-ups. If its response to the bus cycle. A drawback of this the first look-up hit but the second look-up scheme is that the memory controller must still missctl, the cache would have to retry the bus complete the control cycle to the dynamic RAMS cycle. The retry would invalidate the block in the of main memory. Consequently, the controller first-level cache and waste bus bandwidth. On the cannot respond as quickly as it had initially other hand. use of a block size larger than a quad- if the cache hit is immediately followed by a word would require extra data path and control cache miss. We expected this penalty to be to perform block fill operations. insignificant. The alternative to a look-aside architecture Tag Store Organization would be to place the memory controller on a Once we knew the data store sizc (64KR), orga- separate bus. The bus cycle would pass to the nization (direct-mapped), and block size (quad- controller only after the cycle missed the second- word), we could determine the 0rg;lnization of level cache. This design would have improved the tag store. the efficiency of main memory usage. However, The tag store requires one row for each of the this design requires additional data path and con- 8.192 clu;idword !>locks of the data store. Of the trol to create the separate mernory bus, and CVLK $0 bit physical address. 13 address bits reduces processor performance by adding at least (bits I5 through 3) are used to select the quad- one additional niicrocycle to the penalty for a word block and associated tag store row. Each tag cache miss. row must store a parity bit, a valid bit, and enough of the memory address to specify where Handling of Write Cycles in main memory the cluadword block of data We chose a simple write-through design for the came from. Since the U650 would architec- second-level cache instead of a more complex turally support no more than 64MB, address write-back design. The penalty of not using bits 29 through 26 would always be zero to write-back is reduced by the CMCTL dump- access main memory. This left 10 address bits and-run write feature. When CVAX writes ;in (bits 25 through 16) to be stored in the tag unmasked longword to main memory, the CIMCTL row. 'I'herefore the tag store would require latches the address and data and terminates the 8,192 words of RAM, each word consisting of bus cycle before the write to main memory is 10 tag bits plus a valid bit and a parity bit. actually completed. If write cycles occur back to To make this 8K-by-12 array, we used three of back (which is common for VAX processors), the same I GK-by-4 RAMs used in the data store. then the second write will be delayed while We did cxamine the special 2K-by-9 tag-store the first one completes. However, many write RAMs being developed by some memory vendors. cycles can still complete in the minimum two We concluded that these RAMS were too small microcycles. and their availability too risky for the U650. DMA Access to the Cache Look-aside Architecture To maintain design simplicity, we decided not to The dcsign of the first-level cache keeps most allow DMA to read or write the sccond-level of the processor memory traffic off the CDAL cache. This section discusses several of the bus. Instead of this "look-through" design, the possibilities we considered and rejected. These secontl-level cache uses a "look-aside" architec- include DMA reads, DMA write-through, and a ture which simplifies the bus data path and con- cache without valid bits. trol and improves performance on cache misses. First, we considered allowing the CQBIC In the look-aside architecture, both the (which is the only DM4 device on the CDAL bus) second-level cache and the memory controller to read from the second-level cachc. However, resitlc on the same bus. When CVLK starts a read the cache control logic is synchronous with the

Digital Technical Jortnral 9 1 No 7 ANRIISI 1988 Ilesign of the MicroVAX 3500/3600 Second-level Cache

CVAX clocks. The control logic design would CVAX. Such a bit would allow a machine check have been significantly complicated if that logic handler to isolate a failing bit in the data store. had to respond to the CQBIC, which runs asyn- Without this control register bit, software can at chronous to the CVAX clocks. best determine in which byte the error resides; if Second, the cache must recognize DbM writes multiple bytes have errors, only one byte can be to memory to prevent stale data from accumulat- identified. ing in the cache. We considered letting DMA Tag Store Parity - The tag store parity must be write cycles write through the cache, but again generated and checked by the tag store itself. concluded the timing was too complex to be Two separate parity trees are used: practical. (Bus parity was also a concern, which The predictive parity tree is discussed in the section Cache Parity.) Instead, the second-level cache latches the address and The error-checking tree simply invalidates one or two blocks if the The predictive parity tree generates the parity address hits in the cache. of the tag field of the address. This tree predicts Finally, while considering DMA write-through, what the parity stored in the RAM must be for the we thought about designing the cache without bus cycle to hit in the cache. Predictive parity is any valid bits. Power-up routines in the read-only fast because the parity is calculated while the tag memory (ROM) code could initialize the cache RAMS are looking up the tag. This scheme does to match main memory. The cache would then not delay the tag comparison and is sufficient to remain consistent with memory unless an uncor- guarantee that bad parity stored in the tag RAMs rcctable ECC (error correcting code) error was will force a cache miss. However, it is not encountered in main memory. When that error sufficient to determine whether the parity in the occurred, the cache would simply disable read RAMs is actually bad. Thus, a second parity tree, hits until the operating system could restore con- the error-checking tree, is needed. sistency with main memory by writing the quad- The error-checking tree identifies bad parity in word block containing the error. Of course once the cache tag RAMs. The output of this second we decided against DMA write-through, we had tree is checked after the hit/miss decision is to include valid bits. made, to determine whether a miss was caused by bad parity. If bad parity is detected, the cache Cache Parity control register error bit is set, the cache-enable 1'0 improve the integrity of the second-level bit is cleared, and an interrupt is posted to the cache, both the data store and the tag store of the processor. Since the bad parity forced a miss, no second-level cache are protected by parity. state is corrupted, and a process or system crash Data Store Parity - Data store parity was sini- is averted. plified by taking advantage of the CDAL bus par- Second-level cache tag parity covers both the ity supported by CVAX and CMCTL. The data 10 tag bits and the valid bit to protect against store simply stores and returns parity captured erroneously set valid bits. off the bus, and asserts CDPE (CVAX data parity cnable) to have CVAX check the parity. Cache Diagnostic Space This parity checking scheme was another rea- Early in the project we recognized the value son we rejected DMA write-through, since CQBIC of being able to directly access the cache as neither generates nor checks CDAL bus parity. 64KB of fast RAM. Thus we created "cache diag- One drawback to this simple scheme is that the nostic space" in the 64MB address range from processor cannot easily determine the source of a 1000 0000 to 13FF FFFF. In cache diagnostic CDAL bus parity error. A CDAL bus parity error space, the cache RAM appears as 1,024 copies of can be caused by a cache failure, a CMCTL fail- the 64KB of cache. The cache responds to all ure, or an actual bus fault (such as open etch). CVAX read and write cycles in this address range, 'T'his lack of isolation makes error diagnosis effectively forcing a cache hit. For simplicity, difficiilt or impossible when CVAX detects a DMA access to cache diagnostic space is not per- CDAL bus parity error. mitted. One useful feature we did not think to include During power-up self-test, some diagnostics was a control register bit to disable the assertion are relocated from the boot/diagnostic ROM of CDPE and the subsequent parity checking by to cache diagnostic space for faster execution.

Digffd Technical Journal No. 7 August 1988 CVAX-based Systems

Cache diagnostic space was also useful at initial bus cycle) to collect a total of 268 million debug of thc CVAX chip set. We were able to sequential bus cycles of interest. For example, to downline-load diagnostic programs through the study the read hit rate, the four counters simulta- console serial line and execute them from the neously collected: cache diagnostic space. With the diagnostic pro- The total number of cacheable quadword read gmnisin this cache space, we could continue cycles debug work on the module without relying on either main memory or the Q-bus interface. The number of cacheable quadword read Writing to cache diagnostic space could cor- cycles that hit in the second-level cache rupt normal cache operatjon by creating stale data in the cache. To prcvent this, write cycles to The number of cacheable quadword read cache diagnostic space normally invalidate the cycles that missed the second-level cache tag for that ;~ddress.This invalidation also pro- (Counting both the cache hits and misses pro- vides a simple means for flushing all or part of the vides a useful error check.) cache. To simplify diagnosis of cache faults, a The number of cachcable quadword read diagnostic mode bit in the cache control register cycles that hit in the cache, or that would have can be set to cause writes to cache diagnostic hit if the valid bit had been set space to set the valid bit instead of clearing it. Setting the diagnostic mode bit also clears the Since CVAX gives no external indication when cache enablc bit. Thus normal allocation and a memory read is satisfied by the internal cache, DhLA invalidation are prevented from acciden- only reads that miss the first-level cache (and tally upsetting a diagnostic pattern being written therefore generate a bus cycle) can be directly into the cache. These features simplify the task of measured. Thus, it is very important to note that putting the cache in a specific state for diagnostic the read hit rate of the second-level cache alone is purlmses. not the same as the read hit rate of both caches taken together as a single whole (which is beyond Performance Measurements the scope of this paper). This is not a problem for ~Mcasurements of sccond-level cache perfor- write cycles because the first-level cache is write m;ince bear out that the fundamental architec- through. turd decisions were sound. The rneasurcments were performed on a small Test Results system consisting of a KA650 CPU with 16~~of For the workloads tested, the read hit rate was main memory, an RQDX3 disk controller with an typically 85 percent and ranged between 82 per- RD54 hard disk, and a DEQNA Ethernet interface. cent and 91 percent. This is what we intuitively Thc CPU module was modified with additional expected: the large size of the cache would kecp circuitr). to dctcct various kinds of cacheable bus the hit rate high, even though the first-level cache cycles. Thc systcm ran VMS version 4.7A. To heav- tends to strip off much of the memory access ily load the system with reasonably realistic locality. workloads. we uscd varying combinations of We measured the read hit rate of the second- thrcc basic tasks: level cache with the first-level cache turned off, just to get an idea of how well a simple but large h\scmbling and linking a large program writ- cache can perform. The memory read hit rate ten in VAX MACRO ranged between 96 percent and 99 percent when Running a CAD program that compares the the first-level cache was turned off. This demon- topology of two largc net lists strates that even a simple direct-mapped cache performs we1 I if it is large enough. However, note Copying large files (greater than 8,000 that turning on the first-level cache tends to radi- blocks) across the nctwork cally alter the bus traffic seen by the second-level Four 1 (,-bit counters and a logic analyzer were cache. Therefore a direct comparison between uscd to log the occurrence of particular bus hit rates with and without the first-level cache cycles. For each measurcment, the cache perfor- can be misleading. mance was monitorcd continuously for 5 to 30 The "would have" hit rate is a measure of what minutes (depending on the workload and type of the read hit rate would have been if DM4 write

Digital Technical Journal 9.3 No. ' A~rgtrsl I088 Oesigjl (?/'thr Micro VA X J 5(1O/j(i00 .Teconrl-leoel C~che cycles wrote through the cache instead of invali- Table 1 Comparison of Benchmark Results dating the cachc. The niotlificatioris to the CPIJ for First- and Second-level Caches module included an extra tag comparator that Second- First- ignores the valid bit. Once the cache has initially level level filled. valid bits are clcarcd only by DMA invali- Bench- Neither Cache Cache Both MicroVAX II dates. If the tag matches but the valid bit is mark Cache Only Only Caches cleared, then thc cachc miss was causctl by a HANOI 0.45 0.70 1 .OO 1.00 0.42 DMA invalidate and would have been a hit if thc PRIME 0.68 0.81 0.97 1.00 0.24 DiMA cycle had written through. FFT45 0.52 0.69 0.91 1.00 0.28 The "would have" hit rate showed the benefit JACOB1 0.47 0.65 0.93 1.00 0.27 of DMA write through would have been ncgligi- CAE2 0.51 0.69 0.95 1.00 0.31 blc. 'The incremental improvement in hit ratc was typically 0. 1 percent, though in one casc it rose to about 1.3 percent (copying large files over the Each cache provides a significant pcrform;~nce network, with no other computational tasks). boost, birr performancc with the first-level cache This improvement is lost in the noise when com- alone is bctter than performance with the second- pared to the normal task-to-task variation in hit level cache alone. The faster cyclc time and ratc. Again, this is what we intuitively expected: two-way associativity of the first-level cachc out- DMA tends not to write into memory currently in weighs the large size of the second-level cache. use by the processor. Clearly we made the right An extreme example of this is the Towers of decision to avoid the added complexity of DMA Hanoi benchmark, where the performance of writc through. both caches together is no bctter than that of the Memory write cyclcs wcre also measurccl for first-level cache alone. the same tasks as mcmoqrreads. However, instead of mc;~suring the "would have" hit ratc, we Conclusions counted the nu~nbcrof cycles that took longer At the project close, we had met our fundamental than two ~nicrocyclesto complete. This gives us goals. The MicroVAX 3500/3600 CPU is compat- some measure of the cffcctiveness of the CMCTL ible wirh the MicroVAX I1 but delivers three dump-;tnd-runwrite b~rffcr. time5 the performance - performance attribut- The memory write hit rate ranged between able in part to the two-level cache. And because 77 percent and 89 perccnt. Of all memory writc we adhered to a simple design approach, the new cyclcs, 46 pcrcent to 63 percent took longer than system was ready to ship as soon as CVAX chip two microcycles (the rninimum write cyclc sets wcre available in production volumes. timc); and 37 percent to 44 pcrcent took longer than two microcycles and hit in the cache. References We had hoped more cyclcs could takc advan- 1. T. Fox, P. Gronowski. A. Jain, B. Lcary, and tage of the clump and run write buffer in the D Miner, "The CVAX 78034 Chip, a 32-bit ClMC'1'1.. However, this performance is still good Second-generation VAX Microprocessor," for the relative simplicity of thc CMCTL writc I)igital Technical Journal (August 1988, buffcr. Also remember that the CVAX internal tli~sissue): 95- 108. writc buffer helps shield CPU performancc from the dela)~sof many write cycles. The complexity 2. D. Morgan, "The CVAX CMCTL - A CMOS ant1 schedule risk of adding another writc bufter Memory Controller Chip," Digital Techni- or designing the cache for write-back operation cal Journal (August 1988, this issue): would not have been justifiable. 139-143. Ib examine the relative impact of the two-level 3. B. Maskas, "Development of the CVAX cachc on processor performance, we ran bcnch- Q22-bus Interface Chip," Digital Technical marks with both caches enabled, each cache Journal(August 1988,thisissue): 129- 138. alone, and both caches turned off. Table I shows sornc typical results normalized to the perfor- 4. J. Winston, "Thc System Support Chip, a mance of the KA650 wirh both caches turned on. Multifilnction Chip fgr CVAX Systems," Pcrforn~;~nccof the MicroVAX 11 is shown for Lligital Technicul Journal (August 1088, comparison this issue): 121-128.

9 4 Digital Technical Journal No. 7 August 1988 Thomas 1E Fox Paul E. Cronowski Anil K. Jain Burton M. Leary Daniel G. Miner me CVAX 78034 Chip, a 32-bit Second-generation VRX Microprocessor

The MicroVAX 78034 chip - also known as CVAX - is a second-genera- tion single-chip VAX microprocessor. A primary project goal was to develop a chip with three times the performance of thejrst single-chip VAX processor, the MicroVAX 78032. Therefore, architecture and circuit design eJorts were directed toward decreasing ticks per instruction (TPI) and machine cycle time. The designers reduced the TPI by 27 percent and achieved a 90-nanosecond (m) cycle - a signzjkant improvement over the 200-ns cycle time of the first-generation chip. Implemented in a 2-micron CMOS process, the chip comprises six major functional units. Tbese include the instruction queue, execution unit, memory management unit, bus interface unit, microsequencer and control store, and a unique on-chip cache. l'hc WAX 780.34 CPU chip is a second-gcncra- ~mplemented.In the CVAX chip, both the 'TPI tion, single-chip VAX microprocessor. This chip ;tnd the machine cycle time wcre improved to is the CPU of the MicroVAX 3500 and 3600 com- meet the performance goal. puter systems, which have approximately three Much effort went into reducing the TPI. By timcs the pcrforn~anceof the MicroVAX I1 com- way of comparison, the MicroVAX I[ system, pute~-syste111. I" The VAX 6200 family of systems which is based upon the MicroVAX 78032 chip, uscs slightly fi~stcr80-11s (speed-binned) CVAX performs at approximately 1 1.5 'T'PI; whereas (:PI: chips in a multiprocessor configuration. In the iMicroVAX 3600 system, which uses the this paper, we describe the CVAX chip and CVAX 78034 chip, performs at approximately cxplain how thc increase in performance was 8.4 TPI. The TPI was lowered mainly by reducing achieved. the average number of cycles rcqi~iredto access memory. This rcduction in the number of cycles Project Coals was achievcd by thc inclusio~iof the following 'l'hc primary projcct go~lw;is to devclop a singlc- architectur;~lfcati~rcs in the system: chip <:PIJ that implementctl the VAX architecture and tlcliverctl three timcs the performance of the A 1-kilobyte (KB), on-chip instruction and rMicroVAX 780.32 CPlJ chip used in the MicroVAX data stream cache, which is capable of a long- I1 computer systcms. Of thc sevcral clements in w-ortl read each cyclc this go~l.pcrforrnancc presented the greatest A 64KB, seconcl-level cache on the board, tlcsign challcngc. wh~chis capablc of a longword read or writc 'l'hc performance of a CPlJ is inversely propor- 111 two cycles and a quadword read in thrce tional to the product of ticks per instruction cycles ('I'PI) and the machine cycle time. TPI depends on the perform;rnce of the system architecture. A 28-entry translation buffer (TB). which 'I'hc minimum machine cyclc time depends on achieves a high hit rate for virtual-to-physical circuit speccl ;~nd011 how the architecture is atldress translation

Digital Technicul Journal 95 A'IA ' A~rgrrsl 1988 The CVAX 78034 Chip, a 32-bit Second-generation VAX Microprocessor

Table 1 CVAX lnstruction Set Architecture cation. The first-generation chip, the MicroVAX 78032 CPU, has a 200-ns cycle time and was Instruction Type Number implemented in a 3-micron NMOS process. In Implemented Fully by CPU comparison, the CVAX 78034 CPU chip had a goal of a 90-11s cycle time and was implemented Integer/logical in a 2-micron CMOS process. However, only Address 60 percent of the improvement in the CVAX Bit field cycle time results from the fabrication process. Control The remainder results from architectural and cir- Procedure call cuit innovations, which are described in the sec- Miscellaneous tion Internal Organization. Queue The section following presents an overview of System support the CVAX architecture. Character string Subtotal CVAX Architecture Implemented by Floating Point Chip The CVAX 78034 CPU chip implements the VAX architecture, which has 16 general-purpose reg- F floating 24 isters, the processor status longword, and 18 mis- D floating 23 cellaneous privileged registers. All 304 VAX G floating 23 4 instructions are supported by the system. The Subtotal 70 chip fully executes 181 instructions and pro- Implemented Partially by CPU vides microcode operand parsing for 2 1 instruc- tions that are emulated with macrocode. The Character string 3 chip passes 70 F, D, and G floating point instruc- Decimal 16 tions to a companion floating point chip. The Edit 1 remaining 32 instructions are fully emulated in CRC 1 macrocode. Table 1 summarizes the instruction Subtotal 21 set architecture. Implemented Fully in Macrocode The chip memory management hardware and microcode provide a demand-paged virtual mem- H floating ory environment. The virtual memory size is Octaword 4 gigabytes, and the physical address space is Subtotal 1 gigabyte. Total External Interface The CVAX bus provides a flexible interconnect Other factors influencing the lower TPI are as protocol between all CVAX family members. The follows: primary data bus is 32 bits wide and is time mul- tiplexed to share addresses and data. Up to four More efficient m~crocodewas implemented for longwords can be transferred with each address. some instructions. In general, most complex Strobes provide timing information for syn- ~nstructions,such as WLx, RET, PUSHR, chronous and asynchronous devices. Direct mem- POPR, and INSV, were coded for speed rather ory access (DMA) request and grant signals are than for space used to control arbitration of the data and address Six additional instructions were implemented line (DM) bus between the CPU and peripheral in microcode. These instructions are CMPC3, chips. CMPC5, LOCC, SKPC, SCANC, and SPANC. Shown in Figure I, the CVAX 78034 CPU chip is a synchronous device on the CVAX bus. In addi- The instruction decode section decodes all tion to supporting the CVAX bus protocol, eight specifiers instead of relying on the microcode dedicated pins support a floating point coproces- to decode some specifiers. sor interface. These pins are time multiplexed The machine cycle time reduction was deter- between the CPU chip and the coprocessor chip mined in part by the technology chosen for fabri- to transfer control and status information.

Digital Tecbnjcd Journal No. 7 August 1988 CVAX-based Systems

- - - DS -HALT DS - - AS --PWRFL AS 4 A - INTERRUPT -CRD BM <3:0> -- CONTROL BYTE MASK BM<3:0> -MEMERR LATCH -INTFIM -IRQ<3:0> - CONTROL STATUS , CS<2:0> , LATCH CVAX 78034 CENTRAL PROCESSOR UNIT

PARITY DP<3:0> CS DP<3:0> TRANSCEIVERS * b

- - T W R W R 7, - b - DBE - DEE DMA +DMR CONTROL ( - -DMG L)AL<~~:00>

BA<31:00

CACHE MEMORY - CCTL - c -AS AND WRITE BUFFER ( -- CONTROL -CWB DAL<31:00>

CPSTA r rn CPSTA< 1 :O> CPDAT<5:0> * c CPDAT<5:0> - - - DMG -DMG -RDY - CLKA A CLK.4 ERR c CLKB --cCLKB 781 34 - RESET FLOATING-POINT RESET ACCELERATOR CLKA - - RDY -CLKB RDY - - - ERR RESET ERR -

Figure I CVAX 78034 External lnterface

Digital Technical Journal No. 7 August 1988 The CVAX 78034 Chip, a 32-hit Second-generation VAX Microprocessor

A clock chip generates pairs of 180-degree Execution unit (E-Box) phase-shifted clock signals that are distributed to Memory management unit (M-Box) all synchronous MOS components in the system. The clock also generates auxiliary pairs of clocks Bus interface unit (BIU) that can be used by any non-MOS components in Cache thc external interface. Separation of the clock~ng for MOS and non-MOS elements provides better Microsequencer and control store skew control for the critical MOS clock signals. The photomicrograph in Figure 2 and the block diagram in Figure 3 illustrate all functional units Microarchitecture on the chip. The CVAX 780.74 CPU chip has some pipelining and is microprogrammed. The chip comprises six Internal Organization major functional ~nits:~.~.' This section describes the six major functional units of the chip. As noted earlier, the emphasis Instruction decode and prefetch queue here is on those aspects of the design that en- (I - BOX) hanced the machine's performance. In addition,

Figure 2 Photomicrograph of the CVAX CPU Chip

98 Wgital Tecbnkal Journal No 7 Auuust 1988 TO

IMERRUF?PINS

CONTROLSTORE

Figure 3 CVAX CPU Block Diagram The CVAX 78034 Chip, a 32-hit Second-generation VAX Microprocessor

at the end of this section we discuss the design longword and send it to the instruction prefetch approach taken to build in chip testability. queue. The flow of information between all functional When a microinstruction that loads the pro- units on the chip is synchronized by four on-chip gram counter register is detected, for example, clock phases of nominally equal duration. All cir- during a branch instruction, the prefetch queue cuits werc designed to function with the partial is flushed. A new instruction must then be phase overlap or undcrlap that can result from fetched before the processor can proceed. external clock skew and variations in the fabrica- IJp to three prefetched longwords of instruc- tion process. tion stream data can be queued by the prefetch queue. In addition, the prefetch queue rotates Instruction Decode and the instructions to bring the opcode to the front Prefetch Queue and extracts in-line instruction stream data for The instruction decode and prefetch queue, the use by the E-Box. I-Box, controls macroinstruction sequencing and instruction stream prefetching.%uring a micro- Execution Unit (E-Box) cycle, the I-Box determines what the next micro- The main functional blocks in the execution unit, code dispatch will be, based on the instruction the E-Box, are the register file, program counter stream data and the current proccssor state. (PC), constant generator, shifter, and arithmetic The CVAX I-Box is designed to generate the and logical unit (ALU). The data path has two microcode dispatch address for every specifier precharged 32-bit read buses, called the A and B flow. This design differs from the MicroVAX CPU buses, and a static write bus, called the W bus. 78032 chip design; there, the I-Box provides the The functions performed by the E-Box during a dispatch address for just the first two specifiers of cycle are determined by the current microin- a macroinstruction and relies upon the micro- struction and internal state. Following are code to generate the dispatch address for addi- descriptions of each of the main functional tional specifier flows at a performance cost of one blocks. microcycle per specifier. The register file contains 3 1 single-read-port/ Primary subsections of the CVAX 78032 I-Box single-write-port registers and 8 dual-read-port/ include the instruction decode read-only memory single-writc-port registers. The register file is (ROM), the dispatch programmable logic array used in the data path where compact layout is (PLA), and the prefctch queue. especially important. Therefore, to save chip area The instruction decode ROM (IROM) contains the register file cell was designed using an NMOS the information about VAX macroinstructions pass gate rather than a full transmission gate. that is required to parse the instruction stream. The 32-bit PC is located in the data path along The IROM determines the number of specifiers with the program counter adder. This adder is for an instruction, the sizes of its operands, and a used to increment the PC as macroinstructions partial microaddress for the execution micro- are parsed. code of the instruction. Literals can be introduced into the data path by The dispatch PLA examines I-Box state, instruc- conditionally discharging the precharged A or B tion stream data, and other microprocessor states bus lines. to predict the next hardware-supplied microad- The shifter function is implemented as a data dress for the microsequencer. This PLA is self- extractor rather than a full shifter, which would timed and evaluates in slightly under one clock require more hardware. The extractor can extract phase. 32 contiguous bits from a 64-bit field. When the The I-Box instruction prefetch queue operates values on the input buses are identical, the high- in parallel with the instruction execution hard- order bits appear to wrap around to the low-order ware on the chip. Whenever a longword in the positions, thus mimicking a full shifter. instruction prefetch queue is empty, the I-Box The shifter has the two 32-bit precharged read issues a request to the M-Box to read the next buses (the A and B buses) as inputs and a 32-bit aligned longword in the instruction stream. If the output. The shifter is implemented using NMOS M-Box and BIU are not doing some other read or transistors. ?'he control diagonals are run in write operation, they will fetch the requested polysilicon strapped by metal at both ends.

Digital Technical Journal No. 7 August I988 CVAX-based Systems

Because the RC delay in asserting the control occurred. Therefore, all access protection fields lines is long, the control lines are driven before in the TB are simultaneously compared to the the input data is valid. The inputs are then condi- current access type while the translation is in tional ly pulled low, discharging the outputs. progress. This scheme requires that the access The ALU in the data path is capable of addition, protection field be fully decoded before it is subtraction, and a variety of logic operations. The stored in the TB. ALU also includes a 1-bit left/right shifter and In addition to interacting with the cache, the additional logic to support multiply and divide M-Box interfaces with the BIU and the I-Box. operations. The ALU is implemented using a The M-Box contains three registers: the virtual carry-lookahead scheme with propagate and gen- address (VA) register, the virtual address prime erate logic. 0)register, and the virtual instruction buffer The ability to read the register file, do an ALU address (VIBA) register. After a data read or or shift operation, and write the result back into write using VA or VAP, VAP is loaded with the the register file all in one cycle is important to most recently used address plus four. In this way, the machine's performance. This critical path VAP can quickly generate sequential longword was alleviated by partially overlapping the regis- addresses. During a memory operation, the ter file write with the next register file read. The M-Box sends the address to the cache and BIU. partial overlap introduces a race between the The M-Box will forward data from the E-Box write and the read, but the circuit delay in assert- during the next cycle if the operation is a write, ing the read select line is sufficient to ensure that or capture data for the E-Box if the operation is the race is always won without extending the a read. cycle time. Whenever there is space available for a long- word in the I-Box prefetch queue, the I-Box Memo y Management Unit requests instruction stream data. If the iv-Box When memory management is enabled, the M-Box does not decode a memory read or write request uses a fully associative translation look-aside from the current microinstruction, it services the buffer (TB) to translate virtual addresses to physi- instruction stream read request using the virtual cal memory addresscs. The major design goal for address stored in the VIBA register. After a the M-Box was to achieve a TB miss rate that was prefetch reference succeeds, the VIBA register is one third that of the lMicroVAX 78032 CPU chip. incremented by four in preparation for the next Consequently, we increased the size of the TB prefetch. from 8 to 28 page table entries (PTEs). Further- more, we used a more efficient microcode routine Bus Interface Unit to reduce the number of cycles required to fetch The bus interface unit, the BIU, controls external a PTE on a TB miss. A PTE is composed of the chip operations, internal cache access and higher order bits of the physical address, the refresh, and arbitration for the internal data and access protection field, and other memory man- address bus. The BIU contains two state agement information. In the MicroVAX 78032 machines. CPU chip, a least-recently-used algorithm was employed to replace the PTE on a TB miss. How- 'The internal state machine controls the arbi- ever, the implementation of this algorithm tration for the internal data and address bus requires complex circuits and a large amount of (IDALs) . chip area as the TB size is increased. For this rea- The external state machine, controls the arbi- son, we implemented a simpler but almost tration for the external pins and DALs. equally eflicient not-last-used algorithm in the CVAX 78034 CPU chip. The design goal was to achieve a single-cycle 'To realize a single-cycle cache read operation, read operation for hits to the internal cache and a both a virtual-to-physical address translation and two-cycle write operation for an ideal memory a check of the access protection field of the subsystem. In addition, better system reliability PTE must occur in just two clock phases. How- is achieved by providing parity protection on all ever, there is not enough time to check the the external data transfers and internal cache access protection field after the translation has read/write operations.

Digital Technical Journal 101 No. 7 August 1988 The CVAX 78034 Chip, u 32-62'1 Second-generation VAX Microprocessor

To accomplish a single-cycle read operation, requests. If the read reference misses the cache, the two state machines were implemented as self- it is queued and serviced only after the write timed PLAs that require just one phase to evalu- operation completes. This overlapping of read ate. The separation of control operations between and write operations reduces the number of the two state machines allowed the PLAs to oper- memory stall cycles, resulting in a lower TPI. ate in different phases. Read/write-related, inter- To facilitate support for multiprocessor appli- nal time-critical signals are generated by the cations and DMA activity, the BIU provides a pro- internal state machine. This state machine evalu- tocol for internal cache coherency. To activate ates first, stalls the CPU if necessary, controls the this function, an external device first gains own- cache, and sets states for the external state ership of the external address and data bus by machine. Time-critical external strobes are con- means of the DMA request and grant protocols. trolled by the external state machine. The exter- The device then presents an address, qualified by nal state machine operates next, controls the ter- certain strobes, to the processor. The processor mination of external operations, clears the latches the address and then performs a cache internal state machine flags, and grants control of look-up. If a cache hit occurs, the matching external buses and strobes to external devices. cache entry will be invalidated. On a cache miss, the external state machine Eight pins are dedicated to the floating point unconditionally drives the external read data to interface. To optimize the operand transfer rate the M-Box or the I-Box, and a phase later the state between the CVAX 78034 CPU and its floating machine validates the data. This scheme made it point processor, both chips read the floating possible to service the next microinstruction point operands from memory simultaneously. while the previous one was completing. The BIU also controls all memory transactions. Cache A memory read operation is performed in one The goals for the design of the internal cache cycle if there is a hit in the internal cache and no were twofold: to reduce the memory access time cache parity error is detected. However, when a to one rnicrocycle for data that is resident in the cache miss occurs during a read operation, a two- cache; and to minimize the number of cache ref- longword block in the cache is allocated to store erences that miss the cache. the data, which must now be read from memory. To achieve the one-microcycle access time, the The BIU stalls the CPU until the first longword of internal cache is designed to perform the cache data is received. The BIU initiates the external look-up in parallel with the translation buffer read cycle, sending the address of the first long- look-up. This scheme uses the 9 virtual address word to the external memory system. When the bits that do not change during the address transla- first longword of data is received, the BIU sends it tion process to index into the array. Because the to the cache and E-Box or I-Box, and unstalls the cache look-up and translation buffer look-up are CPU. The fetch of the second longword is over- performed in parallel, the data for the selected lapped with other chip activity to minimize the cache entry is ready when the translated address effective memory access time. The second long- is being latched into the tag comparator. The word of data is written into the alternate long- cache tag is then compared to the translated word in the allocated quadword (two longword) address. If a match occurs, the data is driven onto cache block. The cache block is validated only if the IDAL before the end of the cycle. both longwords in the block are fetched success- To achieve our second goal - minimization of fully. the number of cache misses - we used a two- The BIU contains a longword write buffer way set associative cache with a block size of which supports a dump-and-run write mecha- 8 bytes. This two-way set associative cache was nism. Chip activity, including cache reads, can designed to meet both performance and chip size proceed in parallel while the BIU is waiting for requirements. First, a random replacement algo- the completion of a write operation. The BIU may rithm was selected to reduce circuit complexity have up to three different operations in progress with a minimal impact on cache performance. at once: a write to memory, a read from memory, With reference to chip size, we determined that and an internal cache entry invalidation. Descrip- a cache size of 1KB was the largest that could be tions of these operations in the BIU follow. used. In addition, the cache is designed so that While a write to memory is awaiting cornple- it can be configured by software to act as an tion, the internal state machine can service read instruction-only cache or as an instruction and

Digital Tecbnical Journal No. 7 August 1988 CVAX-based Systems

data cache. The instruction-only option was pro- 200 rows. Selection of a row causes all eight vided to simplify hardware in multiprocessor words to bc driven onto the precharged bit lines systems where the designers do not want to deal which form the inputs of an 8 to 1 multiplexer. with DMA invalidates. The three remaining microaddress bits, 3 Thc cell chosen to implement the cache array through 1, choose one of these eight microwords is a one-transistor (IT) dynamic RAM. The 1'T to be driven onto the microinstruction bus. The cell, illustrated in Figure 4, was chosen because final value of bits 3 through 1 can be modified by of its small area. A comparable array design with values on the microtest bus. This 3-bit bus con- either a four-transistor dynamic RAM or a six-tran- veys state information from other sections of the sistor static RAM cell would have required 2.4 to chip to the microsequencer. In this way, various 3 times as much area. The storage capacitance of processor states may be polled to enable up to an the IT cell is 1 10 femtofarads, resulting in a bit- eight-way microcode branch. line to cell-capacitance ratio of 8 to 1. With a The primary function of the microsequencer is folded bit-line structure and the use of a dummy to supply microaddresses to the control store. cell (which stores half thc charge of the storage The microsequencer selects a microaddress cell), a voltage differential of 200 millivolts was based on microcode control and external control realized at the sense amplifiers. Because of the from the testability logic. In addition to generat- dynamic nature of the IT cell, a refresh counter, ing microaddresses, the microsequencer receives composed of linear feedback shift registers, was exception request lines from other sections, pri- designed to control which row is refreshed dur- oritizes these requests, and generates base ing idlc cache cycles. addresses for microcode exception senrice rou- We designed byte parity into the cache to tines. These base addresses can be modified by detect data corruption resulting from either soft the section signaling the exception by means of or hard errors. A study was done to determine thc the microtest bus. soft error rate of the cell. The soft error rate for The microsequencer contains a last-in-first-out the cache array was found to be 10 FITS, wherc (LIFO) queue of eight microaddress entries I FIT is equal to 1 failure in one trillion operat- called thc microstack. A latched copy of the ing hours. To protect against data corruption due microaddress bus is stored on the microstack to minority carrier injection, the array is sur- when a microcode exception occurs. Once the rounded by a deep N-type implant ring. exception has been serviced, this latched copy The CVAX CPIJ chip is the first microprocessor allows reexecution of the microinstruction that in the industry to include an on-chip dynamic caused the exception. In the case of a microcode 11~'ccllcache. subroutine call, the current microaddress is incremented and stored on the microstack. This forms the address when returning from the sub- Control Store and Microsequencer routine. The operations and interactions of the five func- tional blocks described so far are all controlled by microcodc in the control store. The micro- Testability Issues scqucncer supplies the microaddress to the As a complex microprocessor chip, the CVAX control store. The control store contains 78034 CPU chip has some difficult testability 1,600 words of read-only memory. Each 4 1 -bit issues. A large number of internal state bits and word is divided into a 28-bit field, which controls buses arc not normally visible at the pins of the the execution sections of thc chip, and ;I 13-bit chip. Early in the design process, techniques ficlcl, which controls the microsequencer. Con- such as Icvel-sensitive scan design (LSSD) and trol store access is achicvcd in less than three built-in self-test were eliminated as possible clock phases. testability strategies. Both of these strategies 'The control store is organized into 200 rows of would have had a significant impact on chip area 8 words each. H-shaped cells, 7 by 8 microns in and performance. Instead, an ad hoc method of size, arc used to implement the array. design for testability was developed ." A microaddress is supplied to the control store Thc design for testability strategy has two main by thc microsequencer by means of the 11-bit themes: ( 1 ) make maximum use of existing hard- micro;~dtlressbus (bits 10 through 0). Eight of ware for tcst obscrvability and controllability, thcsc bits, 10 through 4 and 0, select one oC the and (2) add special tcst hardware to those areas

Digital Tecbnical Journal No. 7 AitgusI I988 The CVAX 78034 Chip, a 32-bit Second-generation VAX Microprocessor

BIT LlNE iMz (METAL 1) BIT LlNE POL) WORD LlNE

METAL WORD WORD LlNE cN+l> LlNE S

/ VDD (POLY PLATE)

WORD LlNE cN>

WORD LlNE iN-lr

,VDD

WORD LlNE cN-2>

ACCESS DEVICE

1rl1l1lll1llllllllI1111I11I11111 EACH DIVISION = 1.Opm

KEY:

N+ SOURCE/DRAIN DIFFUSION 0METAL 1 0METAL 2 POLYSlLlCON METAL 1 CONTACT

Figure 4 Cache ITDynamic KAM Cell (Four Cell Shown)

Digital Tecbnkd Journal No. 7 August 1988 CVAX- based Systems

of the chip where observability or controllability the data to a single-bit output stream. In this would not otherwise be possible. manner, all of the LFSRs may be observed at once. The chip already had some important features In addition, all of the LFSR outputs are fed into a that could be exploited. multiplexer that allows any one of the registers to be observed. The chip is controlled by the microcode con- The test logic requires only one dedicated test rained in the control store. Thus, it is an obvi- pin to select test mode and uses less than 2 per- ous candidate for controlling the chip when in cent of the chip area. Moreover, inclusion of this test mode. logic does not affect chip performance. When in Many of the internal registers are readable and test mode, 3 to 15 other pins are redefined for writable from the internal buses. By transfer- test functions. A 4-bit test-mode configuration ring this read and write data to the main bus register selects which of the LFSRs is to be that connects to the pins (the DALs), much of observed, whether the LFSRs will be in scan or the internal state can be observed and compress mode, and whether or not test broad- modified. cast mode is enabled. The interface for the floating point coproces- The Role of Simulation and Modeling sor chip contains a mode that broadcasts a value from the internal cache or register file to Complexity was managed and detailed circuit the pins. This mode is also used during test for behavior was predicted through the use of mod- cache and register file observability. els and simulation During the design, the chip was modeled at five levels of abstraction. As the These features alone were not enough, however, design progressed from concepts to implementa- and some specialized test hardware had to be tion, the level of abstraction was refined to reflect added. the increasing detail of the design. To make use of the chip microcode in test mode, it is necessary to be able to externally Choosing the Microarchitecture choose the addresses of the microword to be The performance model was the earliest and he executed. l'hus, a test mode was added to the most abstract of all the models. The performance microsequencer. In this mode, the micro- model was used to predict the machine's perfor- sequencer ignores its normal choice for the mance and to quantify the speed advantage of the microaddress and uses the value from a group various microarchitectural options under consid- of pins. eration. Written in PL/I, the performance model was driven by trace files. These files consisted of The cache is dithcult to test in its normal oper- streams of opcodes and operand specifiers ating mode. To overcome this, a special cache derived by running typical VAX applications pro- diagnostics mode was developed. grams. The psuedo-microcode contained in the Some special test microcode was added to model approximately modeled memory request allow more ethcient testing of some areas. patterns and microinstruction counts for each type of VAX instruction. As we had planned, the A few major internal buses were not observ- performance model did indeed help predict the able. Dual mode linear feedback shift registers machine's TPI. Moreover, the model also helped (LFSRs) were added to these buses: the output identify performance bottlenecks in the micro- of the I-Box instruction decode ROM, the architecture. microinstruction bus, and the microtest bus. As noted in the section Project Goals, perfor- The cache refresh address counter is also mance is inversely proportional to the product of implemented as an LFSR. the TPl and cycle time. Specifically, the cycle time depends on the delay through the critical The dual mode LFSRs allow the data bus to be speed circuits. Therefore, to identify the critical captured and scanned out serially. Alternatively, circuits and determine the propagation delays the data can be compressed every cycle using the through the circuits, we carried out cycle time linear feedback technique. The outputs of the feasibility studies. SPICE, a circuit-level simula- LFSRs are inputs to another LFSR that combines tor, was used in these studies. With the chip die

Digital Technical Journal No. 7 August I988 The CVAX 78034 Chip, a 32-bit Second-generation VAX Microprocessor

size as a given requirement, we dctcrmined the Verification of Logic - Transistor microarchitecture of the machine by selecting Level those fcatures that minimize the product of TPI The DECSIM simulation tool also supports MOS and the cycle time. transistor-level modeling. We used this tool as a VeriJication of the Microarchitecture switch-level simulator, that is, we modeled tran- sistors as open or closed switches. The model was Once the microarchitecture was defined, a automatically generated from the schematic data- detailed specification was written for each sec- base. tion of the chip. Next, an abstract behavioral This level of modeling reflected the true behav- model was written to verify that the specification ior of the schematics with greater subtlery than described a VAX CPU. Much more detailed than the schematic-level behavior model. However, the performance model, this model was con- this model was not nearly as computationally trolled by microcode, ran real VAX code, and efficient as the behavioral model. closely modeled the major chip buses, global DECSIM MOS modeling identified sequencing signals, and clocks. The model was written in errors, charge sharing problems, sneak paths, and Digital's DECSIM behavioral modeling language. race conditions that the more abstract models had Many microcode and microarchitecture bugs failed to detect. were identified and fixed as a result of this behav- ioral model testing. Pbysical Technology Logic and Circuit Design The CVAX 78034 CPU chip is implemented in a P-EPI, N-well CMOS (complementary metal- The detailed logic and circuit design began while oxide-semiconductor) process developed in- the abstract behavioral model was being written. house. The process has two layers of aluminum During this phase of the design, SPICE simula- interconnect and a single layer of polysilicon. tions were used extensively to predict circuit The critical process dimensions and chip charac- behavior. Because SPICE simulates transistor teristics are summarized in Table 2. behavior in detail, it requires a large amount of 'The chip contains 180,000 transistor sites computer resources. Consequently only critical with 134,000 actual transistors, and measures circuits were simulated and these were often sim- 0.7 mm by 9.4 mm on a side. (See Figure 2.) It plified to contain only the essential elements. is packaged in an 84-pin surface-mountable Circuit simulations typically involve tens of tran- ceramic chip carrier with 50-mil leads, uses a sistors rather than hundreds or thousands. single +5 volt supply, and has a worst-case Verification of Logic - Gate Level power dissipation of 1.5watts. The abstract behavioral model had been used to verify the specification. Now it was necessary to verify the implementation of the specification To Table 2 CVAX Chip Process make this verification, we wrote a schematic-level behavioral model that captured the logical and Fabrication Process timing characteristics of every schematic. Almost Fabrication process CMOS every node was modeled explicitly. This essen- Gate oxide 300 A tially gate-level model was also written in the Substrate N-well in P-EPI DECSIM language. The model identified many Device types N-channel enhancement logic and timing bugs, especially between sche- MOSFET; matics designed by different engineers. P-channel enhancement The schematic-level behavioral model was sub- MOSFET jected to intensive verification because it offered a good compromise between implementation Interconnect Pitches (Linelspace Drawn) detail and simulation efficiency. This model of Polysilicon 2 micron12 micron the CVAX 78034 CPU chip was used by the sys- Metal 1 4 micron12 micron tem designers in other design teams to model the Metal 2 5 micron12 micron interaction of the CPU with other chips in board Contacts 2 micron12 micron designs.

Dfgftal TecbnfcalJoud No. 7 August 1988 CVAX-based I Systems

PERFORMANCE RATIO MtcroVAX 3500/3600 to MlcroVAX II

Figure 5 Micro VAX 3500/3600 and Micro VAX II System Benchmark Comparison

Summary Acknowledgments The CVAX 78034 CPU chip met the project The authors wish to acknowledge the technical design goals. Depending on the benchmark or contributions of D. Archer, D. Bhavsar, W. Bider- application program being run, the performance mann, S. Carroll, D. Deverell, J. Kccfe, S. Martin, of the MicroVAX 3500/3600 systems is 2.6 to A. Olesin, S. Persels, J. Reinschmidt, L. Rozek, 4.1 times that of the MicroVAX I1 computer. P. Rubinfeld, M. Schenstrom, D. Schurnacher, (Refer to Figure 5.) This performance increase B. Supnik, J. St. Laurent, T. Thrush, and was achieved by reducing both the TPI and the B. Worster. machine cycle time. The main factors influencing TPI are the lKB, on-chip cache; the 64KB on-board cache; Notes and References and the 28-entry virtual-to-physical address trans- lation buffer. 'l'he cycle time was reduced as a 1. D. Dobberpuhl, et al., "The MicroVAX result of the advanced process technology chosen 78032 Chip, A 32-Bit Microprocessor," and the architectural and circuit innovations Digital Technical Jot~rnal(March 1986): made by the design team. 12-23.

Digital Technical Journal No. 7 AugusI 1988 The CVAX 78034 Chip, a 32-bit Second-generation VAX Microprocessor

2. J. Beck, et al., "A 32-Bit Microprocessorwith 6. P. Rubinfeld, et al., "The CVAX CPU, A On-Chip Virtual Memory Management," CMOS VAX Microprocessor Chip," Proceed- IEEE International Solid-State Circuits ings of the 1387 IEEE International Conference Digest of Technical Papers Conference on Computer Design: VLSI in (1984): 178-179. Computers and Processor (October 1987) : 148-152. 3. To compute clock ticks per instruction (TPI), typical application programs and 7. D. Archer, et a].,"A CMOS VAX Microproces- benchmarks are first run. Then the number sor with On-Chip Cache and Memory Man- of clock cycles required to execute these agement," IEEE Journal of Solid State programs is divided by the total number of Circuits, vol. SC-22, no. 5 (October 1987): instructions executed. Thc result of the 849-852. computation is the TPI. 4. VAX Architecture Handbook (Maynard: 8. D. Archer, "The Instruction Parsing Micro- Digital Equipment Corporation, Order No. architecture of the CVAX Microprocessor," EB-19580, 1981). Proceedings of the 20th Annual Workshop on Microprogramming (December 1987) : 5. D. Archer, et al., "A 32-Bit Microprocessor 147-153. with On-Chip Instruction and Data Caching and Memory Management," IEEE Inter- 9. D. Bhavsar and D. Miner, "Testability Strat- national Solid-State Circuits Conference egy for a Second Generation VAX Micro- Digest of Technical Papers (February processor Chip," International Test Con- 1987): 32-33.329. ference (September 1987): 8 18-825.

108 Digild Technical Journal No. 7 August 1988 Edward J. McLelfan Gilbert M. Wolrich Robert AJ Yodfowski

Development of the CVAX Floating Point Chip

The CVAXJloatingpoint accelerator (CFPA) chip is a CMOSfloating point coprocessorfor the CVAX system. The purpose of the CFPA project was to provide gains injloating point perjormance equal to those of tbe CVAX CPU for integer pet$ormance. Combined witb an aggressive schedule, the primary go& required tbe CFPA cbip toperform at tbree times tbe leuel of tbe prmevr0zcsgeneration MicroVAXJloating point unit (FPU) and to be complete two years a@er delivery of tbe MicroVAX II system. Designevs obtained a performance gain of only 25percent tbrougb base technology improvements. Consequently, most gains are achieved through the use of a multiplier array, improved arithmetic algorithms, and a fast and ewent interface with the CPU.

Functional Overview Table 1 CFPA Physical Characteristics The CFPA VLSI chip is the companion floating Number of transistors 65,000 point processor for the CVAX CPU. The chip's Package 68-pin surface-mountable hardware structures and algorithms provide high chip carrier with 50-mil lead overall system performance. In all, the chip exe- spacing and heat sink cutes 76 instructions. Die size 7.3 mm x 9.1 mm The CFPA supports Power dissipation 1 W Three VAX floating point data types: Fabrication process 2 micron drawn, N-well, dual aluminum CMOS F-floating, D-floating, and G-floating Floating point calculations, which include a polynomial evaluation instruction Integer multiply and dividc instructions equal the central processor chip's expected per- Conversion between integer and floating point formance level for integer operations, and (2) to data types adhere to the same development schedule set for the CVAX CPU chip. Specifically, these goals Complete detection of all exception condi- required instruction execution times to be three tions times faster than the MicroVAX FPU on average. The CFPA operates synchronously with the Further, the schedule allowed little time to CPU at speeds of 80 and 90 nanoseconds (ns) per achieve these significant performance gains; the cycle. Opcode, control, and status information is design would have to be completed only two communicated between the coprocessor and the years after the MicroVAX I1 system design. CVAX by means of a dedicated 8-bit bidirectional In order to improve computer performance, coprocessor bus. the clock frequency and/or the amount of work Table 1 lists the CFPA physical characteristics. completed in a cycle must be increased. The CVAX CPU uses the improved speed characteris- CFPA Project Goals tics and greater density of the CMOS process to The two main goals of the CFPA chip design pro- reduce the clock cycle time from 200 ns in the ject were ( 1) to provide the CVAX system with an MicroVAX I1 design to 80 or 90 ns. A pipelined improvement in floating point performance to architectural approach was necessary to achieve

Digital Tecbnical Journal 109 No. 7 August 1988 Development of the CVAX Floating Point Chip

this reduction. In particular, while the arithmetic and logic unit (MU) operates on one microin- struction, the register file is free to access data for the next microinstruction. This improvement allows more work to be completed in each micro- fCVAX BUS cycle and offcrs a reduction in the cycle time as well. The previous generation floating point design, used in the J-11 FPA as well as the MicroVAX I1 SECOND-LEVEL and VAX 8200/8300 systems, already pipelincd register file access with ALU operations. This pipelining was necessary to allow a 100-nscycle Figure 1 CFPAExampleSystem time - twice the frequency of the companion ConJguration CPUs - in the ZMOS process technology. Since the pipelined register/ALU operation was already achieved, the improvement in cycle time for After supplying operands to the CFPA, the the CFPA is limited by the speed of the ALU and CVAX relinquishes control of the coprocessor bus does not benefit from additional pipelining. The to receive the result status of the floating point improved technology allowed for an ALU imple- operation. Control of the coprocessor bus, how- mentation that provides a 20 percent decrease ever, does not imply control of the CVAX system in cycle time, matching the CVAX microcycle. bus. The CFPA ensures availability of the CVAX Therefore, the necessary performance increases system bus by monitoring the direct memory for the CFPA would not be created by scaling the access (DMA) grant signal from the CVAX. If a cycle time. Instead, CFPA designers would make DMA has been granted, the floating point result improvements in the amount of work done per status will be retransmitted until the DMA opera- microcycle and in the interface between the pro- tion is complete. Receipt of the floating point cessor and the floating point chip. This interface status while the DMA grant signal is deasserted is described in the following section. guarantees availability of the CVAX system bus An overview of the chip's overall performance for the next cycle. Control of the coprocessor bus is presented in the section CFPA Performance at is returned to the CVAX after successfully driving the end of this paper. floating point status. The CFPA drives the result data on the CVAX system bus one cycle later, Processor-to-bus Interface completing the operation. In addition to the CVAX system bus used to trans- Floating point instruction latency comprises fer floating point data, a dedicated 8-bit bidirec- overhead devoted to opcode, operand and result tional coprocessor bus is used to communicate transfer, and actual computation, or execution betwccn theCVAXand theCFPA.Anexampleof the time. Due solcly to improvement in CVAX cycle CFPA system configuration is shown in Figure 1. time - from 200 ns in MicroVAX systems to The CFPA normally monitors the coprocessor bus 80 or 90 ns in CVAX systems - overhead times for opcode and operand information until it is are improved by factors of 2.5 or 2.2, respec- ready to drive a result back to the CVAX. After tively. Designers achieved additional improve- decoding an opcode, the CFPA monitors control mcnts in the interface by reducing the actual signals on the bus that indicate the presence of an number of cycles required for these overhead operand. Operands may come from a CPU general transfers. As compared to the MicroVAX I1 sys- registcr, internal cache location, or from the tem, the CVAX system requires fewer cycles to memory system. When operands arc transferred access and transmit register and internal cache from CPU general registers or internal cache operands located on the chip. Moreover, external locations, the data is transmitted directly cache and memory operands are input directly between the CVAX and the CFPA. Operands from from the CVAX system bus as opposed to being external memory or cache locations are indicated fetched by the CPU and later retransmitted to the on the coprocessor bus at the start of thc external FPU as in the MicroVAX I1 system. The resulting memory access. The CFPA then monitors the interface improves performance by a factor of CVAX system bus and latches the returning data approximatcly 2.5 (90-ns cycle) to 2.8 (80-ns without CVAX intervention. cyclc) over the MicroVAX I1 system.

110 Digital Tecbnlcal Journal No. 7 August 1988 CVAX-based Systems

Despite these improven~ents,more than half until the final sum is formed through the use of the cycles required to execute a floating point carry save adders. Multiplier arrays consist of instruction in the CVAX system can still be rows of carry save adders which add in a new mul- attributed to overhead costs. The possibility of tiple of the multiplicand at each row. The carry pipelining macroinstructions - overlapping the save adders produce a result, or partial product, operand fetches of the next instruction with exe- consisting of two outputs, the carry and the sum; cution of the current instruction -as well as if added, the two outputs represent a single num- operand forwarding was studied. In such a system ber equivalent to the partial product at that the effective instruction time is determined by step obtained using full propagation addition. By the longer of the operand transfers or the actual deferring the final summation of the sum and floating point execution time. Instruction time is carry words, the comparatively time-consuming not determined by the additive effect of the inter- carry propagation addition need be performed face and execution. The one-instruction macro only once to produce the result. pipeline interface was rejected due to the risk The only drawback to the multiplier array is and coniplexity of the design. Moreover, perfor- the large percentage of chip area devoted to this mance goals had already been met and develop- one operation. Nevertheless, the magnitude of ment time was at a premium. performance gain warrants the use of an array in any high-performance computation unit. Another common method used to improve the Although the interface figures prominently in the processing of multiplications involves multiple- achievement of overall performance targets, most bit Booth encoding. This method, which requires of our design efforts were focused on the actual significantly less hardware, is aimed at reducing execution unit. To maintain and even increase the number of partial products needed to be the benefits gained by the interface design formed. The multiplier operand is encoded - or improvements, we needed an equal or greater recoded -as a control pattern used to deter- improvement in cxecution times. Since the most mine a sequence of shift and add or subtract oper- important instructions for a floating point unit ations on the multiplicand. Multiple bits of the are addition/subtraction, multiplication, and to a multiplier can then be retired in a single opera- lesser extent division, designers set about opti- tion. This method of reducing the number of mizing these instructions. The remainder of multiplication steps can be employed either with instructions implemented by the CFPA also or without an array structure. benefit from the shift, multiply, and divide opti- The previous generation MicroVAX FPU exe- mizations and demonstrate performance gains cutes n~ultiplicationusing a fixed, 3-bit-per-cycle relative to the MicroVAX I1 FPU as well. Finally, Booth algorithm without the use of a multiplier all instructions gain from microcode improve- array. Single-precision multiplication requires 8 ments in atypical case handling and from faster cycles to compute 25 product bits; D-floating code entry and exit techniques. and G-floating double-precision formats require 19 and 18 cycles to produce the necessary 57 or Multiplication 54 product bits. Additional cycles are needed to Floating point multiplication consists of multipli- set LIPthe multiply loop, calculate the initial par- cation of the fractional, or mantissa, portions of tial product based upon the multiplier least- the operands and the summation of the corre- significant bit (LSB), and round and normalize sponding exponents. Many multiplication tech- the final product. niques have been developed and implemented to The CFPA multiply algorithm takes advantage increase the speed of this frequently executed of the greater density and transistor count instruction. Perhaps the best technique for VLSI afforded by the CMOS process. The CFPA imple- implementation at this time is the multiplier ments a multiplier array, which consists of four array. The array is particularly well suited for rows of 65 carry save adders. The multiplicand VLSI implementation due to the array's regularity select logic associated with each row of the array of circuit connections which allow for a very as well as the interconnect between the rows is compact and repeatable cell design. configured to implement a 2-bit Booth encoding. The process of multiplication involves a series As a result of this configuration, 8 product bits of additions. It is possible to delay the carry prop- are completed per pass through the array. Single- agation necessary to complete these additions precision multiplication requires three passes

Digital Technical Journal No. 7 A~rgltsr 1988 Development of the CVAXFloating Point Chip

through the array, and double-precision requires array of 4 rows of 2-bit-per-row retirement re- seven passes to complete. quiring only 1 ..3 mm of chip height. The result is The array can be evaluated twice per cycle. a three and one-half to four times gain in the over- Therefore, single-precision multiplication re- all performance of multiplication. quires one and one-half cycles, and double-preci- sion D-floating and G-floating formats require three and one-half cycles of processing in the Floating point addition involves a series of steps array. Before running the array, one-half cycle is needed for set up and initial product calculation. 1. The exponents are subtracted to determine After the multiplier array completes, a cycle is the shift amount necessary to align the frac- used to complete the full carry propagate add, tions. which combines the final carry and sum outputs 2. The fraction operand with the smaller of the array. This cycle is followed by a normal- exponent is shifted into alignment and ization cycle during which valid status is added or subtracted. returned to the CVAX. When we compare the MicroVAX I1 system to 3. The result is shifted back to the normalized the CFPA, the number of cycles required to com- form (5 result < 1.0). Normalization plete a MULF instruction has becn reduced from shifting is accompanied by exponent adjust- 14 to4 (aratioof3.9 to 1 at90ns, 4.4 to 1 at80 mcnt. ns); to complete MULD or MULG instructions, 4. The result is rounded and checked for the reduction is from 26 to 6 (4.8 to I at 90 ns, overflow or underflow conditions. 5.4 to 1 at 80ns). If we include operand transfers and count each interface cycle of the MicroVAX I1 Typically, the shifting operations and their con- system as equivalent to two CVAX cycles, how- trol consume large amounts of chip area and ever, the reduction in the total number of cycles potentially a large portion of the total calculation for MULF is from 27 to 9 (3.3 to 1 at 90 ns, 3.8 to time. An analysis of these operations was used to 1 at 80 ns); and for MULD, from 43 to 14 (3.4 to guide trade-offs in the design of the CFPA. ' It was 1 at 90 ns, 3.8 to 1 at 80 ns) for register-mode noted that although large shifts arc sometimes instructions. When operands are read from or necessary to compute the final result, their fre- written to memory, the overhcad support per- quency of occurrence is very small. Furthermore, centage becomes an even greater factor; and thc a small shifter, capable of covering the vast impact of the actual CFPA multiplication speed js majority of cases in a single operation provides reduccd the benefit of a small control circuit that can be To further increase performance, we consid- more casily optirnizcd for speed. It was decided ered an array of sufficient size to complete single- that thc speed and area advantages gained by precision multiplication in a sjngle pass and designing for the most frequently occurring cases double-precision multiplication in two passes. providcd the best solution under project con- However, such an array would require thrcc straints. times the chip area for a 2-bit algorithm. A 3-bit- Specifically, a small shifter that is capable of per-row multiply would require 8 rows to com- left-four to right-seven bit shifts proved to have plcte single-precision multiplication in one pass adequate range for most alignment and normal- and 9 or 10 rom7s to complete double-precision ization shifts. In up to 80 percent of the cases, multiplication in two passes, as well as an adder additional cycles arc not needed for alignment to calculate the multiplicand factor of 3. Either of shifting. Larger alignment shifting utilizes the these alternatives, if feasible, would save only multiplier array for a shift capability of 16 bits one cycle in single-precision (a reduction from 9 per cycle. The array minimizes the worst-case to 8, or 11 percent) and two cycles in double- shift time without requiring a large shifter. precision multiplication (14 to 12, or 14 per- Although it rarely requires additional cycles, nor- cent). Ln addition to the area requirements, the malization shifting may cause a longer latency. circuit design difficulty and risk involved to Additional cycles, however, are not necessary for implement a larger array were deemed much too normalization in 93 percent of the cases. great for the limited gains. We therefore chose to To reduce the shifter control complexity, a trade off these smaller gains in favor of a partial modified ALIJ calculates the absolute value of the

112 Digital Tecbnical Journal No. 7 August 1988 CVAX-based Systems

cxponent difference. The modificd ALL1 docs not method calls for shifting over, or normalizing, require additional calculation timc to accom- multiple leading bits when the partial remainder plish this calculation. The absolute value result is small. A partial remainder with multiple lead- simplifies control logic to enable the alignment ing ones indicates a small negative remainder, shifter to con~pletein the next clock phase. Only whereas leading zeros indicate a small positive one additional generate tcrm is needed to enable remainder. Multiple quotient bits can be deter- two carry chains executing simultaneously; one mined for cycles in which the magnitude of the calculates A minus B, the other B minus A. The partial remainder is small. Shift operations most significant bit (MSB) of the first carry chain replace arithmetic operations on unnormalized determines the sign of the operation. To produce remainders, reducing the number of ALU cycles the absolutc value or positive result, the MSB of needed to develop the final quotient. This thc first carry chain is ilscd to select the final our- method of division is called normalizing, non- put from the two carry chains. In addition, the restoring division and is also used in the MSD is uscd to select the fraction requiring MicroVAX FPU. The differcnce between the two alignment. implementations is in the normalization shift 'I'hc CFPA completes addition or subtraction range provided for partial remainder and quo- opcrations in three cycles for most cases. This tient developmcnt. minimum execution timc is exceeded for only Of course, this algorithm is quite data sensi- 25 percent of all addition or subtraction opcra- tive. A division that results in a partial remain- tions, almost all of which require only one addi- der of all ones or all zeros can be completed tional cycle. in a minimum amount of time; whereas, if a l'he major improvement over the MicroVAX 11 string of alternating ones and zeros is produced FPIJ in the addition/si~btractionalgorithm is the at each ALlJ operation, the process degener- elimination of no-operation cycles necessary for ates to a one-bit-per-cycle pace. The observed control eva luation preceding the alignment and average rate for an algorithm that allows normalization steps. The resultant reduction as unlimited shift range is 2.66 bits per cycle. compared to the MicroVAX I1 FPU is from eight Unfortunately, the shift range chosen implies a cyclcs to thrcc for both single- and double-preci- control structure directly between the shift sion additions/subtractions in the actual floating and ALU operations. The time between these point unit calculations (3 to 1 at 90 ns, 3.3 to opcrations is critically important to the over- 1 at 80ns). all cycle of the chip. We chose 4 bits as the The overall performance gain in equivalent left shift rangc for the CFPA to reap the maxi- cyclcs is 20 to 8 for single-precision (2.8to I at muln benefit from the technique without intro- 00 ns, 3.1 to 1 at 80 ns) and 26 to 11 for doublc- ducing inordinately difficult control paths precision addition/subtraction (2.6 to 1 at 90 ns, between the shift and ALU operations. This 3.0 to 1 at80ns). amounts to an increase of 2 bits of shift range over the MicroVAX FPU . Correspondingly, the Division average number of quotient bits developed Floating point division consists of division of each cycle increased from 1.5 to 2.4. Expand- thc fraction or mantissa and subtraction of the ing the shifter beyond a range of 4 for this exponents. Division presents a more intractable method provides a diminishing improvement, as problem than multiplication when designing for shown in Table 2. high-speed pcrformancc. The difficulty arises due to the fact that the partial remainder at each stcp must be cxarnincd bcfore thc ncxt operation Table 2 Average Quotient Bits per Cycle can Ile dctcrn~incd.Various algorithms have been proposed to reduce thc number of arithmetic Shifter Range Average Speed steps,, but no single solution seems to optimize 2 1.5 both perforn1;lnce and sizc constraints. 4 2.39 Thc CFPA uses a method of division that offers 6 2.54 an improvement over single-bit division algo- 8 2.64 rithms, which perform an arithmetic operation Unlimited 2.66 to produce a single quotient bit per step. The

Digital Techt~icalJournal 113 No. 7 Augrrsl 1988 Oeuelopment of the CVAX Floating Point Chip

Lncreasing the number of quotient bits devel- 2.5 bits per cycle as compared to 1 bit per cycle oped per cycle from 1.5 to 2.4 res~~ltsin on the MicroVAX FPIJ. increased speeds in the CFPA divide loop relative to the MicroVAX FPIJ: 1.8 times greater for Microcode Control Structure 90-ns cycles, and 2.0 times greater for 80-ns The control structure for the CFPA is influenced cycles. The overhead cycles involved in sctting by two opposing constraints. The complicated up the divide sequence and normalizing the quo- requirements of instructions such as extended tient are reduced from 7 to 2. As a result, the multiply and integerize (EMOD) and polynomial CFPA rcalizes a performance greater than the evaluation (POLY) req~iirethe flexibility offered MicroVAX 11 FPU in terms of number of cycles by a microcoded approach. Performance goals, reduced for division. including the processor-to- however. require the speed of hardwired control FPU interface cycles, the number of cyclcs for structures to avoid costly delays incurred during single-precision division is reduced from 37 to microcode branch handling. The final irnple- 18 cyclcs (2.3 at 90 ns, 2.6 at 80 ns); for mentation combines a small control PLA (pro- D-floating double-precision division, 6 1 to 3 5 grammable logic array) to provide the flexibility (1.94 at 90 ns, 2.2 at 80 ns). of microcode control with hardware control Comparatively, this method of division is very structures for speed critical paths. These control efficient, especially when we consider the small structures are enabled through the microcode to amount of control circuitry and data path area emulate complete hardwired control for impor- required. Designers can increase performance tant instructions. The structures provide support additionally by using algorithms that cniploy for alignment, normalization, multiplication and multiples of the divisor, or by implementing a division steps. Standard microcode control sup- divider array structure. The use of multiples of ports the Less critical instructions. the divisor requires both additional registers to Functions are performed under more straight- hold the multiples (3/4, 1, 3/2) and further forward microcode control when the code does expansion of the left shift capability to take not penalize the instruction performance. This advantage of the longer normalizations created by trade-off simplifies critical circuitry in some this approach (3.6 bits per cycle with left shift instances. 'The only exception to this rule is in range expanded to 6). In addition, the control the handling of exception conditions. If an logic rcquired to support the selection of the exception condition can be isolated from the nor- proper multiple is more complex and would be mal instruction flow, it is also processed in much more difficult to implement in the con- microcode rather than through the more expen- strained cycle tinic. The other alternative of exe- sive hardware control. cuting the divide step in an array structure for The use of hardware structures reduces performance capable of 3 to 4 quotient bits per the total number of microcode terms needed cycle involves an even greater cost in hardware to implement the instruction set. This reduc- and is not consistent with the project goals. tion is important to ensure that the microcode Integer division docs not automatically bene- PLA can be implemented with an access time fit from hardware devoted to floating point divi- of one half cycle. Instructions generally use sion. Since floating point division relies on one code tlow for all data types. In addition, the normalization of the operands, integer divi- similar instructions merge sections of flows to sion must either convert operands to the normal- further minimize terms. For example, the add- ized form or accept a slower one-bit-per-cycle compare-and-branch (ACB) instruction, which is algorithm. The CFPA design for integer division one of the more complicated instructions imple- normalizes both the divisor and dividend in mented by the chip, required only three addi- order to use the 2.4-bit-per-cycle divide algo- tional terms beyond the addition and compare rithm. Normalization of the divisor and dividend instruction flows. Despite this effort, almost proceeds at 5 bits per cycle. The number one third of the code was devoted exclusively of quotient bits needcd to complete the integer to two instruction types, EMOD and POLY. division operation is determined by the differ- By splitting, or "folding," the PLA into two ence between the normalization shift amounts half-height interleaved arrays, the target speed of the divisor and dividend. Consequently, was met with a penalty of only a few dupli- integer divides arc typically executed at cated terms. In total, 76 VAX floating point

114 Digital Technical Journal No. 7 August 1388 CVAX-based Systems

as well as integer multiply and divide instruc- can be returned prior to determination of the tions are implemented in the CFPA. In compari- final result. son to the MicroVAX FPU, the total number of microcode states was reduced by 20 percent, to CFPA Implementation only 159. After deciding on a set of basic algorithms that appeared to meet the project goals, the develop- Microprogramming ment effort proceeded to actual implementation. As mentioned earlier, the use of hardware Individual algorithms can sometimes result in a support contributes to improved performance proposed hardware solution that requires modifi- for most instn~ctions.However, since the CFPA cations to either the hardware or to the algorithm cycle time during execution is very similar to in order to be implemented within design con- that of the MicroVAX FPU (80 or 90 ns versus straints. Merging the requirements of several 100 ns), we needed further improvement to meet algorithms can create implementation conflicts the project goals. Algorithmic improvement in throughout the physical design. Care must be the convert-floating-to-integer (CVTFI) and taken to consider the opposing requirements EMOD instructions provides between three and while incorporating the necessary features in a four times the performance of the MicroVAX FPU single design. The algorithms for the CFPA were for the same instructions. But these gains would chosen with a single hardware microarchitecture harclly translate to improved overall performance in mind. That architecture evolved as the design when considering the frequency of use for progressed, but the architecture maintained the these instructions. Therefore, to reduce cycles basic structure that was used as a framework for for all instructions, we examined transitions early circuit design and feasibility study. The fol- during code entry and exit with internal proces- lowing section outlines the overall hardware sing. Since the CFPA always receives the opcode ~nicroarchitecturefor the CFPA. This section is in i~dvanceof the operands, it is possible to followed by explanations of the more interesting reduce the execution time for all instructions circuit design issues. by performing the first step of each operation rcpei~tccllyin anticipation of receiving the last Microarchitecture operand. In this way, as soon as the interface recognize that the operand is valid and the The CFPA contains two main functional units: control sequencer is able to act on that informa- The execution unit, which performs all arith- tion. the first step of the instruction is already metic calculations complete. In the CVAX system, as in the blicroVAX I1 sys- The bus interface unit (BIU), which controls teni, floating point status must be returned before all 1/0 operations data can be received. One reason for this return of status is that it prepares the write path back to A block diagram of these units is shown in the general-purpose register file located on the Figure 2. CPll chip. Status conditions must be checked The execution unit consists of two main data before the result register is written; the register paths and their associated control logic. The update can thus be inhibited in the case of an 65-bit fraction data path contains an integral error or exception condition. Latency was multiplier array and also processes integer data. reduced on almost all instructions by transmit- Also included in the fraction data path are a small ting the result statils back to the CVAX CPU in the 4-bit left to 7-bit right shifter, a general-purpose sanic cycle as the last step of execution. This is ALU, scratch register, ROM constants, and quo- ;~ccomplishedby checking the result prior to the tient register and shifter. The second data path, last normalization or round operation in order to the exponcnt data path, is 1 3 bits wide and con- determine if the possibility of an exception con- tains a modified ALU design used to calculate dit ion exists. Since F-floating and D-floating absolute values needed in floating point addition. formats use an exponent with a range of 256 val- The exponent data path operates in parallel with ues, and G-floating format increases that range the fraction data path and may be controlled inde- to 2,048 possible values, the exponcnt is in pendently or conditionally based upon results range for most results, and a no exception status from the fraction data path. A 160-term PLA,

Digital Technical Journal 115 No. 7 Augrrsl I988 I)e~lelopmc~ntof the CVAX Flocltirig Point Chip

f 32-BIT I/O BUS

KEY: BIU - BUS INTERFACE UNIT DAL - DATAIADDRESS LINES

I:i~~lre2CFPA RlockDia~r~~rn which accesses a single 44-bit microword each latches clocked on nonconsecutive phases, which cycle. controls the execution illlit. are nonoverlapping. 'I'hc RJU controls the intcrhce bctwccn the CPU and memory system. A 70-term PM in the unit controls all 1/0 transactions between the As notctl in the scction Multiplication, it was rec- CVAX ant1 CFPA. The BIlJ also controls the test- ognized early in thc chip design that the multi- mode logic to allow visibility to thc data paths plier array would be key to meeting the desired ancl execution unit PLA during operation. perforniance. The CFPA i~nplementsmultiplica- Figure 3 illustrates the physical layout of these tion by using an array of carry save adders with structures on the CFPA die. partial product wraparound. The wraparound enables rhc array to be cycled as many times as Circuit Design necessary. The final carry and sum addition is csecutc~lin the fraction MU. A static irnplernen- Clocking tation of the carry save adders is necessary since 'l'hc CFPA chip employs a four-phase overlapping data propagates through multiple rows of the clocking scheme which provides timing resolu- array. tion. Much of the control circuitry design calls To build the carry save adders, we used a four- for combinational circuits that opcratc between transistor XOR. This approach allowed for mini-

Digital Technical Journal No. 7 A14gust 1988 CVAX-based Systems

Figure 3 CFPA Yh.ysical Lrr)~out mum tlcla), and rcquirctl the least ;rniount of chip worst-c;~scdevices, a half cycle takes 45 ns. An area. As ;I result of SPICE simulation, we found array size of four rows takes 26 ns to propagate that doubling the minimum size of the transistors through the array, allowing 19 ns for latching, in the multiplier ;lrray could provitlc a 20 pcr- return of partial products, and control switching. cent spcctl incrc;isc. Since the cell ;lrca was con- For typic;il devices four rows complete in 18 ns, strainctl by the nccessiiry interconnect in the allowing 22 ns in an 80-11s cycle for the metal layers, the device sizes were increased wraparound path. without affecting the cell size Further device size increases, however, would have forced us to Control /'LA increase the cell size and would not have We also recognizccl the fraction shift control PLA irnprovctl speed appreciably due to incrc;tsed ;IS a possible speed limiti~rion.The shift control self-loading With the approach we chose. SPLCE PLA was the largest PLA in the control section and simulation showctl a worst-case delay of 0.5 ns had to evaluate in a single clock phase. Because per row and a typical delay of 4.5 ns. no clock signals wcre available to control evalua- 'So obtain the desired multiplication pcrfor- tion of the PLA, we used ;i "dummy" AND array niancc and minimize thc area necessary for the term to start evaluation of the OR array. A multiplier array, we used ;I technique in which "dummy" OR line controls output clocking, mak- the array is cycled twice per microcycJe. 1L)r ing the P1.A self-timed. Bcc;~usethis PLA could be

Digital Technical Journal 117 NO. ' Atr~lrsl 1988 Development of the CVAXFloating Point Chip

evaluated in a single clock phase, both alignment path controls, exponent data path, exponent data and normalization operations were able to elimi- path controls, microsequencer, bus interface nate an unnecessary wdit cycle present on the unit, and clock generator. Consistent divisions MicroVAX FPU. We were also able to expand the and global signals between these major sections divide algorithm to 4 bit shifts per cycle. were maintained in both the behavioral and tran- As; we had suspected, the limiting factor in the sistor modeling levels as well as in the final mask final chip cycle time was the multiplier array. artwork. This approach allows maximum possi- The ALUs and the large control PLAs in both the ble checking to be carried out on each section, microcode control section and the BIU easily met independent of the state of other sections of speed requirements in the CMOS I process. the chip. Upon con~pletionof the initial design concep- Design Methodology tion, a behavioral model was written in the As VLSl technology improves, both chip area and DECSlM simulation language. This model helped density increasc, allowing much larger and more us to refine the algorithms and further define the complicated designs to be attempted. Critical to data path and control structures. We rewrote the any large project, the ability to predict and adjust model several times to improve detail and incor- the design according to the most current infor- porate design changes. From early in the develop- matjon plays an important role in achieving a suc- ment. the behavioral model was merged with cessful project outcome in a minimum of time. the CVM CPU chip model and a small system envi- This section describes the various phases and ronment to provide a platform for more extensive feedback paths of the design process for the CFPA testing. Existing diagnostic programs were there- and some of the unique aspects of VLSl design. fore able to be run on the model to provide early In the first phase of design, we defined the checks on the design integrity. Additional tests major sections and the necessary global signals were written to verify specific features of the communicating between them. The major out- CFPA implementation before we began the puts of this phase were hand-drawn sets of notes detailed circuit design for critical sections. on the necessary functions of each section and Throughout the development phase, we used the preliminary sketches of possible implementa- VAX Architectural Exerciser (AXE) extensively tions. Early in the design, we recognized that cer- to test instruction compatibility with existing tain subsections would be critical to meeting the VAX implementations. Despite a degradation of desired performance goals. These particularly approximately 1M : 1 while using the simulator critical sections were to run test code, well over 500,000 test cases were run on the behavioral model before the The ~nultiplierarray design was considered ready for fabrication. 'I'he exponcnt input path Using the DECSIM MOS device simulation sys- tem, we created a transistor-level model from The fraction shiftcr controls final schematics as they were completed. By col- We therefore generated more detailed prelimi- lecting test patterns from the appropriate signals nary designs for all of these sections. Moreover in the behavioral model, the team could begin to we tested their feasibility with SPICE circuit sini- debug the schematic in complete sections as ulations. The MSB and LSB logic in the multiplier other sections were still being designed. To do was also verified with an APL language simulation this efficiently, the DECSIM group modified their of the multiplier array. simulator to allow designers to write a binary One of the hazards in the early stages of a pro- state file and reload the file for examination. This ject is the tendency to spend too much effort per- facility gave logic designers a very efficient means fecting one small piece of the design If the origi- to debug the transistor-level logic. Designers nal requirements are modified at a latcr dare. could run thcir simulations in batch mode over much time is wasted. The design team, therefore, night, examine the resulting patterns for mis- made a conscious effort to keep all parts of the matches with the behavioral model results, and design at similar levels of detail at all times then "back up" to the area before the failure test throughout the project. point to find the underlying cause. They could For purposes of design checking and chip perform all these steps without rerunning the implementation, we divided the CFPA into seven entirc simulation each time they wanted to go major sections: fraction data path, fraction data back in time to look at another signal.

Digital TecbnfcalJournal No. 7 August 1988 CVAX-based Systems

As each section of the transistor-level sche- Test Features matic was developed to a satisfactory level of To aid the debugging process and provide more xccuracy, the third phase of the design - complete test coverage, the BIU contains test creation of the physical layout artwork - began logic, This logic allows visibility to both data on that section. To create the artwork, a Cal~ua paths or to the main PL4. A simple test load GDS interac~iveediting system was used. Over sequence allows one of 16 possible test modes to the course of the project. three layout designers bc selected. Various groups of internal data path werc employed full time. Toward the end of the and control bits and two test-drive timing options layout phase. up to four additional designers are allowed. The test mode can be enabled or dis- were working on various parts of the chip. Each abled at any time by asserting a single test pin. section was checked with the interconnect Certain test modes are available while operating verification (I\') wirelist extraction tool and a at full speed in a system configuration. design rule checker (DRC) program. As. all the sections were drawn and global inter- CFPA Performance connect wiring was added to the chip layout, the Although there is no absolute measure of perfor- fourth phase of the design - the back end mance in computer system design, the floating checks - began. The 1V program was used to point performance of the CVAX system is com- extract actual capacitance values for all nodcs on pared at approximately three times the perfor- the chip. We used these capacitance values in mance of the MicroVAX 11 system. Using some of two ways to check the design. Fjrst, they were the more widely publicized benchmarks of compiled into the DECSIM MOS transistor-lcvcl floating point system performance, the CVAX sys- sirnulator. The timing feature of this tool was tem with CFPA running at 25 MHz shows better used to quickly check for gross timing problcnis than three times the speed of the MicroVAX 11 over the entire chip operating as a whole. Once with FPU. The system calculates 3,105K single- we idcntifietl ;in area ;IS having a possible timing precision Whetstone instructions per second and probleni and for those areas where we believed 1,996K double-precision Whetstone instructions the DECSIM MOS siniul:ttion was inaccurate, we pcr second. Linpack performance of 0.68 Mflops created and rdn SPICE circuit sinlulations. In a single-precision and 0.45 Mflops double-preci- second use of thc extracted capacitance values, a sion demonstrate over four times the perfor- program c;tlled PATH was written in the SCAN mance of the previous generation MicroVAX compi lcr generator language. PA'I'H allowed the implementation. circuit designers to easily and accurately create Table 3 lists the typical cycle counts for regis- wirelists representing critical paths for submis- tcr-to-register execution of floating point addi- sion to SPICE. The program extracts a circuit tion, subtraction, multiplication, and division. path description from the much larger wirelists gcncrated from either the IV tool or the chip- wide schematics. Wirelists created by the IV pro- gram include interconnect and capacitance infor- mation directly from layout artwork. Table 3 CFPA Cycle Counts for Optimized Although the chip design process appears in Instructions this cliscussion to be a neat progression, thc Opcode/ various aspccts of the actual project quickly ovcr- CFPA Operand Total lapped one another. Almost all phases were tak- Instruction Cycles Transfers Cycles ing place simultaneously on the various sections ADDFISUBF 3 5 8 of the chip. -1-0 keep track of all these activities MULF and continually update the project completion 4 5 9 DlVF date, we used a spreatl-sheet program as a track- 13 5 18 ing tool. ADDDISUBD 'I'he design team of I I people completed the MULD projcct in 2 1 months, including 6 months for DlVD product conception and 15 months for imple- mentation. Duc to the extensive modeling ant1 ADDGISUBG 4 8 12 MULG simulation prior to device fabrication, initial 6 8 14 DlVG parts werc functional at speed. 26 8 34

Digifal Techtricul Journul No AI~~LISI1088 Development ofthe CVAX Floating Point Chip

Acknowledgments Technical Journal (August 1988, this issue): In addition to the authors, members of the CFPA 95- 108. team were Roy Badeau, John Kowaleski, Omar E Mcl,ellan, G. Wolrich, et al., "The CVAX FPU, A Malik, and John Ellis who contributed to the logic CIMOS VAX Floating Point Unit," ICCD Proceed- and circuit design as well as Katie Alexandrowicz, ings (October 1987): 153- 156. Mike Benoit, Larry Commodore, Sharon Crafts, Bob Hicks, Dennis Hodges, Ellen Kagan, Marc P. Rubinfeld, et al., "The CVAX CPU, A CMOS Schenstrom, Tim Thrush, and Peter Wilcox who VAX Microprocessor Chip," fCCD Proceedings created the layout for the chip. Test, product (October 1987): 148-152. engineering, and additional technical contribu- W. Bidermann, et al., "The MicroVAX 78132 tions were made by Dilip Bhavsar, Shlomo Floating Point Chip," Digital Technical Journal Daniely, Larry Harada, Charlie Hilman, Vinotl (March 1986): 24-36. Rai, and Nigel Scott. G. Wolrich, et al., "A High Performance Floating Reference Point Coprocessor," IEEE Journal of Solid-State Circuits, vol. SC-19, no. 5 (October 1984): I. D. Sweeney, "An Analysis of Floating Point 690-696. Addition," IBM Systems Journal, vol. 4 (1965): 31-42. T. Leonard, ed., VAX Architecture Reference Munuul (Bedford: Digital Press, Order No. General References EY-34 59E-DP, 1987). T. Fox, P. Gronowski, A. Jain, B. Leary, and 0. L. ~MacSorley, "High-speed Arithmetic in D Miner, "The CVAX 78034 Chip, a 32-bit Sec- Binary Comp~iters,"Proceedings of the IRE, vol. ond-generation VAX Microprocessor," Digital 49 (1961): 67-91.

Digital Technical Journal iVo 7 Augusf 1988 The System Support Cbip, a ~ultifunctionCbip for CVM Systems

Developed as a general-putpose companion to tbe new CMOS VAX VLSI chips, the System Support Chip (SSC) contains a common core of peripb- era1 system functions which are required to support a MicroVAX system environment. These functions imlhtimers, VAX console support, and standby RAM.In addition, the SSCprovides system designers with '%ooksn to other systemfunctions. With these peripheral functions integrated on a single cbip, system designers can substantially reduce the number of com- ponents on a module and add features previously not considered cost eflective. Primarily used witb tbe CVAX CPU cbip, tbe SSC is also compat- ible witb tbe NMOS MicroVAX CPU cbip.

Background and Goals We decided a chip that provided the common In 1984, as the VAX 8200 and MicroVAY I1 chip core of these peripheral functions would be a sets entered production, Digital's Semiconductor strategic component for Digital products. This Engineering Group (SEG) directed its attention singlc chip would integrate many of the pcriph- toward defining thc next generation of MicroVAX era1 functions usually required on MicroVAX CPU systems. This paper describes the project modules. Consequently, a system designer could history and functionality of one of this new gener- substantially reduce the number of components ation's peripheral chips, the MicroVAX System on a CPLJ module and add features that previ- Support Chip (SSC). Developed over a period of ously would not have been cost effective. More- 18 months beginning in late 1984, the SSC was over, the chip would allow him to add features designed as a general-purpose companion to thc without lengthening the project schedule or CVAX CPU. As such, the chip is used in the VAX requiring extra resources. As a result, the system 6200 family ant1 in the MicroVAX 3000 family.*.' designer cou Id produce a more competitive As part of the definition of the new CMOS VAX Digital product at little additional cost. Family of VLSl chips, SEG looked at the periph- From the system designer's viewpoint, the chip eral functions that surrounded the existing would MicroVAX 11 CPU. We observed that, to build a Fully implement many functions used identi- marketable product, each system group had cally across different MicroVAX systems, such added a collection of timers, decoders, and other as timers, ROM support, and standby RAM low- and mid-complexity functions to their respcctivc n~odules.A high level of similarity Provide the "hooks" to support other func- from module to module was apparent in the tions that would be implemented differently in makeup of thesc functions. the different system enviro~l~nents In addition to examining these existing mod- Thus each system group would no longer need to ules, we talked with the system designers to learn design, implement, and debug these important what additional functions should be included on peripheral functions from scratch. Instead, they thc next generation of systems. Again, we found could use a readily available part that had been that the various systems under development debugged and qualified. Further, since the SSC would have a significant number of overlapping would use custom CMOS VLSI, this chip would functional recluirements. contain some additional useful functions, such as

Digital Technical Journal No 7 AU~I~S/1088 The System Support Chip, a Multifunction Chip for CVAX Systems general-purpose timers, that are expensive to Table 1 SSC Physical Characteristics implement in off-the-shelf or gate array tech- nology. Total device count 84,000 (approx.) With these goals outlined, we began develop- Die size 8.0 mm x 7.5 mm ment of the SSC. The following section presents Power dissipation Less than 1.0 W, worst case an overview of the chip In the balance of the paper, we describe the chip functions in detail Packaging 84-pin surface mount and discuss the trade-offs made and problems Clock 40 MHz external; 20 MHz encountered during development. internal; 25.6 kHz for time- of-year clock SSC Overuiew The SSC incorporates onto a single chip a com- mon core of functions required to support the nents on the module and thus the product cost. VAX system environment. Table 1 lists the essen- The MicroVAX 3000 uses two 64 kilobit (Kb) tial physical characteristics of the chip. Figure 1, ROMs in parallel, forming a 16-bit ROM word. a photograph of the chip, shows the major sec- The VAX 6200 system uses two 64Kb ROMs in tions. Grouped into three main categories, these series. sections are To provide data-width compatibility between the 32-bit-wideCVAX bus and the narrower ROMs, Support for power-up booting and the VAX the SSC includes packing support for 16-bit console word-wide or 8-bit byte-wide external ROM. Clock and timing functions With packing support, the SSC performs multiple reads of the narrow ROM word, assembles a Features required by the VMS operating system 32-bit longword, and sends the longword back to and those commonly required on a VAX CPU the microprocessor. The SSC performs this func- module. tion by directly driving the output enable and We begin our detailed discussions of the chip address lines 1 and 0 of the ROM. (See Figure 2.) functions with the SSC console and boot code The ROM's other address pins are driven by an support. external address latch, and the data lines of the ROM drive the CVAX bus directly. Console and Boot Code Support To pack a ROM, the SSC asserts output enable, The peripheral support described in this section drives the appropriate combinations of ROM includes ROM packing, halt-protection, the address pins 1 and 0, and receives the narrow UARTs, and standby RAM. data across the CVAX bus in consecutive ROM acess cycles (unbeknownst to the microproces- ROM Packing sor). The SSC then deasserts output enable, puts When a MicroVAX CPU is powered up, it begins the packed longword on the CVAX bus, and com- executing code from read-only memory (ROM). pletes the read transaction. To properly communicate with an off-the- shelf ROM, the microprocessor requires addi- CPU Halt-request Protection tional interfacing logic. The SSC provides this System designers requested that the SSC help logic by generating the signals needed for the prevent an undesired condition in the halt logic. ROM-to-microprocessor interface. The SSC also When the halt pin is asserted on the micro- provides the packing support for data-width processor, it executes a special trap to console compatibility between the ROM and the micro- code stored in the ROM. A second assertion of processor. the CPU's halt pin (typically generated when At project outset, SSC designers assumed the someone repeatedly presses the halt button on module designers would use four ROMs in paral- the system front panel) causes a second such lel to provide a 32-bit-wide ROM word to the trap, overwriting the pointer needed to return CPU. However, with ROMs becoming denser to program code upon leaving console mode. every year, it is now possible to put all boot, con- Without this pointer, normal operation of the sole, and diagnostic code in one or two 8-bit-wide machine cannot be resumed without booting. ROMs. System designers therefore chose to use Obviously system designers wanted to prevent fewer ROMs, decreasing the number of compo- this condition.

Digital Tecbnkal Journal No. 7 August 1388 CVAX-based Systems

;ADDRESS DECODE iAND I INTERRUPT LOGIC

-.., .. ------Y PROGRAMMABLE W TIRAFRS

Figure I SSC Photograph Showing Major Sections

The SSC prevents the second call by monitoring CPU CVAX BUS the addresses of all instruction reads and by inter- cepting all external halt requests made to the DATAcl5:O> CPU. During normal CPU operation, the SSC passes an initial halt request to the microproces- LATCH OUTPUT sor. The microprocessor immediately begins to ENABLE execute from halt-protected space, which is a special address space programmed into the SSC CONTROL by the user at boot time. When the CPU reads thc first instruction from Figure 2 SSC ROM Packing Connection console code, the SSC detects this console code Diagram address and masks further halt requests. These

Dfgftal Technical Journal No. 7 August 1988 The System Support Chip, a Multifunction Chip for CVAX Systems

requests are masked as long as the microproces- ports within the VAX archite~ture).~As a further sor is executing ROM console code. The console simplification, we limited the number of baud code can then run uninterrupted by halts. rates to eight power-of-two choices (300 to During console code execution, the SSC con- 38,400 baud). tinues to monitor all instruction addresses. When Our most significant improvement to the an address outside halt-protected space is DLART design was the addition of hardware detected, the SSC re-enables halt requests to control-P break-detection. Control-P entered on a the CPIJ. VAX console is interpreted as a halt request. Before deciding on the design described above, Thus, the UART must pick out this special we considered implementing a software-con- keystroke from the normal character stream and trolled bit that would enable and disable halts. then signal the CPU to take appropriate action. This scheme would require the software to set Formerly, this function was performed by cum- the bit upon entering halt-protected space and to bersome firmware. However, the SSC hardware clear the bit upon re-entering normal operation. continuously watches for this character and, Although apparently simpler, this scheme proved when it senses control-P, automatically signals to be flawed because two conditions might occur the microprocessor. that would prevent the user from halting the sys- The console code may configure the SSC tem: (I) the bit could be accidentally set by non- such that a break is defined as a control-P or as boot code, or (2) a software error in the boot 20 spaces; the latter is a definition still used in code could cause the microprocessor to start exe- some console applications. At one point, we had cuting nonsystem code. planned to use the chip timebase to define a With the plan we chose, control is automati- break as a space lasting a fixed number of mil- cally returned to the user as soon as the software liseconds instead of 20 spaces. However, users completes execution of the assumably bugfree advised us that this new idea, although more halt-protected boot code. The system designer elegant, would make the UART more confusing can, however, provide software control of the to 11se. halt-enable function by aliasing the boot ROM Other improvements include better notifica- into two adjacent spaces, where only one copy is tion of overrun and framing errors, and secure halt protected. The software can then control console support. Console security is effected by a halts by jumping between copies of the code. pin. When grounded, the pin prevents a break (This method is used on the MicroVAX I1 and from halting the CPU. This pin is typically con- MicroVAX 3500/3600 systems.) nected to a key switch on the computer's front panel. Using the switch, the user can lock out UARTs console-induced halts. Although it was clear from the beginning that the Further, the SSC allows the CPU to directly SSC should provide UARTs, the best choice for access the UARTs, time-of-year clock, and bus number and design was not immediately clear. reset register by means of the VAX external pro- We had two choices at the time the chip was cessor register protocol. Using this protocol, the defined: microprocessor can address system registers located outside the microprocessor by register Double-buffered DEC DLARTs (DC-3 19), number rather than by complete address. The which were in wide use, although a few SSC understands this protocol and is capable of problems with this design had recently decoding the register number and generating surfaced the desired response. Previously, VAX module designers using off-the-shelf UARTs had to imple-

8 Silo designs, which were becoming popular, ment a substantial amount of external logic to though large in size decode the register addresses and enable the UARTs to respond to this protocol. To conserve chip area, the SSC team settled on Finally, the UARTs support break transmit and a design very similar to the DEC DLART design, loopback, and properly respond to VAX inter- making a few improvements in response to user rupts. In products containing the SSC, one UART requests. To keep from unduly con~plicatingthe is used as the system console; the other is used design, we also decided to limit the number of for auxiliary functions, such as remote diagnos- UARTs to two (the number supported as console tics. or is disabled.

Digital Tecbnlcal Journal No. 7 August I988 CVAX-based Systems

Standby RAM terminated, error-handling code reads the SSC When a VAX system is powered off, the operating internal status flags and takes the appropriate systcm must storc some information in non- action. volatile memory until the system is powered up The timeout interval may be programmed in again. This stored information describes the sys- I -microsecond increments up to 16 seconds. tem configuration and contains pointers to restart The larger values are used to time out system data stored on the disk. On the MicroVAX 11 CPU self-test. module. a watch chip provided 50 bytes of stor- age for this purpose. System designers indicated Interval Timer this amount was inadequate; 500 to 1000 bytes The VAX architecture specifies a complete inter- was desirable. val clock which the operating system uses to To meet this standby storage need, the SSC pro- schedule time-critical system functions at regular vidcs 1 kilobyte (KB) of battery backed-up ran- intervals. On MicroVAX CPUs, logic for the clock dom-access memory (RAM), organized as 256 by is simplified to reduce the amount of circuitry on 32 bits. This RAM is also used as a system the microprocessor chip. On these microproces- "scratch pad" during power-up test. sors, only an interrupt-enable bit is imple- Additional standby support featurcs are de- mented. The timer source is generated externally scribed in thc section Standby Features. and is driven onto an input pin of the micropro- cessor chip. When the interrupt-enable bit is set, Timers an intcrrupt request is generated on the falling edge of the timer source, which is a 100-Hz sig- The SSC timers scrve to improve system reliabil- nal on MicroVAX systems. ity. meet architecture requirements, and save The SSC eliminates the need for the module module space. These timcrs include the pro- designer to place another oscilJator on the CPU grammable bus timeout, the interval timer, gen- module by providing a 100-Hz output suitable eral-purpose timers, and the time-of-year clock for driving the interval timer input to the discussed in this section. microprocessor chip. Bus Timeout General-purpose Timers Sincc the CVAX bus is a handshake bus, incom- Early in the SSC development, many potential plcrc bus transactions can hang the system. Some users voiced a need for general-purpose timers on older VAX sysrcms permit this condition; when future MicroVAX modules. However, no one had those systems wcre designed, the high cost of specific recommendations on how such function- implementing a timeout in external logic could ality should be implemented. Some users nor be justified in relation to the rarity of this requested four timers; whereas others reasoned event. However, the SSC improves system reli- that onc timer supported with software could do ability by providing a programrnablc bus timeout the work of four or eight timers. at no additional systcm cost. After some design attempts, we decided to If a transaction lasts longer than a user- copy, bit for bit, the VAX standard interval programmed intcrval, the chip clock. We reasoned that it was prudent to select a design that was already well thought out and Signals the microprocessor that a bus error has in general use. We did add onc control bit to occurred provide a one-shot capability. Our decision to Terminates thc transaction includc two timers was based on the amount of available chip area and a desire for some Sets certain internal status flags based on the redundancy. type of transaction that timed out Each timer provides scheduled interrupts with 1-microsecond resolution. The maximum The status flags differentiate the two types of interval between interrupts is 1.2 hours. In one- timcouts: (I) unexpected timeouts of read or shot mode, the timer stops upon generating write transactions. and (2) permissible timeouts its first interrupt. In single-step mode, a count caused by some unimplemented external proces- can be caused only by writing to a specific sor registers or by certain interrupt-acknowledge control bit. The interrupt vector is user-pro- transactions. After the timcd-out transaction is grammable.

Digital Tecbnicul Jourt~d No 7 AI~~IAS~I088 The System Support Chip, a Multifunction Chip for CVAX Systems

These timers are not used by the CPU module. grammed address and transaction type are but are available to the end user. We expect them matched. to be very helpful to users designing embedded, The strobes can be programmed either to time-sensitive applications. provide a hook for external logic or to complete a transaction after a delay. When the SSC is pro- Time-of-yearClock grammed to provide a hook, the strobe might 'I'he VAX architecture requires a battery-backed- be used to drive an external address decoder or to up time-of-year clock with a resolution of enable another chip. After asserting the output 10 milliseconds (ms). When the MicroVAX 11 strobe, the SSC takes no further action, permit- CPU module was designed, the best method for ting another device to complete the bus trans- providing this feature involved the use of a BCD action. watch chip, approximately one-half gate array of Alternatively, a strobe can be programmed to logic to interface the chip to the MicroVAX bus, complete the transaction after a delay that per- and some specially written operating system mits an external device several hundred nanosec- code. Even then the clock provided a resolution onds to respond. When configured in this way, of only 1 second in standby mode. the strobe is usually programmed to respond to The SSC provides a much more desirable solu- reads of a single longword address. The strobe is tion. Its 32-bit VAX-standard time-of-year clock, then wired to enable three-state drivers which driven by an external 25.6 kilohertz (KHz) oscil- drive module data onto the CVAX bus. This data is lator, increments every 10 ms. As with all SSC- often made up of external registers, or of dual in- internal registers, the microprocessor can access line package switches that indicate baud rate the time-of-year clock without using any external selection and other module-specific information. logic. To further minimize cost and module space Output Port usage in systems where battery-backed-up clock Four pinson the chip function as an output port. operation is not required, the user may simply The port is written as a register and is capable of ground the 25.6 KHz input pin on the SSC. driving simple output devices. This output port During normal operation, the time-of-year clock is another general-purpose feature that system will automatically derive its timebase from the designers need to implement various module- chip's UART timebase, removing the need for the specilic functions. Some designers use the port 25.6 KHz oscillator on the module. pins to drive LEDs, which are then flashed in a particular sequence to indicate progress of self- Other Support Features test. In other applications, system designers have used these signals to control external multiplex- Programmable Address Strobes ers and to provide simple modem control. As noted in the section Background and Go;~ls, the SSC is designed to provide system designers Bus Reset with "hooks" to other system functions. One of The VAX architecture requires a reset of the 1/0 these hooks is the SSC programmable address system when the CPU issues a write to a particu- decode strobe function, which adds user lar external processor register. This specification flexibility and also saves module space. requires support from both decoding logic and Virtually every CPU module needs logic that 1/0 system reset logic. In the past, each module watches the bus for particular addresses and designer had to implement both logic blocks in asserts signals when these addresses are sensed external hardware. SSC designers saw another This function is typically embedded in gate array opportunity to simplify the CPU module by plac- logic or in dedicated programmable array logic ing some of the consistently required logic on (PAL) chips. the SSC. The SSC has two programmable address decode Although the 1/0 system reset logic varies strobes. The user may program each strobe for a among systems, the decoding logic is the same in particular address of Is, Os, or "don't cares." The each MicroVAX system. The SSC provides this user can also program selectively for read or core logic, taking three actions. First, the chip write transactions. When a strobe channel is decodes the external processor register number. enabled, the corresponding output pin will assert Then it asserts an output pin in response to the during an37 bus transaction for which the pro- external processor register write. Finally, it

Digital Technical Journal No. 7 August 1388 delays the completion of the write transaction for detector latch will operate for arbitrarily slow several hundred nanoseconds, so that module- supply transitions. In addition, the latch's reset specific logic, triggered by the pin assertion, may input includes internal filtering for protection proceed to take the proper action to complete against fast supply transitions or power-up noise. the 1/0 system reset. If either the external circuits assert the SSC input pin or the special power-up latch indicates Standby Mode: Power-sensing Features a loss of power, the SSC sets an internal flag bit at When powered down, VAX systems are required boot time. The bit, which indicates that the clock to maintain a running real-time clock for at least and RAM are not valid, is read by the micropro- 100 hours. Retention of some memory is also cessor during boot. desirable. As noted in the section Standby RAM, System reliability is improved by the SSC's abil- the SSC satisfies these requirements by providing ity to determine the integrity of its standby logic a standby operating mode. In this mode, the and to notify the CPU in a software-accessible power supply to the module and to the chip pad fashion. Moreover, this feature saves design time, drivers is turned off and most internal logic is dis- since designers need not individually create this abled. However, the SSC RAM and time-of-year tricky but necessary logic. clock are powered by three NiCad batteries sup- plying between 4-3.1 V and + 4.5 V at approxi- Flexible Addressing mately 150 microamperes. The batteries also The designers of the SSC determined that the chip power the 25.6-kHz external low-power CMOS should fit into any VAX system environment with oscillator, which provides the time-of-year clock a minimum of external address decoding or sys- timebase. Within the SSC, special logic guaran- tcm incompatibility. As a result, the SSC control tees smooth transitions from normal operation to and status registers and internal RAM are all situ- standby mode. ated within a relocatable 2KB address space. This As part of providing standby operation, the SSC arrangement eliminates the need for an external must reliably report at boot time whether standby chip-enable pin and the external decoding logic power was continuously maintained during the that would be needed to properly assert such a standby period. The task of determining whether pin. The power-up boot code programs the base battery power had remained stable during the address of the registers by writing a 2KB-aligned standby period was a difficult challenge for the value to the SSC base address register. SSC designers. There are two ways power can be The SSC base address register is located at a sin- lost during standby: The batteries may run down, gle fixed address, chosen in cooperation with our or someone may replace the batteries. In either major users. The SSC RAM and registers can then case, the SSC detects loss of power and reports be addressed by adding their specified offsets to such loss to the CPU during the next boot. the value in the base address register. A system Except for external logic used for voltage mea- designer can therefore situate the SSC registers surement, this entire f~~nctionis implemented and RAM (together) anywhere in a system's 1/0 within the SSC as follows. space map. Whcn the batteries run down, the unaccept- ably low voltage can be detected during boot. Initialization However, our CMOS process is not optimized for To make the SSC especially easy to use, most of the design of logic that can accurately measure the SSC configuration bits are grouped in a single intermediate voltages. Thus, external circuits register. These bits include setup for the UARTs, are used to detect whether battery voltage is programmable address strobes, ROM packing, currently below a minimum level. If voltage is and halt-protection features. Thus, during system below minimum, these circuits assert an SSC initialization, most SSC features can be config- input pin dedicatcd to this function. Howcver, ured with a single write. these external circuits cannot detect temporary power losses that occur during standby mode, for Micro VAX and Multi-speed example, when the batteries are replaced. To provide for these cases, a special latch on the Compatibility chip, which powers up in a preferred state, Although targeted primarily as a companion to detects the interruption of battery power during the CMOS VAX CPU, the SSC is also compatible standby or initial power-up. This power-up with the older NMOS MicroVAX CPU used in the

Digital Technical Jounad No. 7 August I988 The System .fupport Chip, a Multifunction Chip for CVAX Systems

MicroVAX 11. Thus, new low cost or low perfor- References mance designs ming the older microprocessor 1. T. Fox, P. Gronowski, A. Jain, B. Leary, and chip can also take advantage of the high integra- D. Miner, "The CVAX 78034 Chip, a 32-bit tion and extra firnctionality provided by the SSC. Second-generation VAX Microprocessor," The SSC is also compatible with modules that Digital Technical Journal (August 1988, have either high or low cycle times. Originallj~ this issue): 95-108. designed for a 100-ns microcycle, the CVAX microprocessor runs at 90 ns in the MicroVAX 2. B. Allison, "An Overview of the VAX 6200 3000 system and at 80 ns in the VAX 6200 Family of Systems," Digital Technical system. Early in the developnlent of the CVAX Journal (August 1988, this issue): 10- 18. chip set, we decided that chips that were not 3. G. Lidington, "Overview of the MicroVAX performance-critical, like the SSC, would run at 3 500/3600 Processor Module," Digital just one speed (100 ns), but would be capable of Technical Journal (August 1988, this interfacing to a faster-running microprocessor. issue): 79-86. Speed conformability would simplify develop- ment, manufacturing, and field support because 4. T. Leonard, ed., VAX Architecture Refer- one SSC could be used across all MicroVAX ence Manual (Bedford: Digital Press, Order systems No. W-3459E-DP, 1987). Accordingly, the SSC bus interface, running at a 100-ns microcycle, accommodates microproces- sors running at microcycles from 100 ns to 60 ns. Summary The SSC project yielded a CVAX microprocessor companion chip that provides a high degree of functionality, flexibility, and integration. Com- prising console support, timers, decoders, and other programmable features on a single chip, the SSC permits system designers to develop smaller, more integrated modules at lower cost. Moreover, improvements made to the generalized features, such as halt protection and break detec- tion, contribute to increased system reliability without reducing system design flexibility. The utility of the SSC is evidenced by plans to include the chip in over a dozen different Digital products, such as the MicroVAX 3000 systems, the VAX 6200 systems, many XMI adapter boards, and various controller products. Acknowledgments The author wishes to acknowledge Brian R. Allison, Barry Maskas, Robert McNamara, Jay T. Nichols, Michael H. Phipps, and Robert M. Sup- nik, who contributed to the development of the SSC functional specification. Further, the author acknowledges the consider- able efforts of the SSC design and manufacturing team: Robert A. Anselmo, Nannette M. Fitzgerald, Robert J. Flanagan, James D. Gorr, Dennis E. Hodges, Keith D. Johnston, Balakrishna Joshi, Joseph R. Mantos, John W. May, Michael J. Saldana, and Nicholas D. Wade.

Digital Tecbnkal Journal No. 7 August 1988 Barry A. Maskas I

Development of the CVAX Q22-busInterface Chip

Tbe CVM Q22-bus interface chip (CQBIC) is a highly integrated, single chip that serves as the interface between the CVAX microprocessor and the Q22-bus I/O subsystem. Tbe CQBIC VLSI design is the first produced by Digital's Japan Research and Development Center in coordination with teams in the U.S. Before implementing the interface &sign, team members built a test chip to ensure thefeasibility of a CMOS Q22-bus transceiver and to test various design alternatives. Also as part of their research eflort, they examined alternative designs for several functions, including the scatter-gather map cache and the data buflering functions. Project designers then implemented the CQBIC using a mix of fd custom and semicustom design databases. A description of themmajor functional sections ispresented in thispaper.

The CVAX Q22-bus Interface Chip (CQBIC) is an Table 1 CQBIC Physical Characteristics evolutionary step in functionality and integration Process 2-micron drawn, N-well, from the MicroVAX I1 CPU module's Q22-bus dual aluminum CMOS intcrface design. The MicroVAX 11 CPlJ module's Number of transistors 40,900 (approx.) Q22-bus interface comprises 18 discrete chips Die size 9.2 mm x 9.4 mm and ;I gate array; the module design employs Power consumption 1.5W linked sequential controllers.' The advanced Packaging 132-pin surface-mountable CQBIC design integrates these controllers and all chip carrier with 25-mil other interface functionality in a single chip and lead spacing and heat sink retains the linked controller design. Power supply +5 V Specifically, thc CQBIC provides the electrical and functional intcrface between the 3 2-bit CVAX microprocessor and the 16-bit Q22-bus 1/0 sub- system. Integrated on the chip are the complete era1 reasons. Primarily, such a chip would reduce Q22-bus interface, data buffering, the CVAX component costs and system module size, and bus,' direct memory access (DMA) interface, a increase system reliability as compared with the scatter-gather (S/G) map cache, and complex MicroVAX I1 CPU module's Q22-bus interface. control logic. Table 1 lists the chip's physical Therefore, the primary goal of the CQBIC pro- characteristics. ject was to develop a highly integrated chip as an Begun in February 1985, the two-year CQBIC interface between the CVAX microprocessor and project was a joint venture for three of Digital's the Q22-bus. This chip would ease the task of groups: Japan Research and Development Center, Digital's system designers by standardizing the Largc Scale Integration URDC/LSI); Semi- interfacing to the Q22-bus and by providing the conductor Engineering Advanced Peripherals same or improved 1/0 bandwidth performance as Development (SEG/APD); and Micro Systems the MicroVAX 11 CPU module Q22-bus interface. Dcvelopment (MSD) .3 Achievement of this performance goal was complicated by the single-port memory architec- Project Goals and Organization ture of the first planned CPU module and its two- A highly integrated, single-chip, CVAX bus to level instruction and data, direct-mapped cache Q22-bus adapter was a desirable product for sev- scheme. In comparison, the MicroVAX 11 CPU

Digital Technical Journal 129 No. 7 August 1088 Development of the CVAX Q22-busInterface Chip module has a dual-ported memory architecture be presented in the context of the five other VLSI with no caching. However, the DW single-port chips being designed by the SEG groups with a architecture was required for the new two-level focus on the CPU module product. The US.- cachc architecture; with a single-port organiza- based project leadership had to provide budget, tjon. DMA addresses can be viewed by the caches scliedule, and task coordination for JRDC, MSD. so that thc caches can invalidate valid entries dur- and for other organizations within SEG. ing I/O-to-memory write transactions. Conse- As the initial customer, MSD performed three quently, to both accomnlodate this architecture major specification reviews. This group continu- and mcct its performance goals, CQOIC had to ally provided direction concerning design be designed to consume little CVAX bus band- trade-offs, and requested specific functionality width while performing DMA transactions. Such revisions to tailor CQBIC more to their CPU a design would not grcatly degrade CVAX application. microprocessor performancc. Digital's Engincering Nctwork was the primary A second important projcct goal was to pre- means of transferring written communications serve 1/0 performance and operating system between groups. We also exchangcd information software compatibility.4 Therefore, CQBIC by sending facsimile copy and by mailing mag- would provide the same Q22-bus virtual to CPIJ netic tapes and documents. At times telephone physical memory address translation as contained discussions and personal visits were necessary. on the MicroVAX I1 CPU. Specification development began with a two- In addition to meeting these goals, the CQBIC week visit to the JRDC facility in Tokyo. At that project would also serve to demonstrate the feasi- timc, we wrote the first draft with key members bility of a remote VLSI design center for the SEG of the JRDC team. This draft specification laid organization. Moreover, through this project the the foundation for subsequent architecture and JRDC/LSI Group would havc an opportunity to filnctionality research, and served as a communi- demonstrate its VLSI design capabilities. cation medium. The draft specification was then Further conlplicating the challenges presented maintained by the JRDC tcam and SEG and was by rhc design goals, thc distance between the frequently revised and rcviewed. working groups, the cultitral and work style The following section presents the project differences, and the language barricr was the research conducted to ensure the feasibility of newness of the JRDC team. Many of thc JRDC project goals and to resolve major questions team mcrnbers could read and write English, but raised by the draft specification. had somc difficulty speaking and listening to English. Also, t he Japanesc language was com- Project Research pletely foreign to MSD and SEG. Written English Project research focused on two arcas. First, we senred as the primary form of communication wantcd to evaluate the risks involved in the throughout the project. Further, the JRDC team implementation of a CMOS Q22-bus transceiver. members had to learn not only about Digital's For this purpose, SEG team members implernen- products and architectures, but also the Q22-bus, red a test chip. Second, we wantcd to determine the other five chip specifications under develop- the best means to achieve our stated performancc ment, the SEG semicustom and custom chip goals. The tests and studies which we conducted design tool suites, and Digital's CMOS technol- and their resu Its are described below. ogy. To help with this stccp learning curve, experts from each of thesc areas Facilitated the Q22-bus Transceiver Test Chip training and information flow. Thcsc experts To clctcrnline whether or not a CMOS Q22-bus provided answers to spccific questions and transceiver could be implemented, several stud- helpecl to solve spccific problems as follow-up to ies wcre performed by SEG circuit designers formal training sessions. responsible for the cell library. These stitdies Based on the MicroVAX I1 CPU design experi- showed feasibility, with two major implementa- ence in SLG and MSD, SEG provided leadership for tion risks: both the cliip specification development and the project. This role involved conveying to the JRDC The proposed differential comparator to be team the chip functional definition and detailed used as the receiver required a stable voltage behavior specifications. This information had to refcrence.

130 Digital Tecbnical Journal No. 7 Augusl 1988 The 33 100-milliampere (rnA) peak, 70-mA proceeded under a plan that included integral steady-state sink current Q22-bus transccivers transceivers. were to be on the same substrate as complex control circuitry. Three problems could Architecture and Performance Studies result: As the octal transceiver tcbt chip was being devel- oped, MSD, JRDC and SEG conducted architec- - CMOS latch-updueto charge injection from ture and performance studies. These studies input signal overshoot would answer questions about the organization of - Excessive noise due to substrate current the S/G mapping function, the data buffering transients required to meet the performance goals, and the sequential controllers partitioning and clocking - Excessive localized power dissipation to manage the two asynchronous buses and the With several dcsign alternatives available to us, internal functions. we needed more experimental data to determine the better alternatives. To obtain this data, a S/G Mapping Q22-busoctal transceiver test chip was designed, A RAM structure was first proposed to implement fabricated, and packaged by SEG circuit dcsign- the S/G mapping functionality. 'The MicroVAX I1 ers. Available after seven months, this packaged CPU design had used such a structure, with two octal transceiver test chip was tested in a 8K-by-8 static RAMS. This proposal, however, was MicroVAX 11 CPU module and performed well rejected since not all of the RAM would tit on a under system conditions. single chip with all the other required circuitry. The test chip experiments showed that CMOS Increasing the chip size was not an option. The latch-up due to worst-case overshoots below chip size was limited for cost reasons as well as ground did not occur. These results matched our packaging cavity size reasons. The chip's cost is expectations. We were not concerned with over- directly proportionate to its size, and the design shoots above thc + 5 volts (V) bias because of the of a new package was outside the scope of the Q22-bus termination voltage of 3.4 V. Tests also project. Moreover, implementation of a portion showed that special care would be required in of the RAM would have introduced a system soft- the allocation of dedicated ground pins for the ware incompatibility with MicroVAX I1 and Q22-bus transccivers to avoid noise coupling would have reduced the planned performance. from substrate bounce and package power-lead As the problem of S/G mapping functionality inductance. Also, in the chip layout, we would was studied, it became clear that system memory have to use many parallel traces of metal inter- was adequate. Further, CQBlC could not imple- connect to prevent metal migration when sinking ment the full 8192-entry RAM on a chip size that 100 rnA of peak current. Finally, due to low could be fabricated with reasonable yield. Also, a channel resistance of the Q22-bus driver output capability to prefetch S/G map entries based on pull-down device, the power dissipation of the expectation was considered necessary to sustain test chip was shown to be within reliable opera- peak, as opposed to average, performance. We tion limits. Therefore, CQBlC power dissipation looked to the Q22-bus DMA devices which per- was not a concern in terms of thermal characteris- form transactions with incrementing addresses. tics of the planned packaging. In particular, Q22-bus devices are designed to The test chip results did lead to a compromise utilize the Q22-bus block-mode data transfer concerning the stable voltage reference. Because protocol. This protocol transfers data packets of of large variations in CMOS process materials, a eight-word blocks. With this protocol available, precision off-chip or external resistor would bet- we could design the CQBIC to cache the S/G map ter sen1e to establish the required voltage than entries from system memory on demand and on would some risky process-desensitized structure expectation. in CMOS. The next two problenls were how to imple- Prior to these tests, we designed CQBIC to ment the cache and how many entries to include facilitate the i~seof either integral transceivers or in the cache. A 16-entry cache provided the bal- off-chip transceivers. Fortunately, the test data ance we sought between several factors: appro- demonstrated the feasibility of a single chip with priate chip area, implementation complexity, integral Q22-bus transceivers, and the project design risk, and DMA 1/0 performance impact.

Digital Technical Journal No 7 August 1988 Developn~enrof the CVAX Q22-bus fnterfucc. Chip

Dcrta Buffering cache hit rate result in an insignificant DMA 1/0 CVAX bus cycle times were targcted to be four or impact on CVAX CPll performance. Given the more times greater than typical Q22-bus cycle buffering and control organization and optimiza- times. Also, the CVAX bus was being designed to tions descri bed above, performance difference support DMA rnultidata transfers. This design was between the single-port and the dual-port mem- consistent with the Q22-bus block-mode data ory designs cannot be detected by a Q22-bus transfer protocol. To bridge the bandwidth gap device. The result is improvement in Q22-bus between the two buses and to minimize the use read and write throughput over the MicroVAX I1 of CVAX bus bandwidth, data buffering tech- CPU. The CQBIC maximizes Q22-bus perfor- niques were investigated to optimize for QL2-bus mance and minimizes CVAX bus usage. Moreover, block-mode throughput for read and write trans- CQBIC can sustain Q22-bus block-mode transfer actions. These investigations resulted not only in write data rates of 3.3 megabytes (MB) per sec- the determination of buffer sizes but also in a ond and read data rates of 2.5 MB per second. decision on how to control the buffers to opti- Finally, to optimize the CVAX 1/0 write pcrfor- ~r~izesustained throughput and minimize initial rnance, a dump-and-run buffer was to be imple- latency. mented in CQBIC. 'l'his buffer is used to avo~d The MicroVAX 11 CPlJ is capable of sul,pllring tying up the CVAX bus while the slower Q22-bus read data to the Q22-bus with a very consistent transaction corupletes and while deadlock situa- access time because memory arbitration is not t ions are resolved. required. To achieve MicroVAX 11 average read performance, read data prefetching was consid- Controller Purlition ered neccssary to compensate for the memory Given these buffering functions, the control of arbitration time. For CQBIC. thc first read of a the data path and of the two major bus interfaces Q22-bus transaction would be time delayed by was naturally partitioned into live linked con- thc DMA request and grant tinie, to obtain master- trollers and a prioritization function. Each bus ship of the CVAX bus, and by the subsequent sys- interface was partitioned into a master and a slave tem menlory access time. The delay would alw;~ys controller. I'he S/G map cache also required a be longer than MicroVAX I1 read latency. which controller. Then to assist in coordination of con- had only memory access time read latency to trol flow decisions, ;I priority resolver function consider. We determined that two quadword read was needed. buffers would be sufficient to sustain the 'This partition allows the Q22-bus and the required throughput becausc read data is CVAX bus to operatc in parallel while all dead- prefetched based on cxpcctations of the Q22-bus lock conditions are resolved. Fortunately the block-mode protocol. Low latency was achieved CVAX chip team implemented a bus transaction by provicling a response to the Q22-bus as the retry capability. This retry capability proved first longword of the quadword read data was cssc~itialto our partition and implementation of obtained from system memory. CQBIC control functionality. Pipelining the buffered write data coultl be achieved with two buffers, each eight words Clocking deep. An octaword block is the packet size of the 'Two primary factors led us to select a 50-nano- Q22-bus block-motlc protocol and is thc maxi- sccond (ns) two-phasc nonoverlapped internal mum multitransfer block size of the CVAX bus. clock scheme. First, thc MicroVAX 11 CPU rnod- The control logic would be designed to allow one ulc's 50-11s single-phase clocking scheme was buffer to be unloaded to system memory while a pro\.en approach and mapped well to the fixed thc other was being filled. The latency wo~~ldbe Q22-bus minimum asynchronous timing specifi- bctter than that of the MicroVAX 11 CPlJ motlule, cations Second, we expected synchronous CVAX since the CQBlC data was packet1 into fast octa- bus c!,clc r[~nirigto vary with CMOS technology word buffers. The average throughput woi~ldbc irnprovemcnts The variable CVAX cycle time and sustained by the four times or greater bandwidth four-phase overlapped clocking scheme could of the CVAX bus, as compared to the Q22-bus, by not be used to generate the fixed Q22-bus tim- the use of pipelined data buffers. ing. Also. having two clocking schemes in one The CQBlC buffering and transaction optimiza- chip was determined to be a design too complex tions in conjunction with the CVAX CPU internal to manage.

Digital Technical Journal h'o 7 August 1988 CVAX-based I Systems

'I'hc implication of the sclccred CQBIC clock- pc.ns:ltc for the long latcncy that could occur, for ing scheme was that, with reference to all intcr- cxample, when the look-up misses the cache and nal controllers, the CVAX bus ancl the 022-bus requires an S/G map memory read access. were asynchronous. The standard cell library was rejected because it did not offer a content addressable memory Research Results Summary (CAM). which is the structure required to facili- Thc result of thc research was a single chip t;~tcFast address look-ups. In addition. the use of design that would achieve the sratcd project goals standard library cell latches and exclusive OR by providing gates was estimated to almost double the desired look-up time on the 16 cached entries. Again to contain chip size and also to meet con- A 16-entry map cache, with prefetching trol performance, custom programmable logic 'l'wo octaworti Q22-bus write buffers array (PIA) sections were rcquired. The PLA structures offcred by the standard cell library Tbvo quadword (222-bus read buffers. with were too slow and required a clocking scheme prcfetching different from the CQBIC two-phase clocking A longword CVAX write buffcr scheme. This decision to implement custom PLA structures is credited as the reason performance Transaction partitioned sequential controllers. goals were achieved. In hct, performance goals which are optimized for look.ahead data could not have been achieved without custom buffering control and for i~tilizationof multi- I'LA structures. ple-transfer transactions to minimize CVAX At the timc logic and circuit design began, the bus and Q22-bus usage standard library cells available for this design The research results were documented in the were found to be inadequate. Many necessary form of a revisetl chip specilication and a bcliav- functions were missing or were not tailored for ioral model. The chip was implemented from the the specific application. Also, in many cases the revised specification with a process which was performance of library cells did not match the unique and unproven. performance required by the two-phase clocking schcme. Hence the JRDC tcam developed its own Implementation Process extensions to the standard cell library. The com- CQBlC was implemented using a mix of standard mon logic structures such as NAND, NOR, flip- librxry cells, custom library cells, and full cus- flop, and latch were used from the standard cell tom 1;1yout sections. At the timc. SEG could not library as much as possible. since these struc- okra formal design tool suitc to deal with such tures reduced the risk of circuit problems. Cus- a mix of full custom and semicustoni design data- tom structures. such as counters, multiplexers, bases. So the JRDC team standardized by select- latched pad transceivers, synchronizers, PLA AND ing the methocls of thc semicustom tool suite for plane drivers, and PLA OR plane receivers, were logic and circuit design. The semicustom sche- designed and made available to the library. matic editor and wire lister were used to design The JRDC team accurately modeled the chip all thc logic. This wire lister facilitated interhc- based on the specification at the behavioral and ing to SPICE and other checking tools and most the MOS levels of abstraction using Digital's importantly to the layout tools. For layout, no DECSIM simulator. automation of floor planning and cell placement Initially. the JRDC team developed a behav- ancl routing could he employetl. This layout was ioral system environment model based on their all done by hand, as were the full custom dcsigns. understanding of the CVAX bus and the Q22-bus Interconnect verification and design rule check- specification. This environment model was ing \wre completed using the tools from the cus- layered around the CQBIC behavioral model to tom dcsign suite. verify the design. As the design progressed, a A full custom layout section was requirecl to more accurate behavioral chip model replaced implement the S/C; map cachc because of the the initial model after correlation. chip-size and latency constraints. A part of the Further, as other CVAX behavioral, structural, latency is due to the Q22-bus address look-up in or MOS chip models matured, MSD incorporated the cache. The S/C; latency had to be small to cotii- them into the CPU system model. This model was

Digital Technical Journal 133 No. ' Atrg~tst 1988 Development of the CVAX 822-bus Interface Chip

then used to test the CQBIC further in the con- CQBIC provides the power-up, initialization, text of the application system. System simulation power-fail, and power-down protocols to the proved that all CVAX bus specifications which system and performs Q22-bus and CVAX bus were communicated were understood and imple- address decoding. Further, the chip performs mented correctly. The system simulation served the page address S/G mapping function for DMA as an independent test of the CQBIC design. devices by using its 16-entry S/G address map Although no CQBIC problems were found by cache. MSD during system simulation, the testing did This cache contains a copy of the most recently prove that the system would operate. We later used S/G pointers, which are located in system learned that several bugs could have been found memory. The cached pointers are used to map had more time-varied events been scheduled 22-bit Q22-bus virtual to 29-bit CVAX bus physi- with the system simulation test cases. cal addresses. CVAX bus and Q22-bus transac- When completed, the CQBIC MOS model was tions are optimized by using a CPU dump-and-run correlated to the behavioral chip model. The write buffer, two pipelined Q22-bus octaword MOS chip model was then placed in the MSD sys- write buffers, and two pipelined Q22-bus quad- tem model for regression testing. word read buffers. The chip performs transparent When we were confident that the CQBIC address and data alignments, and packing and design was complete, that is, when no new bugs unpacking of internal buffers. were found after thorough testing, the chip CQBIC is composed of five global control sec- was released to SEG for a final design review and tions. A block diagram of the chip control sec- submittal for fabrication. The database was tions is shown in Figure 1. copied over the Engineering Network from the Each section contains an independent sequen- JRDC facility in Tokyo to the Hudson, Massachu- tial controller: setts, plant. After completing a final design review and subsequent problem fixes, the chip The Q22-bus arbiter was submitted for fabrication. Eight weeks later The S/G map first-pass parts were probed and found to be func- tional. Packaged parts were run in the MSD CPU The Q22-bus master module. This testing revealed several timing bugs related to events from both buses occurring at The Q22-bus slave and CVAX bus master the same time. After extensive testing, the bugs The Q22-bus electrical interface. were fixed, and a second revision was releasecl for fabrication. When the second pass part was tested A photomicrograph showing the floor plan of in the CPU module, another timing problem the control sections is shown in Figure 2. related to coincjdent transactions from both Each section shown in the Figure 1 block interfaces surfaced. This particular bug was diagram is described next. obfuscated by a pass 2 bug. A third revision was prepared and fabricated. This third pass was Q22-bus Arbiter Section available in time for the first customer shipments. As a Q22-bus arbiter, the CQBlC is the default The final chip functionality is briefly described Q22-bus master and the highest priority below. requester. The arbiter accepts requests from Q22-bus DMA devices and from the master sec- The CQBIC Functional Organization tion, and grants mastership with first priority to CQBIC is an asynchronous CVAX bus device and the master section. In response to a master requires a fixed 40-megahertz oscillator input to request, the arbiter exercises a demand master- dcrive Q22-bus timing. The oscillator input is ship protocol to Q22-bus devices to ensure used to generate a two-phase, nonoverlapped low-latency interrupt vector or data reads. In clock which is distributed to all chip sections. response to interrupt requests from the The CVAX bus interface was designed to accom- Q22-bus, the arbiter receives the requests and modate transaction cycle times from 100 ns to passes them to the CPU. When the CPU acknowl- 40 ns. This design anticipated a CVAX CPU tech- edges the request, CQBIC reads a vector from nology change and subsequent performance the Q22-bus device and supplies an acknowledge improvement. signal.

Digital Technical Journal No. 7 August 1988 CVAX-based Systems

KEY 3 ADDRESS AND DATA PATHS CONTROL PATHS

Figure I Control Section Block Diagram

When multiple CQBIC chips are connected to Either as arbiter or as an auxiliary device, the the Q22-bus, they take on different functions. arbiter function performs the system power- 'I'he first chip operates as Q22-bus arbiter; the up, initialization, power-fail, and power-down others operate in auxiliary mode. As an auxiliary sequences. mode device, a CQBIC chip does not perform Q22-bus arbitration. Instead, the chip behaves as S/G Map Section a typical Q22-bus DMA device that is a default The S/G map consists of 8,192 longwords allo- Q22-bus slave. Therefore, when the CPU initiates cated from system memory. Each map entry con- aQ22-bustransaction, itsCQBICrequestsQ22-bus sists of a 20-bit page pointer, a 3-bit descriptor mastership. The arbiter CQBIC serves as Q22-bus which CQBIC ignores, and a valid bit. The low arbiter and grants the bus accordingly to auxil- 9 bits of a Q22-bus address pass through as an iary mode CQBICs and other DMA devices. interpage offset; the upper 13 bits select the con-

Digital Technical Journal 135 No. 7 Augz~st 1988 Development of the CVAX Q22-busInterface Chip

i-..,! b-- , I i : h-- . I - I L ., I C: .'.I 1.- ., I I... I

Figure 2 Photomicrograph of CQBIC

tents of one of the 8,192 S/G map locations. The A.s noted in the section Project Research, we CPU informs the CQBIC of the S/G map location selected a map cache size of 16 entries. The re- by writing a base address into the CQBlC map search of Q22-bus DMA device transfer sizes and base register. This write flushes the valid bits of the number of devices active in a dynamic system the cached map entries. showed that 16 entries were sufficient to avoid To avoid map cache coherency problems, the thrashing on entries. The effects of the Q22-bus CPU accesses the S/G map through a VAX 1/0 fair arbitration scheme were used to show that address range decoded by the CQBIC master sec- the simple first-in-first-out (FIFO) replacement tion. The slave section then performs the S/G algorithm selected did not waste performance map memory transaction. This indirect approach and was consistent with incrementing DMA prevents the CPU from directly modiFying the device addresses. As a DMA device transfer S/G map memory independent of the 16 cached address incremented to a page boundary, the next pointers. A CPU to S/G map write invalidates the map entry would be prefetched, and the previous cached map entry as the slave section performs map entry was not used unless the current 1/0 the memory write. CPU to S/G map reads return request completed and another was requested. the cached copy if it was cached or return the We found that the operating system's allocated S/G pointer from system memory. map entries for 1/0 requests to Q22-bus DMA

Dfgftal Technical Journal No. 7 Augztst I988 CVAX-based Systems

devices from a frcc pool list maintainecl in a last- CPU then retries thc same transaction when bus clcallocatcd, first-allocated manner. The overhead control is returned.) S/G map transactions have a of one extra read for a map cntry per page was higher priority than Q22-bus slave transactions. found to be insignificant. The slave section therefore performs S/G map transactions in parallel with Q22-bus slave trans- Q22-bus Master Section actions. When the master tries to access the The mastcr section contains two configuration Q22-bus and it is busy. the arbiter attempts to registers and three status-and-error reporting reg- gain mastership. Until mastership is obtained. isters in addition to all the control circuitry. the slavc can perform a retry to satisfy the The mastcr section's function is to decode all Q22-bus transactions. the CVAX bus addresses and cycle status codes. This decoding determines which of two types of Q22-hzcs Mastership actions is rcquired: When thc master acquires Q22-busmastership, it A transaction to an internal register, the S/G sequences the tr;rnsaction. A special case of the map, or the Q22-bus sequence occurs when the 1/0 memory segmcnt address maps back to system memory through the Q22-bus mastership prior to completion of the slave and map cache. In this case a retry is used, transact ion and the slave gives the data to the master. Each of these actions is described in the text The CPU writes to the Q22-bus arc accepted below. by the master in a dump-and-run manner to 1f a decoded address requires no CQBIC improve performance. response, a signal pin is asserted to external logic for control of buffers and timeout counters. Q22-bus Slave Section The slave section design implemented the two Trunsaction to an Internal Regisler quadword read buffers and the two octaword When the master sectlon detects a CVAX bus write buffers. This section was the key to realiz- address for one of the two control or three ing the performance goals established for the address registers in CQBIC, it returns or writes chip. Thc slave has to respond to all Q22-bus the data. The master sectlon also facilitates a transactions by checking the address in the S/G memory lock for the CPU to perform a read-lock map arid then sequencing the CVAX bus to put or and write-unlock operation. First, the master get data. The slave must coordinate its intentions detects a CVAX bus interlocked transaction and with all other chip sections to avoid deadlock then performs a retry until Q22-busmastership is conditions. This coordination is realized in a pri- obtained. Q22-bus mastership is held until an oritization circuit which receives state inputs unlock transaction or an exception occurs. As from all sections of the chip and outputs status long as other Q22-bus devices follow this proto- codes to the slave and master sections to trigger col, memory that is mapped to the Q22-bus can actions. be shared. The slave watches for master or Q22-bus trans- action requests. When the slave receives Q22-bus Transaction to S/G iMup addresses, it passes these to the map cache for As noted in the S/G iMap section, S/G map trans- validation. If the S/G entry is not cached, the map actions arc controlled by the master section. The cache signals the slave to acquire a new S/G map master requests the slave and map cache sections pointer from system memory. The map cache to complete the memory and cache transactions. will cache this new entry if the valid bit is set. If To construct the memory address for the slave and the valid bit is cleared, then an exception is map cache, the master uses the significant low taken. When the address is validated, the slave 13 bits of S/G map 1/0 address as an offset from proceeds to sequence the transaction to or from a the map base register. buffer and system memory. During slave writes to the system memory, the CVAX is signaled to Transaction to Q22-bus invalidate its internal cache. To avoid deadlocks, the master utilizes the CVAX The slave maintains two octaword write buffers CPU retry transaction (CVAX CPU relinquishes to optimize Q22-bus octaword block-mode trans- CVAX bus control to the CQBIC slave section. 'The actions. By using a CVAX bus multitransfer burst,

Digital Technical Journal 137 No. 7 August 1988 Dez~elopmentof the CVAX Q22-busInterface Chip the slave can unload one buffer to memory while We learned how to manage efforts from a distance filling the other octaword buffer. and to coordinate and communicate complex For each new Q22-bus read request, the slave technical information around the globe. prefetches four words from memory. This pre- fetch is done in anticipation of block-mode trans- Acknowledgments actions. These four words are buffered and sent to The author wishes to acknowledge the technical the Q22-bus master. As the third word is contributions of S. Kyu, S. Akanuma, A. Abe, unloaded, the slave prefetches four more words. H. Hayashi, M. Hasegawa, M. Taiji, S. Iida, As either a Q22-bus block-mode read or write M. Kikutani, K. Koga, L. Walker, D. Grondalski, transaction nears a page address boundary, the J. Lipcon, and R. McNamara. slave performs an S/G map entry prefetch of the next entry. The slave then passes the prefetched References entry to the map cache. 1. B. Maskas, "Developing the MicroVAX I1 An additional function of the slave section is a CPU Board ," Digital Technical Journal Q22-bus addressable interprocessor doorbell (March 1986): 37-47. register. This register accommodates arbiter and auxiliary mode operation by supplying to the 2. P. Rubinfeld, et al., "The CVAX CPU, a CPU a memory access semaphore, an interrupt CMOS VAX Microprocessor Chip," ICCD request, and a vector address. Proceedings (October 1987): 148- 152. Q22-bus Electrical Interface Section 3. G. Lidington, "Overvjew of the MicroVAX 3500/3600 Processor Module," Digital Thc Q22-bus is a 120-ohm transmission line Technical Journal (August 1988, this with near- and far-end parallel termination. The issue): 79-86. length of the Q22-bus can vary from 25 to 60 cen- timeters and is subject to reflection and crosstalk 4.T. Leonard, ed., VAX Architecture Refer- noise. CQBIC contains 33 transceivers and ence Manual (Bedford: Digital Press, 9 receivers which connect directly to the Order No. EY-3459E-DP, 1987). Q22-bus. The open-drain outputs and filtered inputs were designed to operate reliably in the Q22-bus environment. The input filter rejects crosstalk and reflection noise by staging a low pass RC filter. The filter is constructetl with an n-diffusion resistor and p-type field effect transistor (PFET) capacitor with a differential amplifier receiver which main- tains a narrow noise immunity region. The open-drain output driver controls the edge rates. 'This control minimizes transmission-l ine reflections and crosstalk for ac load variation from 30 to 330 picofarads, and dc termination variation of 240 to 60 ohms at 3.6 volts. To satisfy the 100 mA sink current possible on each of 3.3 outputs without excessive heating, low inter- nal power dissipation was achieved by low steady-state "on" resistance. A disable control allows the output to power down without affecting the Q22-bus. Conclusion A single chip Q22-bus interface was realized and is being shipped in Digital's systems as the result of the successful venture for JRDC, SEG, and IMSD.

Digital Technical Journal No. 7 August 1988 David K. Morgan I

me CVM CMCTL - A CMOS Memory Controller Chip

Tbe CMCTL -part of the CVM family of chips - is a high-performance ECC memory controller for single-processor systems. Implemented in Digital's CMOS technology, the CMCTL is optimized to satisfy Q-bus-based system requirements. Tbe CMCTL operates as either a synchronous or an asynchronous interface between the CVM bus at cycles from 60 to 100 nanoseconds and the private memory interconnect. For memory read or write operations, the CMCTL supports the CVAX multiple-transfer proto- col. Data parity and memory error checking is implemented for all data transfers. Tbe chip's high performance is achieved in part by a high-speed, page-mode accessprotocol.

'I'he decision to design a CVAX memory controller had to interfdce directly to the CVAX bus and (CMCI'L) was made in July 1984. The primary handle the memory transactions originating from goal of thc CVAX CMCTL project was to dcsign a the CVAX CPU chip. Located in the CVAX CPU high-performance, single-chip, error-correcting chip is an integral primary write-through 1 -kilo- code (ECC) memory controller for a single- byte (KB) cache. The size of this cache can be processor system. This chip would be part of a optionally expanded with a second-level cache CVAX family of core peripheral functions. function on the CVAX bus. Consequently, the Several systems being developed at that time CMCTL-to-CVAX bus interface had to work with utilized thc MicroVAX I1 CPU chip, the predeces- or without the optional second-level cache. Fur- sor to the CVAX CPU chip. Because company rev- thermore. the primary cache and the optional enue for Q-bus-bascd systems such as the sccond-level cache use byte parity for memory MicroVAX I1 is significant and a performance error detection. Therefore, the CMCTL bus inter- benefit could be gained from a custom chip face was required both to generate and to check design, the memory controller design goals were byte parity. For CVAX-based systems operating at focused to satisfy the requirements of a Q-bus- 100-nanosecond (ns) and 60-11s CVAX bus cycles based system. The initial system requirements for and implementing a second-level cache, the per- the CMCTL were determined by studying the formance goals were respectively 2.5 and memory controller specifications and by dis- 4.0 times the performance of the MicroVAX I1 sys- cussing requirements with key members of the tem. These goals governed the CMCTL bus mem- project team for the existing MicroVAX I[ system. ory performance, or memory cycle time, require- In addition, the Electronic Storage Development ments described later in this paper. Since (,ESD) Group was consulted on the requirements memory size requirements are proportional to of a memory controller. CPU chip performance, the CMCTL had to sup- Let us now examine the key aspects of the port a memory size larger than that of the CVAX CPU chip that influenced the system MicroVAX 11. The MicroVAX I1 CPU memory sys- requirements for the CMCTL. First, the CMCTL tems have a byte-parity, memory error-detection scheme. To meet the reliability requirements for A shortcr version of this paper first appeared in the Procesd- larger memory systems, the CMCTL was designed ings of the 1987 ICCD: VLSI Cornpulers and Processors. primarily as an ECC memory controller. October 1987 entitled "The CVAX CMCTL. A CMOS Memory Since a direct memory access (DM) function Controller Chip" by D. Morgan. K. Chui, J. Clouser. S. Nadkarni, and R. Strouble. Copyright 1987, The Institute can also become the bus master on the CVAX bus, of Glectrical and Electronic Engineering, Inc. the system requirements for the CMCTL were

Digital Technical Journal 139 No. 7 August 1988 The CVAX CMCTI. - A CiMOS Memory Controller Chip intluenccd by thcsc functions also. Because the write transaction on the CVAX bus that signals the CVAX CPLJ chip pcrforrns only synchronouh trans- termination of the interlock instruction. actions and a DMA function could be either syn- Certain base technology constraints influcnced chronous or asynchronous. the CMCTL is designed the CMCTL specification. First, the high perfor- to run as a synchronous or asynchronous slave on mance requirements for memory in a system that the CVAX bus. Further, the CVAX CPU chip can docs not implement a second-level cache deter- handle only two of the possible four types of data mined that the CMCTL be implemented in a sin- transfer lengths on the CVAX bus. However, a gle custom chip. At the time, it was not possible Q-1x1s DM4 function (CQBIC) needed to gener- to implement a memory controller with the ate all four possible data transfer lengths in order required speed in a commercially available gate to efficiently hanclle data transfers between the array that would run synchronous with the CVAX Q-bus and the CVAX bus which havc dara widths CPlJ chip. Furthermore. in a Q-bus-based system, of 16 bits and 32 bits, respectivcly. The require. memory expansion occurs in the Q-bus back- mcnt to work with the Q-bus DMAfunction meant plane. Therefore, a single memory controller that th:it the CMC'TI. needed to handle all four dat;~ resides on the CPIJ module and controls the transfer lengths. In addition, since a DMA func- memory by means of signals on the backplane is tion could optionally generate and check parity. the simplest and most quickly implementcd sys- the CMCT'I. had to be flexible in this rcgard as tem solution. Another factor that influenced the well. Finally, the CVAX CPIJ chip cxecutcs intcr- single-chip alternative solution was the limited locked instructions which must have the cffcct of space available on the CPlJ module that imple- "locking" or "unlocking" the memonr from ments a second-level cache. Taken together, these DMA read-modify-write transactions. lntcrlocked factors ruled out the possibility of designing a memory transactions are not defined in the Q-bus slower memory controller using commercially protocol. Therefore, interlocked memory trans- available memory controller components for sys- actions are handled with a bus interlock scheme. tems that irnplernent a second-level cache. The In this scheme, the CQnlC stalls, i.e., RETRY, availability of CMOS-I technology in Digital's the CVAX CPlJ chip memory re;~dlock bus trans- Hudson. Massachusetts. facility in 1984 drove action on the CVAX bus until it becomes thc the design technology choice. Q-bus master first - locking out 1/0 to mem- onr - before the CVAX can perform interlocked System Overview instructions. RETRY is a slave response to a bus The CVAX CMCT'L is the core control function of master on the CVAX bus that tells it to retnzthe a single CVAX CPLJ memory system. This chip bus cyclc because it cannot complcte the serves as the interface between dcvices on the rcqucstcd operation. The CQBIC rele:~scs thc CVAX bus and a CMOS private memory intcrcon- Q-bus after it sees a CVAX CPIJ chip memory nect (PMI). Figure 1 shows the major interfaces

MEMORY MEMORY MEMORY MEMORY

CMCTL '7 PMI CVAX BUS

I/O BUS

Figure I Major Inlerface Conneclions of the CMCTL Chip

1-40 Digital Tecbnical Journal No. 7 Augrisl 1988 CVAX-based Systems

Figure 2 Photomicrograph of CMCTL Showing Major Sections

of [he CMCTL in a CVAX system, and Figure 2 The CMCTL connects directly to its other major shows the major sections of the chip. Table 1 lists interface, the PMI. The PMI consists of control, the physical characteristics of the chip. address, and data signals which interconnect the This section presents a brief overview of CMCTL and the memory array modules. Through the CMCTL chip's two major interfaces, data transfer support, and error-checking and notifica- Table 1 CMCTL Summary Characteristics tion features. Process 2-micron drawn, N-well, dual CMCTL Major Interfaces aluminum CMOS process As interface to the CVAX bus, the CMCTL responds Number of 20,000 as either a synchronous or an asynchronous slave transistors device. When the CVAX CPU chip is bus master, Die size 7.6 mm x 8.0 mm the CMCTL responses are synchronous. When a Power dissipation 1.5 W worst case DMA device is bus master, a bus-mode signal Packaging 132-pin surface-mountable chip carrier with 25-mil lead spacing determines whether the chip responds as a syn- Power supply +5 V chronous or asynchronous device.

Digital Technical Journal No. 7 August 1988 The CVAX CMCTL - A CMOS Memory Controller Chip these interconnections, the chip controls up to CMCTL Performance four memory modules, each containing one, two, The CMCTL achieves its performance in part by or four banks of dynamic random-access memory using a high-speed, page-mode RAM access pro- (DRAM). Each memory module is required to tocol on the PMI. DRAMS that run in page mode buffer all the PMI signals. can perform data transfers in approximately one- Data Transfer half the cycle time of those run in nonpage mode. The CMCTL responds to CVAX single-transfer The CMCTL fully supports the CVAX bus multi- memory write or read operations within two or ple-transfer protocol and can perform one to four four CVAX bus cycles, respectively. During a data transfers on a memory read or write opera- memory read operation, the CMCTL starts a mem- tion. Each data transfer can have up to four bytes ory read access in parallel with an optional cache of data. Since ECC is generated across four bytes, to increase memory read performance. If the write data with less than four valid bytes will memory read address hits in the external cache, cause the CMCTL to do the actual memory write the CMCTL aborts the read operation. The on the PMI as a read-modify-write cycle. Other- CMCTL performs memory write transactions as wise, the write data goes directly to memory. dump-and-run. Error Checks and Notz9cation Table 2 lists the memory operations and the corresponding performance for synchronous data The CMCTL performs two error-checking func- transfers with 4 bytes of data. Two numbers are tions: shown for multiple-transfer memory operations. CVAX bus data parity error checks The first is the time in CVAX CPU bus cycles to Memory error checks complete the first transfer; the second, the time to complete subsequent transfers. In order to To assist with the error checking of data transfers tune the memory performance across different on the CVAX bus, the CMCTL checks data parity CVAX bus speeds, the CMCTL provides a pro- on memory writes. The chip generates parity grammable mechanism for varying PMI transac- with the data on memory reads. tion timing. For CVAX bus cycle times less than For data transfers on the PMI, the CMCTL has 100 ns, the CMCTL can be programmed to add two memory error-checking modes: 7-bit ECC, slip cycles to memory read operations in incre- and single-bit parity. In ECC memory error mode, ments of the CVAX bus cycle time. The asyn- the CMCTL detects double-bit uncorrectable chronous performance of the CMCTL can be memory errors and detects and corrects single-bit estimated by adding one bus cycle to the syn- memory errors. In parity memory error mode, the chronous data transfer numbers in Table 2. CMCTL can detect single-bit memory errors. The CMCTL memory read access time is very The CMCTL uses four outputs to notify the important for systems that do not have a second- CVAX bus master of four error conditions. These level cache. For example, a 90-ns CVAX bus cycle error-condition notices are as follows: with a 5/3 CMCTL memory read access with a The bus transaction was successful and com- second-level cache results in CPU performance pleted with no errors. 3.0 times that of the MicroVAX 11. Without the second-level cache, the CPU performance is The memory data transfer resulted in an uncor- rectable ECC or parity error. The memory data transfer resulted in a cor- Table 2 CVAX CMCTL Read and Write Perfor- rectable memory error. mance (in Numbers of Bus Cycles) The CVAX CPU chip-initiated memory write Memory Operation CVAX Bus Cycles had a parity error. (4 Bytes of Data) 100ns 90ns 60ns In addition to these four outputs, the CMCTL pro- Single read 4 5 6 vides an output that indicates when the CMCTL is Multiple read 412 513 613 not going to respond to either a memory or an CPU single write 2 2 2 1/0 operation. This output reduces the number DMA single write 3 3 3 of external components required to detect Multiple write 312 312 312 addresses not implemented in a system.

Digital Technical Journal No. 7 August 1388 CVAX-based Systems

reduced by 15 percent, or to 2.5 times the uses single-bit parity to detect memory errors. MicroVAX 11. If the CMCTL memory read access The data path transport delay for a memory read was fixed at 6/3 without the second-level cache, or write is one-half the cycle time of the CVAX the CPU performance would be reduced another bus. This performance measure includes module- 10 percent, or to 2.0 times the MicroVAX 11, at a level interconnect delay. 90-11s CVAX bus cycle. Therefore, the ability to program the CMCTL memory read access time as Memory Control an integral multiple of the CPU bus cycle is a very The PMI interface provides 20 signals. These sig- important feature that helps maximize the CPU nals comprise all the control strobes and memory performance. address signals needed to control DRAMS. A fast memory access time is achieved by detecting a CMCTL Functions valid memory address and starting a memory The CMCTL was designed to integrate both the access within 25 percent of a CVAx bus cycle control and data path functions required to con- time. trol the data flow to and from memory. The CMCTL has an integral refresh counter for refreshing memory. Registers The CMCTL contains two registers: Summary A status register The CVAX CMCTL is the core control function of a complete memory subsystem. The chip provides A control register the control for a flexible memory subsystem that How each functions within the CMCTL and the functions at CVAX bus cycles from 60 to 100 ns. system is described below. The status register is loaded with important Ack;nowledgments information when the CMCTL detects an error. The author wishes to acknowledge the technical The system error-handling software uses this contributions of F. Aires, M. Benoit, K. Chui, information to log the error. The CMCTL has a J. Clouser, N. Fitzgerald, J. Gerde, B. Griswold, memory error status register that captures the N. Murthy, S. Nadkarni, J. Siegel, R. Strouble, failed memory address along with the type of K. Steward, and K. TenHuisen. memory error (bus parity error or memory error) and error syndrome. In ECC mode, the error syndrome is a 7-bit encoded number. For correctable errors, this number indicates which data bit was corrected. In parity mode, the error syndrome has no useful meaning. The chip's control register serves several func- tions. First, the control register regulates a diag- nostic test mode. Second, this register controls the PMI cycle tuning. Third, memory error detec- tion and correction can be turned on or off to facilitate the testing of the CMCTL error-check- ing functions and memory module RAMS by mem- ory diagnostic software. Finally, a refresh opera- tion can be forced for high-speed refresh testing. Data Path In ECC error detection mode, the data path uses a modified Hamming code to detect double-bit errors and to detect and correct single-bit errors. The PMI interface has 39 signals; 32 are used for the memory data, and 7 are the memory check bits. In parity error detection mode, the data path

Digital Technical Journal No. 7 August 1988