CVAX-based Systems

Digital Technical Journal Digital Equipment Corporation

umhcr 7

August I9R8 Managing Editor Richard W Beane

Editor jane C. Blake

Production Staff Produnion Editor- Helen L. Paucrson Designer- Charlo!!<: Bell Typographers- Jonathan !\·f. Uohy Margart:t Burdine lllustrawr- Dt:borah Keeley

Advisory Board Samuel H. Fuller, Chairman Robert M. Glorioso John W McCredie Mahendra R. Patel F. Grant Saviers William D. Strecker Victor A. Vyssotsky

The Digital Tecbnical jnumal is published by Digital Equipmt:nt Corporation. 77 Reed Road. Hudson. Massachuscus 01749.

Changes of address should be sent to Digital E quipment Corporation. aut:ntion: List Maintenance. I 0 Forbes Road. Northboro. MA 015j2. Please includt: the address label with changes marked.

Comments on the content of any paper are wclcomt:d. Write to the editor at Mail Stop 1-lLOl-.�/Kil at the published-by address. Comments can also be sent on the ENET to RDV�'X::OIAKE or on the ARPANET to BLA.K.E'X,RDVAX. DEC�)l)ECWRL.

Copyright © 1988 Digital Equ ipment Corporation. Copying without fee is permiued providt:d that such copies are made for use in t:duc:uional institmions by faculty members and arc not distributed for commercial advamage. Abstracting with credit of Digital Equipment Corporation's authorship is pcrmiued. Requests for other copies for a fee may be made 10 Digital Press of Digital Equipmem Corporation. All rights reserved.

The information in this journal is subject to change without notice and should not be construed as a com­ miunem by Digital E quipment Corporation. Digital Equipment Corporation assumt:s no responsibility for any errors that may appear in this document.

lSSN 0898-901 X

Documentation Number EY-67•i2E-DI'

The following are trademarks of Digital Equipment Corporation: ALL-IN- I. DEQNA. HSC70. J-1 l. MicroVAX, MicroVAX 11, MicroVAX .�000. N•'-'11. NOTES. Q-, Q22-bus, RA81. RA82, RD54. RQDX3. RX50. TK50. UITRIX-.�2. VAX. VAX-11/780. VAX-11/782. VAX 6200. Cover Design VAX 6210, V�'( 6220. V�'( 62.10. VAX 6240. VAX 8200, .S)'stems based un Digital's aduanced CMUS technology are V�'( 8300. VAX 8650. VAX flf!OO. VAX HH40. VAXlll, featured in Ibis issue. Tbe graphic on our COI'er includes the VAXELN. V�'( MACRO, V�X SI'M. VAX/VMS. VMS. lattice structure of the silicon nystal. a basic element of Compu-Sharc is a trademark of Compu-Shart:. Inc. this tecbnology. The expansion of the image expresses the GDS 11 is a trademark of Calma Corporation. pe1j'ormance growth and the system e.Yiensibility of the new SPICE is a trademark of the University of California at CVAX-based systenH. lkrkdey

Tektronix and DAS arc trademarks of Tt:ktronix. Inc.

The COI'er was designed by Barbara Grzeslo and]acquie !look production was done by Digital's Educnional Hockaday 1�[ the Graphic Design Department. Services Media Communicnions Group in Bedford. \·lA. Contents

8 Fo reword Robert M. Supnik

CV AX-based Systems

1 0 An Overview of the VA X 6200 Fa mily of Systems Brian R. Allison

19 TheAr chitectural Definition Process of the VAX 6200 Fa mily Brian R. All ison

28 Interfacing a VA X Microprocessor to a High-speed Multiprocessing Bus Richard B. GiUett, Jr.

4 7 The Role of -aided Engineering in the Design of the VAX 6200 System Jean H. Basmaji, Glenn P. Garvey, Masood Heydari , and Arthur L. Singer

57 VMS Symmetric Multiprocessing Rodney N. Gamache and Kathleen D. Morse

64 Performance Evaluation of the VAX 6200 Systems Bhagyam Moses and Karen T. DeGregory

79 Overview of the MicroVAX 3500/3600 Processor Module Gary P Lidington

87 Design of the MicroVAX 3500/3600 Second-level Cache Charles]. DeVane

95 The CVAX 78034 Chip, a 32-bit Second-generation VA X Microprocessor Thomas F. Fox, Paul E. Gronowski, Ani! K. Jain, Burton M. Leary, and Daniel G. Miner

1 09 Development of the CVAX Flo ating Point Chip Edward J. McLellan, Gilbert M. Wolrich, and Robert AJ Yod lowski

121 The System Support Chip, a Multifunction Chip for CVAX Systems Jeff Winston

129 Development of the CVAX Q22-bus Interface Chip Barry A. Maskas

139 The CVAX CMCTL -A CMOS Memory Controller Chip David K. Morgan Editor�s Introduction

In the last paper related ro the VAX 6200 system, Bhttgyam Moses and Karen DeGregory describe the development of workloads to measure VAX 6240 performance. As part of their discussion, they include performance measurements and analysis. The second new system based on the CVAX chip set is the low-end MicroVAX -�500/3600 system, which offers three times the performance of irs predecessor, the MicroVAX II. In his over­ view of the major sections of the processor mod­ ule, Gary Lidington relates how schedule and performance requirements influenced product Jane C. Blake design decisions. Editor Charles DeVane then describes the MicroVAX .3500/3600 system's rwo-level cache architecture, The second issue of the Digital Technical journal with emphasis on the design of the second-level (March I 986) fearured papers on the then cache. He also presents some cache performance recently announced MicroVAX II system, a system rest results. based on a single-chip VA.'{ implementation. In The high performance of both the VAX 6200 this seventh issue, we present papers on the sec­ family and the MicroVAX 3500/3600 system is ond generation of that chip set, CVA.'\, the two attributable in great measure tO the CMOS VAX new systems that take advantage of its increased family of chips on which these systems are based. performance capabilities, and a new version of Our five final papers address the design and the VAXfVMS operating system for symmetric development of this chip set. Frank Fox, Paul multiprocessing. Gronowski, Ani! Jain, Mike Leary, and Dan Miner The new mid-range system based on the CVAX begin the discussion with an explanation of how chip set is the VAX 6200 family of compUlers, designers achieved the performance goals for the which utilizes a multiprocessing architecture. single-chip VAX CPU by reducing ticks per The first of rwo papers by Brian Allison is an instruction and machine cycle time. overview of this highly configurable, expandable A companion ro the CVAX CPU, the floating system. Brian's second paper offers insights into point processor chip offers floating point perfor­ the architectural definition process for the 6200. mance equal ro that of the microprocessor for One of the major decisions made by the 6200 integer operations. The approach taken tO attain engineers was to design a new interconnect to this goal and a description of the chip are pre­ support the multiprocessor system. Rick Gillett sented by Ed McLellan, Gil Wolrich, and Bob presents an informative discussion of the com­ Yodlowski. plexities involved in interfacing a microprocessor Jeff Winston then discusses the development of ro a high-speed, multiprocessing bus. the system support chip, which provides a com­ To ensure the availabiliry of first-pass func­ mon core of peripheral system functions. tional parts, a design verification team of engi­ Next, Barry Maskas relates the design effortS of neers worked in parallel with the 6200 module three groups, one in Japan and two in the U.S., designers. Jean Basmaji, Glenn Garvey, Masood that resulted in a single-chip interface between Heydari, and Art Singer discuss the computer­ the CVAX microprocessor and the Q22-bus l/0 aided engineering and verification principles the subsystem. team instituted for the project. ln our final paper, Dave Morgan describes the Rod Gamache and Kathy Morse then describe CVAX memory controller chip, CMCTL, which is the major features of symmetric multiprocessing optimized for Q-bus-based systems. in the VAXjVMS operating system. Of particular interest is their description of a new synchroniza­ tion method implemented in VAXjVMS version 5.0.

2 Biographies

Brian R. Allison Brian AJlison, a consultant engineer for mid-range VAX systems, was the system architect responsible for the coordination of the VAX 6200 system definition and design. Prior tO this work, he served as system architect for a project that yielded several products. including DEBNA, DEBNK, and the KA800. As a member of the VAX-11 /750 design team, he wrote various portions of the microcode for that product. Brian holds a BS.E.E. and a B.S.C.S. from Worcester Polytechnic Institute ( 1977).

jean H. Basmaji jean Basmaji is the technical director of computer-aided engineering and design-verification testing for the VAX 6200 project. A soft­ ware consultant engineer, he has also been involved with CAE/DVT planning and scheduling, and has served as CAE/DVT project leader for the VAX 6200 CPU module. jean joined Digital after receiving his B.S.E.E. from Lowell Technological Institute in 1977.

Karen T. DeGregory A senior software engineer in the Systems Perfor­ mance Analysis Group, Karen DeGregory is project leader of systems perfor­ mance measurement for the VAX 8840 and VAX 6240 systems. In addition to planning and implementing the measurements, she helped develop appro­ priate workloads for these systems. Prior to this work, Karen was a senior software specialist in the Software Services Backup Support Group. She received her B.S. ( 1980) with honors and distinction and her M.S. ( 1981) from Cornell University.

Charles J. DeVane Charles DeVane is a senior hardware engineer in the MicroVAX Systems Development Group. For the MicroVA.-'( 3500/3600 pro­ ject, he designed the second-level cache on the KA650 CPU module and guided the process of module system debug and introduction tO manufactur­ ing. Before joining Digital in 1981, Charles received a B.S. E. E. from North Carolina State University in Raleigh, North Carolina. He is a member of Eta Kappa Nu and Tau Beta Pi engineering honor societies. Biographies

Thomas F. Fox Fran k Fox. a principal engi neer in the Semiconducror Engi· ncering Group, worknl on rhc implementation of the CVAX 780:14 CPU chip. He is currently designing a high-performance microprocessor and consulting with the Advanced SemiconductOr Development Group on the deve lopment of a submicron CMOS process. Frank was educated in Ire land and received a B.E. degree from University College Cork ( 1974) and a Ph.D. degree from Tr inity Col lege Dublin (1978) . both in electrical engineering. He has published papers on ultrasonic instrumentation and magnetic reso­ nant imaging (M RI) and has three patents pending.

Rodney N. Gamache A consulting software engineer in rhe VAXjVMS Soft· ware Development Grou p, Rod Gamache has been with Digital for 11 years. Shortly after receiving a B.S. in mathematics and computer science from the University of New Hampshire, he joined Digital ro work on the development of DECnet Phases III and IV for both RSX- 11M and VMS. For the last two years. Rod has been project leader for the VMS symmetrical multiprocessing project and has filed two patenrs on VMS SMP. Rod also serves as a consultant for the architectures of fu ture low-end VA.,'( processors.

Glenn P. Garvey Glenn Garvey is an engineering supervisor presently leading a team in the verificationof a new VAX processor. He was the project leader for the system-level veri fication performed on the VAX 6200 system and has been involved in modeling and ve rification since coming to Digital in 1982. Glenn was a co-op student at Digital in I 980 and 1981. He holds a B.S.E.E. from Rensselaer Polytechnic Institute.

Richard B. Gillett, Jr. Rick Gillett, a consu ltant engineer, led the VAX 6200 CPU module project. Prior ro his work on the 6200, he served as one of the architects of the XMI bus and was a member of the VAX BI bus project ream . Relative ro his work on rhe VA X 6200 design , he has recently filed 13 patent applications. Currently, he is system architect for a new VAX system. Rick jo ined Digital after receiving a B.S.E.E. (summa cum laude) from rhe University of New Hampshire in 1979. He is a member of Tau Beta Pi and Phi Kappa Phi .

Paul E. Gronowski Before receiving a B.S.E.E. from rhe University of Cincinnati in 1984, Pa ul Gronowski was a co-op srudenr at Digital working on chips for rhe VAX 8200 and 8500 systems. Currently a senior engineer in the SemiconductOr Engineering Group, he has been a codesigner of the E-box for the CVAX 78034 CPU chip, designer of the bus interface unit and exponent section for a CMOS tloating point chip, and is now doing advanced deve lopment work for a new CMOS microprocessor. Paul is a member of Eta Kappa Nu .

4 Masood Heydari Masood Heydari is an engineering manager responsible for the computer-aided design , ve ri fication, and testing of mid-range VAX sys­ tems. He is also managing the development of an I/0 subsystem for VAX products. Since joining Digital in 1981, he has developed the diagnostics for a .36-bitsystem and has been responsible for functional rest partern genera­ tion for several products. Most recently, he was responsible for CAD/DVT on the VAX 6200 project. Masood holds a B.S. and an M.S. in computer engineer­ ing from Boston Universi ry ( 1 980). He is a member of Tau Beta Pi.

Anil K. Jain Ani! Jain received an M.S.E.E. from the University of Cincin­ nati (1980) and a B.S.E.E. from Punjab Engineering College ( 1978) . Upon joining Digital in 1980, he worked on bipolar and CMOS- 1 technology while a member of the Device Modeling Group. A5 a senior engineer working on the CVAX project, An i! designed the bus interface unit for the CPU chip and has a patent pending for the bus interface protocol . He is currently working as a project leader on the floating point chip project for vector applications.

Burton M. Leary Mike Leary is a principal engineer in the Semiconductor Engineering Group/Advanced Development Memory Group and is currently wo rking on the design of advanced memory products . In previous work, he participated in the design of the fl oating point chips for the MicroVAX and 8200/8300 systems and the design of a CMOS serial interface controller chip. Mike joined Digital after receiving a B.S.E.E. from the University of Mas­ sachusetts. He is a member of Ta u Beta Phi.

Gary P. Lidington Currently an engineering manager in the MicroVAX Engineering Group, Gary Lidingron has served as a system engineer for the MicroYAX 3500/3600 system and as a maintainabi lity engineering project manager for the MicroVAX II and single-board with Q-bus multi­ processing architectures. Before coming to Digital in 1981, Gary was a co-op student and rest engineer on the L200 project at Te radyne, lnc. He holds a B.S.E. E. (honors) from Tufts University and an M.S. in computer engineering from the University of Massachusetts.

Barry A. Maskas Barry Maskas, a consulting engineer with the Semicon­ ductor Engineering Group, is the project leader for the development of a cus­ tom VL SI memory contro.ller. Prior to his current work , he was the U S. project leader and architect for the CVAX Q22-bus interface chip and co­ designer of the Micro VAX IICPU and memory boards. Barry came to Digital in 1979 after receiving a B.S.E. E. from Pennsylvania State University Biographies

Edward J. Mclellan Fd McLellan is a principal engineer in the Semicon· ductor Engineering Group. He has worked on the design of several chips, including the .J-1 I and the CVAX tloating point accelerator chips. He holds one patent for previous work and has made application for two additional patents based on work done for the CVAX tloating point chip. Ed joined Digital in 1980 after receiving a B.S. in computer and systems engineering from Rensselaer Polytechnic Institute. Currently he is project leader for a tloating point chip for a new processor.

Daniel G. Miner Dan Miner came to Digital after receiving a B.S. in com· purer engineering from Rensselaer Polytechnic Institute in 198'5 A software engineer in the Semiconductor Engineering Group. he wrote the debug and diagnostic tests for the CVAX CPU chip. Dan co-authored and presented a paper on the subject of resrahi I ity strategy at the I 987 IEEE International Test Conference.

David K. Morgan Dave Morgan is the engineering manager of Advanced Peripheral Development in the Semiconductor Engineering Group For his work on several engineering projects. Dave has three patents pending. Before joining Digital in 197'5, he was a design engineer at the RCA Solid State Division and holds five parents for his work on integrated circuit designs. Dave receiwcl a B.S.E.E. ( 1969) from Western New England College, an M.S.E.E. ( 1972) from Rutgers University. and has pursued doctoral work in solid·state physics.

Kathleen D. Morse A-; a consulting software engineer. Kathy Morse is working on advanced development for the VA.X.jVMS Development Group. She is one of the VMS SMP architects and consults on enhancements to vari· ous parts of the VMS executive. Earlier, she implemented the VMS support for the first MicroVA.t'{ systems, the first asymmetric multiprocessing VAX sys· rem. and the i'v1A780 mulriporr memory. Kathy joined Digital in 1976 after receiving her B.S.CS degree from Worcester Polytechnic Institute. where she also earned her M.S.CS. degree in 1985. Kathy is a member of IEEE, the Professional. Council, and ACM as well as Tau Beta Pi and Upsilon Phi Epsilon.

Bhagyam Moses Bhagyam Moses is the engineering manager of the Mid· range Systems Performance Analysis Group, a group which she established two and a half years ago Prior to her current work with the VAX 8000 and VAX 6200 series of systems, she had been involved in the modeling and per· formance measurement of the VAX 8600 and 8650 systems, PDP·l I systems, OECSYSTEM-20. and earlier VAX systems. Bhagyam joined Digital in 1979. She received a 13.S. degree (honors) in mathematics from Spicer Memorial College and an M.S. in applied mathematics from Howard University. Arthur L Singer Art Singer supervises the computer-aidcd design , simula­ tion , and timing verification for a new ljO adapter product. His previous work includes supervision of the system simulation and timing verification of thc VA,'\. 6200 system and microdiagnostie and release support for the VAX 8600 project. Before joining Digital in I 984 , Art was employed by IPL Sys­ tems as a design engineer and manager of a microdiagnosric development grou p While at IPL, he received a parent for the design of an instruction unit. Hc is a member of Tau Bcra Pi and Eta Kappa Nu .

Jeff Winston A mcmhcr of the Semiconductor Engineering Group, Jeff Winston designed thc microsequencer for the VAX 8200 CPU chip and led the design of two generations of the Micro VAX System Support Chip. He has also contributed to the development of many CAD tools used in chip design . Before joining Digital in 1980.Jeff rcceived his B.S. (1979) and M.S. (1980) in Electrical Engineering from Cornell University. He is currently leading thc dcvelopment of the XMI interface chip set for a new mid-range VAX CPU.

Gilbert M. Wolrich A consulting engineer in the Semiconductor Engineer­ ing Group/Architecturally Focused Logic, Gil participated in the )- 11 con­ trol chip design and was project leader for the CVAX CFPA and the J-1 I FPA chip projects. He holds a parent for an ALU with Carry Length Detection used on the J-11 FPA. Gil received a B.S.E.E from Rensselaer Polytechnic Institute in 1971 and an M.S. E. E. from Northeastern University in 1977.

Robert AJ Yodlowski I3ob Yod lowski. a principal engineer. is the chip implementation project leader and l<:ad circuit designer for the CVAX floating point accelerator. For his work on this project, Bob has three patents pending. He was also senior circuit designer on the J- 1 1 floating point chip project. Re lative to this project work. he is a co-inventor ancl patent holder for AUJ with Carry Length Detection ( 1987). Before joining Digital in l 977, Bob was a senior member of the technical staff at LFE Corporation in Wa ltham, MA . He received a I3.S.in engineering physics from Cornell Univer­ sity ( I 968) and an M.S. E. E. from Syracuse University (I 970).

7 Foreword

• Multiple process generations related by opti­ cal scaling

• VAX microprocessors as the leading edge chip development projects

• Performance improvements targeted for greater than 50 percent per year

This program not only provided the LSI Group with an overall structure for its process and chip development projects; it also provided Digital's system groups with a stable, long-term basis for planning system products. Robert M. Supnik The program was also a significant leap of Corporate Consultant, faith. When it was formulated, there was no VLSI Te chnology, and MicroVAX business. The MicroVAX II system was Group Manager, two years away from shipment. Almost all design Semiconductor Engineering resources in the LSIGr oup and in the low end sys­ Microprocessor Development tem groups were busy with the MicroVAX chip set and its related systems. Major development In May 1985, Digital introduced the MicroVAX II projects in technology, chip design, systems computer system. Based on the Micro VAX proces­ design, and manufacturing were required to sor chip set, the MicroVAX II system offered bring the program vision to fruition. unsurpassed price, performance. and re liabi lity Work began with development of the under­ characteristics . In the three years since then, lying semiconductOr technology. Starting in Digital has sold more than 100,000 systems 1983, a ream from Semiconductor Manufactur­ based on the MicroVAX chip set. There are more ing's Advanced Semiconductor Development MicroVAX -based systems in the field than all (ASD) defined, simu lated, and rested CMOS-I , other types of VAX systems combined. Digital's first CMOS process. When first defined, ln the same three years, the practice of com­ CMOS-I 's key features - N-well base on a p-rype puter engineering has advanced considerably. epitaxial layer. rwo levels of metal interconnect, Faster processors, bigger memories, quieter pack­ 2.0 micron feature size, direct scalability to ages, and more complex software have appeared 1.5 micron feature sizes - were controversial in a steady stream. For Digital to remain compe­ within an industry that was still debating NMOS titive, we wou ld need, over time, a second gener­ versus CMOS. Over rime, these choices have been ation of VLSI-based VAX chips and systems. The vindicated, and CMOS-I has proven ro be a main­ chips and systems that constitute the second stream, robust, highly manufacturable process. VLSI-based generation are described in this issue Equally important was development of design of the Digital Technical]ournal. methods for larger and more complex chips. The The planning for the second generation began Semiconductor Engineering Computer Aided in 1983. That year, the LSI Group (now Semicon­ Design (CAD) Group continuously refined the ductOr Operations) formu lated a multiyear pro­ structured design process first deployed for gram for the development of both semiconductOr Micro VAX and V- 1 1. The goals of this effort process technology and leading-edge chip prod­ were improved simu lation coverage, faster ucts . The key characteristics of this process; turnaround time, and more extensive automated product plan were verification. One consequence of the increased • CMOS (complementary metal-oxide-semicon­ use of CAD roots was a dramatic increase in the ductOr) process technology (Previous Digital amount of computing power required. This gen­ chips were based on NMOS technology .) eration of chip development projects used four

8 times as much computing power as the first VLSI an extensible multiprocessor system. They generation. defined a new system int erconnect (supported by The Semiconductor Engineering Microproces­ unique chips) to provide unprecedented flexi­ sor Group began architectural prework on the bility and extensibility in configuring systems, second-generation chip set (called CVAX) in and new system packaging ro support the con­ mid-1984. The overarc hing goal was simple: cept. However, a general-purpose multiprocessor three times the performance of the MicroVAX system was feasible only if the VMS operating sys­ chip set in less than three years - a compound tem cou ld rake advantage of the incrementa I performance growth rate of more than 50 per­ power offered by additional processors. This cent per year The central processor design requirecl a major restructuring of VMS tO support starred from the MicroVA..'<: base but drew upon symmetric (al l processors equal) multiproces­ ideas from other VAX implementations, notably sing. Thus, the definition and implementation of the 8700. The tloaring point unit design focused the mid-range 6200 system fa mily and of VMS on minimal execution flows for the most common symmetric multiprocessing support had to be instructions. Both chips rransirioned to imple­ close ly linked. mentation in 1985. As the engineering development projects pro­ The original concept for the CVAX chip set gressed. manufacturing became heavily involved had been to build chip-for-chip analogues of in planning and executing the transition from MicroVAX - a central processor and a floating design to volume product. LSI Manufacturing in point unit. However, as the flexibility of the new Hudson, Massachusetts, introduced CMOS-I into CMOS process, and the effic iency of the CAD multiple fabrication units in order to produce tools, were appreciated by designers, the chip set prototypes quickly and ro ramp up to high vol­ concept expanded beyond the central processor ume production. System manufacturing groups ro include key peripherals. The implementation in Westfield (Massachusetts) . AJbuquerque (New of these peripheral functions in VLSI chips made Mexico). Puerto Rico, a net other sires worked systems faster. more re liab le, and less expensive. closely with rhe system designers to introduce In addition. it allowed peripheral functions to be rhe new manufacturing processes required for standardized across multiple system implementa­ system production . tions and additional fu nctions tO be added in The resu lts of these development programs is a modular fas hion. The Semiconductor Engineer­ family of VAXsysrems with exemplaryprice . per­ ing peripherals group (now Ad vanced Develop­ formance. and reliability characteristics. More­ ment) specified and implemented a memory con­ over, the programs leave as residuals a set of trol ler, a memory driver, a console interface, and VlSI components from which other products can a Q-bus interface. be built, and base technology from which further Mter the MicroVAX II system shipped in May advances in chip and sysrem design wi ll evolve . 1985. the Low-end Systems Group and the Mid­ The initial program vision has been fu lfilled, range Systems Group became actively involved even exceeded. Many people, in reams and indi­ in the specification of the CVAX chips and in vidually. worked together ro bring this about. the definition of new systems utilizing the chip The excellence of the results reflects. in fu ll set . In the low end, the 3500/3600 systems measun:. the excellence of the work that they were defined as evol utionary extensions of the have done. MicroVAX II. Nonetheless, the performance targcts for the new chips posed knotty design problems for a system fa mily bounded by both cost and packaging considerations. In the mid-range, the system designers wished to exploit the CVAX chip set's combination of high performance and low cost by constructing

9 Brian R. Allison I

An Overview of th e VAX 62 00 Family of Sy stems

Digital's VAX 6200 series is a high-performance, expandable fa mily of computer systems that combines low-cost microprocessors with high­ performance memoryand I/0 subsystems. Based on the CMOS VAX chip set, the VAX 6200 CPU module performs at 2.8 times the VAX -11j780 system; utilizing a multiprocessing architecture, system sp eeds are available up to 11 times the VAX -11/780 system. The memory subsystem utilizes a multi­ controller architecture fo r up to 256MB of total system memory. TheXMI bus, the electrical interconnect fo r the system, supports the multiple pro­ cessors, memory subsystems, and VAXB I channel adapters. The VAX BI is usedfo r all IjO devices.

The VAX 6200 fa mily of computer systems is thc with hardware optimized for efficient multipro­ most reccnr audition to Digital's !inc of VA..'< com­ cessor operation . The result is a system that offers purer systems The VA,'( 6 200 systems. primarily similar performance for a large class of applica­ based on CMOS technol ogy, are mid-range sys­ tions at a better price-performance rat io than that tems which exploit mu ltiprocessing techniques. offered by traditional single-processor. high-per­ The VA,'( 6200 family currently comprises four formance computer systems. systems, all built from common subassembl ies. A primary objective of the VAX 6200 system Any VAX 6200 system may be upgraded to any design is tO provide a highly configurable and other VAX (1 200 system simply by adding CPU expandable computing environ ment. To achieve and me mory modules to rhe existing cabi net. this objective. designers chose a modu lar sub­ This paper provides an overview of the system asse mbly design for the totaJ system. This modu­ and therefore a conrexr for the five papers that lar design provides for cost-effective bas ic sys­ fol low in this issue. These papers describe sev­ tems and also al lows for system expansion to eral of the components in detai l. the engineering achieve higher performance. Al l members of the design cfforr, rhe performance evaluation pro­ VAX (> 200 family arc housed in the same cabinet cess, and some of the multiprocessing aspects of and usc the same basic su basse mblies. The on ly the operating system. difference is the number of processors, amount of In the pasr. CMOS- based microprocessor tech­ memory . and number of 1/0 devices. Tab le I nology has been used primarily ro build low-cost cleraiIs the configurations of the VAX 6210 , systems. Today , by using multiples of these low­ VAX (1 220, VAX 62:)0,and VAX 6240 systems. cost microprocessors . we are presented a unique opportunity ro produce a high-performance com­ Sy stem Architecture purer system when the microprocessors are cou­ pled with high-performance memory and 1/0 All VAX 6200 systems consist of CPU(s), mem­ subsystems. Al though this type of system archi­ ory. and 1/0 channel adapters connected to a tect ure will nor directly resu lt in fas ter execution common system interconnect known as the XMI . of a single task, it docs result in greater system The VAX BI is used as the interconnect to all I/0 throughput in appl ications that have several devices in the system. 1 All memory and 1/0 simu l taneously computable tasks. The architec­ devices are equal ly accessibl.c by all CPUs in the ture couples the effectiveness of the VMS operat­ system. Figure I shows a block-level diagram of ing system in multiprogram med environ ments the VA,'( 6200 system.

10 Digital Technical jounwl No . 7 August 1988 CVAX-based Systems

Ta ble 1 VAX 6200 Family System Configurations

VAX 6210 VA X 6220 VA X 6230 VA X 6240

Number of processors 1 2 3 4 Main memory 32MB 64MB 64MB 128MB VAXBI channels 2 2 2 2 CPU cycle time 80 ns 80 ns 80 ns 80 ns Cache size 1KB 1KB 1KB 1KB (per CPU) 256KB 256KB 256KB 256KB Free XMI slots 10 8 7 4 Performance 2.8 5.5 8.3 11.0 (times one VAX-1 1 /780 system) Maximum CPUs 4 4 4 4 Maximum memory 256MB 256MB 256MB 256MB Maximum VAXBI 6 6 6 6 channels

UP TO 256MB

4 CPUs MAXIM UM UP TO 11 X VAX-11/780

XMI 100MB/SECOND

VA XBI CHANNEL ADAPTERS (6 MAXIMUM)

VA XBI 4 > < VAXBI 5 > < VA XBI6 > OPTIONAL VA XBI EXPANDER< CABINET

Figure 1 VAX 6200 System Block Diagram

Digital Te chnical journal I I No. 7 A11gus1 I'.JRS ------An Overview of the VA X 6200 Fa mily of Sys tems

The primary goa l of the VAX 6200 system is the number of IjOs available to VLSI chips and to allow higher levels of system performance the amount of logic that can be put on a module. through multiprocessing. To simplify software Surface-mounted components are used exten­ design and to be consistent with previous multi­ sively throughout the system. Further, many pas­ processor architecture, it was essential to pro­ sive components and a limited number of active vide a shared memory resource. All system mem­ surface-mounted components reside on side 2 of ory is a global resource accessible through the the modules. Al l VAX 6200 modules limit the use same address space from each processor and from of surface mount ro 50-mil lead pitch compo­ all ljO devices. A sophisticated multilevel cache nents with vias on 1 00-mil centers. Across the contained locally in each CPU minimizes mem­ modules in the system, there is a mixture of sma.l l ory accesses on the XMI. Cache coherency is outline integrated circuit (SOl C) , plastic leaded maintained totally by hardware. chip carrier (PLCC) , and cerquad surface-mount packages. Te chnology All VAX 6200 XMI modules interface to the The VA..'\ 6200 systems are based on a number of XMI through a set of eight semicustom pans. different CMOS technologies. The VAX CPU chip These eight chips are physically mounted on a set and the system interconnect transceivers are section of the module known as the "XMJ cor­ implemented entirely in Digital's fu ll custom ner" This section of the module is approximately 2 CMOS process featuring a size of 1.5 microns 12.7 em (5 inches) by 3 e m (1.2 inches) and is The interface between each module and the located by the A, B, and C connectors of the mod­ system interconnect is implemented in channel­ ule. (See Figure 2.) The XMI interface area is less 1.5-m icron CMOS gate arrays. The number of identical on all modules so that a common elec­ gates used in these arrays varies from 18K to 50K trical load is presented to all slots on the XML gates. The interface tO the VAXBI and the XMI The XMI corner has four 44-pin cerquad pack­ arbitration system is implemented in 1.5-m icron ages on side 1 of the module and four 44-pin channeled arrays. The on-board CPU caches are cerquad packages on side 2. In addition, approxi­ implemented with 45-nanosecond (ns) 64K-by-4 mately 100 surface-mounted-device (SMD) sig­ CMOS static random-access memories (SRAMs) nal termination resistors and bulk power capaci­ and industry-standard CMOS cache tag chips. tors are divided evenly across both sides of the Al l VAX 6200 XMI and VAX BI modules are module in the XMI corner. connectt.:d to their respective backplanes by a Figure 2 is a photograph of the three VAX 6200 300-pin zero insertion force (ZIF) connectOr Al l XMI modules. Note that all three modules have moduks use 1 0-layer controlled impedance the identical components in the lower right cor­ printed circuit boards. Al l cables from the mod­ ner and a similar gate array directly above the ules art: connected through the backplane to XMl corner. improve reliability and tO minimize the task of replacing modules. VAX 6200 CPU Module The VAX 6200 XMI backplane is a 14-layer As noted earlier, the VAX 6200 CPU (KA62A) is controlled impedance printed circuit board. Side based on the CMOS VAX chip known as the CVAX . 1 consists entirely of surface-mount contacts for The KA6 2A is a single module that implements the ZIF connector. Side 2 consists of plated a fu ll CPU subsystem. Included on the KA62A through holes for power strips and 1/0 pins, and module are surface-mount pads for resistors. These surface­ • The CVAX chip, which includes a 1 kilobyte mount rt.:sistors form the termination network for (KB) on-chip cache the XMI signal lines. VAX 6200 XMI modules use a printed circuit • An external 256KB cache board very similar to the VAXBI printed circuit • A fl oating point accelerator chip (CFPA) board. XMI modules have the same finger pin design as the VAXBI, but the module size is • Console support hardware 28 ern (11 .025 inches) deep instead of • An interface to the XMI 20 38 em (8.025 inches) deep The VAX 6200 modules make use of advanced Figure 3 shows a block diagram of the KA6 2A module technology features to maximize both module.

12 Digital Te chnicaljo urnal No. 7 August I <)88 CVAX-based Systems

Figure 2 Th ree VA X 6200 Xll11Modules

EEPROM 32KB

DIAGNOSTIC ROM 128KB

CPU CONSOLE FPU CACHE 256KB CHIP ROM r-- CHIP TAG CACHE lKB CACHE 128KB

I COAL I I I

CONSOLE CPU/XMI SUPPORT GATE ARRAY CHIP

REiAD WRITE INVALIJDATE DUPLICATE QUEUE BUFFER QUEUE CACHE TAG

I I

XMI INTERFACE CHIPS < XMI Fig ure 3 VA X 6200 CPU Module (KA 62A) Block Diagram

Digital Te chnical journal 13 No . 7 AU[?USI I'J88 An Oueruiew of the VA X 6200 Fa mill' of Svstems

Using the CVAX processor with an HO-ns cycle Memory time, the KA62A CPU module performance is The VAX 6200 memory subsystem is made up of approximately 2.8 times that of the VAX- 1 1/780 memory control ler;array modules and is known system. For a rota I system performance up to as the MS62A. The MS62A module, shown in I I times greater than the VAX- Ilj7HO, up to fo ur Figure 4, contains a memory controller chip and KA(J2A CPU modules may be configured in a �2 megabytes (MB) of !-megabit (Mb) dynamic VAX 6200 system. RAM s (DRAM s) . The MS62A maintains a 64-bit The KA6 2A CPU module contains a two-level data path between the memory comrollcr chip cache to reduce memory access time. The pri­ and the RAMs, and implements an 8-bit error-cor­ mary cache is I KB in size and resides inside the recting code (ECC) for each 64-bit word . The CVAX chip. This cache contains only instruction MS62A contains hardware to implement up to 16 data to eliminate the need to inva l idate this data "lockable" memory locations per memory array. as other processors write to cached data loca­ These memory locks are used extensively by pro­ tions. (The VAX architecture provides strict rules cessors and I/0 devices to ensure singular access for modification of instruction type data.) The to data structu res in a shared-memory multipro­ secondary cache is 256KB in size and contains cessor system . data as we ll as instmctions . The KA62A moni tors The greater memory bandwidth required by write transactions on th<.: system interconnect and multiple processors and I/0 channels is achieved inva lidates any cached locations written by by memoryin terleaving. The MS62A allows inter­ another CPU or 1/0 device. leaving on 32-bytc boundaries. As long as mem­ ory addresses are randomly distributed across the lower 6 address bits, the bandwidth of the total memory subsystem can be increased linearly with the addition of interleaved memory controllers. 8-BIT BANK 4 8MB (64 BITS x 1M WORDS) The MS62A memory modules may be inter­ ECC leaved two, fou r, or eight ways. The interleave 8-BIT factor is automatically determined by the system BANK 3 8MB (64 BITS x 1M WORDS) ECC upon power-up or system initialization . How­ 8-BIT ever, designers have given the user the ability to BANK 2 8MB (64 BITS x 1M WORDS) ECC manually specify the interleave characteristics of

8-BIT the memory subsystem. Up to eight MS62A mem­ BANK 1 8MB (64 BITS x 1M WORDS) ECC , ory modules may be configured in a VAX 6200 systl'm.

MEMORY CONTROLLER IjO Channels GATE ARRAY The VA,'\ 6200 system uses the VAXBI bus as the interconnect for all 1/0 devices. The system interface to the VAXBI is a rwo-module set called 8-ENTRY 16-ENTRY DATA l COMMAND l the DWMBA. Figure 'S shows a block diagram of LOCK TA BLE QUEUE QUEUE the DWMBA modules. The DWMBA/A module is connected ro the XMl, and the DWMBA/B module is connected to the VA XBI. These two modules l J arc interconnected with a 120-wire cable assem­ bly which may be up ro 4.6 meters (I 5 fe et) XMI long. INTERFACE Thl' DWMBA allows VAXBI devices to read sys­ CHIPS tem memory at up to 'S.SMB per second and ro writl' system memory at up to 13.3MB per sec­ XMI om! . Any VAXBI-compatible device may be con­ nected to the VAX 6200 systems through the DWMBA. Al l VAX 6200 systems contain a mini­ Figure 4 VA X 6200 Me mory Module (M562A ) mum of two VAX BI channels and may optionally Hlock Diagram contain up ro six VAXBI channels.

14 Digital Te chnical jo urnal No. 7 August 1988 CVAX-based Systems

Sy stem Interconnect, the XMI The XMl has :)0 address hits. and the smallest a The XMI. the primary electrical int erconnect in addressable entity is single b)'l e. XM 1 address the VAX 6200 fa mily of computer systems. space is divided intO two halves by bit 29 of en com passes the address. When bit 29 equals zero. an address is said to fa ll into memory space. When bit • The protocol observed hy a node on the XMI 29 equals one . the address is sa id to fa ll within l/0 space. This arrangement march es the maxi­ • The electrica l environment of the XMl mu m physica l address as defined by the VAX architecture and allows up to ')12MB of physical • The ba ckplane memory ro be addressed . The XM l architecturally

• The logic used to implement the protocol allows up to 16 nodes, but is physica lly and elcc­ trica lly constra ined ro 14 nodes. The XMI can support multiple processors. mu ltiple me mory subsyste ms. and multiple 1/0 channel adapters. VAXBI XMl nodes may be classi fied as commanders or responders. depending on their role in a given transaction . A commander is a node that is initiar­ i ng an XM I rransact ion . A responder is the node that must act upon the transaction . A processor BIIC node usually acts as a com mander. (Howeve r. a processor node may become a responder if anot her node reads a control/status re gister on VA XBI VAXBI INTERFACE the CPU .) Memory nodes. on the other hand. arc MODULE GATE ARRAY alwavs responders si nce they cannot initiate an Xl'vll transaction . 1/0 nodes may act as either com manders or responders. depending on the I BUS type of 1/0 operation . The fu nctions of these TRANSCEIVERS nodes arc further explained in sections below. Beca use the Xi\H is a pcnded interconnect. sev­ eral tra nsact ions can be in progress simulta­ � neously. When an XMI com mander initiates a �BUS CABCE UP TO 15 FEET IN LENGTH request for a read or to sol icit an interrupt vector. an id entifier code is also transmitted to the sc::l ccted responder. At this point. control of the Xl'vtl is relinquished. and ot her transactions arc I BUS allowed to take place whi lc the responder fe tches TRANSCEIVERS the req uested read data or interrupt vccror. The responder then arbitrates for control of the XMI XMI and re turns the requested data or vector along XMI INTERFACE MODULE with the identifier code . By mon itoring the GATE ARRAY identifier codes. the initial commander is abk ro receive the correct data and continue. XMI Ar bi tration and data transfers occur simu lta­ INTERFACE neously over a multiplexed set of address and CHIPS data lines. and a se pa rate set of arbitration lines. The XM I supports quadword . octaword . and hex­ word reads to memory. as we ll as quadworcl and ocraword memory writes. In addition. the XMI XMI su pports longword-lengrh read and write opera­ tions ro l /0 space. These longword operations implement byte and word modes required by cer­ Figure 5 VAX 6200 VA XRI Channel Adap ter tain l/0 device s Rlock Diagra m

Digital Te chnical jo urnal I') .\'o. - ,-J. ugus/ I'.IHH ------An Overview of the VA X 6200 Fa mi�y of Systems

The XMI multiplexes data and address informa­ Ta ble 2 XMI Bandwidth Based on tion onto the 64-bir data path. Data transactions Transaction Size arc initiated with a "command and address" Tra nsaction Interconnect cycle, followed by multiple data cycles . The max­ Size in Bandwidth imum length fo r an XMI transact ion is 32 bytes of Bytes in MBjsecond data . The XMI cycle rime is 64 ns. The effective bandwid th of the XMI is a function of the data 4 31 .25 transfer size , as shown in Ta ble 2. 8 62.50 The XMl archi tecture allows for three distinct 16 83.33 classes of devices. 32 100.00

Processor Nodes Each processor node contains a CPU that exe­ our conditions that may be used to detect and cu tes instructions and manipulates data con­ diagnose fa ilures. rained in XMI memory. The processor node can execute any instruction set compati ble with the VAX -sryle byte addressing and memory locking VAX Console mechanisms. A processor node will have a cache The VAX 6200 system implements the standard that must force all written data back to main VAX console fu nctionality by means of software memory. Any cached processor module must also that conditionally executes on each of the KA62A monitor write traffic on the XMI and invalidate CPU modules. Each KA6 2A CPU module contains any location in its own cache that is written into a serial-line interfa ce, 256MB of read-only mem­ main memory. Processor nodes must also be ory (ROM), 32MB of electronically erasable ROM capable of responding to interrupt requests ge n­ (EEROM) , and 51 2 bytes of RAM. Control is erated either by other processors or by ljO passed to the console software upon any one of nodes . the fo llowing occurrences:

IjO No des • System power-up IjO nodes generally respond ro I/0 space refer­ ences either by mapping the data onro another • Initialization bus or by interpreting data as a command . An I/0 node can also become a commander on the • Receipt of a controi-P character from the con­ XMI and access global XMI memory. ljO nodes sole terminal may generate interrupt sequences directed toward processor nodes. However, IjO nodes do • Execution of the HALT instruction nor respond to commands directed toward mem­ ory space. • Some severe error conditions

Memory No des Each KA6 2A CPU has access to console termi­ Memory nodes act only as responders on the XMI. nal transmit-and-receive lines carried on the sys­ They respond to read and write requests directed tem backplane. Upon power-up, control of the toward memory address space. These requests are system console term inal is dynamically allocated generated either by processor or 1/0 nodes. to one of the CPUs present in the system. This CPU, known as the "boot" processor, provides Data Integrity the system interface ro the console terminal as The XMI contains a number of fe atures to we ll as ro the switches and lights located on the enhance the integrity and re liabil ity of the system control panel. intercon nect. Fi rst, all XMI information transfer On receiving commands from the console ter­ lines arc parity protected, and XMI command minal, the boor processor may run diagnostics or confirmation signals are ECC protected. The XMI boot an operating system. This processor commu­ protocol is sufficiently robust tO permit detection nicates with other processors by means of a struc­ and recovery of all single-bit error conditions on ture maintained in memory known as the console these signals. Add itionally, the XMI defines time commun ications area (CCA).

16 Digital Te chnicaljo urnal No . 7 August I <)88 CVAX-based Systems

Also considered as part of the console sub­ system. a TK'50 tape drive is included in each VAX 6200 system. The tape drive is connected to the system hy means of a TBK'50contr oller mod­ uk located on a VAX BI l/0 channel and is used for the fo llow ing purposes:

• Saving all vo latile parameters for the console subsystem

• Loading the VA,'( Diagnostic Su pervisor (VDS) when no disk is avai !able or fu nctional in the system

• Distributing operating system and layered soft­ ware

The TK'50tape drive is also ava ilable under oper­ ating system control as a general-purpose data interchange device.

Built -in Self-test Extensive bu ilt-in sel f-test is used by al l modules contained within the VAX 6200 systems. Upon power-up. all modules within the system, with the exception of the DWMBA, perform a se lf-test in para I let. After self-test is complete, the CPU modules examine each other's status; the one in the lowest slot number that passed self-test is selected as rhe boor processor. The boot proces­ sor then continues to execute an additional test ro ensure memory accessibility and finally exe­ cutes a tesr of the DWMBA.

Physical Pa ckaging Al l VAX (J 200 systems are housed in the same cabinet, which is 7R em (30 '5 inches) wide by I 54 em ( 60. "i inches) tall by 76 em (30 inches) Figure 6 VA X 624 0 Sys tem, Fr ont Door deep. The cabinet contains one 1 4-slot XMI back­ Removed plane , two (J -slot VAX Bl backplanes, and all nec­ essary power and cooling to sustain a wide range of configurations Figure 6 shows a VAX 6240 The VAX 6200 systems all contain two 6-slot with the front door removed. VAX BI backplanes, which arc configured as inde­ The XMI is physically implemented in a pendent channels. The first slot of each VAX BI 14-slot backpl ane assembly containing ZIF mod­ backplane is occupied by the DWMBA/B module, ule connectors. signal terminating networks, and leaving '5 slors for standard VAXBI i ntcrfaces . Al l a centra lized clock and arbitration system. Mod­ systems contain a DEBN K TK'50 tape control ler ules are located on 2 em (0.8 inch) centers. The and a DEfiNA Et hernet controller as standard Xi\'1 1 bac kplane is supplied with + 5 volts (V) for equipment. The rwo VAX BI backplanes are sup­ ge neral logic. a separate + 5 V supply for mem­ plied with +5 V, ± 12 V, -'5.2 V, and -2 V. ory. ± 12 V for tbe console terminal line drivers, and - '5 2 V 1-2 V for emitter-coupled logic Summary (FCL). Presently none of the VAX 6200 XMI The VAX 6200 fa mily of systems merges the modules util izes the ECL voltages. but ECL is CMOS VLSI VAX chip, which is used in a number included for potential fu ture usc. of Digital's products, with a very high perfor-

Digilal Te chnical journal 17 No. 7 August 1')88 ------An Ot,ervieu• of the VA X 6200 Fa mily of Svstems

mane<:: me mory and 1/0 subsystem . This hard­ References ware. combined with the new fu lly symmetric I. P Wa de. "The VAX B1 Bus - A Randomly mu ltiprocessing capabi lities of VMS version 5.0, Configurable Design," Digital Te chnical allows very high system throughput previously jo urnal (February 1987): 81-87. achievable only with ECL technology. Moreover. the extensive usc of CMOS technology results in 2. T Fox. P. Gronowski. A. Ja in, B. Leary, and physically smaller systems. These smaller sys­ D. Miner, "The CVAX 78034 Chip, a 32-bit tems consume less power and are more re liable Second-generation VAX Microprocessor," due 10 the lowe r component counr and lower Digital Te chn ical jo urnal (August 1988. power consumption . this issue) : 95-108.

18 Digital Te chnicaljournal No . 7 August 1988 Brian R. Allison I

TheAr chitectural Definition Process of th e VAX 62 00 Family

The architecturalde finition of Digital's VAX 6200fa mily was governedby a twofo ld goal: to build a system with higher throughput than previous CMOS, Q-bus-based systems at a cost lower than ECL -based systems. Deci­ sions made during the definitionprocess were influenced by firm schedule guidelines. Further, the very nature of the multiple processor system imposed its own requirements, particularly in the definition of the XMI bus. This new 64-bit-wide interconnect is sp ecifically designed to meet the memory and IjO needs of the symmetric multiprocessor system. Through­ out the architectural definitionprocess, engineers continually evaluated the interdep endency of one designdec ision up on another andag ainst the project and schedule goals. By this process, the total definition of the sys­ tem - the XMI bus, the processor module, memory module, console sub­ system, and packaging - was achieved.

Definition of the VAX 6200 family of systems Once the decision tO build a multiprocessor began in March 198'5.The engineers' intent was was made. the next question was how many tO design a follow-on product w the VAX 8200/ processors to include. Several small computer 8:)00 fa mily of systems. still in development at manufacturers were building 8- to 32-processor that time. This paper discusses the system archi­ systems at the time. Our belief was that the mar­ tectural de finition process that took place during ket for systems with numerous processors was 198'5 fa irly small because few applications wou ld run Like the VAX 8200/8300 family before it, the effic iently on these systems. Therefore, we VAX 6200 fa mily provides a system environment decided ro design the VAX 6200 as a 4-processor for a VLSI VAX chip set. This new fa mily of sys­ system. with the possibility of expansion to tems is a mid-range VAX implementation. In this 8 processors . This arrangement would allow us to context. a mid -range system is defined as a pro­ sri 11 configurecost-eff ective 1- ro 2-processor sys­ duct with more capabi lity than the Q-bus-based tems. If we found a significantnu mber of applica­ systems and less capability than the emitter-cou­ tions cou ld benefit from the larger number of pled logic (ECL) based systems. processors, we could expand to 8 processors. Building an efficient multiprocessor system Project Goals wou ld necessitate optimization of both hardware The primary goa l of the VAX 6200 program was and software functionality . The VMS asymmetric twofold: to build a system with greater system multiprocessing code (VMS versions 2 through throughput than the CMOS. Q-bus-based VAXsys­ -4 ) that supported the VAX -I lj782, VAX 8300, tems , and to ensure system cost was lower than and VAX 8800 systems worked we ll for compute­ that of high-performance EC L- based systems. bound , dual-processor systems. However, asym­ Designers wou ld achieve this goal by designing a metric operating system software wou ld not be system architecture that al lows a moderate num­ acceptable for larger scale multiprocessors. In ber of low cost CMOS VAX microprocessors to the existing VMS asymmetric multiprocessing share a com mon system environment. Such an design . most operating system code was exe­ efficient multi processor system environment cuted on the processor designated as the "pri­ wou ld offer higher throughput for a large num­ mary" processor. Whenever a process needed to ber of applications and at a cost lower than a perform ljO or invoke most of the VMS system high-performance single processor. services. the process would have to be scheduled

Digital Te chnical ]o m-r1al 19 No. 7 August I'J88 ------Th e Architectural Definitio n Process of the VA X 6200 Fa mily

on the primary processor. The task of making • To maintain predictable memory access time, VMS more symmetric in its handling of !jO and we decided that the shou ld not be VMS system services was undertaken ro support run over 7'5percent util ized . 1 the VAX884 0 and the VAX 6200 fa milies. Discussion of how \ve chose to optimize the Using the worst-case amici pared bandwidth VAX 6200 hardware begins in the section The needs, 80MB per second of peak bus bandwidth System Interconnect. would be required to support 8 processors . fkcause of the tight schedule and our aware­ Schedule ness of the sign ificant amount of time needed to In March ll)H')th e design of the CVAX chips was design a new system bus, we first looked into the aln:ady well under way. These chips wou ld be feasi bility of using an existing bus. We consid­ del ivered in time to a.l low systems to ship in late ered hut rejected the existing VAXBI bus, the I 987 Based on the CVAX chip set schedule, we primary interconnect for the VAX 8200/8300 established the fol lowing schedule for the devel­ system, because of its limited 13 3MB per second opment of the VAX 6200 system: bandwidth. We also rejected the NMI bus, the VAX 8'500j8700j8 800 fa mily interconnect , Six months of architectural definition because this bus uses ECL technology. At one Twelve months of designjsimulation point we even considered using the SBI from the VAX- I 1/780 system with a 64-bit data path Three months to budd and test approximately instead of its existing 32-bit data path. After five first-pass prototypes extensive analysis, however, we decided a new Six months to build approximately 70 second­ system bus wou ld have w be engineered for the pass protOtypes product to meet its goals. Although we wou ld have ro define a new bus Three months for final testing and manufactur­ for processor-to-memory communications, the ing introd ucrion schedule did not allow us to design a fu ll com­ This two and a half year schedule significantly plernenr of IjO interfaces for the new bus. Since a influenced the definition of the system architec­ large number of !jO interfaces would be avail­ tu r<.' as we ll as the selection of implementation able on the VAXBI, the design team decided to technologies . (Actual implememarion rook three use the VAXBJ as the interconnect to all IjO years. The design/simulation phase took three devices. The new system interconnect, the XMI, months longer rhan expected, and rhe firsr-pass would be used only to connect processors, mem­ prototype phase rook three months longer than ories, and VAXBI channel adapters. Therefore , expected ) in addition to the requirements listed above , the XMI architectu re wou ld also allow multiple The System Interconnect VA.,'\BIchann e l adapters to optimize 1/0 through­ The first order of business was to define a new sys­ put where necessary for large systems. Use of the tem interconnect. This interconnect wou ld have VA.,'\BI for l/0 adapters also had the positive rhe bandwidth required tO support the memory effect of minimizing the number of electrical ancl 1/0 needs of the multiple processors We interconnects tO the XMI; the physical length of ou tlined three requirements that wou ld affect the the XMI wou ld consequently be shorter and the design of the new system interconnect. tOta l capaci tance lower. Further discussion of the channel adapters is presented in the section • We estimated that each CVAX processor wou ld VAX nt Channel Ad apters. require between 3 megabytes (MB) and 6Ml3 In June 198'5 a team of ll senior-level engi­ per second of dara tojfrom memory. This rare neers was assembled ro produce the architectural would depend on the clock rate of the pro­ and electrical specification for the XMI bus and cessor, the selected cache architecture , and the VAX 6200 system. In addition to architectural the cache "hit" rate of the program being and electrical experts, this team included one executed . representative from each of the anticipated mod­ • We also estimated that each processor cou ld ule design reams. Almost all members had previ­ req uire peaks of 1MB ro 1.5MB per second of ously worked on projects involving the VAX BI l/0 bandwidth. bus. It was understood that the XMI would be

20 Digital Te chnical]oun�al No . 7 August 1988 CVAX-based Systems

used solely for the VAX 6200 family of systems, chip wou ld he required to supply the logic ro unlike the VA.-'\131. which would be used across implement rhe bus-l evel protocol . many different applications. A strict adherence to As the electrical design of the XMI progressed, this premise greatly he l ped the specification a bus cyc l<: as fast as 64 ns appeared feasible. team to put technical trade-offs in perspective . Although not entirely necessary ro support the stared system performance goals, the fa ster XMI XMI Electrical In terfa ce Definition cycle time was strongly pursued to ga in extra Since most of the VAX 6200 system is CMOS and margin in the system design. Furthermore , this transistOr-transistor logic (lTL) based, we imme­ fast cycle time would allow the possibility of sys­ diately decided th<: XMl could not be impl<:­ tem upgrades ro faster processors in rh<: future . mentecl in ECL To maintain a TIL- level bus and Consequently, 64 ns became the stared go al for to achieve the desired bandwidth, the data path the XMl cycle time; 80 ns was rhe fa ll back strat­ clearly wou ld have to he 64 bits wide . Further. to egy if the design complexity of a 64 -ns cycle meet our goal of 80MB per second bandwidth, rime began ro place the overall project schedule the XMf wou ld have to transfer 64 bits of infor­ ar risk. mation every 80 nanoseconds (ns) . (This transfer Logic design across the entire system was done rate assumes a protocol in which address and data assuming a 64-ns cycle rime. Eventually 64 ns arc multiplexed. and up to 32 bytes of data can became the actual speed of rhe bus as the CMOS he transferred per address cycle.) process was characterized and rhe first parts were Several electrical alternatives were considered sampled and found to contain sufficient margin to fo r the XMI. A scheme using the com mercially support rhe fa ster cycle time. available FumreBus components was seriously considered . However. we rejected this scheme XMIPr otocol Definition because a large number of components wou ld be XMI prorocol definition took place in parallel necessary to implement the 64-bit data path. with rhe electrica l definition of the bus. It was The lack of commercially available compo­ clear from the start that the bus wou ld cycle sev­ nents to drive a 64-hit bus at the required speed eral times faster than the memory subsystem . This tinally led us to a decision . We would design a difference in cycle times immediately led us to hit-sliced custom CMOS bus interface chip set. the decision rhar the XMI wou ld run a "pended" Each chip wou ld transceive 11 Jines, and seven bus protocol . A pended bus protocol allows con­ chips would be used for the entire data path. trol of the XMI tO be re linquished between Al though the "sliced " bus interface wou ld use a "read" command and the return of rhe data from more module real estate than a larger chip, the the memory subsystem. With multiple processors sliced bus design greatly simpl ified the chip and multiple memory control lers, seve ra l read packaging problems. Each chip wou ld fi t into a commands could be outstanding at a rime. standard 44-pin cerguad package. A sliced XMI To optimize data traffic on the XMI bus, interface also allows each chip to dissipate under we needed ro define data transfer commands 0.'5 wa tt (W), which enhances reliability and of several lengths . Since VA X instructions may rel ieves the need for heat sinks on the part. With­ write as little as I byte of data, a 64 -bit write out hear sinks. the XMI interface parts can be command was defined . (There is a mask field mounted on both sides of each mo dule. This associated with the write command that allows arrangement saves '50 percent of the real estate single bytes to be written .) Since the VAX BI necessary to interface to the XMI. bus already had commands to transfer 16 bytes To simplify rhe design of the fu ll cusrom XMI of clara per address, it was essential to allow interface parts, we wou ld keep the functional similar commands on the XMI bus to mini­ requirements for rhe pans as simple as possible. mize the interface complexities ro the VAXBI. The XMI interface chips have little knowledge of Eventually we added a 32-byte read command the XMI protocol and serve only as the electrical tO allow processors ro prefetch larger amounts interface. Due to the diverge nt needs of pro­ of data upon cache misses. A _32-byte write cessor, memory, and 1/0 interfaces, designers command was not implemented, because it already knew that each module would need a wou ld be roo great a burden for the memory different VLSI chip for XMl interface functions. controller ro buffer multiple 32-byte write We decided, therefore, that each module VLSI commands.

Digital Te chnical jo urnal 2 1 No . 7 August I'JRR ------The Architectural Definition Process of the VA X 6200 Fa mi�y

In many cases the protOcol of the XMI is simi­ a module the same size as the existing VAXBI lar ro that of the VA,'{BJ. In parr this similarity module 20.32 em (8.0 inches) by 23 33 em resulted because the designers of the XMI were (9 187 inches) . In addition. 32MB of memory very fa miliar with rhe VAXBI. The similarity could fiton the same size module. between protocols was also deliberately chosen System packaging was another factor to con­ because it greatly reduced the complexity of sider in selecting the module size. From the very interfacing the XM I to the VAX BI for ljO pur­ start of the VAX 6200 program, it was not clear poses. what type of system-leve l packaging was optimal. The bus arbitration scheme is one area where Designers knew, however, that the larger systems designers had to deviate from the method used would primarily be placed in computer-room by the VAXBI bus. The VAX BI uses the main environments. For these applications, a standard bus data path for arbitration, which requires I 5:) 67-cm (60. 5-inch) tall cabinet would be extra bus cycles. This approach was nor feasible necessary. What was not clear was if office-type for the XMI pencled prorocol , since two arbi­ packaging or rack-mount-type packaging would trations are necessary for each read transac­ be required. Since VAX BI formfactor pedestal and tion . Further, the VA XBI arbitration scheme also rack-mount box packages were both available, requires a great deal of duplicated logic in eveq' designers found it very attractive to use the same module. Due to the large number of allowable fo rmfacror module for the XMI to ease the devel­ XJ\'11 nodes. it was not feasible to implement opment of these packages if necessary. Based on an arbitration mechanism located on an XMI the functionality fir and the desire ro potentially module. To implement arbitration on any XMI reuse existing packaging, we decided to adopt module wou ld have required a great number of the VAXBI module size for the XMI. signal pins. The sol ution was to implement a cen­ Another advantage to using the VAX BI module tralized arbiter. The XMI uses a module physi­ size was the opportunity to use the VAX BI zero Gtlly attached tO the rear of the backplane as a insertion force (ZIF) backplane connector. His­ centralized arbiter as well as rhe source of the torically. developing new backplane connector master clock. technologies has proven difficult and time­ The subject of data integrity on the XMI was of consuming The VAX BI uses a five-segment, great concern tO the designers. Initially carrying 60-pin-per-segmenr connector. Of the 300 pins, error-checking and correction (ECC) bits on the I 20 pins are assigned to the VAX BI signals and bus was considered. However, this scheme was 180 pins to each module for ljO use. Since the rejected because additional encodejdecode tim­ XMl has 32 more data-path bits than the VA,'XBI, ing would have been required, and because addi­ designers chose to allot an extra 60 pins for the tional bits would have to be carried on the bus. XMI signals. This leaves I 20 pins for general Eventually a robust prorocol was implemented module use. Designers believed the arrange­ based on parity detection and hardwarejsoftware ment ro he acceptable. since there are no I/0 retries when errors arc detected . All transient modules for the XMI. The only use for the I/0 single-bit errors on rhe XMI are recoverable. pins is to connect ro the VAX BI card cages. The 120 available pins are more than adequate for XMI Physical Definition this function. The physical definition of the XMI was a difficult To meet the cycle time goals for the XMI bus, task. There were a great number of interdepen­ the length of the XMI would have to be limited to dent rrade-offs for module size. module spacing, about 0 :) meters ( 1 2 inches) and the number of number of backplane slots. and cabinet size loads limited ro approximately 16. The XMI pro­ To minimize design complexity, we had tocol assumed a maximum of 16 devices would decided at a very early stage that each module interface ro the XMI bus. Eventua.lly the number within the VAX 6200 system wou ld implement a of slots in an XMI backplane became 14 for two single fu nction . Thus the task of design ing each different reasons. First. 14 slots wou ld allow a module was simplified and rhe diagnosabiliry of system to have 8 processors, four memory arrays, the system enhanced. Initially, the size of the and two VAX BI channels. Second, a 14-slot XMI XMJ module was largely governed by the space backplane would he very similar in size to the needs of the processor Analysis showed that a pair of 6-slot VAX B!s that already existed in the processor based on the CVAX chip set could fit on VAXBI pedestal and rack-mount box packages.

22 Digital Te chnical journal No. 7 August 1988 CVAX-based Systems

XMI module spacing of 2.03 em (0.8 inches) alternatives were explored but ultimately is the same as that on the VAX BI bus. We chose rejected because of the immaturity of their CAD this spacing to allow for heat-sink components tools. on side I of the module. Enough height would Discussions with LSI Logic Corporation led us remain to allow non-hear-sink. surface-mounted tO consider their newly developed I.5-m icron components on side 2. "Sea of Gates" array, which offers up to 50.000 About 18 months into the program, rhe module routable gates Alt hough this array did not give us designs were complete, and both the pro­ the mature technology we were seeking. it did cessor module and memory module were experi­ appear to offer the fl exibility needed by all XMI encing great difficulty during printed circuit designs. We ultimately chose the LS I Logic board layout. Although all components could LL 10000 fa mily of gate arrays because a I.J designs be placed within the area available, the very could use the same technology. Moreover. we high pin-count gate arrays in use (223 pins) could fo cus our CAD tool development on a were causing considerable routing problems. single technology To lower rhe schedule risk to the program. The 64-bir-wide XMI data path forced the pin designers decided to lengthen the module by count of a single interface chip to be 200-plus 7.62 em (3 inches). The impact to the computer­ pins. The LSI Logic LLI 0000 array was offered room packaging was minimal because a 7(J-cm in a 223-pin pin-grid-array (PGA) package which (30-inch) cabinet depth could accommodate appeared suitable. Although most of the logic the change . However, the change in module on each module was implemented in surface­ lengt h made impossible the adaptation of the mounted components. we did not pursue a existing V�'\BI pedestal and rack-mount packag­ 223-pin. surface-mount package. We wanted to ing to the XMI. Ar this time the pedestal-based avoid the manufacturing problems presented by strategy for the MicroVAX 3500/3600 systems components with 25-mil pitch leads. was clear. thus reducing the need to package the VAX 6200 fa mily of systems for office use. TheProces sor Module Furt her, extremely low sales of rack-mounted The VAX 6200 processor module uses the CVAX V�'\ 8200/8300 systems led us to the decision chip set to implement the VAX instruction set. that a rack-mount package was not immediately Due to an uncertainty about the fi nal CVAX chip necessary. speed. the CPU module was designed to operate over a range of 70 ns to l 00 ns. The intent was to XM/ lnterja ce Te chnology use "binned" parts in the VAX 6200 system. and The decision to implement the XMI electrical to use the "nomina.!" parts in the MicroVAX interface in simple fu ll custom CMOS parts dic­ 3500/3600 systems. (Chip manufacturing pro­ tated that each module have additional logic to cesses yield pans of different speeds; "binning" complete the XMI interface and to supply mod­ refers to the process of testing the chips over a ule-specific logic. To simplify both the design of range of speeds.) For the CVAX chip set. the nom­ the XMI interface parts and the CAD tools. we inal parts run at 90 ns, and the binned parts run decided that all module-ro-XMI interfaces wou ld at 80 ns. be implemented in the same technology. Given A major system-wide architectura l issue. which the aggressive design schedule, we would need a primarily affected the processor module, was technology that was mature as well as easy to whether the cache should be write-back or write· design for. through. Although a write-back cache could We initially fo cused on a family of 2-micron potentially reduce the number of processor CMOS gate arrays available from LSI Logic and writes on the XMI by 50 percent, such a cache Toshiba. However. ir quickly became clear that was complicated and had never before been array limitation of approximately I 0,000 gates designed for a multiprocessor VAX system. Our wou ld force us tO place multiple chips on each fi nal decision was based on the need ro reduce module. The use of multiple chips was highly overall risk to the program. Therefore. we would undesirable from the perspective of design implement the more straightforvvard write­ resources, module real estate, and cost. A search through cache design and build the extra band­ was started to locate a suitable alternative. To get width into the XMI to handle the additional write the desired logic density, several semicustom traffic.

Digital Te chnical journal 23 No. 7 Augusl 1988 ------Th e Architectural Definition Process of the VA X 6200 Fa mily

Once the decision to implement a write­ cou ld not be broadcast into each of the CVAX through cache was made, the major architectural chips. The possibility of maintaining a duplicate issue for the processor module became the cache external tag store proved to be very difficult to organization . The CVAX chip contains an internal implement. Consequently, we chose the alterna­ !-kilobyte ( KB). two-way set associative cache tive to store only I-stream data within the internal accessible to the internal micro engine in one CVAX cache . cycle. Due to the long latency to main memory, a A similar problem was knowing when to invali­ second-level cache on the processor module was date data in the external cache . In this case it was imperative. The size of the second-level cache feasible to implement a duplicate tag store. The was determined by the available static random­ second-level cache has two tag stores. One is access memory (SRANI) technology. The newly located on the CDAL and is used for cache look­ available high-speed 64K-by-4 SRAMs would up by the CVA.,'{ chip. The second tag srore is provide a 2'56KB cache with on ly eight parts. located within the XMI interface and is used to Although no accurate simulation was available to determine if XMI writes hit the second-level indicate tilL effect of this large cache, the effects cache . When hits are detected, a request is were assumed to be positive. Therefore we queued to invalidate the entry within the second­ decided that the higher cost of the SRAJ\1s was a level cache . worthwhile trade-offgiv en the potential gains in Another problem to be solved on the processor system performance. module was the issue of combining writes into A third major issue relative to the caches on the larger blocks before issuing them to the XMI. processor module was the invalidation scheme . Since the CDAL data path is only 32 bits wide, the In the past. VAX processors have managed cache CVAX chip is incapable of generating a write inva l idation. since processors and 1/0 devices command any larger than :) 2 bits. The 64-bit data have always shared a common memory subsys­ path of the XMI would need larger writes to oper­ tem. 'J'he issue of cache inval idation became ate efficiently. The sol ution to this problem was much more important to our program because to implement a "write buffer" in the XMI inter­ of the multiprocessor nature of the VAX 6200 fa ce of the processor module. The write buffer system. This type of system could cause large takes advantage of the fa ct that writes generated amounts of stale data as a process migrates from by VAX processors are often sequential. The processor to processor. write buffer will buffer up to four sequential The JK.l3 cache contained within the CVA.,'( 32-bit writes and combine them into a single 2 chip caused the largest problem. If it were XMI write transaction allowed to cache data that could become stale, every write to memory would potentially have TheMem oryModule to be inva l idated within the CVAX cache . The system design goal was to provide the capac­ This meant choosing one of two approaches: ity for 1 '5MBto 30MB of memory per processor. (I) broadcasting every write in the system onto As mentioned earlier, the module size was par­ the CDAL bus of every processor. or (2) fi nding a tially governed by the need for 32MB of memory way to maintain a duplicate tag store of tags per memory module. The number of slots in the within the CVAX chip and only passing writes XMI backplane was also partially determined by known to reference cached data within the CVAX the desired amount of system memory. onto the COAL. Another alternative was to cache The wide range of possible VA X 6200 con­ only instruction-stream (!-stream) data within fi gurations dictated the need for an expandable the internal cache. This alleviates the need to memory subsystem. Since fu ll memory band­ inva l idate. because !-stream clara is defi ned to be width would on ly be necessary for very large read-only by the VAX architecture. We projected configurations, it was decided to adopt a dis­ this alternative could cause a 3 to 5 percent tributed memory architecture. An individual degradation in CPU performance. memory controller could be made simpler if it An alysis of the cache-invalidate problem did not have to supply fu ll XMI bandwidth. Full proved very difficult. because we did not know XMI bandwidth could be achieved by inter­ what percentage of data wou ld be shared in this leaving multiple memory controllers. class of multiprocessor system. With the poten­ With the module size and number of slots tial for 8 processors, it was clear that all writes determined , the fi rst architectural decision to be

24 Digital Te chnical journal No . 7 August 1')88 CVAX-based Systems

made for the memory was inrernalorga nization of signal on rhe XMI that would inhibit all new the memory . The 64-bit width of the XMI made it commands from being issued on the XMI Unfor­ desirable to have a 64-bit data path internal ro tunately. due to the pipelincd nature of the pro­ the memory module. The very right module real cessor and the memory array. three additional estate made it very attractive to consider imple­ com mands could possibly be received by the menting a 64-bit clarapath to re duce the number memory control ler after it had determined the of required ECC check bits. A 64-bit-wide data need to stop additional requests Since the depth path was also attractive given that the processor of the command queue was four. the memory moduiL would issue a read for ) 2 byres whenever array wou ld need ro "stall" the bus after receiv­ there was a cache miss . ing on ly a single command. Since this effectively The negative side of a 64 -bir internal memory eliminated the command queue. we decided ro organization was that any write of Jess than lengthen the depth of the command queue to 64 bits in width wou ld result in a read-modify­ eight entries. write operation ro calculate the proper ECC The VAX architecture forces the usc of a hard­ code. An analysis of the expected write traffic ware-based memory lock to control access tO through the processor's write buffer showed that shared clara structures. The memory lock is used approximately 50 percent of all writes wou ld be by some intelligent IjO adapters as well as a fu ll 64 bits in width. Further analysis showed processors. that as long as there was at least one memory Svsrem performance suffers when there is controller for every 2 processors. there would contlicr over different lock variables that acquire be sufficient memory bandwidth for the system. a common hardware lock. Given that Digital had Given the performance characteristics of the never bu i I r a fully symmetric multi processor sys­ CVAX processor, it seemed reasonable ro require tem and that major changes were being made to a 32MB memory array for every 2 processors. We VMS, we did not know what the lock traffic pat­ therefore decided to implement the 64-bit mem­ tern would look like in a large system. We did ory internal organization. know, however, that the existing VAX BJ 1/0 Since it was very difficult to design the memory adapters and the CVAX processor could not hold module to accommodate the fu II bandwidth of more than a single hardware lock at one time. the XMI. designers used memory interleaving to 11ascd on this. we designed the memory con­ provide an aggregate memory bandwidth com­ trol ler to have up ro 16 locked locations. This patible with the speed of the XMI bus. The inter­ number seemed more than adequate given a leave size of 32 byres was determ ined by maximum of 8 processors and on ly three existing the protocol of the XMI. which allows reads of VAX 131 1/0 adapters that usc memory locks 52 bytes per address cycle. (, Cl, and TK 'iO). The granularity of The multiprocessing design of the system each Jock is )2 byres to simplify the memory con­ made it possible for a single memory conrroller trollers' handling of �2-byte read requests. Lock to be the object of several simultaneous req uests. congest ion is st iJI possible if multiple lock vari­ To avoid re jection of processor traffic. we ables are allocated within the same �2-byre designed the memory controller with an input region of memory. An examination of VMS code q ucue. This queue accepts me mory access shows lock congestion ro be very rare requests and services them in a first-in, first-our (FIFO) order. VA XBI Channel Adap ter Initially the memory controller was designed We had decided right from the starr ro design an with a fou r-command queue that would reject XM 1-ro-VAXBI channel adapter to handle all IjO. new requests once the queue was fu ll. As the To meet the desired maximum 1/0 rates of design progressed, we realized that with our XM.l l M.B ro l '5MBper second for each processor, we arbitration scheme , a processor or VAX BI channel wou ld include multiple XM !-to-VAX BI adapters. adapter might possibly be denied memory access. Al though rwo VAX Bl channels would allow A processor or channel adapter might be denied I.)i 'vll3 per second per processor. it was decided access for indeterminate periods of time if the to allow up to eight VAXI31 channels to be con­ memory array was allowed to reject com mands nected to the VAX 6200 system. The design was when its queue became full. To avoid this prob­ not made more complex by the change from two lem, the memory array was allowed to assert a to eight VAX BI channel adapters.

Digital Te chnical jo urnal 2'5 No. 7 August I'J88 ------Th e Architectural Definition Process of the VA X 6200 Fa mily

Designers wa nted to optimize data transfer reconfiguracion , all processors would have to be from the VA XBI into XMI memory, si nce statisti­ allowed access to the physical console terminal cally more data is read from ljO devices chan is as we ll as the physical front control panel of the written. A double-buffered system. This access is accomplished by busing (DMA) data path from the VAXBI to the XMI the signals that interface to the console terminal allows transfers at the full I 3. 3MB per second and front control panel across the XMI. However, VA.'(BI data race. we needed a mechanism to ensure that on ly one For reads of XMI memory, it was known that of the processors would actually respond ro the fu ll bandwidth could not be maintained due to console terminal and front conrrol panel. This the memory read latency through the VAXBI mechanism is a protocol whereby the processor channel adapter and the XMJ memory subsystem . in the lowest XMI sloe that passes self-test Since most IjO transfers are sequential, we con­ assumes control of these external resources. The sidered building prefecch buffers into the VAXBI processor that cakes control of the console termi­ channel adapter. Because transfers could be in nal is known as the "primary processor." The pri­ progress to several VAXBI nodes at once, multiple mary processor communicates with all ocher pro­ prcfetch buffers wou ld be needed . Since prefetch cessors by means of a message passing protocol buffers can architecrurally be considered to be through system memory. small ca ches, the VA XBI channel adapter would It is necessary for the console subsystem to also have to monitor all XMI traffic for potential have access to a mass storage device. Such access invalidate conditions. Eventually the need for is needed for distribution of software and for large amounts of buffer srorage and the compli­ loading of diagnostics. The TK50 was selected cation of XMI monitoring decided us against because of its high density and the availability of bui !ding prefetch buffers. This decision was a preexisting VAX BI interface (the DEBNK) . The influenced by other factors as well. No single TK70 was not used because there was no VAX BI existing VAXBI I/0 adapter could read ac fu ll inrerface to it. The only other alternative was the bandwidth, and multiple r;o devices cou ld be RX50, which has superior access time but a data spread across several VAX BI channels co achieve a capacity of only 400KB. The longer time ro run higher aggregate XMI read bandwidth diagnostics from the TK50 was unimportant To ease physical implementation, the VAXBI si nce the system can be diagnosed largely by channel adapter was implemented on two mod­ means of diagnostics contained in CPU read-only ules that were interconnected by fou r 60-pin memory (ROM) and by the built-in self-test con­ cables between their 1/0 pins. Unlike the tained in all VA XBJ ljO adapters. Further, the VAX 8500j8700j8800 VA XBI channel adapter, TK50 makes an excellent software distribution the VAX 6200 could not use a single XMJ module device and allows VAX 6200 systems to be co connect to multiple VAXBI buses. The 6200 is configured without nine-crack magtape drives. restricted by the 120 IjO pins available on an In previous systems dependent on ROM-based XMJ module. console programs and ROM-based diagnostics, code updates have been a problem. To alleviate Console Subsystem the need ro physically change the ROM, each The console function in a VAX 6200 system is VAX 6200 processor contains a 32KB electron i­ performed by code run on the CVAX CPUs. This cally erasable ROM (EEROM) . Most console and use of the main CPU-based console contrasts with diagnostic code is accessed by means of an the more traditional use of a dedicated front-end address cable contained in the EEROM. In the processor, which has access co all system evenr that a code bug needs co be corrected, the resources. We chose to use a main CPU-based address table is rewritten to point to a replace­ console primarily because we had no way to ment routine that is also written into the EEROM. externally access the internal state of the CVAX The console program implements a routine that processor. Furthermore, we did not want to add can patch the EEROM image from a database the cost of a dedicated console processor. distributed on TK50 cape . A side benefit co a system design that employs multiple processors, memories and ljO adapters , Po wer and Packaging is the opportunity w design in extra availability As noted earlier, we projected that the VAX 6200 by reconfiguring the system in the event of a sin­ system wou ld be used as a large system generally gle component fa ilure. To accommodate fo r located in computer-room environments. An

26 Digital Technical jo urnal No. 7 August I 988 CVAX-based Systems

important goal was to design for a high degree of tion during the design phase. Careful balancing flexibility and configurability in the system. The of technical complexity with the necessary decision ro use a 14-slot XMI backplane had been minimum functionality yielded an architecture based on desired maximum configurations and that could be implemented with a manageabJe the size of existing pedestal and rack-mount amount of risk in a bounded amount of time. packages. ln addition to housing the XMI backplane, the Acknowledgments computer-room package would need to house The initial VA X 6200 system definition and archi­ VAX BI backplanes to accommodate ljO adapters. tectural group was made up of the fol lowing peo­ The VA..'<.I31 backplane is manufactured in cas­ ple: Brian Al lison, Charlie Barker, Frank Bomba, cadeable 6-slot segments. It seemed that two Darrel Donaldson, Rick Gillett, Dave Hartwell, 6-slot VAX BI backplanes would provide adequate Dave Ives, Jim Stegeman, Pat Sullivan, Mike l/0 adapters for most systems. To ensure that all Uhler, and Doug Williams. cusromers' I/0 requirements could be met, a design was also initiated for a VAXBI expander References cabinet that could house four additional 6-slot 1. R. Gamache and K. Morse , "VMS Symmetric VAX BI backplanes Multiprocessing," Digital Te chnical jo ur­ To avoid developing a new power subsystem, nal (August 1988, this issue): 5 7-63 we looked into modifyingan existing power sys­ tem. Although we could find no preexisting per­ 2. R. Gillett, "Interfacing a VAX Microproces­ fect match, we did locate a previously designed sor to a High-speed Multiprocessing Bus," 5 volt ( V) regulator specified at 1 00-ampere (A) Digital Te chnical jo urnal (August I 988, output. We respecified this design to 120 A by this issue): 28-46. using slightly higher power components. The VAX BI requires ± 12 V, -5.2 V, and -2 V in addition to the main + 5 V channel. To accommo­ date the VAXBI requirements, a new regulator was designed. The XMI backplane is supplied with two of the 5 V regulators (one for main logic and one for memory) . Although not required by any current designs, one of the ± 12 V, -5.2 V, and -2 V regulators also supplies the XMI for potential future designs. The two VAXBI back­ planes are supplied by one + 5 V regulator and one of the ± 12 V, -5.2 V, -2 Vregulators. Conclusion The design of a complex system like the VA .'( 6200 is much more than making well­ informed engineering decisions based on hard data . Engineers based the initial system definition on their perceptions of the needs for future com­ puting systems. The definition was further shaped by what was technically feasible with a defined degree of risk. Throughout the architec­ wral specification phase, many trade-o.ffs were made with only partial data and the intuitive insight of very experienced engineers. The design process for the VAX 6200 system was extremely smooth, and the product was designed within six months of the initial engi­ neering goals. Due to the large degree of built-in configuration flexibility, the product definition never changed enough to force a change in direc-

Digital Te chnical jo urnal 27 No 7 August 1'.188 Richard B. Gillett, Jr. I

Interfacing a VAX Microprocessor to a High -speed Multiprocessing Bus

The design decisions involved in interfa cing a microprocessor (CVAX) to a high-speed, shared-memory multiprocessing bus (XMI) are more complex than those encountered in designing a single-processor sys­ tem. Although the same basic interfa ce architectures are used, the signifi­ cantly different multiprocessing environment requires a much more complexim plementation. In particular, the performanceof a multiproces­ sor system is very dependent on the effi ciency of its main memoryinter­ fa ce. To achieve the desired system performance, app ropriate compro­ mises between design complexity and performance must be made. In the case of the VA X 6200 system, performance simulations made early in the proj ect guided the complexity;p erformance trade-o.ffs. Actual system performance results have largely confirmed the validity of the design trade-o.ffs.

The primary goal of the VAX 6200 design was to ing system.) We expected its availability could provide a general-purpose, high-performance, be the critical path tO product shipment. mid-range VAX computing system. Further, this The operating system software would probably system design would cake advantage of Digital's not stabilize in time for us tO discover and fix proprietary CMOS technology and VMS version any major hardware problems and still stay on 5.0 symmetric multiprocessing capabilities. VMS our original schedule. Unfortunately, until the version 5 0 has dramatically changed the way we operating system stabilizes, testing for complex approach mid-range system design; no longer do bugs is difficult. This concern about complexity we design a system to support just one or two relative to the schedule affected several design processors. With the ability to effectively utilize decisions. the power of four or more processors within the On the VAX 6200 CPU module, the design same system came the need to design signifi­ challenge was to interface a custom CMOS VAX cantly higher performance interconnects tO tie microprocessor (called CVAX) to a high-speed these processors together. multiprocessor bus (called XMI). The trade-otfs The VAX 6200 was to be Digital's first CMOS made duri ng the design of a multiprocessor sys­ multiprocessor system. The designers were there­ tem are more complex than those made in fore strongly motivated to provide the best per­ designing a single-processor system. For a single­ forming product they could within reasonable processor system, the performance trade-otfs are time and complexity constraints. Complexity was relatively straightforward. The goal is to design of particular concern since the product schedule the highest performance single-processor system did not allow for the production of second-pass that is practically possible. For a multiprocessor parts prior to the first shipment to customers. system, the goal of maximum single-processor Complex multiprocessor interfaces give ample performance must be tempered to obtain maxi­ opportunities for the kinds of elusive design bugs mum system throughput (i.e., multiprocessor that can be very difficult and time-consuming performance). ro exercise and diagnose. In addition, unlike The foundation of the CPU interface is the other recent VAX systems, the VAX 6200 system cache subsystem, which reduces the effective required a major new release of VMS. (fn many read access time to main memory. By reducing ways the new release represented a new operac- the processor's need tO access main memory, a

28 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

TO VA XBI 0

Figure 1 VA X 6200 Sys tem Architecture

cache improves both single-processor and multi­ Summaryof VAX6200 Sy stem Architecture processor system performance . This paper dis­ The basic architecture of the VAX 6200 sys­ cusses the complexities involved in choosing the tem shown in Figure 1 is no different from optimal cache design and the simulation tech­ architectures used on recent VAX systems. 1 The niques used tO ensure informed design decisions. architecture most closely resembles that of the One of the biggest problems in cache design is VA X 8800 series. Processors and memories reside choosing the correct set of workloads to charac­ on a single, high-speed interconnect called terize the cache performance. Cache perfor­ the XMI bus. Al l memory is shared and equally mance can vary tremendously with different accessible by all processors. Ad apters to the wo rkloads. Therefore, we chose a set of work­ VAX BI bus also attach to the XMI. I/0 devices, loads that spanned a wide range of system activi­ in turn, are attached to the VAX BI buses. The ties. Toward the end of this paper, we present XMI supports a total of 14 slots, which can actual cache performance resu lts that largely be populated with modules to provide a wide confirmthe legitimacy of our approach. range of system configurations. These con­ We also examine one of the more complex figurations can range from small single-processor aspects of multiprocessor designs, which is systems with 32 megabytes (MB) of memory ensuring cache coherency across the entire sys­ and a single IjO channel to a large multipro­ tem. Cache coherency refers to the maintenance cessor system with 256MB of memory and multi­ of a sufficiently consistent memory state from the ple ljO channels. One of the primary system perspective of all processors and 1/0 devices design goals was to support up to eight pro­ within the system. cessors with very good multiprocessor perfor­ The designers also went to great lengths to mance. This goal gu ided the performance ensure maximum system reliabiliry. As part of decisions concerning the bus, memory, and pro­ this effort, we generated a set of error-detection cessor designs. and response rules. These rules ensure that the The heart of the system, the XMI bus, is largely operating system software can easily recover a hybrid of the VAX 8800 NMI and VAXBI buses. from almost all transient cache or bus failures. The XMI is a synchronous bus that runs with a These rules are discussed . 64-nanosecond (ns) cycle time. The data path is The follow ing section is an overview of the 64 bits wide, and the maximum transfer rate is VAX 6200 system architecture. It provides a basis I OOMB per second. The protocol supports for the subsequent discussions on the challenges "pended reads" (as does the SBI on the of multiprocessor design , the VAX 6200 CPU VAX - IIj780 system and the NMI on the 8800). responses to those challenges, the performance In a pended read transaction, the CPU that wishes simulation environment, cache coherency and to read a location requests use of the bus. When error handling, and finally, real performance the request is granted, the CPU transmits the results. address of the desired location. The appropriate

Digital Te chnical journal 29 No. 7 August 1988 ------lnleljacing a l/,.

PENDED PROTOCOL (EXAMPLES NMI. SBI. XMI}

CYCLE 1 CYCLE N

CPU TRANSMITS CYCLES 2 TO N-1 MEMORY RETURNS READ ADDRESS ARE AVAILABLE FOR REA::J DATA TO ON BUS OTH ER DEVICES CPU

NON PENDED PROTOCOL (EXAMPLES: VA XBI. 0- BUS)

CYCLE 1 CYCLE N

CPU TRANSMITS CYCLES 2 TO N-1 MEMORY RETURNS READ ADDRESS ARE BUS STALLS READ DATA TO ON BUS (NO OTHER ACTIVITY CPU CAN OCCUR}

Figure 2 Pe nded uersus Nonpended Protocols

memory controller latches the address inw an Challenges of Multiprocessor Design input queue and begins a read access to the The major challenges fa ced by the multiproces­ specified location. In the meant ime , bus owner­ sor system designer result primarily from one ship is reli nquished by the CPU. and the bus simple system characteristic. The intimate inter­ may be used by other devices. \V' hen the memorv fa ce between processor and memory that most has completed the look-up and has the data . it single-processor systems enjoy must be broken, makes a request for the bus. \X!hen granted the and mai n memory must be shared among a .large bus, the memory drives th<: req uested data on number of devices . This sharing has several the bus. which is latched by the CPU that effects: origi na l ly req uested the data. Pended protocols are contrasted with non pended protocols in • Main memory access rime is significantly Figure 2. i ncreased . Pended protocols are a big advantage when the • Bandwidth to main memory becomes a pre­ bus cyc le time is significantly less than the mem­ cious commodity that determines overall sys· ory access rime. As a case in point, a memory rem performance. read on the XMT bus requires about 500 ns (roughly 8 XMI cyc les) . Without a pended proro­ • Complexity results from increased bus traffic col . these 8 cycles on reads would resu lt in and parallel activities. wastefu l bus stalls. Another advantage of pend<:d protocols is that they allow multiple me mory Jn the fol lowi ng sections. we expand on each of control lers to be used tO advantage. In the case of these effectsin relation ro the VAX 6200 system the VAX 6200, it was nor practical to build a sin­ gle memory controller that cou ld keep up with a In creased Ma in Memory Access Time sat u rated XMI bus. But it was relatively easy ro In a singk-processor system, main memory is construct a memory control ler that cou ld com­ generally closely coupled to the CPU. An exam­ fortably run at about one third the bus maximum . ple of this closely coupled architecture is shown

With four interleaved memory controllers on the i n F igu re :) Clearly , this architecture provides XMI. memory controller bandwidt h is greater the potent i a l for low-latency and high-bandwidth than XML bandwidth. CPU-to-memory transactions.

:Hl Digital Te chnical jo urnal /Vo. 7 August 1')88 CVAX-based Systems

senred below in the section on the multiproecs­ sor environment. 4 MEMORY ·I I Ta ble 2 reinterprets the data in Ta ble 1 in '----,-- terms of bandwidth instead of latency. For exam­ ple, the MicroVAX 3600 system fe tches 8 bytes of data from memory on a cache miss, which rcquires 8 (90 ns) processor cycles. or 720 ns. This corresponds w 8 bytes of data every 7 20 ns, or 11.1MB per second . In comparison, the VAX 6200 system fetches 32 bytes of data on a cache miss, which corresponds w 16.7MB per Figure 3 Ty pical Single-Processor second (32 bytes of data every 1920 ns) . Architecture

Ta ble 1 Comparison of MicroVAX 3600 In rhe VAX 6200 multiprocessor system, mem­ and VAX 6200 Memory Latency ory must be shared by several devices and there­ fore cannot be close.ly coupled to a single proces­ VAX 3600 VA X 6200 sor. The result is a significant increase in main Cache 1 memorv access time. Since the MicroVAX 3600 (CVAX internal cache) 1 (90 ns) 1 (80 ns) and VAx 6200 systems are both CVAX-based, a Cache 2 comparison of the main memory access times for (Second-level cache) 2 (1 80 ns) 2 (1 60 ns) the two systems illustrates this point. Table 1 Main Memory shows the access time in processor cycles for First longword 5 (450 ns) 14 (1 120 ns) the two-level cache subsystem and the main Second long ord 8 (720 ns) 15 (1 200 ns) memory. � Third longword na 19 (1 520 ns) Ta ble 1 shows that the VAX 6200 takes three Fourth longword na 20 (1 600 ns) times as many processor cycles to access the first Fifth longword na 21 (1 680 ns) longword in memory as does the MicroVAX 3600 Sixth longword na 22 (1 760 ns) system. The main reason for this difference is that Seventh longword na 23 (1 840 ns) the MicroVAX 3600 memory controller actually Eighth longword na 24 (1 920 ns) resides on the CPU module. Therefore, the sys­ tem architecture is optimized ro provide mini­ mum access time for processor accesses ro main memory. On the VAX 6200, system memory is a Ta ble 2 Comparison of Processor Read shared resource equally accessible by all CPUs Bandwidths on MicroVAX 3600 and and 1/0 devices. The price of this equality is VAX 6200 Systems (in MB per second) increased latency on all memory references. Note however that although latency has increased, the MicroVAX 3600 VAX 6200 VAX 6200 can support almost ten times more Cache 1 memory bandwidth (the time required per unit (CVAX internal cache) 40.0 50.0 of data transferred) . Cache 2 As will be later presented, the VAX 6200 sys­ (Second-level cache) 20.0 25.0 tem uses memory bandwidth to compensate for Main Memory increased memory latency. Trading bandwidth Fi rst longword 8.8 3.6 for latency is one of the fundamental tools of the Second longword 11 .1 6.7 multiprocessor designer. Cache memory systems Third longword na 7.9 essentially convert increased memory b,andwidth Fourth longword na 10.0 (manifested as a larger fiII size) into lower aver­ Fifth longword na 11.9 agc read latency (due to the decreased miss Sixth longword na 13.6 rate in the cache resulting from the larger fill Seventh longword na 15.2 size) . This explanation is an oversimplification; Eighth longword na 16.7 details of the trade-offs in cache design are pre-

Digital Te chnical jo urnal 31 No. 7 August 1')88 Interfa cing a VA X Microprocessor to a High-speed Multiprocessing Bus

Limited Bandwidth to Main Me mory of accesses that will never result in data being In a single-processor system such as the returned to the processor. (The second-level MicroVAX 3600, the performance is generally cache will probably "hit" on more than 80 per­ limited by the CPU itself and not by the main cent of these references.) This behavior is desir­ memory subsystem. The opposite is general ly the able for many single-processor systems but case on large multiprocessor systems where a would be inappropriate for a multiprocessor large number of processors can create a boule­ design in which main memory bandwidth is neck ro the main memory subsystem . A major precious. goal of the multiprocessor designer is to mini­ In the multiprocessor system, main memory mize the bandwidth required to support a given bandwidth is shared by all processors and IjO level of CPU performance . In that way, the main devices. Ta ble 3 compares the system bandwidth memory bus can support more processors; in the MicroVAX 3600 and VA X 6200 systems. therefore, the system can attain higher total Since the VAX 6200 uses a pended bus that sup­ throughput. For example, assume a processor ports I tO 8 memory controllers, we present two requires an average of 20 percent of the total sets of bandwidth numbers for the VAX 6200 bandwidth available to main memory to run a memory subsystem: one for a single memory con­ given workload. Based just on bus bandwidth trol ler and another for a four-way interleaved, considerations, the total system performance fou r-memory controller subsystem. wou ld nor exceed five times the single-processor The data makes a strong argument for large performance if the system is simultaneously run­ transfer sizes tO achieve high bandwidths on the ning that workload on all processors. For a num­ VAX 6200. A large cache fillsiz e can be used to ber of reasons . systems are rarely designed such assure high read bandwidth, and a write buffer that the bus must be saturated ro meet irs perfor­ can be used to provide longer length write trans­ mance goa ls This same method of calculating actions. Note that longword writes are particu­ performance can be used to estimate perfor­ larly inefficient in the memory controller; nine mance at some lower level of bus utilization. A cycles are required for a longword write com­ bus utilization level of 75 percent is often used; pared with only five cycles for a quadword write. in that case , the system performance wou ld be This inefficiency results from the implementation limited to 5. 75 times the single-processor system . of the error-correcting code (ECC) across a quad­ This example reveals one of the main com­ word on the VAX 6200 memory. (VAX systems promises multiprocessor system designers must have traditionally implemented ECC across a make: increased bandwidth, which would reduce longword .) This implementation improved the the main memory access time seen by a single memory module capacity at the cost of forcing processor, is traded off to reduce the total band­ all longword writes to be a read-mod ify-write width consumed by a single processor and sequence in the memory. thereby increase total system throughput. Ba nd­ width is really nor the characteristic we are trying Increased System Bus Traffic to minimize; the rea l goal is ro reduce the Another challenge ro the multiprocessor designer number of bus and memory cycles used to sus­ is the increased memory traffic in the system due ta in a given lcvc.:l of performance . As we will tO the increased total system performance. For a demonstrate, the efficiency of the transfer gen­ given workload , it is fa irly accurate to assume erally increases as the transfer size increases. that the traffic to main memory increases linearly Therefore the system can fetch twice as much with the tOtal performance system. Therefore, a data from memory without using twice as VA,'( 6240 (a four-processor 6200 system) wou ld many bus and memory cycles. This characteristic have roughly four times the main memory traffic is important when evaluating various cache of the VAX 6210 (a single-processor 6200 sys­ alternatives. tem) . Since processors must monitor main mem­ Again looking at the MicroVAX 3600 design, ory traffic tO maintain cache coherency, this the CPU actually starts accessing main memory increase in main me mory traffichas tO be consid­ once the first-leve l cache has determined a miss ered when looking at cache inval idate implemen­ occurred but before the look-up in the second­ tations. Aga in the single-processor system has a Jevc l cache has completed . This overlap means much less severe problem. The single processor the memory controller will start a large number has ro monitOr only the traffic from ljO devices,

.1 2 Digital Te chnical jo urnal No . 7 August 1988 CVAX-based Systems

Ta ble 3 MicroVAX 3600 and VAX 6200 Main Memory Bandwidth (in MB per second with corresponding number of cycles in parentheses)

MicroVAX VAX 6200 VA X6200 3600* XMI Bus Memory (90-ns cycles) (64-ns cycles) (64-ns cycles)

Reads 1 Memory 4 Memories Longword (48) 8.8 (5) 31 .2 (2) 1 0.4 (6) 41 .6 Quadword (88) 11.1 (8) 62.2 (2) 20.8 (6) 83.2 Octaword (1 68) na 83.3 (3) 31 .2 (8) 124.8 Hexword (328) na 1 00.0 (5) 38.5 (13) 154.0

Writes Longword (48) Full 11.1 (4) 31 .2 (2) 6.9 (9) 27.6 Masked 6.3 (7) 31 .2 (2) 6.9 (9) 27.6 Quadword (88) Full na 62.2 (2) 25.0 (5) 100.0 Masked na 62.2 (2) 13.9 (9) 55.6 Octaword (1 68) Full na 83.3 (3) 31 .2 (8) 124.8 Masked na 83.3 (3) 16.7 (1 5) 66.8

· These numbers represent a CPU perspective. 1/0 devices on the 0-bus can use longer transfer lengths.

which typicaJI y generate abour one-tenth the VAX 6200 CPU Design Alternatives traffic generated by a single CPU. Extending this This section presents an overview of the argument, it appears 10 indicate that a VAX 6240 VAX 6200 CPU architecture , fo llowed by a dis­ system must handle inva l idate look-ups at a rate cussion of the various implementation alterna­ more than 30 times that of the MicroVAX 3600 tives that we considered during the design pro· system. (The VAX 6200 CPU has tO handle invali­ cess . We conclude with a list of specific design dates from three other CPUs and for about four alternatives and a discussion of ou r performance times as much ljO traffic.) simulation environment, which we used to exam­ The increased system bus traffic is a symptom ine these alternatives. of the large number of parallel activities that characterize a multiprocessor system . The abun­ dance of queues in a multiprocessor system results in a more complex system. The section on Ta ble 4 Summary of Differences cache coherency in this paper discusses several between Single-processor manifestations of this increased complexity. and Multiprocessor Systems Ta ble 4 summarizes the major differences between the single-processor and multiprocessor Single- Multi- processor processor systems. Characteristic System System This discussion has demonstrated that the per­ formance of a multiprocessor system is very Memory latency Low Medium dependent on the designers making the right Performance CPU Memory decisions about the CPU interfa ce. In the next bottleneck bandwidth section , we discuss the basic architecture of the Invalidate rate Low High VAX 6200 CPU and specific aspects of the multi­ Level of parallel Low High activity processing environment.

Digital Te chnical Jo urnal 33 No . 7 August 1')88 ------In terfa cing a VA X Microprocessor to a High-speed Multiprocessing Bus

CVAX SECOND-LEVEL SECOND-LEVEL WITH lKB CLOCK CFPA CACHE r--- TAG STORE CACHE � 256KB H ! LATCH/BUFFER I t CVAX COAL t

A<31 :00> 256K DUPLICATE ! l XMI t ADDRESS EPROM TAG ---- INTERFACE sse STORE GATE ARRAY I : 32K t EEPROM CONSOLE SERIAL LINE

XCI

� � , ------l I I XMI XCLOC K XLATCH X I I ·I 71 CORNER L ______r- __ j

XMI

Figure 4 VA X 6200 CPU Block Diagram

The VAX 6200 CPU is a single-board VA X pro­ clock source, and the XMI bus has a single clock cessor based on the CVAX chip set designed and source that provides synchronous clock signals tO bu ilt by DigitaL The 6200 CPU has a CVAX cyc le all XMI nodes. time of 80 ns (as compared to the MicroVAX The XMI corner represents a standard set of 3600 90-ns CVAX cyc le time) ; its nominal per­ interface components and a physical intercon­ formance is 2.8 times the VA X- 11/780 system nect that ensure all XMI devices meet the timing (sl ightly more than three times the MicroVAX II). and electrical characteristics required by the XMI A block diagram of the module is shown in specification. The XMI corner components inter­ Figure 4. Three major buses are associated with fa ce to the rest of the logic on the module over the module. The CVAX processor chip set com­ the XMI chip interconnect (XCI). A duplicate tag municates over the CVAX data and address bus stOre also attaches tO the XCI bus. (COAL). 2 5 The SSC chip connects to the COAL As outlined in the previous section, several bus and provides such fu nctions as read-only specific challenges must be addressed by the memory (ROM) address decoding, time-of-year multiprocessor designer. At the CPU level the " clock support, and console terminal interface. design responses are as follows : The CVAX chip contains the fi rst-level cache. Also connected to this bus is the second-level cache • Implement an effective cache tO reduce the data store and tag srore logic. The path to the XMI effective access time and the tOtal traffic to bus is provided entirely by the XMI interface gate main memory array and the XMI corner. This gate array provides all necessary synchronization between the CVAX • Implement a write buffer to decouple and and XMI. Each CPU module has its own CVAX reduce write traffic.

34 Digital Te chnicaljo urnal No. 7 August /')88 CVAX-based Systems

• Implement a uuplic;Hc rag store to reduce the con tigurarion was expected to he the most read­ overhead and complexity of ma intaining ily available. Since the CVAX has a .1 2-bit data cache coherency. path. eight 64 K-by-4 parts wou ld naturally pro­ vide a 2')(,KB direct-mapped cache (four times Cache Su bsys tem the size of any previous VAX ). This configuration \Xic will first look at the issues associated with also provided the optimal one-output-load per d<::sign ing an <:fkcrive cache . The main character­ clara line . We also examined configurations with istics of a cache arc size. associativity. fi ll size. increased associativity to confirmour be lief that and block size. Size is simply the size in byres of the benefitof s<::t sizes greater than one is small the data store section of the cache. As the size of for caches in the range of 2')(,KB the cache increases. the effectiveness of the cache Having se lected a very large cache . we next also increases. A-; sociariviry refers ro the number considered block size and fillsiz e . The XMl bus of sets in the cache. A cache with a single set can su pports only 8 (quadword ). 16 (octaword ). and store data with a particular tag address in only a :)2 (hexword) byt e transfers to memory. There­ single .l ocation . (A single-set cache is often fo re . the fillsiz e wou ld have to be one of these referred to as a direct-mapped cache.) A two-set three sizes. The block size can be large r than the cache has two locations capable of storing data fi ll size if the design supports what arc called with a particular rag address . A� associativity is subblock va lid bits. Ideally the ti ll size and block increased . the likelihood of cache "thrashing·· size wou ld be the sa me. With a very large cache. decreases. (Thrashing occurs when rwo pieces of however. providing sufficienttag storage can be a uata cannot simultaneously be in the cache due real problem. Again in an attempt to be conserva­ to an insufficient number of sets.) The likelihood tive . we looked into stat<:-of-t he-art . tag-inte­ of thrashing also decreases as the cache size grated circuits. The best we fo und in the increases . Therefore it fo llows in most cases required 2';- to 30-ns speed range was a 2K-by-9 that as the cache size increases, the benefits of part. With two of these parts. we could imple­ increased associativity decrease. Fill size defines ment a 2K tag srore subsyste m. A 2')6KB data the amount of data that is fe tched from main store with 2K tags wou ld have a 1 28-byre fi ll size. memory on a cache miss and loaded into the Subblock va lid bits wou ld be needed to identify cache. Over the range of cache sizes of interest, which subblocks are actua lly valid. \XIe decided it the miss rare decreases as the cache fill size wou ld be practical to choose a larger tag store increases. Block size refers tO the size of the data size in which four tag chips wou ld be used to block covered by a single rag address. In a direct­ implement a 4K rag store subsystem. mapped cache . the block size is equal to the Choosing the ideal fill size was expected to cache size divided by the number of tags. The fill involve an interesting comprom ise between sev­ size is equal to or less than the block size. era l characteristics. As the fi ll size is increased , A major issue facing the designer of any com­ severa l things happen. puting system is the amount of variation in per­ • The cache miss rare drops . Over reasonably fo rmance that can be accepted over a wide range large ranges, the miss rare can he reduced by of workloads . Since we were concerned abou t our requiring that more data be fe tched on a cache ability to accurately model the effect of large miss. This is nor true when the likel ihood of caches. we wa nted to err on the side of conser­ using the new data is less than the likelihood vatism . This meant we wou ld choose the largest that bringing in the addit ional new ti ll data cache size practical. The state-of-the-art tech­ will force the tl ushing of other cache data nol ogy static random-access memories (SRAMs) 'i more likely to be used . This will nor occur available to the VAX 6200 ream were expected to with cache and hll sizes in the range consid­ be 2%-kilobir (Kb) pans with speeds down to ered for the VAX 6200. Y) ns. We determined that a pipelined cache design with 3'5-ns SRAMs cou ld support CVAX • CPU stalls per miss increase . In VAX 6200 cycle times down tO 60 ns. This cycle time was CPU architecture, as the second -level cache is comfortably beyond our product goal, which was being filled, the CVAX cannot access it. On a tO support a range of 70 ns to I 00 ns, depending second-leve l cache miss. the XMI interface on the success we had speed-binning CVAX pa rts . does re turn the actual requested data item to \XIe tentatively decided to use 64 K-by-4 SRAMs the CVAX first and then com pletes the re main­ for the data store. large ly because the 64 K-by-4 der of the cache fill.Theref ore. the number of

Di?,ilal Te chnical jo urnal No. 7 August I ')88 ------Interfa cing a VA X Microprocessor to a Higb-speed Multiprocessing Bus

6 cycles in which the CVAX is stalled wa iting for references are made per insrruction and an rhe second-level cache ro become available average instruction on the CVA.,"\ requires again after a cache miss increases as rhe fill between 9 and 10 cyc les. This wou ld seem ro size increases . The CVAX internal cache indicate that the performance penalty wo uld remains accessible while the second -level be about 8 percent (0.8 references divided by cache is being fili ed . 9 5 cycles), assu ming the D-srream miss rare in the internal cache is 0 percent. With an • The MIJ per second ro main memory required expected more-typical 40 percent miss ra re, ro support a given level of performance the penalty would be about 5 percent. increases . If twice as much data is fetched on a cache miss, the miss rate does not drop by a • CVAX stalls will increase for references that ' factor of two. Therefore, as fi ll size increases, occur while the second-level cache fi ll for a the MB per second required to support a given previous reference is still not complete. This level of performance increases. increase results because the CVAX will need to access the second-level cache on all D-srream • The "available MB per second" of the bus references. increases . The efficiency of buses that do not have separate address lines (such as rhe XMI) • Assuming a low frequency of REI instructions, increases as the transfer size increases . Basi­ the !-stream miss rate should improve since Gl lly, rhe re quired address cycle can be amor­ there wi ll be no contention for cache blocks tized over more data cycles. between rhe I and D streams. (REis will cause the 1-srream-only cache to fl ush.) • The "available MB per second " of rhe memory controller increases. The memory controllers • The module space needs will be less because in the VAX 6200 can deliver more MB per sec­ there will be no need for an extra duplicate tag ond if more data is fetched for a given fe tch to track the CVAX internal cache. Since the address. CVAX internal cache has two sets, it cannot be practically "followed" by a simple second­ Based on our significantexperience with VAX sys­ level, direct-mapped cache . tems, we knew that either the 16-byte or 32-byte Looked at another way, we cou ld afford to fe tch wou ld be the right choice . The resu lts from devore more logic to making the second-level simu lation wou ld be used to select rhe final cache more effective if we did nor support va lue. CVAX 0-stream caching. Another major cache design issue was the configuration of the CVAX 1 KB imernal cache. • The complexity lessens with one less cache ro This cache can be configured to run in a conven­ keep coherent with hardware . We also had tional instruction and data stream write-through more flexibil ity in implementing error-recov­ mode. In this mode, the cache must be invali­ ery mechanisms and wou ld nor have to imple­ dated when writes occur ro a srored bloc.k. Al ter­ ment a complex mechanism ro suppress the natively, the cache can be run in !-stream-only generation of XMI write transactions when the mode in which the cache does nor have ro he inval idate queue was at risk of overtlowing. invalidated on writes. Instead, the cache is auto­ We planned to use the simulation environment ro matically tl ushed on VAX Return from Exception quantify rhe performance penalty that results or Interrupt (REI) instructions . The methods we from running the CVAX cache in !-stream-only used to ensure the success of this cache mode. coherency mechanism are discussed in the sec­ tion Maintaining Cache Coherency and Hand ling Write-Buffe r Subsystem Cache Error Conditions. Convenriona I write-through caches greatly reduce A-; suming all other things remain equal, react traffic to main memory but do nor reduce there is a performance penalty for choosing the the write traffic. Therefore, although the mix of !-stream-on ly mode. If we select 1-srream-on ly read and write references from the CPU itself is mock, the fo llowing occurs: we ighted heavily toward reads, the traffic down­ • All D-stream references will require a mini­ stream of a write-through cache is primarily mum of rwo cycles instead of one. Generally, writes. Other cache architectures offer the poten­ for VA.,'\ CPUs an average of 0.8 D-srream tial ro reduce write traffic. A write-back cache

56 Digital Te chnical jo urnal No. 7 August 1')88 CVAX-based Systems

might be considered the obvious approach. By word XMI transaction (if the buffer contains no caching writes as we ll as reads, a write-back more than an aligned quadword). CVAX CPU cache offers the potential for the highest perfor­ reads (unless interlocked or made to 1/0 space) mance multiprocessor system. Nevertheless, the are allowed to bypass the write buffers after first complexity is significantly higher than a write­ being checked for an address match with the through design . Industry experience is that very write buffer. few write-back caches work on first-pass, and Either a read address comparison match or an their bugs are very difficult to fix. Another risk interlocked or 1/0 space transaction forces the with write-back caches is in the area of error write buffer to be purged. There are several other recovery . It is much more difficult to recover conditions under which the write-buffer must be from transient cache errors with a write-back pro­ flushed. These conditions are discussed in the tocol . To avoid the increased complexity and section Maintaining Cache Coherency and Han­ resulting schedule risk, we decided to pursue a dling Error Conditions. hybrid approach. We would implement a write­ We believed the write buller could provide through cache with a write buffer design very about half the bandwidth benefit of the write­ 7 similar to that of the VAX 8800 cache back cache but with little more complexity than A write butler resides between a write-through a simple write-through design. As an added cache and the system bus. A write buffer isacru­ benefit, the buffer architecture was already ally a simple, very effective form of a write-back implemented and running with very good perfor­ cache . A write bufferrakes advantage of the local­ mance results in a VAX multiprocessor (VAX 8800 ity of write transactions to reduce the number of fa mily). We planned to use performance simula­ write references to main memory by combining tions to confirm that the write buffer was ade­ several small write references into a single larger quate to meet our performance goals. transaction to main memory. This behavior has three main advantages. First, almost all buses Duplicate Ta g Store (including the XMI) increase in efficiency as the As noted earlier, a multiprocessor environment transfer size is increased. This efficiency results puts significant strain on the cache coherency because every transfer generally requires the logic. The rates at which write addresses on the transmission of an address cycle before the data . system bus must be checked against the addresses This address cycle is basically fixedov erhead that stored in the cache require that a different archi­ can be more effectively amortized as the transfer tecture be used for servicing invalidates. size is increased. The transfer sizes and relative The 2K-by-9 tag chips used to implement the efficiencies of the XMI bus are shown in Table 3. main tag store are also used to implement a Second. as previously mentioned. the VAX 6200 duplicate tag store. The duplicate rag store runs memory does nor efficiently process Iongword synchronously with the XMI bus and permits write transactions. The write buffer converts filtering of invalidates. so the CPU would stall significant numbers of longword write transac­ only on an XMI write hit. It is nor uncommon 10 tions into fu ll quadword and octaword transac­ have ratios of 100 10 10,000 to I between dupli­ tions that are processed with many times greater cate tag misses and dupl icate tag hits. efficiency. The operation of the duplicate tag store is dis­ Finally, the buffer helps to reduce the fre­ cussed in the section Maintaining Cache Coher­ quency of processor "write stalls," that is, pro­ ency and Handling Error Conditions. cessor cycle slips due to writes to main memory We have now defined the basic architectural that back up. The buffer largely decouples the issues that needed to be resolved and have indi­ processor from the main memory write timing; cated the alternatives we would like to pursue. ln the processor perceives that most writes are com­ the next section we present the results of our per­ pleted in minimum time. formance simulations. The VAX 6200 write buffer accumulates write The following list summarizes what we exam­ data umi l a memory write address falls outside ined in our simulation environment: the address range of the current block. When this occurs, an al ternare octaword buffer begins • Determine the loss in performance that would filling. The first buffer is emptied either with an result from running the CVAX internal cache octaword XMJ transaction (if the buffer contains in !-stream-only mode instead of combined more than an aligned quadword) or with a quad- I- and D-stream mode.

Digital Te chnical jo urnal 37 No 7 August 1988 ------Interfa cing a VA X Microprocessor to a High-speed Multiprocessing Bus

• Investigate octaword ( 16-byte) versus hex­ we know actually exists) could not be demon­ word (32-byte) fi ll sizes for both I-stream and strated when the model ran with a flush every D-stream . Further, examine the relative miss 5,000 instructions. rates, MBs per unit of performance, bus cycles To more accurately model the benefitsof large per unit of performance, memory cycles per caches, internal studies of complex timeshare unit of performance, and absolute perfor­ loads were undertaken. Multiuser program traces mance. Look at a large multiprocessor system's were run against a cache model, subjecting the sensitivity to main memory access time. cache model to the context-switch behavior of a real system. The cache performance results of • Determine the effectiveness of a write-through that run were compared with single jobs run with write-buffer cache architecture . In other against a cache model that was flushed after vari­ words, can the writes be reduced sufficiently ous numbers of instrucrions had been executed . to avo id write-back in the chosen architecture . The results indicated that similar cache perfor­ • Examine the benefits of a two-way, set-associa­ mance resu lts could be obtained in simulation by tive cache over a simpler direct-mapped using a single job trace and complete cache design. flushing every 35.000 instructions. The number 35,000 applies only to a 256KB cache ; smaller Pe rfonnance Simulation caches would have a smaller context-switch The basis of the simulation environment was a interval. We decided to simulate the VAX 6200 high-level performance model of the CVAX chip. with the 256KB cache flushed every 35,000 Written in PASCAL, this model was interfa ced to a instructions; the I KB CVAX internal cache wou ld configurable second-level cache, write buffer, be flushed at the more traditional 5,000 instruc­ and memory subsystem. The model accepted tion rate. All simu l ations would represent a sin­ instruction traces for input. At the time the per­ gle-processor system; main memory access times fo rmance modeling was done , seven standard would be minimum. The performance resu lts benchmarks were available: DIRECTORY, EDT, would generally be presented as a set of relative FORTRAN, LINKE R, MAIL, RUNOFF, and SORT numbers comparing the alternatives. All instruction traces were captured from a Ta ble 5 summarizes all the cache characteris­ YAX - 11/780 system. Since each trace was for a tics we wou ld simulate. single process, one of the major issues was deter­ mining how to correctly model the effect of CVAX Internal Cache Co nfiguration ti mesharing on cache performance. The first aspect examined was the CVAX cache The very natu re of timesharing has a negative configuration. As shown in Ta ble 6, the 1- and effect on cache performance as compared with D-stream design offered an average increase in singlc process runs. Ideally, the cache wou ld be performance of 5 percent over the !-stream-only dedicated entirely to holding instructions and cache . We concluded 5 percent average perfor­ data associated only with a single process. In mance cou ld be sacrificed in return for the timeshare systems, processes are not initiated and reduced complexity of the l-stream-only design. th<.:n run nonstop to completion; instead the CPU is constantly swi tching from process to process. This switching requires the cache resources to be Ta ble 5 Cache Characteristics Simulated distributed across a number of processes and Second-level therefore reduces the effectiveness of the cache. CVAX Cache Cache A VA.t'<- 11/780 studl indicates that the average number of instructions between context switches Associativity 2-way Direct-mapped/2-way on a VAt'( system is about 5.000 instructions. A Configuration I & D/1 only I & D traditional and very conservative approach to Size 1KB 256KB simulating the effect of context switches is 10 Block size 88 648 flush the entire cache every 5,000 instructions. Fill size 88 16Bj32B Flushi ng the cache every 5, 000 instructions Tags 1K 4K was not a big penalty for small caches that Simulated 5,000 35,000 could quickly refill themselves after a tlush; context instructions instructions switch rate however. the advantage of larger caches (that

Digital Te chnical jo urnal No. 7 August I'J88 CVAX-based Systems

Ta ble 6 CVAX 1-stream and 1- and D-stream Octaword versus Hexword Fill Size Relative Performance Choosing an octaword or a hexword fi ll size was !-stream 1- & 0-stream the next and probably the most complex major issue. The results are shown in Ta ble 7. In all Average 1.00 1.05 cases, re lative numbers are used with the charac­ Minimum 1.00 1.03 teristics of the octaword machine as the refer­ Maximum 1.00 1.07 ence.

Ta ble 7 Octaword versus Hexword Fill Size Results

Relative Fill Size Performance

Octaword All 1.00 Hexword Average 1.01 Minimum 1.01 Maximum 1.02

Relative Miss Rates Relative MB/sec

Fill Size !-stream 0-stream All Reads !-stream 0-stream All Reads

Octaword All 1.00 1.00 1.00 1.00 1.00 1.00 Hexword Average .56 .84 .71 1.12 1.68 1.42 Minimum .54 .81 .68 1.08 1.61 1.36 Maximum .57 .87 .76 1.14 1.74 1.52

Percent XMI Percent Memory

Fill Size !-stream 0-stream All Reads !-stream 0-stream All Reads

Octaword All 1.00 1.00 1.00 1.00 1.00 1.00 Hexword Average .93 1.40 1.18 .91 1.36 1.15 Minimum .90 1.34 1.16 .88 1.31 1.13 Maximum .96 1.45 1.27 .95 1.41 1.27

Relative Relative Percent Percent XMI Memory Fill Size Utilized Utilized

Octaword All 1.00 1.00 Hexword Average 1.08 1.04 Minimum 1.06 1.03 Maximum 1.09 1.07

Digital Te chnicaljo urnal 39 No . 7 August 1988 ------Interfa cing a VA X Microprocessor to a High-speed Multiprocessing Bus

A summary of the results in Ta ble 7 follows : more bus cycles and 16 percent more memory cycles to support read trafficto main memory. • The fillsiz e has a negligible effect on perfor­ Eighteen percent and 16 percent may seem mance (less than 1 percent difference). The like a big increase, but it is important to look hexword alternative delivered an average of at overall bus bandwidth. On a write-through I percent better performance. interconnect, the writes generally dominate It is important to keep in mind that the simula­ the traffic. tion was performed assuming the minimum • The overall bus traffic (taking into account delay from main memory. In a multiprocessor writes) increased by only about 9 percent. system, the alternative with the lower miss Overall memory controller cycles increased by rate increases in performance relative to the even less - only about 4 percent. The low other alternatives as the main memory access increase resulted because the ratio of write time increases. cycles to read cycles is higher in the memory • Hexword fe tches dropped the overall miss rate controller than on the XMI bus. by almost 30 percent. (As expected, the Based on this data, we chose the hexword !-stream miss rate improvement was much fi ll alternative. We felt the potential for higher - almost 50 percent.) significantly more consistent performance in • The megabytes per second required to main­ large multiprocessor configurations (due to tain a given performance level increased by decreased cache miss rate) was worth the esti­ about 4 0 percent overall for the hexword mated 9 percent increase in bus utilization. fetch. Write Buffe r Effectiveness and Overall • As mentioned earlier, we were not as con­ Bus Utilization cerned about megabytes per second as much as We were pleased to find that the write buffer was the percentage of the bus and memory con­ about as effective as we had predicted. The data trollercy cles per second. In this light the hex­ in Ta ble 8 compares the XMI write traffic gener­ word alternative required about 18 percent ated with and without a write buffer. The data is quite consistent. On average, the write buffer Ta ble 8 Write Buffer Effectiveness reduced the number of write cycles on the bus by Ratio With Write Buffer; slightly less than half (45 percent) and reduced Without Write Buffer* the memory controller cycles by slightly more than half (51 percent) . Write Buffer XMI Memory Ta ble 9 shows the bus utilization by the Miss Rate Utilization Utilization VAX 6200 CPU running the test benchmarks. Average 47.1% .55 .49 Using the average bus utilization number of Minimum 40.4% .50 .42 6.27 percent still yields only 50 percent for a fu ll Maximum 54.9% .64 .58 eight-processor system; the 7.25 percent maxi­ mum value yields 58 percent utilization . These • The utilization numbers are expressed as ratios between the figures are we ll within our 75 percent utilization utilization with a write buffer and the utilization without the write buffer. design goal, and we decided to implement the write-buffer instead of the write-back design. Another more conservative way to look at the Ta ble 9 XMI Bus Utilization per CPU data is to assume that we may not have the worst­ !-stream D-stream case environment covered in any single bench­ Reads Reads Writes Total* mark. Therefore we should look at the "sum of maximums" to determine whether the design Average .89% 1.39% 4.41% 6.27% goal is met. Using the sum of maximums Minimum .24% 1.26% 3.57% 5.27% approach, we require 9.72 percent of the XMI Maximum 1.65% 2.10% 5.97% 7.25% per processor, or about 78 percent for eight pro­

• The numbers in this column are averages of the total XMI bus cessors. This figure is sufficiently close tO our utilization across the seven workloads. These numbers are not design goal of 75 percent maximum utilization sums of the individual utilization percentages in each column. to be acceptable.

40 Digital Te chnical jo urnal No. 7 August 1988 CVAX-based Systems

Effe ct of Associativity Ta ble 10 Direct-mapped versus Two-way We next explored the benefits of associativities Cache Performance greater than one. Implementation of a cache All Reads other than a direct-mapped cache was probably Relative Relative not practical. However, we wanted to examine Miss Rates Performance the performance results. Direct- Two- Direct- Two- The results given in Table I 0 indicate that a mapped way mapped way two-way, set-associative cache could reduce the overall miss rate by I3 percent, whereas the per­ Average 1.00 .87 1.00 1.01 formance gain was negligible (1 percent) . This Minimum 1.00 .74 1.00 1.00 improvement in miss rate is fairly significant; Maximum 1.00 .95 1.00 1.02 but we determined it was not practical from a module real estate and electrical timing perspec­ tive to implement other than a direct-mapped meaning of coherency and the methods employed scheme. To implement a fast two-way cache, two by the VAX 6200 project engineers to ensure separate RAM arrays must be supported. This coherency. We also describe our techniques for implementation requires roughly twice the mod­ supporting recovery from all single-bit transient ule area of a directed-mapped approach. A two­ cache errors. way cache can be implemented with a single For this discussion, we divide the cache sub­ RAM array (cannot start the RAM look-up until system of the VAX 6200 into three sections. Fig­ the proper set has been identified), but this ure 5 shows the three major subsystems in the would force the access time to increase by a VAX 6200 cache: cycle. Increasing the access time to the second­ level cache wou ld be particularly undesirable to • The CVAX internal I-stream-only cache the VAX 6200 designers since we had already decided to configure the CVAX cache in !-stream­ • The 256KB l-and 0-stream cache only mode. (With an additional cycle, all D-stream references wou ld then require a mini­ • The 16-byte write buffer (a form of write-back mum of three cycles.) Board area constraints and cache) increased cache access time are the two most common reasons for rejecting the miss reductions CVAX /-stream-only Cache of the multiway cache in favor of the simplicity The first cache, contained within the CVAX chip and the practical, fast access time of the direct­ itself, is configured for !-stream-on ly operation. mapped cache. In that mode, the CVAX flushes the entire con­ tents of the cache whenever a VAX REI instruc­ Maintaining Cache Coherency and tion is executed. Motivated originally by the Handling Cache Error Conditions potential problems with instruction prefetch As mentioned in the introduction, a major chal­ buffers, the VAX architecture defines rules for lenge to a multiprocessor designer is to imple­ software to assure that writes tO l-stream data 8 ment a reliable scheme for cache coherency. produce predictable results In all cases, if the Coherency is a term somewhat difficult to define. rules are not followed, stale data may be read In this section, we give some insight into the from the cache and cause unpredictable results.

1KB 256KB 168 CPU r-- I-STREAM 1- AND D-STREAM 1--- WRITE TO MAIN MEMORY CACHE CACHE BUFFER r--

Figure 5 VA X 6200 Cache Subsystems

Digital Te chnical jo urnal 4 I No . 7 August 1988 Interfacing a VA X Microprocessor to a High-speed Multiprocessing Bus

Second-level !- and D-stream TAG DATA 256KB Cache STORE STORE The second-level cache is architecturally similar to caches used on most VAX systems. With a I I write-through design , the cache srores both COAL BUS

I- and 0-stream data . Coherency is maintained by XMI INTERFACE monitoring all writes from other devices to main GATE ARRAY memory and invalidating cached locations that correspond ro any of the monitored writes. The processor does not generate invalidates for its own writes to main memory since the cache is DUPLICATE "HIT" EIGHT-ENTRY TAG � INVALIDATE write-through; a write by the processor itself that STORE QU EUE hits in the cache immediately updates the appro­ priate location . The VAX 6200 second-level cache coherency logic is shown in Figure 6. A duplicate tag store is located on the multiplexed XCI bus. This store contains a duplicate copy of the 4,096 cache tag XCI BUS entries, which are in the second-level cache located on the COAL The duplicate tag store tracks the primary tag store on allocates by moni­ tOring XMI read transactions. Whenever an XMI XMI CORNER memory space readis initiated, the CPU allocates the cache block that corresponds to the read address. XMI BUS The duplicate tag store also monitors ail XMI write transactions and performs a duplicate tag store look-up. If a hit occurs and the write was Figure 6 Second-level Cache Coherency Logic not from this CPU, then the duplicate tag location is invalidated. The address is then loaded into an flow the CPU's invalidate queue. Instead of eight-entry invalidate queue implemented in the adding significant complexity to the invalidate XMI interface gate array. Cache invalidates are contro.ller to suppress the generation of XMI not performed in response to an individual CPU's write commands when the invalidate queue is at own writes since the write-through second-leve l risk of overflowing, the overflow condition is han­ cache always contains the most recent data . dled as an exception condition. (This subject is When an entry has been loaded into the invali­ discussed in the section Handling Second-leve l date queue, the COAL interface logic arbitrates Cache Error Conditions.) For this alternative to for the COAL and invalidates the full 64 -byte be practical, we had to ensure that invalidate block in which the write address was located. queue overflows would be very rare; we felt this The use of a duplicate tag store reduces COAt was ensured by the depth of the invalidate queue traffic to only necessary invalidate transactions. (eight entries) and the optimized design of the After performing an invalidate, the XMI interface inval idate controller. gate array checks for any additional invalidates that may have accumu lated while the previous Th e Write Buffe r invalidate was being serviced. If another invali­ A write buffer design offers the designer oppor­ date request exists, then it is serviced prior to tunities to break cache coherency rules. The release of the COAL.This procedure ensures that VAJ( 6200 CPU follows several rules to maintain invalidates are serviced as quickly as possible. coherency. The VAX 6200 hardware automati­ The CVAX bus interface ensures that the invali­ cally fl ushes the write buffer under the following date logic is given an opportunity to use the conditions: COALbetween everyCV AX bus operation . Though occurring very infrequently, the XMI • In response to a write that misses the currently bus could issue writes quickly enough to over- active write buffer. The current write buffer is

42 Digital Te chnicaljo urnal No. 7 August 1988 CVAX-based Systems

fl ushed while the new write is accepted by tine responds by logging the error and then the alternate buffer, thus write ordering is tl ushing and reenabling the cache. maintained . The fol lowing error conditions cause the XCP hardware to disable the second-level cache. The • Before an XMI r;o space read or write refer­ errors are of two fo rms. The first two are error ence is performed. 1/0 references could result conditions that potentially result in a missed in the initiation of an 1/0 operation that may cache update on a write-through; the last three require the data from the write buffer. deal with conditions under which an invalidate is • Before an interlock read or unlock write refer­ potentially missed: ence is performed. Interlock sequences are • Subblock valid bit parity errors - The the primary means for synchronization VAX 6200 CPU supports a doubly-redundant set between processors and must always force all of subblock va lid bits. On a cache look-up, outstanding writes to main memory. ifthe two corresponding valid bits do not match, • Before an interprocessor interrupt is per­ then the hardware reports a parity error and formed. As with inrerlocks, interprocessor forces a cache miss. If this error occurs on a interrupts arc used for synchronization be­ write-through that should have hit in the cache, tween processors and must always force all then the cache state is no longer consistent. outstanding writes to main memory. • Cache tag parity errors - The tag chips used • Before issuing an XMI read to a location that on the VAX 6200 CPU support parity on the includes the data contained in the write fu ll tag address. As with valid bit errors, a tag buffer. The write buffer contents are fl ushed to parity error can result in a missed write­ main memory and then the XMI read is issued. through. Reads that miss the write bufferdo not force a • XMI inconsistent parity error - If the CPU write buffertl ush ("write bufferbypas s"). detects an XMI cycle that has bad parity and • Fol lowing the assertion of the CVAJ( clear­ that cycle is acknowledged by another proces­ write-buffer pin, the CPU fl ushes the write sor, then the worst-case assumption is that the buffer to main memory. This form of write duplicate tag logic just missed a write transac­ buffer fl ushing is primarily used ro associate tion that should have resulted in an invalidate. fa iled writes with a given process. If no associ­ • Duplicate tag store parity error - As with the ation could be made, then the operating sys­ previous error, the processor has tO assume the tem wou ld always have to crash the entire sys­ parity error resulted in a missed invalidate. tem on every fa iled write transaction. • Invalidate queue overflow - Again, this con­ Ha ndling Second-level Cache Error dition is similar to the one above except that Conditions this condition does not require a transient One of the major goals of the VAX 6200 design error in the system. Instead, an inva lidate was to provide improved system reliability. One queue overflow is the result of a very rare com­ method we used was hardware-enforced soft bination of XMI writes that result in a queue fa ilover in response to many error conditions, backup and the potential loss of invalidates. combined with efficient software recovery proce­ The system responds to this condition just as it dures. This method was used extensively when would for all other cache errors. dealing with all types of second-level cache Actual System Performance Results errors. We were very interested in determining how we ll In general, the individual processors have the our simu lation results marched real-world opera­ responsibility to recover from potential cache tion . We decided to focus on several key aspects coherency fa ilures. When errors occur that of the system to bound the task of correlating sim­ may leave the second-level cache incoherent, the ulation with the real world. Specifically, we VAX 6200 processor hardware amomatically dis­ planned to ables the cache. Disabling the cache ensures that the system can continue to run "safely," albeit at • Confirm that the VAX 6200 CPU performs as reduced performance. The processor then posts a expected relative to the MicroVAX 3600 sys­ "soft" error interrupt . The interrupt service rou- tems. If the cache subsystem behaves as

Digital Te chnicaljo urnal No . 7 August l'J88 ------Interfa cing a VA X Microprocessor to a High-speed Multiprocessing Bus

expected, then the VAX 6200 performance that our simulation indicated that the percentage should exceed that of the MicroVAX 3600 XMI consumed would be 6.27 percent. (See systems by the clock rare improvement minus Ta ble 9.) the penalty for running the CVfu'{ cache in 1-stream-only mode. Multistream Performance on Compute­ intensive Benchmarks • Confirm that the simulation traces adequately "stress" the memory interfa ce such that extra­ It is beyond the scope of this paper ro present the polation ro real workload performance is multiprocessor simulation data that was gener­ valid. The percentage of the XMI bus con­ ated prior to design. That data indicated that the sumed wou ld be the basis for this comparison . VAX 6200 system performance on compute­ This characteristic includes all the effects of intensive benchmarks would be nearly linear references per instruction and miss rates and when running from one to eight processors. ultimately determines the performance of a Tests to date have confirmed our high expecta­ multiprocessor machine. tions. On compute-intensive workloads, a four­ processor system consistently provides better • Confirm that the cache subsystem supports than 3.95 times the throughput of the single­ very effective utilization of multiple proces­ processor system (less than 2 percent degrada­ sors. VAX 6200 multistream throughput mea­ tion) . Limited configuration testing on systems surements form the basis of this verification . with up to eight processors indicates that

• Compare the results from the simulation tests compute-intensive workloads continue ro per­ with simi Jar workloads run on real machines. form very well. An eight-processor system per­ formed at 7.75 times the single-processor (less Comparing VA X 6200 and than 5 percent degradation). Micro VA X 3600 Systems Fully Characterized Wo rkloads Due to the similarities between the two systems, our first approach was to compare the perfor­ We also instrumented a VAX 6200 system ro mea­ mance of the VAX 6200 ro the MicroVAX 3600 sure a number of processor characteristics, systems by running a set of I 00 compute-inten­ including bus utilization We wanted ro deter­ sive benchmarks. The VAX 6200 has a 12 percent mine how much the real workload runs varied cycle time advantage (90 ns to 80 ns) , but it is from the simu lated runs. The test methodology somewhat handicapped by the !-stream-only limi­ was quite simple.

tation placed on the internal cache . Recall that • Command files were created that executed a our performance simulation indicated this single benchmark. These individual bench­ penalty would average about 5 percent. (See marks were designed to correspond with the Ta ble 6.) On average then, we expected the simulation traces listed at the beginning of the VAX 6200 CPU to be about 7 percent faster than Performance Simulation section. the MicroVAX 3600 CPU. The compute-intensive benchmarks basically confirmed this number; • The Digital Command Language (DCL) com­ VAX 6200 performance averaged 6 percent faster mand fi les were of the following form: than the MicroVAX 3600 CPU. $ $ Multzprocessor Bus Bandwidth @f l ushcache ! initially flush the cache Utilization - Real and Simulated $ @Star thardwaresamp 1 e! start the Workloads measurem ent hardwa re We have run several forms of multiuser time­ $ @getcpu t ime ! get the ini tial CPU time sharing workloads on the VAX 6200 system. $ These workloads include Digital's standard $ run benchm ark ALL-IN- I workload, an order processing bench­ $ mark (Compu-Share), an electrical CAD work­ $ @ge tcput ime ! get the final CPU time 9 load, and a software development workload In $ @Stophardwaresamp le ! stop the al.l cases, the average percentage of the XMl used measureme nt hardwa re per processor ranged from 3 75 ro 5.0. Recall $

44 Digital Te chnical journal No . 7 August 1988 CVAX-based Systems

• The measurement hardware consisted of two Ta ble 11 Simulated versus Actual XMI Bus Te ktronix DAS 9200 Logic Analyzers; one mon­ Utilization itored the processor bus, and the other was Simulated/ attached to the XMI. The start-measuremenr Simulated Actual Actual command file simply referenced a specific !-stream !-stream Ratio XMI 1/0 space address on which the DAS 9200 0.32% 2.6 analyzers wou ld trigger and start taking mea­ Average 0.84% 0.17% 1.4 surcmenrs Similarly, the stop-measurement Minimum 0.24% 0.52% 3.2 command fi le wou ld reference another XMI Maximum 1.65% 1/0 space address that wo uld cause the logic Simulated/ analyzers to stop acquiring data. Simulated Actual Actual D-stream D-stream Ratio This technique made the measurement process Average 1.63% 0.74% 2.2 simple and repeatable. The overhead of the com­ Minimum 1.26% 0.26% 4.8 mand filewas measured by running the command Maximum 2.10% 1.10% 1.9 file with the "run benchmark" line removed. This overhead was then subtracted from the resu lts Simulated/ Simulated Actual Actual obtained from benchmark runs. Run-to-ru n con­ Writes Writes Ratio sistency was better than ± 1 0 percent. The logic analyzers captured the data neces­ Average 4.46% 4.86% 0.92 sary to determine the total number of XMI read Minimum 3.57% 3.84% 0.93 and write references that occurred during the Maximum 5.97% 5.75% 1.04 execution of the command file. This data was Simulated/ used to calculate the total number of XMI cycles Simulated Actual Actual used by the processor. To derive the percentage Overall Overall Ratio of the XMI utilized. the tOtal XMI cycles were reduced by the command file overhead, and the Ave rage 6.09% 4.86% 1.25 resu lt was divided by the benchmark CPU time. Minimum 5.27% 3.84% 1.37 This method ensures that the XMI percentage is Maximum 7.25% 5.75% 1.26 not artificially low due to the inclusion of null time elapsed while the processor is waiting for 1/0 activities associated with the benchmark to complete. The results are shown in Table 11. The data indicates that the simulation traces ments indicated that the simulated traces used required significantly more XMI read bandwidth 2 5 percent more bandwidth than the actual (on average more than double) than the similar workloads. actual benchmarks . This result is not unexpec­ ted , since the simulation runs were designed tO Conclusions andFu ture Wo rk simu late a worst-case timeshare workload. (This The VAX 6200 design experience has demon­ goal influenced the choice of 35,000 instruc­ strated that trace-driven simulation is a power­ tions for the cache flush interval.) The real fu l tool in the design of a multiprocessor bus workloads were run on standalone systems, and interface. Because the designers were able to therefore the cache performance was expected to make informed trade-off decisions, the design be higher. We are currently studying the effectof met or exceeded all performance goals; and the heavy timesharing in multiprocessor systems on reduced design complexity helped bring the cache performance . Initial results indicate that system to market on schedule. It is a tribute to ou r simulation ru ns are still conservative. the team's appropriate control of complexity and The resu lts for writes, which are unaffected by to the rigorous verification process 10 that the context switch rates, matched the actual bench­ first-pass VAX 6200 CPU printed circuit design marks quite closely. The actual benchmarks and XMI interface gate array are currently ship­ required about 4 percent to 8 percent more ping in VAX 6200 systems. At Digital, this level bandwidth than the equivalent simulation trace. of success is unprecedented for a machine of this Combined read and write bandwidth require- complexity .

Digital Te chnical jo urnal 45 No . 7 August 1988 ______Interfa cing a VA X Microprocessor to a High-sp eed Multiprocessing Bus

The continuing trend toward multiprocessing Chip," Digital Te chnical jo urnal (August and faster processors will force increasing depen­ 1988): 109-120. dence on complex cache subsystems to deliver 4. ). Winston, "The System Support Chip, a the desired system performance. It follows that Multifunction Chip for CVAX Systems," minimizing the complexity of the cache sub­ Digital Te chnical jo urnal (August 1988, system will help support ever decreasing time­ this issue): 121-128. to-market schedules. Accurate cache simulation techniques will be required to select the imple­ 5. A. Smith, "Line (block) Size Choices for mentation that meets the performance goals and CPU Cache Memories," IEEE Tra nsactions is minimally complex. on Computers, vol. C-36, no. 9 (September 1987): 1063-1075. Acknowledgments 6. D. Clark and J. Emer, "A Characterization of The author would like to acknowledge the Processor Performance in the VAX- 1 1 /780," work of the fo llowing individuals who con­ Proceedings of the lith Annual Sy mpo­ tributed sign ificantlyto the VAX 6200 CPU archi­ sium on Computer Architecture Qune tecture and design : Brian Allison, Giao Dau, 1984). Ron Desharnais, Glenn Herdeg, Dave lves, 7. ). Fu,). Keller, and K. Haduch, "Aspects of Bimal Sarecn, Simon Steely, and Doug Williams. VAX 8800 C-Box Design," Digital Te chnical References jo urnal (February 1987): 4 1-51. 8. T. Leonard, ed., VA X Architecture Refer­ 1. B. Allison, "The Architectural Definition ence Manual (Bedford : Digital Press, Order Process of the VAX 6200 Family," Digital No. EY-3459E-DP, 1987). Te chnical jo urnal (August 1988, this issue) : 19-27. 9. B. Moses and K. DeGregory, "Performance Evaluation of the VAX 6200 Systems," 2. T. Fox, P. Gronowski, A. Jain, B. Leary, and Digital Te chnical jo urnal (August 1988, D. Miner, "The CVAX 78034 Chip, a 32-bit this issue): 64-78. Second-generation VAX Microprocessor." 10. ). Basmaji, G. Garvey, M. Heydari, and Digital Te chnical jo umal (August 1988, A. Singer, "The Role of Computer-aided this issue) : 95-108. Engineering in the Design of the VAX 6200 3. E. McLellan, G. Wolrich, and R. Yo dlowski, System," Digital Te chnicaljournal (August "Development of the CVAX Floating Point 1988, this issue) : 47-56.

46 Digital Te chnicaljo urnal No. 7 Au!?,ust 1988 jean H. Basmaji Glenn P. Garvey Masood Heydari Arthur L. Singer

TheRo le of Computer-aided Engineering in th e Design of th e VAX 62 00 Sy stem

The success of the VAX 6200 design is partly attributable to the development and implementation of a total verification plan. The goal of this plan was to shorten the total system design cycle; the approach was to perfo rm suffi­ cient verification toensure thatfirst-pass parts would boot and runVMS at sp eed. The team responsible fo r achieving the goal began implementing the verificationprocess on availability of thefirst design specification. The team 's effo rts continued concurrent with those of the moduledesign team. Milestonesfo r the process reflect the verification team 's top-down fu nc­ tional app roach, proceedingfrom architectural-level verification through logic, timing, and system verification, and concluding with vector genera­ tion. Review and reporting methodsestablished fo r the project ensured all fu nctions were tested and verified.

'fhis paper presents an overview of the computer­ The heart of the system is a new interconnect aided engineering (CAE ) and CAE-based design called the XMI. This interconnect was specifi­ verification test (DVT) approach to the develop­ caUy designed to serve as the processor-to­ ment of the VAX 6200 system. Our intent is not memory int erconnect in the VAX 6200 system to give a step-by-step dcscri ption; therefore, few and its derivatives. Optimizations of and trade­ details of the implementation are given. The off's in the design of the XMI were made with that CAEjDVT Group developers believe that project­ fu nction foremost in mind. The key features of specific problems arc generally best solved by the interconnect are as follows. project-specific solutions. Instead , we offer a broad overview of CAE which includes the engi­ • The pended bus design allows multiple trans­ neering principles established for the VAX 6200 actions to be in progress at the same time; thus project and which we believe will be of use to waste of bandwidth is minimized, for instance, those planning a task of similar scope. during memory read accesses.

A Brief VAX 6200 Sy stem Overview • The XMI implements the concept of comman­ No discussion of CAE or DVT methodologies can der nodes and responder nodes. A commander take place withou t a description of the task to node initiates a bus transaction tO which a which these methods are applied. For our pur­ responder node must respond. poses. the overall task was to engineer, proto­ type, debug, and release for manufacture the • The XMI is a centralized arbitration intercon­ VAX 6200 mid-range computer system. nect. Arbitration logic, resident on the back­ The VAX 6200 multiprocessor architecture plane, grants bus mastership according to a is implemented with CMOS technology 1.2 The modified round-robin scheme. There is a system is housed in a I 56 by 79 by 76 em cabi­ higher priority responder round-robin queue net. which contains a system bus backplane, two and a lower priority commander round-robin 6-slot VAX BI backplanes, a TK50 tape drive, queue. space for future rack-mount devices, power sup­ plies, and blowers. • Bus width is 64 birs.

Digital Te chnical journal 47 Nn. 7 August 1')88 Th e Role of Computer-aided Engineering in the Design of the VA X 6200 Sys tem

• Cycle time is 64 ns.

• The XMl supports reads of quadword, acta­ word, and hexword length, and writes of quad­ word and octaword length.

• Raw bandwidth is 125 megabytes (MB) per second.

The XMI supports three module types in the VAX 6200 system: the CPU module (KA62A) , a 32MB memory array (MS62A) , and an XMI-to­ VA XBI adapter module set (DWMBA) . The CPU module is based on the CMOS VAX (CVAX) chip set, which includes a MicroVA,'{ architecture microprocessor (CVAX), a floating point accelerator (CFPA) chip, and a system sup­ Figure 1 XMI Module Connections on a port chip (SSC) . The module supports fu ll VAX VA X 6220 Sys tem capabilities, excepting only PDP-1 1 compatibil­ ity mode. In addition to a two-way associative needed to bring a product to market. The defini­ !-stream cache in the CVAX chip, the module tion of CAE and the way engineers use CAE to contains a 256 kilobyte (KB) direct-mapped accomplish this goal differs from project to pro­ cache . Performance is approximately 2.8 times ject and even within a single project. Neverthe­ that of a VAX - 1 1/780 processor. less, two principles are preeminent. The MS62A is a 32MB memory array module 1. CAE should provide the tools, the methods, with an on-board controller. Modules may he and perhaps most importantly, the discipline interleaved up to eight ways to decrease laten­ that together enhance an engineer's productiv­ cies. Each module has an eight-deep command ity without unduly restricting his or her cre­ queue. The arrays are fu lly error-correction code ativity. (ECC) protected. The DWMBA is an adapter module set which 2. CAE should provide a continual check to allows the 6200 system ro access ljO devices on ensure that the engineer's product meets the the VAXBI bus. The DWMBA/A module, which needs of the project in terms of both function resides in a single XMI slot, is connected by cable and quality. to the DWMBA/B module, which resides in a sin­ The role of the CAEjDYT Group on the gle VAX BI slot The DWMBA can support up to VAX 6200 project was different from the tradi­ fu ll VAX BI bandwidth of 13.3MB per second on tional CAE role in one significant respect. The write transactions and approximately 5.5MB per group's primary responsibility would not be the second on read transactions. development of CAE tools and processes. Instead, Figu re I ill ustrates how these system elements its responsibility was the delivery of first-pass interconnect in a two-processor system with two hardware that was fu nctional at speed . Explicitly, VAX BI channels. our goal was ro ensure that the system would Because the VAX 6200 system backplane has boot the operating system (VMS) and run soft­ 14 slots, many system configurations are possible ware the first time the system was powered up. with differing numbers of processors, memory The on ly tools and processes developed were modules, and 1/0 channels. those specifically necessary to fu lfill that goal. In the sections following, we describe the engi­ The project ream felt that objective simu la­ neering process employed in the design of these tion and verification of the hardware and irs logic elements. performance by the CAEjDYT Group wou ld (1) enhance the chances of first-pass fu nctional­ CA E Ve rification Challenges ity, and (2) reduce the overall design cycle by and Organizational Structure paralleling the CAE and the design efforts. Con­ The overrid ing goal of any CAE effort is always sequently, the CAE engineers were active con­ the same: ro shorten the development time tributors ro the architecture and participated in

48 Digilal Te chnical jo urnal No . 7 AuQus/ 1988 CVAX-based Systems

choosing alternatives. effecting compromises, The DVT specification must begin as early and implementing details of the design . The CAE as specification of the hardware functionality Group was responsible for the correctness and begins . Working the two specifications in parallel quality of the designs and not just for the delivery ensures fu nctional ve rification of the system. Fur­ of tools to accomplish that correctness. To ther, the DVT specification should be treated achieve this goal, tasks traditionally performed with the same formaliry as the hardware specifi­ using DVT methods would be accomplished cation; that is, it should be reviewed , and all using CAE methodology. reviewers must agree upon its completeness. By formalizing the specification review, project CAE Ta sks members are in effectesta blishing its value tO the Given the charter described above, the CAEjDVT project. The DVT specification defines what is tO Group outlined the fo llowing tasks: be simulated; therefore, superior design tools and modeling cannot substitute for the assurance • Select a rool suite of design accuracy that the specificationaffords. As tO the verification of the hardware, the • Create a process for the CAEeff ort responsibility of the CAE team was to ensure bug­ free and operable component, module, and sys­ • Maintain the databases tem designs. Team members ran the simu lations, • Construct a CAE environment (models and isolated the bugs, and ensured designs were cor­ computes) rected by the design team. Simulations were not done exclusively by the CAE team, however. The • Generate test cases to run against the environ­ environment was available equally to all design ment team members. To the extent that each team felt was appropriate , designers initially debugged • Isolate and report bugs their designs before passing them to the CAE • Ve rify the hardware team for more formal debug. In this way, obvious bugs were found more quickly. Design develop­ • Generate test vectors for outside vendors ers did excellent work in this regard and greatly eased the burden on the CAE team. • Generate test vectors for manufacturing Further discussion of the VAX 6200 hardware • Fault grade the test vectors ve rification is presented in the section Ve rifica­ tion Milestones. • Define exit criteria for committal of design tO hardware Modeling Approach Hardware verification done in software is by • Enforce compliance with exit criteria nature a slow process. The major fa ctor con­ Though the list is long and has some interesting tributing tO the slowness of the verification is the tasks, two items constituted the largest portion of size of the design . The size is not simply the num­ the work: generation of test cases to run ber of logic elements in the design, but the col­ against the environment, and verification of lective size of the models of each of the elements the hardwa re. in the logic network. The generation of test cases is the most time­ We used two types of models for the VAX 6200 consuming, least glamorous, and most often over­ project, behavioral and structural (or gate level). looked task; yet the test cases are the single most Behavioral models, in general, were more important piece of a superior CAE effort. A suc­ abstract and efficient in terms of increasing cessful specification of the test cases (the DVT overall simulation performance as compared tO specification) ro be run against a CAE environ­ detailed structural models. ment requires a lengthy period of development. Behavioral models of many of the components The development time for the KA62A, MS62A, used in the system were generated early in the and DWMBA DVT specifications was approxi­ design cycle. As the design progressed and mately 6 man-months each. Moreover, the speci­ detailed logic schematics became ava ilable, how­ fication is not static and must be kept current ever, the behavioral models, in most cases, gave with the evolving design . way ro detailed structural models. The exception

Digital Te chnical jo urnal 49 No. 7 August I ')88 ------Th e Role of Computer-aided Engineering in the Design of the VA X 6200 Sys tem

was the behavioral models of the CVAX chip set. an additional cluster of eight VAX 8800 systems These detailed models were used throughout the was used. ve rification process. Given the size and complex­ ity of these components, simulation with struc­ Ve rificationMiles tones tural models, for all practical purposes, was Key milestones were established for the verifica­ impossible. tion team throughout the VAX 6200 design In general, our objective throughout the verification process. In December 1985, we verification process was to ensure accuracy and began with the first XMI interconnect verifica­ not speed. The slow speed of the more accurate tion; we proceeded to performance evaluation , models was addressed by applying more compute logic verification, timing verification, system power to the task at hand . verification, and vector generation . These milestones were derived as part of CA E Staffingand Resources our fu nctional top-down verification approach. The CAE group was divided into small reams, We selected this approach based on our deter­ each responsible for the verification of a VAX mination that if a function works correctly, then 6200 subsystem. The size of the teams varied. aU of its component logic must be working The KA62A gate array and module team had four correctly. CAE engineers. Four CAE engineers worked on We therefore chose to model our different the DWMBA ga te arrays and modules . The MS62A design objects in the largest reasonable forms and gate array and module was assigned one CAE engi­ then fu nctionally test these models. Every step neer. As it turned out, these numbers represented naturally lead into the next task of the system nearly a one tO one ratio with the hardware design . This approach was later extended to the designers. As senior, experienced engineers, system as a whole; the system simulation com­ ream project leaders were responsible for the bined the logic and ran code to exercise the overall coherence of the DVT plan and its quality, entire system. and were responsible as well for tracking and resolving problems. Architectural Ve rification Each team included a diagnostic engineer who At the archi tectural level, the simulations focused was also working on design verification test. This on the verification of the new system intercon­ arrangement provided the diagnostic engineers nect, the XMI. As noted earlier, this interconnect, early training and also facilitated testing. More­ specifically developed for the VAX 6200 system, over, diagnostic engineers were in a position to is a memory interconnect bus with a new arbitra­ easily evolve some of the DVT tests intO self-tests tion scheme and a defined bus interface protocol . and ROM-based diagnostics for the VAX 6200 Both the arbitration and the protocol are imple­ product. mented in CMOS semicustom technology. The educational background of the CAE team Once the design for the bus protocol and arbi­ was a mix of electrical engineers, computer engi­ tration was established in a specification form, we neers, and software engineers. Their levels of immediately transformed the specification into experience varied from new college hires ro high-level behavioral models: the arbiter chip those with 10 or 1 5 years of work experience. model, and an XMI commander transactor model. The level of relevant hardware experience in this The behavioral arbiter model represented a group is indicative of the group's tasks, as com­ generic, rou nd-robin arbitration scheme; the pared with other CAE groups that are more commander model represented a generic XMI involved in rools generation. commander design . The commander model con­ Our computer resources consisted of a cluster tained a fl exible user interface to allow the of eight CPUs, including one VAX 8800 system, specification of any desired we ll- or ill-formed one VA X 8650 system, and six VAX- 11/780 sys­ transaction to be generated on the bus. Further, tems. All fou r modules (KA62A, DWMBAjA, the commander transactor model was designed to DWMBAjB, and MS62A) and their associated selectively self-check for any protocol violations. gate arrays were verified throughout most of The two models were the basis for all XMI the project on this cluster. During final regres­ design ve rification . This first level of verification sion testing of each module, in which the fu ll provided feedback ro the architecture team set of DVT tests was run against the design, quickly and answered questions about the inter-

50 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

face prorocol and arbitration scheme. A� a result, Clearly, an architectural verification that con­ the arbitration was enhanced and the protocol centrates on a new bus leaves out many orher was refined to satisfy the design goals. Specifi­ areas of architectural interest. A severe restriction cally. a few new signals were added, and the arbi­ of the scope of the VAX 6200 system's architec· tration was changed from true round-robin tO a rural verification was deemed necessary because modifiedrou nd-robin. of the lack of schedule time and because of the At the next level of architectural verification, immaturity of the art. Nevertheless, architectural we modeled an XMI responder node and incorpo­ verification is a key area where much work should rated this model into the simu lation environ­ be done for the development of the next system. ment. The ream developed a behavioral XMI memory model and completed a high-level sys­ Performance Evaluation tem model. 'T'his model was still totally behav­ The next verification task was performance evalu­ ioral and represented a system with generic XMI ation. Again, work was concentrated into two commanders and responders. well-defined areas, that is, the bandwidth perfor­ Two pieces of test code were generated and mance of the XM.l, and the processor perfor­ verified on that model. The environment mod­ mance in the multiprocessing environment. eled a fu lly loaded XMI interconnect. The fi rst A model of the CVAX processor was obtained piece of test code was structured in such a way from the Semiconductor Engineering Group that every node on the XMI generated its own design team. We enhanced this model to include traffic. Commanders generated all possible com­ an XMI interface with a memory port. The stimu­ mander sequences, and responders generated all lus fo r this model proved difficult to generate possible responder sequences. The goal of this because multiprocessing benchmark traces were first test code was to ensure that the protocol was not avai !able. The traffic patterns had to be sound. By protocol soundness, we mean that deduced from single-stream benchmark traces commanders and responders can coexist on the and extrapolated for VAX 6200 symmetric multi­ XMI and can generate traffic sequences without processing. loss of data . 'T'he results of this verification gave We ran several benchmarks. We then used the the team sufficientconfiden ce in the protOcol to results ro make decisions about the appropriate allow the design of the XMI interface compo­ trade-offs in the area of the processor cache and nents to proceed. write buffer algorithms. These trade-offs dealt The second piece of rest code was verified on specifically with cache and write buffer depth the same environment. Every commander gener­ versus performance gained. ated the same sequence of trafficon the XMI. The Other tools were created to decompose XMI goal of this rest was to verify arbitration fa irness traffic inro histograms and ro generate reports on and to guarantee that all XMI nodes got their fa ir bus bandwidth for the different types of traffic. share of the XMI. The absence of phenomena Eventually, XMI memory design latency targets such as lockouts was also ve rified. were incorporated into the XMI behavioral mem­ This architectural verification proved to be a ory model. These system performance simula­ tremendously valuable exercise. First, feedback tions were used to establish such design criteria to the architecture ream was accomplished as the memory controller input command queue quickly. Second . this architectural verification depth and the command queue processing for the VAX 6200 project established design algorithm. verificat ion tools that can be used for a II future XMI designs. Logic Ve rification In time, the be havioral model of the arbiter was The next major task was logic verification. The replaced with a structural model derived from main objenive of module verification was to the chip design database. We enhanced the accu­ ensure that the imp lementation conformed to all racy of the behavioral models of the XMI com­ design goals documented in the system specifica­ mander and memory by incorporating structural tion . In other words. the goal was not to verify models of the XMl interface components once what the design was, but what the design was their gate-leve l cksigns were complete. These supposed to be . tools are now in use throughout the corporation Members of the CAE team were assigned to by numerous XM I design teams. each design object; each member wou ld work in

Digital Te chnical jo urnal '5 1 No. -, August J'JRR Th e Role of Computer-aided Engineering in the Design of the VA X 6200 Sy stem

a team with the designers. The verification teams with problems. A number of problems were n:quired complete and coherem specifications found and resolved as a result of this testing. for each design object. These specifications had The riming verification of the module gate to be suffi ciently complete to support both arrays was performed in a similar fashion. Pat­ design implementation and logic ve rification . terns of fu nctions were extracted from logic Moreover, ail fu nctions had ro be documented in verification and then applied to the standalone a spccificarion . This documentation served two chip timing models. As each pattern was applied, purposes: (I) ro ensure that the fu nction the riming ve rifier wou ld run a complete check received the proper attention during the verifica­ of the gate array and generate a list of violations. tion phase . and (2) ro give the responsible CAE These violations would then be checked by the engineer the information needed ro understand designer. If they were valid, logic changes wou ld the fu nctions without referring to logic schemat­ be made. The reason that just the gate arrays were ics or meeting with the designer. veri fied, and not their complete modules, was With the functional specification as the founda­ that each module contained some logic for which tion , ream members generated a verification no structural model existed (for example, the working document for every design object. This CVAX chip set on the KA6 2A module) The lack DVT specification, as mentioned earlier, guided of a complete module-level timing verification the verification work and constituted the primary model was rectified by requiring the module hardware-subm i ttal exit criteria. design ream to thoroughly analyze its module. The logic ve rification was grouped into three This approach was possible only because of the categories: highly bus structured nature of our technology.

• Basic fu nctional verification . Basic fu nctional System Ve rification tests exercised each function as a standalone Once every design object met irs exit criteria and piece of the design . This resting isolated obvi­ satisfied the specified testing, the next milestone ous bugs. was the start of system simulation. Our task was

• Interaction sensitivities. Interaction sensitivity ro ve rify the actual design in a system environ­ exercised the dcsign as a whole, making sure menr. \Ve constructed a model consisting of mul­ that fu nctions could interact with each other tiple processors, memories, and I/0 modules. and cou ld occur in series without cumulative This model contained structural representations fault mechanisms. Testing of function interac­ of the actual designs wherever possible. Where tion included any boundary conditions, margin there were multiples of a design object in the sys­ resting, and back-pressure on different key tem simulation environment, one instantiated points in the design. copy of the model would be the detailed (and slow to simulate) structural model; the other • Error handling. Error handling verification instantiations were the faster yet less accurate rested that portion of the design created speci­ behavioral models. fically for error detection and recovery mecha­ In addition ro actual design objects in this sys­ nisms. tem model, we included different types of trans­ actOr and traffic generators on both the XMI and Timing Ve rification the VA XBI buses. Timing ve rification was performed separately The stimulus for this environment had to be upon all key components in the VAX 6200 sys­ specific enough ro ensure that every type of traffic tem. Al l of this work was performed by applying pattern was generated during simu lation . The functional patterns to timing models for each of stimulus attempted to stimulate every node and the module gate arrays and the XMI arbiter logic. fu nction concurrently. ln a system simulation in Th is work was done using AUTODLY, an internal which the simulation rate is so slow, as much as Digital roo! . possible must be achieved in every single simu­ The XMI components were rested first. Te sting lated clock tick. consisted of applying all possible XMI bus cycles Key tO making a system simulation successful against this logic while allowing the timing is to start the simu lation on ly after the constituent ve rifier tO analyze the logic for any timing paths pieces of the system have been very thoroughly

52 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

verified in isolation Given the complexity of the reflect any changes in rhe design as a result of system model and its slow running rare, fi nding hardware debug. The purpose of this on-line sofl simple design bugs at this stage is a waste of representation of rhe design was twofold. First, schedule rime. Instead. the model should iden­ the representation would aid in the isolation of tify the system interaction problems and assure any problems discovered in the lab. Second, the developers that the base logic verification was database could be used ro investigate any suspi­ thorough. cious problem areas that could not easily be trig­ Severa l logic problems were found during our gered in rhe hardware. system simulation dealing wirh complex interac­ tions. some after a few microseconds of simula­ Review and Reporting Methods tion. If undetected . these prob.lems would have Throughout the design verification process, a seriously impeded progress toward our goa.! of means ro ensure coverage was established for providing functional first-pass hardware. each phase At the project outset. DVT speci­ hcation coverage of functions was assured by Ve ctor Generation several leve ls of team review. As the simula­ The start of system simulation rakes place. by tions progressed. the project leaders were given definition . near the end of the logic verification rhe responsibility of ensuring bugs were consis­ process. At about that time. we began to prepare tently reported and corrected. Ve ctor extraction for submittal of the designs for fa brication . and grading of our gate arrays provided a strong Therefore. in parallel with system simu lation. measure of the completeness of the ve rifica­ test pattern generation was started . tion of these chips. Ad ditionally, the internal Test vectors were needed at this time. primarily controllers to the gate arrays were measured for to test chips coming off rhe fa brication line. com plere stare and product term coverage. Therefore we generated tesr vectors for the very Lastly. before being released for manufac­ large channel-Jess arrays con rained on each of our ture, the design was checked against our own moduks. The basic criterion for approval was exit criteria to ensure that rhe verification auainment of 99 percent internal node toggle wa, complete. coverage of the gate array logic. In addition to rhe This section presents details of these methods 99 percent internal node wggle criteria, we also and tools for ensuring all functions were tested included the much more stri ngent criteria of and verified. 95 percent stuck-at coverage as measured by a fau lt grading mechanism. The methods used ro Fu nctional Coverage determine coverage are discussed in the section The VAX 6200 project team chose the functional Problem Reporting and Resolution. verification approach ro ve rify all VAX 6200 The vectors were extracted from a strategic sub­ designs. One problem with this approach is that set of our fu nctional DVT simulation and graded there is no method of measuring functional cover­ on a hardware acceleratorjfault evaluaror. age . Since all verification is based upon the DVT We set a goal that the vecror counr should not specification, functional coverage will be a exceed the chip's gate counr; that is, a chip with reflection of the completeness of this document. 2 5 K gates should have no more than 2 5 K vectors Therefore. the DVT specification becomes the w exercise its logic. The vectoring process, ve hicle hy which the functional coverage of rhe incl uding extraction, grading, and complement­ verification is to be measured. This specification ing. took an average of one month per gate array. must be made as comprehensive as possible. A-; is true of architectural verification. vector Therefore , the specification underwent many lev­ generation is an area where work remains ro be els of review by a wide audience, including rhe done. If we had been able ro include some testa­ entire design team bility features in these very dense chips, we could have saved this month of schedule time. Problem Reporting and Resolution Another means used to ensure coverage was the Fo llow- through problem-resolution and bug-reponing mecha­ Even beyond the prototyping phase. the simula­ nism. Every design verification ream project tion database was maintained and updated tO leader was responsible for tracking bugs in

Digital Te chnical jo unwl No. 7 r'ougust I

the designs and ensuring these bugs were cor­ areas of logic were found that had nor previously rected and the correction was verified.Commu ni­ been tested. This additional logic yielded addi­ cation for this tracking was through VAX NOTES tional bugs. conferences. The fa ult grading process also provided an For each design verification team, two confer­ additional degree of confidence in the coverage ences were created. The first was for bug report­ of the functional verification test cases. The ing and bug-fix resolution . Only verification team 95 percent fa ult coverage goal was achieved with members could write notes in the bug confer­ panerns derived from a subset of those test cases. ence. Every entry indicated the dare, model revi­ It should be mentioned that the hardware fault sion levels, test case number, fa ilure symptOm, evaluatOr was used extensively during this phase and any assessment of the problem. Replies tO of the project and proved to be an irreplaceable each entry were entered, either by the project tOOl . leader or the CAE team member responsible for the fa iling test, tO indicate when the bug was State Machine Coverage verified as being fixed and the model revision lev­ Tools were developed that would analyze traces els at the time of verification . If the problem generated from the internal gate-array controllers remained unresolved, the reply wou ld indicate and sequencers. Traces were collected while the any action taken or patches made. functional tests were being simulated and The NOTES conference review ensured that verified. All traces were later analyzed, and cover­ all bugs were given the proper attention and age was ensured for every state and product term. visibility. This mechanism was put in place and automated, The second conference was informational. so that after each regression, coverage could be Using this conference, engineers could learn rechecked. about key aspects of the design as the verification After every regression run of all test cases, the progressed . Fore exampie, they could obtain results were analyzed to ensure that no product information on undocumented features on which terms or states were missed as a result of test certain verification tests were based . modification or bug fix.Ad ditional test cases were generated to find specific and hard-to-activate Fa ult Grading conditions. Another process, which was implemented tO measure functional coverage of the component Exit Criteria patterns, was the fault-grading mechanism. In Before a design is sent to manufacturing, the this approach, all component patterns for the design must meet the exit criteria. These criteria large compacted arrays were generated at the are as follows: functional level. The simulation environment for pattern capture was the same one used for func­ • All the specified test cases have been gener­ tional verification. The stimulus generated was ated and have run bug-free against the latest driven by high-level functions. The test patterns design. were captured at the chip's boundaries while the • The system simu lation has run bug-free for two chip was being exercised on the module. continuous weeks. Traditionally, component patterns are gen­ erated by simulating the chip standalone and ln other words, if bugs still exist in the design , driving hand-crafted stimu lus through the chip the design is not yet ready for manufacture . simu lation. Due tO test overlap, the approach As judged by the nearly bug-free condition taken by the VAX 6200 team did not ensure the of the implemented hardware, these design­ optimum number of patterns for the maximum completion criteria and coverage metrics were stuck-at coverage. However, the approach proved appropriate for the VAX 6200 development tO be very beneficial. Ranging from 20K to SOK effort. patterns for each gate array, rhe patterns were The VAX 6200 project's tremendous success generated in the veryshort time of approximately has established the process for future systems one month. In reaching our goal of 95 percent verification and for engineering qualiry measure­ fault coverage with these test patterns, additional ment.

54 Digital Te chnicalJo urnal No . 7 August 1988 CVAX-based Systems

Results Attained arose, we became strongly aware that future The development cycle for rhe VA X 6200 system generations of hardware will be much more was quite short. and therefore the need to pro­ dependent on the type of verification used for duce fu nuional first-pass hardware was very rhe XMI. Although timing verification was strong. The first XMI specification was released in performed on all gate arrays on the VAX 6200 December 198'5. Eight months later, all of the system, this verification. In the future, we feel it XMI parts had been designed and simu lated and is important to perform timing verification on the were being manufactured . Two months later, the design during early development. Thus we can parts were up and running. identify and solve the timing problems before During this time, specifications were re leased they become roo entrenched in the design ro be for the KA62A. MS62A, and DWMBA, and logic fixed easily. design was begun. Concurrently, test specifica­ Since the wire delays for gate arrays can only be tion and rest generation also began. fn the late estimated until gate layout has taken place, all summer of 1986. all logic design was completed. verification must be repeated once the actual tim­ and ve rification began. Two ro three months after ing numbers are returned. Add itionally, fl oor design completion . verification was completed planning of the gate array can have a significant for each module. With a complete and ve rified effect on the performance and specific wire design . one month was used to generate all gate­ delays. On the VAX 6200 project, the layout and array rest vectors and then submit the gate arrays final wire delay calculations were performed by for manufacture . our gate array vendor and then sent back to us for In February 1987 - 14 months after the first reverification . These steps can take quite a long complete XMI specification - the DWMBA was time in the design cycle of a gate array. To reduce manufactured. powered on . and run with first­ the wait for real wire delays, we plan to perform pass hardware. One month later. the KA62A was all floor planning and preliminary layout opera­ powered on and running. 1\vo weeks later, with tions at the design site. Ad ditionally, this will fu nctional MS62As, the firstVAX: 6200 system was allow us much more input to the floor plan and powered on . Two weeks after that, on April 1, the layout. first VAX 62 I 0 system booted VMS with al I first­ pass fu nctional parts. Summary Al though a few bugs were later ro be found and The success of the VAX 6200 verification effort fixed, the goal of using simulation to generate can be attributed mainly to the decision to begin hardware that works at speed the first time was verification at the same time as the design and ro attained. In fact, many of those original parts are continue verification and design as para I lei being shipped with the VAX 6200 systems today. efforts. This decision was implemented by assem­ bling ve rification teams at the same time design Opportunitiesfo r Improvement teams were being built. Al though our verification process proved to be Ve rification was performed during each stage quite successful, we plan to make a few changes of development - from initial concept tO system in this process for fu ture projects. integration. The architectura l ve rification con­ Architectural verification, in so far as that firmed the XMI architecture and arbitration means an effort to discover system-level inade­ algorithms. Performance verification helped quacies or bottlenecks. is in its infancy. We con­ define the processor and memory architectures sider this a wide open area where much can be and ensured that these architectures cou ld take accomplished. fu ll advantage of the new XMI The logic of all As module designs call for increases in speed , XMI modules, their gate arrays, and the XMI arbi­ riming verification and signal integrity verifica­ tration logic was verified against their speci­ tion will make a much larger contribution to the fications, nor against the designs themselves. rota! verification effort. Although the XMI inter­ Lastly. the entire VAX 6200 system was simulated connect was verified ro all circuit, signal, and in a multiprocessing environment. proving that riming specifications. signal integrity was not the diffe rent component modules could function emphasized ro the same degree in the modules together as a system. Ve rification from system themselves. Although no significant problems architecture ro gate arrays, modules, and then

Digital Te ch,ical jo urr�al '5'5 No. 7 Augusl I'J88 The Role of Computer-aided Engineering in the Design of the VA X 6200 Sys tem

back to a complete system again, throughout the designs are tO be re leased for manufacture . To life of the project was the only way to assure the make this work successfully, the necessary main verification goal -first-pass, functional resources must be allocated for the verification hardware. effort. Furthermore, project teams must develop During logic verification, attempts were made and follow through with complete verification tO perform verification using the smallest detai I, strategies. These strategies focus on verification while still keeping the scope of the logic under as a part of the rota! design process rather than as test large enough ro a!Jow system-level testing. a process that takes place after designs are com­ By performing all testing at these much higher plete. The VAX 6200 project was proof that this levels, a greater number of functions and more philosophy can be made to work. global functions can be tested at one time. The only drawback tO testing at this level is simula­ References tion speed. The trade-offof speed for accuracy is I. B. Allison, "An Overview of the VAX 6200 a good one, for without accuracy the costly alter­ Family of Systems," Digital Te chnical native is tO design and manufacture multiple jo urnal (August 1988, this issue): 10-18 . passes of hardware. In conclusion . the most important outcome of 2. T. Fox, P. Gronowski, A. Jain, B. Leary, and our verification effort was a management philoso­ D. Miner, "The CVAX 78034 Chip, a 32-bit phy that, in rhe end, verification is as important as Second-generation VAX Microprocessor," logic design . With this understanding, verifica­ Digital Te chnical jo urnal (August 1988, tion criteria now determine when and whether this issue): 95-108

56 Digital Te chnicalJo urnal No . 7 August 1988 Rodney N. Gamache Kathleen D. Morse I

VM S Sy mmetric Multiprocessing

The symmetric multiprocessing fe atures of VMS version 5. 0 effectively utilize the greater computing power of Digital's multiple CPU systems. Key to the SMP design is an innovative mechanism, called a sp inlock, that provides a high degree of parallelism fo r kernel-mode code. Wherefo r­ merly VMS software used interrupt priority levels (JPLs) to synchronize processes, VMS now uses sp inlocks. Because each VMS resource can be protected by a sp inlock, this design provides more synchronization levels than could IPL s alone. Sp inlock granularity directly affe cts system perfomrance.

This paper describes the major features of sym­ share state information that provides load bal­ metrical multiprocessing (SMP) in the VA.XjVMS ancing capabilities. operating system. These enhancements are included in VA.XjVMS version '5.0.Al though it is • An inrerprocessor interrupt capabi lity that impossible to present details of every aspect of enables one CPU to interrupt all other CPUs or the SMP design in these few pages, this paper a single CPU provides an overview of the key mechanisms • The set of interlocked instructions (BBSSI. deve loped for VMS SMP. BBCCJ, ADAWI. lNSQxi. and REMQxl). which Te chnology Developments are part of the VAX architecture and thus present in every VAX system Over the last several ycars advances in computer technology. especially in VLSI. have yielded • Cache coherency maimained by the hardware. greater computing power in increasingly sma ller without software assistance packages. Vl.Sl CPU chips have made possible multi-CPU. single-board computers. These multi­ • One CPU, known as the pri mary CPU. that ple CPU systems arc having an increasing impact must have access ro all ljO. console sub­ on the ge neral-purpose computing environment. system. and timekeeping hardware The net result is that recent technology trends With these hardware features. VMS can provide have red irected the chal lenge of building multi­ symmetric mu ltiprocessing support for any VAX processing sys[(:ms from the hardware engineers system. Al l code executing in user, supervisor. or to the systems software engineers. Systems soft­ executive mode can execute on any CPU without warc engineers must now design effective ways to restriction . Most (if not all) kernel-mode code uti l ize systems with six, eight. or even more can execute on any CPU without restriction. The CPUs. only restricted code is that small amount of VA X Ha rdware Fea tures Required by kerne l-mode code that requires access to the time-of-day internal processor register or to the the VMS Op erating System console terminal and the console block storage The VMS SMP design requires that certain fu n­ device. damental features be implemented in VAX multi­ The SMP design has no requirement regarding processing hardware. These feawres are as the system topology or interconnect joining the fo llows: multiple processors. It supports systems imple­ mented by means of a single bus architecture. • The ability w share common memory among such as the VAX BI bus. as easily as systems that all CPUs in the system use a cross-bar connection . This shared memory allows all CPUs to execute Therefore. the VMS SM P design is flexible a single copy of the operating system and to enough tO support current VA,'{ systems and

Digital Te chnical ]o unwl 57 No. � A IIJ!.IISI I 'JRH VMS Sy mmetric Multiprocessing

been a nonscalable solution ; if more CPUs were

VAX VAX added to the system, system performance wou ld 6200 6200 not increase due tO blocking for the single Jock. CPU CPU A more ambitious yet costly design was to provide a high degn:e of parallelism for kernel­ l I mode code. With this kind of parallelism, many XMI processors are allowed to execute different I portions of the executive at the same time. For example. a process adding a system-wide logical iD iD X X MAIN <( name shou ld be able ro execute on one CPU > � MEMORY while another CPU handles a device interrupt fo r completion of a disk 1/0 req uest, etc. This design would req uire creation of numerous locks and careful design of the interactions between the critical regions that use those locks . This KEY design approach was the one finallych osen by XMI - SYSTEM-TO-MEMORY INTERCONNECT the VMS engineers. and is discussed in the fol low­ VAXBI - BAC KPLANE INTERCONNECT (FOR 1 /0) ing sections. Figure 1 VA X 6200 Sys tem Block Diagram Sy nchronization in VMS: Raising IPL, Mutexes, and Sp inlocks fu ture VAX systems that take advantage of advanc­ 'T'he original VMS version I 0 design used two ing technologies and architectures. types of synchronization : (I) ra ising interru pt priority leve l (I PL) and (2) mutual exclusion Ne w MultiprocessingHa rdware semaphores (mutexes) . The VAX architecture The design of recent VA X systems. such as the provides .) l IPLs; I through 1 5 are dedicated VAX 8800 and the VAX 6200 series of computers. for usc by sohware, and I G through 3 I are offers an elegantly simple, symmetric hardware reserved for hardware. (IPL 0 is not really an configuration . Centra l tO the design of these sys­ lPL but ra ther the leve l at which user. supervisor, tems are two new bus architectures - the XMI and executive mode programs execute .) VMS bus and the VA XBI bus (Figure I). The VAXBI blocked different types of system events by architecture provides a protOcol that allows ( 1) raising IPL tO or above the level at which multiple processors to issue device re(j uests, and that event occurred . For example, process (2) operating system software tO specify which rescheduling was done by means of an IPL 3 processors a device controller will interrupt. software interru pt. Code threads that mod ified a The symmetry of these ljO subsystems pre­ process's context always executed at IPL 3 (or sented a new challenge to the VMS SMP design­ higher) to prevent a reschedule. Anot her exa m­ ers: tO provide an IjO database design that wou ld ple is the manipulation of device controller regis­ make possible simultaneous execution of inter­ ters . These registers were always manipulated at rupt handlers, thus taking advantage of these new the device's hardware interrupt level; thus other hardware features. system activity of a lesser importance was blocked out while the time-critical code path TheDevelo pment of VMS SMP was executed . Critical to SMP was a new method , used through­ The second synchronization met hod , mutexes, out the VMS kernel, to synchron ize multiple pro­ was used to lock purely software constructs, such cessors. One possible SMP design wou ld have as globa l section descriptOrs . Mutexes provided a been to create a single lock for kerne l-mode mechanism for defining many locks without operations and allow any processor to acquire assigning a unique software IPL to each lock. A that lock. However, the VMS engineers believed murex was acquired by the operating system on that such a design would not have provided behalf of a process and was considered "owned" sufficient parallelism to achieve good system by that process. Rescheduling cou ld occur while throughput for systems with more than a few pro­ a process "owned " a mutex; however, process cessors . This single-lock method wou ld have deletion cou ld not occur. Lock requests made by

58 Digital Te chnicaljo urnal No. 7 August 1')88 CVAX-based Systems

a process of higher priority for an already owned with an interlocked rest-and-set me mory opera­ mutcx were handled by placing the requesting tion. the 13BSSI (Branch on Bit Set and S<.:t Inter­ process into a wa it state, thus avoiding dead­ locked) instruction . The spinlock interlock bit is locks. contained in a separate byte within the spinlock In a multiprocessing system , eac h VAX CPU structure. Unlocking a spinlock is done with the has irs own interrupt priority level. independent inverse BBCCI (Branc h on Bit Clear and Clear of the ot hers . Thus ra ising TPL wo uld synchron ize Interlocked) instruction . These interlocked oper­ on a single CPU bur nor across the entire system. ations arc aromic memory transactions across all IPb, rhen, cou ld not be used tO synchronize all processors in a VAX multiprocessor configura­ CPLJs. Neither were murexes a viable solution . tion . Furthermore , since memory is common to since rhey could only be used within process all processors, the interlocked memory rest-and­ comcxr and ar low !Pls Therefore. the SMP ream set operations provide a sufficient method of created a new VMS mec han ism that they termed a extending sync hronization to all processors "spinlock." Anywhere VMS code had previously within a multiprocessor system. sync hron ized by ra ising IPL, the code wou ld now The usc of multiple IPL'i as a synchron ization acquire a spinlock; wherever VMS code had low­ method in VMS provides the capabi lity ro sched­ ered !PL. ir would now re lease a spinlock. Use of ule events in a prioritized fa shion. The inclusion mutexes re mained unchanged save that the code of IPLs in the spinlock srn.t cture allows the SMP to acquire and rckase mutexes was protected by synchron ization mechanism to appear as an a spin lock added dimension to IPb. Moreover, this SMP The design for spin locks included a number of mech an ism preserves the abi I ity ro schedule critica l concepts. First. a spinlock is "owned" by events in a prioritized manner. a CPU . nor by a process (as murexes are). Second , For uniprocessor systems, the SMP d<.:sign also each spinlock is acquired and released at a par­ includes the abi lity to optimize the routines that ticu lar lPL that is associated with the spinlock. acquire and release spin locks. For example, on a Ra ising IPL when a spinlock is acquired prevents single CPU system. the spinlock acquire-and­ orhn activities from interrupting time-critical release routines are never called. Instead , only a code . Third. CPLJs "spin-wait" wh<.:n blocked move-to-processor register (MTPR) instruction is from obtaining a spinlock resource held by executed . rhus raising IPL Syste m performance another CPU, since spin locks arc on ly assigned ro of a single CPU has been measured as on ly a tiny rime-critical resources that cannot be locked for percentage less than VMS version 4 performance. long periods of time. Lastly. the design of spin­ Mutex synchron ization is still the second syn­ locks includes a mechanism for deadlock preven­ chron ization method used in VMS. In the SMP tion or detection since the debugging of "hung" design. murexes are used for locks that arc held systems is roo costly. l'herefore, each spinlock for long periods of time and for situations in is assigned a ra nk. Because spinlocks must be which the IPL has to be lowered. Mutexes are still acquired in order of ra nk. deadlocks are thus pre­ owned by processes . not by CPLJs. under the SMP vented . Further. a debugging aid was built into design. the spin lock design. A parr of each spinlock data structure is set aside tO hold the last eight pro­ Sp inlock Granularity, Devicelocks gram counters (PCs) that acquired or released One aspect of the SMP design that directly affects each spinlock. When enabled. th<.:se consistency system performance is the granu larity of the spin­ checks proved invaluable in determining interac­ locks. A coarse granularity (fewer spinlocks) is tions between different components in the VMS easy ro implement and debug; however. a coarse executive, such as memory management and granularity provides fewer synchron ization scheduling. points, and thus processors are blocked for The VMS engineers implemented routines for longer periods. A finer granularity (more spin­ acquiring and releasing spinlocks rather than locks) provides more parallel ism and thus scatter in-line code through the VMS kernel . The shorter blocking times; howeve r. a tine granular­ first step in acq uiring a spin lock is tO synchronize ity is much more complicated to design and th<.: local processor by raising to the IPL of the implement, and requires more synchron ization spinlock, just as if it were a uniprocessor system . points. An important concept to remember is The actual locking of a spinlock is accomplished that. while the syste m is in a nonconrending

Digital Te cbtJical]o m-.wl 59 No . -, August 1')88 VMS Sy mmetric Multiprocessing

situation, a synchronization point only adds un­ of code. The spinlock design, therefore, has the necessary overhead. That is, if there is never any advantage of providing more synchronization lev­ possibility of processors contending for the same els than could be provided by IPLs alone. Hence, resource. then synchronization is not required. the granularity of spinlocks can be much fi ner Therefore, the SMP team decided that a manage­ than that allowed by software IPLs alone. This able number of spinlocks for the initial design finer granularity in turn provides more concur­ was no more than 32. The SMP design provides rency of execution in the VMS kernel. designers the ability to create a finer granularity For example, IPL SYNCH had protected a large of locks in fu ture releases of VMS as performance number of resources and thus wou ld be a good measurements identify time-critical resources. candidate for a finer granularity of spinlocks. As the SMP development evolved, it became Where VMS code had previously raised IPL to clear that a fi ner granularity of spinlocks for SYNCH. the SMP team had to determine which the ljO subsystem would be easy to implement. spinlocks had to be acquired and then perform With multiple VA.X BI buses, multiple CPUs could the conversion. handle different device interrupts simulta­ In summary, IPL SYNCH became the fol lowing neously. This fu rther improved the parallel ism of spinlocks: the system and resulted in a new characteristic FILSYS File system structures (such as for spinlocks: dynamic versus static spinlocks. A filecon trol blocks) static spinlock protects those resources common to all VA.XjVMS systems. Therefore, static spin­ IOLOCK8 Fork. IPL 8 (map registers, data locks are assembled into the VMS source code. paths and System Communication Dynamic spinlocks synchronize device-specific Services resources) code and so are created at boot time, depending TIMER Timer queue u pan the r;o configuration of the particular VAX system. Thus the number of dynamic spin­ MMG Memory management, page de­ locks varies from system to system, whereas the scription database, swapper, and number of static spinlocks is consistent across modified page writer all systems. The dynamic spinlocks used to lock JIB Portions of the job information particular devices were named "devicelocks" to block differentiate them from static spinlocks. A devicelock is used wherever device-specificcode SCHED Process control blocks, schedul­ previously raised IPL to a device's IPL to block ing database, acquisition/release interrupts. of mutexes Identifying Resources Requiring Per- CPU Context Areas and Sp inlocks Interrupt Stacks One of the first SMP development tasks was to Another development task was to identify the identify each VMS resource that needed syn­ context that had to be maintained for each pro­ chronization and then determine the proper lock­ cessor - independent of the general system ing mechanism - spinlock, mutex, interlocked structures. This "per-CPU" context area had to queue, etc. Once this work was complete, the include such items as identification of the current added dimension provided by spinlocks allowed process, a unique CPU identification field, and multiple resources to be protected by a single CPU-specific work queues. In addition, design I PL. For exam pie, I PL 8 (SYNCH) had protected requirements specified that a processor be able the following resources: memory management, to locate its private CPU context area with mini­ scheduling, the I/0 database, the filesys tem, and mal overhead. the timer queue. By adding a new dimension, The easiest solution would have been to namely spinlocks, each of these resources could include an internal processor register (IPR) into be protected by a different spin lock but share the which software could load the virtual address of same IPL. Therefore, in a multiprocessor configu­ the context area. Since IPRs are part of the pro­ ration, it was now possible to run more than one cessor hardware, each CPU could have pointed to processor at the same IPL. However, the proces­ its own context area without confusion. How­ sors must be executing different critical regions ever, such a processor register did not exist in the

60 Digital Te cb 11icaljou r11al No. 7 August 1988 CVAX-based Systems

VAX architecture. Therefore another solution was PTE. in case the PTE is cached in the translation needed in order for SMP ro execute on existing buffer. This notification is called a translation VAX systems. buffer invalidation request and is accomplished A creative alternative ro inventing a new IPR by a write to an IPR Since PTEs can be cached on was to find a method 10 use an existing IPR any processor in a multiprocessor system, one for multiple purposes. The VAX architecture possible implementation wou ld be for all CPUs includes an interrupt stack pointer (ISP) which to perform a translation buffer invalidation software loads with the virrual address of the request when any PTE is changed. Since transla­ interrupt stack. Since each processor must have tion bufferin va lidation must be carefu lly coordi­ its own stack for handling interrupts, this area nated among all CPUs, however, this simple was already CPU-specific. Under the SMP design, approach would have significantly affected sys­ the interrupt stack area and the CPU context area tem performance if left unmodified. are treated as one virtually contiguous context Two other features of VAXjVMS memory man­ block. When the virtual address of this new con­ agement play significant roles in the design for text area is rounded to an appropriate power of translation buffer invalidation in the SMP envi­ rwo, a simple clearing of the low order bits of the ronmenr. First, a user-process address space can­ virtual address of the ISP yields the base address nor be executing on multiple processors simu lta­ of the private CPU context area . neously. Second, the cached user-process PTEs This sol ution provided two similar ways to find are invalidated when a LDPCTX (load process rhe private CPU context area: context) instruction is executed as part of pro­ cess rescheduling. MFPR #PR 5_\SP, Rx Using these features, engineers optimized the BICL -"mask.Rx design to require system-wide translation buffer or invalidation only for system address space and not for user address space. Since system addresses BICL5 ...ma sk.SP,R;x (when running on the change less frequently than user space addresses, interrupt stack) this new design allowed for a major reduction in Both methods return the virtual address of the the interprocessor communication traffic. private CPU conrexr area. However. the latter case provides the fa ster mechanism. Process Affi nity Certain operations in a multi processor system Translation Buffe r Invalidation - must execute on particular CPT..Js. The VMS SMP A Fo rm of Cache Coherency designers termed the binding of a process to a As was already memioned. the VMS SMP design particular CPU as "process affinity." Affinity for a required that cache coherency be maintained in process is implemented by means of a 52-bit the hardware. However, rhe VAX architecture mask (one bit per CPU) in the process control includes one hardware cache that is maintained block (PCB) . Once a process is assigned affinity. by software. the translation buffer. The transla­ the process may only execute on CPUs for which tion buffer caches page table entries (PTEs) ro it has affinity. Process affinity is enforced by the speed up address translation from virtual to phys­ VMS scheduler during a reschedule event. (Note ical memoryadd resses. that only for real-rime priority processes does Software monitoring of the translation buffer is VMS SMP guarantee to run the N-highest priority appropriate for two reasons. Since page table processes on an N-processor system.) pages arc on ly "virtually contiguous" and not The VMS SMP design currently implements two "physically contiguous" portions of VAX main levels of process affinity: hard affinity and capa­ memory, monitoring changes to the PTEs would bilities. Hard affinity forces selection of a single be difficult for hardware . Also. since modification CPU in the affinity mask. This leve l of affinity is of page table contents is usually an infrequent used when a process must be guaranteed execu­ event. this cache is more suitably maintained by tion on a particular CPU. which is specified the software. by the CPU identification field in the PCB. Therefore, as parr of its monitoring function , Specifically. hard affinity is used ro implement the operating system software must notify the CPU diagnostics and to halt a CPU. When hard processor whenever it changes the contents of a affinity is being enforced . the process affinity

Digital Te chnical jo urnal 61 i\'o. � August J'J8R VMS S)•nunetric Multiprocessing

mask is reduced to a single bit. which represents VMS software that access the hardware itself the one CPU on which the process may execute. (such as device driver routines that alter control The selection of hard affinity is a very static opera­ and status registers) must execute on one of the tion The selection of which CPU to run on is C:PUs in the device affinityset for that device. For determined prior ro scheduling the process. ancl example. most of the initial processing of a SQIO the selection remains enforced until. otherwise request can execute on any CPU The driver code req uestecl actually starts the 1/0 transfer by controlling the Capabi lities provide a logical mapping of pro­ device by means of the control and status regis­ cesses to services. These services may only be ters. Only this portion of the driver code must available on certain CPUs in the SMP environ­ execute on a member of that device's affinity set. ment; forex ample, primariness is a logical capa­ Under the SMP design, all forking and postpro­ bility A capability may be serviced by one or cessing occur on the same CPU that received more CPUs in the SMP environment. For exam­ the device interrupt. The device affinity imple­ ple. primariness is a capabi lity that is on ly mentation uses a "trickle down" method that offered by at most one CPU in the SMP environ ­ requires no affinity checks for any of the queues ment. Instead. fo rk threads are queued to r.he appropri­ When a process requires capabilities. the pro­ ate CPU in the first place. The SMP implementa­ cess indicates the desired capabilities in a _<, 2-bit tion queues the fork threads by repl icating the mask in the PCB. When the process is scheduled. 1/0 postprocessing queue and the fork queues a comparison is made of the current requested for each CPU in the per-CPU context area. Thus capabi lities and the capabilities offered by the each CPU can process its own fork and 1/0 post­ CPU be ing rescheduled . If the CPU has rhc processing queue without acquiring the various required capabilities. then the process is exe­ spinlocks that wou ld be required for system-wide cuted: otherwise. the process is ignored and queues. Further. under this design, the set of another process is chosen for execution . Any crus tO which a particular device is bound under active CPU offering a particular capability may device affinity is a proper subset of the CPUs that service any process requiring that capability can service interrupts for that device. Once th<: capability is no longer required by a The affinity field for a device is stored as a bit process. the capability bit in the PCB is clean�d mask in the unit control block (in the field and the proc<:ss can c:xecute on any CPU in the lJCI3H_ AFFINilY) . This bit mask represents multiprocessing system. Thus, capabilities offera those CPUs that are allowed tO access the much more dynamic load-leveling of processes specified device. The default va lue for across the CPUs in the system than does hard UCBSL_AFFINITY is -I, allowing access from affinity. any CPU tO the device . As already mentioned, the console subsystem devices are accessible Device Ajji ni�y on ly from the primary CPU; therefore. the The VMS SMP design requires that the primarY lJC13SL_AffiNITY mask for these devices is ini­ CPU have access to all l/0 devices on the system tialized ro the primary CPU only. Due to hardware asymmetry for certain devices in 'l'he affinity fi eld for a device is checked on som<: existing multiprocessing systems. the VMS cnrry to only two of the seven driver entry points: SMP design also had to include provisions for • device affinity. for example. usually both devic<:s S'lARTIO in the console subsystem - the console terminal • ACT_ST ARTIO and the console block storage device - can on ly be accessed by the primary CPU. This is espe­ Jf th<.: affinity check fa ils, the 1/0 request packet cially evident on 8300 systems, where a physical (lRP) is queued as a fork block tO another CPU backplane cable connection from one of the from which access is allowed. The fork block in VA.,'\131 slots (usually slot 2, which contains the the CDRP portion of the IRP is used to fork the primary CPU) limits access tO the console sub­ request to another CPU. The fork block is queued system to the primary CPU. to a work request queue in the selected CPU's Device affinity models the hardware asymmetry per-CPU context area. An interprocessor inter­ by allowing only a subset of the processors 10 rupt is then delivered to notify the CPU that work access these 1/0 devices. Only the portions of is now present in its work request queue.

62 Digital Te chnical jo ut"nal No. 7 Augusl 1988 CVAX-based Systems

Al l other entry points i nro device drivers are tion . Therefore, each section of code protected serv iced by rhe primary CPU, which musr be by a spin lock can be executed by only one pro­ guaranrcecl access ro all devices. These enrry cessor at a rime. If rwo processors attempt to points arc normally cal led only during device ini­ access the same section of code (termed a critical tialization and include the fol lowing cnrry region ). then only one processor wi ll proceed po ints : while rhe other(s) spin-waits. To restate Amdahl's Law: You cannot get more than one • TIM EOUT CPU's worth of work out of any synchronization point. • UNIT !NIT The ability to increase the number of spinlock s • CONTROLLER !NIT should prove invaluable in future enhancements to SMP. as performance measurements ind icate • CLON ED UCI3 which spinlocks need tO change their granularity. • UNIT DELIVERY

Process affinity is used ro provide the device affinity req uirements for the SCAN CEL system service. When the SCANCEL request is serviced, the UCfl SL_AFFINITY fieldmay nor allow access from the CPU on which the request was initiated. If access is nor allowed. then the process affinity is changed to force rhe process to execute on a CPU compatible with the affinity re quirements of the device. Some VMS routines are always called when 1/0 completes on rhe same processor rhar ser­ viced the device and fork level interrupt dis­ patching. Therefore, device affinity is implicit for these routines . and no affinity checks are made prior to calling the routines REGISTER DUMP and MOUNT VERIFICATION. Fu ture Investigations

The initial VMS SMP design is finished, bur many interesting areas invite fu rther invest igation. These inc lucie

• Performance improvements, perhaps finer granularity spinlocks

• Enhancements for parallel processing

• Provisions for higher availability

The key to the VMS SMP design is the new syn­ chronization pri mitives, that is. spinlocks. The flexibility of the spinlock design will be impor­ tant in fu ture enhancements to SMP, as already proven in rhe evolution from static to dvnamic sp in locks. Granularity is another imporranr anribure of spinlocks. which arc synchronization points. All synchron izat ion points must be factored inro rhe design of any multiprocessor system. Each spin­ lock represents ar most a single thread of execu-

Digital Te chnical jo urnal No. 7 Aup,ust I 'JHH Bhagyam Moses I Karen DeGregory

Performance Evaluation of the VAX 62 00 Sy stems

Pe rfo rmance evaluation is an essential element in the development of a computer system. An effo rt was made to accurately evaluate the perfo r­ mance of the VAX 6200 system under workloads that represent real customer environments. Wo rkloads were developed to represent three major target markets - Engineering/Scientific, Commercial, and Gen­ eral Timesharing. These workloads were used to drive the VAX 6200 sys­ tems and thus to evaluate system performance in these environments. Performance measurement results indicate that the VAX 6200 system is a well-balanced multiprocessor system and that the multiprocessor perfo r­ mance is fa irly linear across these workloads.

Introduction tion continually debated is how we ll the bench­ The VAX 6240 system is a rightly coupled multi­ marks and workloads represent current user envi­ processor syst<:m based on the CVAX micro­ ronments. Since there are many different kinds of processor. Th<: system consists of four processors computing environments and both the applica­ sharing memory through a single. high-speed tions and computing styles are continually chang· bus. This paper describes the process by which ing. it is very difficult to develop representative performance of the VAX 6240 system was eval u­ workloads accurately. The approach taken here ated under various workloads that re present was tO first survey the current customer popu la· target markers . The method used to develop and tion and identify a few major target markets. ve rify these workloads is discussed along with Ta ble 1 consists of three surveys ohtained from the evaluation of system performance. We use different sources, with n being the sample size. the multiprocessor efficiency measure. defi ned as the re lative throughput obtained by the addi­ tion of each processor, to characterize mu lti­ Table 1 Survey of Customer Environments processor performance. Measurement of the Environment Survey 1 Survey 2 Survey 3

VAX 6240 system indicates that the multi proces­ n = 110 n = 200 n = 55K sor ethciency measure is directly dependent on the contention for shared resources generated by Engineering/ 46% 50% 31% Scientific a workload. Commercial 40% 23% 35% Wo rkload Development Education 8% 15% 8% 6% 12% 4% One of the major issues in evaluating the perfor­ Software Development mance of a compurer system has been in the Miscellaneous 11% workload area . In the context of this paper. wo rkloads arc software roots used to create inter­ actiw multiuser environments in which the imeractivc throughput and responsiveness of the Ta ble 2 Distribution of Customer system are tht' key performance metrics. Con­ Environments versely. benchmarks are either single or multiple Eng1neeringjScientific 40% copies of programs run in batch mode; the Commercial 40% amount of time to complete execution of these General Timesharing 20% progra ms is the perfo rmance metric. The ques-

64 Digital Te chnicaljo urnal No. 7 August 1988 CVAX-based Systems

PROCESSOR 1 PROCESSOR 2 PROCESSOR 3 PROCESSOR 4 B B B B

STREAMS RUNNING SIMULTANEOUSLY ON THE PROCESSORS

THROUGHPUT = NUMBER OF JOBS COMPLETED WITH MULTIPLE PROCESSORS AS COMPARED TO ONE

Fig ure 1 Execu tion of Multiple Programs Ru n in Parallel

Clearly, Engineering/Scientific and Commer­ oil reservoir simulation , fl ight simulation. and cial environments dominate the market, with linear equation solvers. Ed ucation . Software Deve lopment, and General • The scientific stream contains simulation Timesharing applications accounting for the rest. programs that use Monte Carlo techniques Further examination of the Sofrware Develop­ to track particle movement, along with ment and the Education environments showed commonly used routines from national labora­ much similarity in fu nction, except that Software tories. Development is sl ightly more compute intensive.

Thus we fu rther simplified the application cate­ • The commercial stream contains the activities gories. as shown in Ta ble 2. done by a personnel department ro support We identified typical environments in each of salary planning these categories by evaluating system resource consumption in these environments rather than • The general ti mesharing stream represents the by evaluating what an end user does on the sys­ activities done in a software development or tem. Thus we cou ld simplify the number of education environ ment. parameters ro CPU, memory. and 1/0 resource Multiple copies of this stream were run simulta­ utilizations. Having ident ified these typical envi­ neously to take advantage of multiprocessor com­ ron ments, we collected or developed bench­ pure resources (Figure 1). To capture the maxi­ marks and workloads ro represent them. mum throughput, we ensured that all of the processors were 100 percent busy while the mul­ Single Stream tiple streams were ru nning on the system. Acq uiring single stream benchmarks was nor as difficultas developing mu lti user workloads. Most Multiuser Wo rkload Development of Digital's cusromers have benchmarks that The overall process of workload development is represent their environments. Therefore, we shown in Figure 2. Our goal was to represent typ­ acquired a collection of benchmarks to represent ical timesharing environments for the differenr EnginccringjScientific.Commer cial. and General ta rget markets. The entire strategy consisted of Timesharing from various custOmer sires. These benchmarks arc used ro eva luate the single­ • Identifying typical real sites processor speed. • Col lecting data on resource utilization and Multistream Batch jo bs image usage patterns A stream of well-known benchmarks was selected • Deriving a packaged workload ro represent the that represented each of the above-mentioned real site environ ment Engineering/Sc ientific, Commercial, and General • Va lidating the workloads by comparing the Timesharing markers. resource utilization of the workload against • The engineering stream consists of typical the resource utilization at various cusromer programs used in electrical circuit simulation , sires and modifying the workloads as required

Digital Te chnical jo urnal 65 No. 7 August I'J88 ------Performance Evaluation of the VA X 6200 .�)'stems

REAL SYSTEM

RESOURCE UTILIZATION -TERMINAL ACTIVITY DATA -USER CHARACTERISTICS l -USER MIX -VAXRTE SCRIPTS RESOURCE UTILIZATION -USER CHARAC1 TERISTICS DATA -USER MIX

STANDALONE SYSTEM

Figure 2 Interactive Multiuser Workload Development

In the following sections, we describe how we wide average over a week's accumulation of data . used this strategy to develop two multiuser Based on this comparison, we selected a typical workloads: the engineeri ng workload, which rep· day and a typical system. One hour was chosen resents an Electronic Computer-Aided Engineer­ from the typical system on a typical day during ing environment (ECAE); and the Software Devel­ the period of peak interactive use to characterize opment Environment Workload (SDEW) . the system at fu ll load. Further, based on the user profiles, we Data Collection classified users according to computer usage, Two Digital sires were chosen to represent the that is, heavy or light computing (for ECAE ECAE and SDEW environments. Internal sites workload) and heavy, medium, or light comput· were chosen initially to facilitate the data collec­ ing (for SDEW workload). We then used the tion process. Both sires had clustered environ­ image accounting data and user log files to clas­ ments that consisted of a variety of VAX systems sify users according to the type of activity they along with some workstations. performed . We collected information on these clustered Once several user classes were identified, the systems to capture their behavior under the load number of users in each class, or user mix, was generated by the environment over a period of determined. We defined the user mix by looking one week. VAX SPM software was used to col lect at ( 1) the number of users in each class at the resource utilization data (CPU, 1/0. and memory utilization) on all the systems at both user level and system level. VMS Image Accounting was Ta ble ECAE and SDEW User Mix used to obtain resource utilization data on an 3 image basis. Using the SET HOST/LOG Digital ECAE User Mix Command Language (DCL) command , we col­ Type of User No. of Users lected log files of user sessions to study user habits . Other user characteristics, such as think Engineer: Heavy 3 time and type rates, were obtained through inter­ Engineer: Light 3 views and observations. SDEW User Mix

Data Analysis Type of User No. of Users The performance team studied the cluster-wide Heavy software development 1 resource utilization profiles in order ro select the Light software development 3 time when the interactive activities were pre­ Secretary dominant. We compared resource utilization Technical writer profiles of individual systems against rhe cluster-

66 Digital Te chnicaljo urnal No . 7 Au/lust 1988 CVAX-based Systems

one-hour peak, and (2) the organization struc­ Table 4 User Resource Utilization for Real ture at the real sites. Table 3 shows the user mix Internal System and ECAE Workload for ECAE and SDEW workloads. In addition to CPU interactive users, these workloads also have batch minutes/hour DIOfsecond jobs running in the background. User Class Real ECAE Real ECAE

Developing the Workload Heavy 1.6 1.5 0.3 0.4 Having identified the user classes and activities, Light 0.5 0.5 0.2 0.1 we then developed an imermediate workload Batch 42.8 48.5 0.0 0.1 using DCL command procedures. This inter­ mediate step allowed easier translation to the within 0.1 010 per second of the values mea­ fi nal workload, which was based on VAXRTE sured from the real site. (VAXfVMS Remote Te rminal Emu lator) scripts. System-leve l validation - For system-level val­ Individual user scripts were developed and vali­ idation, we compared the system-level usage of dated. We then packaged the entire workload by CPU, disk 1/0. and memory for the 1-hour ECAE integrating all of the user scripts and the batch test experiment to the peak hour of the real jobs. Once development was complete, the system. Figure 3 shows that the CPU was used workload was val idated at both system and user 100 percent of the time on the real system during levels against the real internal site. Further vali­ the I hour; whereas the CPU utilization in the dation was done at the user level against Digital's workload tended tO vary slightly more, but was customer sites. always between 90 percent and 100 percent sat­ urated. The average CPU utilizations of the real Wo rkload Va lidation system and the ECAE workload are very close at This section describes the workload validation 100 percent and 93 percent, respectively. process using the ECAE workload as an example The DJO util ization over a 1-hour period for of the validation methodology. the rwo systems is compared in Figure 4. For borh Va lidation against "real" internal site - The systems there is significantvari ability in the 010 workload was tested using the same hardware rare over the 1 hour period. The ECAE workload configuration as the real system. For the ECAE was slightly more bursty, but the average DIO workload, a VAX- 11/780 system with 32 mega­ rates for the real system and the ECAE workload bytes (MB) of memory, RA8 1 disks, and six inter­ were very close at 3.3 and 3.0 010 operations per active users was rested. The purpose of this test second, respectively. was to compare the resource utilization of the Memory utilization on the two systems did not workload in an hour-long experiment to the vary substantially over the 1-hour period. How­ resource utilization of the real system during the ever, total average memory usage with the typical hour. System- and process-level resource workload, 23MB, was less than on the real sys­ utilization data of several different resources tem, 29MB, as depicted in Figure 5. were compared. Although the workload validated very well for User-level validation - To validate the work­ CPU and 010 resource utilization, the workload load at the user level, we compared the average used 20 percent less memory than was used at CPU and direct 1/0 (DIO) utilizations computed the real site. This was in parr due to the fact that for 1 hour for the different user classes. The during the development of the workload the CPU results are shown in Ta ble 4. and disk 1/0 utilization of subprocesses was CPU utilization for all three user classes vali­ added tO the resource utilization of the parent dated to within approximately I 0 percent, process. Although the workload represents the which was considered tO be well within accept­ work done by those subprocesses and the load able limits. Va lidation of the 010 rare was made placed on CPU and disk 1/0 resources, the somewhat difficult because ( 1) the 010 rare on a workload does not represent the additional mem­ per-user basis was very low (0.3 010 per second ory required by those subprocesses. As will be for the heavy user) , and (2) measurement of the described in subsequent sections, the lower 010 rate is only accurate to 0.1 010 per second. memory utilization of the workload did not con­ For all three user classes, the workload came to stitute a problem.

Digital Te chnical jo urnal 67 No. 7 August I YBB 100% uuuu uu u u 100% u u u u uuuuuuuuuuuu u u u uu u uuuuuuuuuuuu REAL SYSTEM u u u u uu u ECAE uuuuuuuuuuuu u u u u uu u uuuuuuuuuuuu CPU UTILIZATION (PERCENT) u u u u uu u CPU UTILIZATION (PERCENT) 90% 90% uuuuuuuuuuuu VERSUS TIME OF DAY u u u u u u u VERSUS TIME OF DAY uuuuuuuuuuuu uuuuu uuuu u uuuuuuuuuuuu FROM 6-NOV-1986 10 1 4:06.45 uuuuuuuuuuuu FROM: 2-FEB-1987 21 :26 13 08 uuuuuuuuuuuu TO 6-NOV-1986 11 1439.91 uuuuuuuuuuuu TO: 2-FEB-1987 22:30:13 07 uuuuuuuuuuuu uuuuuuuuuuuu 80% uuuuuuuuuuuu 80% u u u u u u u u u u u u u EACH COLUMN � 300 SECONDS EACH COLUMN = 300 SECONDS uuuuuuuuuuuu uuuuuuuuuuuuu (5 MINUTES) (5 MINUTES) uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu 70% uuuuuuuuuuuu CPU IDLE 70% u u u u u u u u u u u u u CPU IDLE uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu TOTAL PAGE SWAP PAGE & SWAP uuuuuuuuuuuuu TOTAL PAGE SWAP PAGE & SWAP uuuuuuuuuuuu IDLE WAIT WAIT WAIT uuuuuuuuuuuuu IDLE WAIT WAIT WAIT uuuuuuuuuuuu 0.1% 0.0% 0.0% 0.0% uuuuuuuuuuuuu 6.7% 6.5% 0.0% 6.5% 60% uuuuuuuuuuuu W% UUUUUUUUUUUUU uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu CPU BUSY uuuuuuuuuuuuu CPU BUSY uuuuuuuuuuuu uuuuuuuuuuuuu 50% INTER INTER uuuuuuuuuuuu KERNEL 50% uuuuuuuuuuuuu KERNEL STACK STACK uuuuuuuuuuuu 11.2% uuuuuuuuuuuuu 14.4% uuuuuuuuuuuu 6.4% UUUEUUUUUUUUU 2.8% uuuuuuuuuuuu UUUEUUUUUUUUU uuuuuuuuuuuu EXECUTIVE SUPERVISOR UUUEUUUUEUUUU EXECUTIVE SUPERVISOR 40% uuuuuuuuuuuu 4.0% 1.9% 40% UUUEUUUUEUUUU 2.8% 0. 1% uuuuuuuuuuuu UUUKUUUUEUUUU UUUUUUEUUUUU UUUKUUUUEUUUU UUUUUUEUUUUU USER COMPATIBILITY UUUKUUUUEUUUU USER COMPATIBILITY UUUUUSEUUUUU 76.6% 0.0% UUUKUUUUKUUUU 73.2% 0.0% 30% UUUSUSEUUUUU 30% U U U K U U U U K U U U U UUUS UEEUUUUU UUUKUUUUKUUUU EUUEUEKUSUUU SYSTEM TASK UEUKUUUUKUUUE SYSTEM TASK EUSEUKKUESUU 21 .5% 78.4% UEUKUUUUKUUUE 20.0% 73.3% KUEEUKKUEEEU UKUKUUUUKUUUK 20% KUEKUKKUKEEU 20% UKUKUUUUKUUUK KUKKEKKUKKKU UKUKUEUUKUEUK KSKKKKKSKKKU UKUKUKUEKUKUK KSKKKKKEKKKU KEY: UKEKUKEKKKKUK KEY: KEKKKKKKKKKU UKEKUKKKKKKEK I - INTERRUPT I - INTERRUPT 10% KKKKKKI KKKKU 1 0% K K K K K K K K I K K K K I KKKKI I KKI KS E - EXECUTIVE KKKI KKKKI KKKK E - EXECUTIVE I K I I I I I I I I I E KKKI KKKKI KKKK U-USER U-USER I I I I I I I I I I I K Kl I I KKKKI KKKK I I I I I I I I I I I I K-KERNEL K I I I K I I I I K I K I K-KERNEL S - SUPERVISOR S - SUPERVISOR 10 14 10 39 11 04 21 26 21 51 22: 16

Figure 3 CPU Utilization fo r Real Internal Sys tem and ECAE Workload fo r 1 Ho ur �� �� 25.0 25.0 �� 0<>"' :: 0" "' ::: - �. "' 22.5 22.5 :;�

gg'C'::

;:,� ... 20.0 20.0

17.5 17.5

CfJ CfJ 15.0 z 15.0 z

....Q Q <{ a: REAL SYSTEM a:� ECAE w w D.. D.. 0 12.5 DIRECT 1/0s (RATE/SECOND) 0 12.5 DIRECT I(Os (RATE/SECOND) lL lL VERSUS TIME OF DAY 0 VERSUS TIME OF DAY 0 a: a: w w co FROM 6-NOV- 1986 10:14:06.45 co FROM: 2-FEB-1987 21 :26 13.08 :;;: TO 6-NOV- 1986 11:14:39.91 :;;: TO: 2- FEB-1987 22:30:13.07 ::> ::> 10.0 10.0 * z z * EACH COLUMN � 300 SECONDS EACH COLUMN � 300 SECONDS * * (5 MINUTES) (5 MINUTES) * * * * 7.5 7.5 * * * * 1/0 RATES (PER SECOND) 1/0 RATES (PER SECOND) * * * DIRECT BUFFERED MAILBOX * * DIRECT BUFFERED MAILBOX * * * 1/0s I(Os WRITES 1/0s I(Os WRITES 5.0 * * 5.0 ** * 3.3 1 8.5 0.6 3.0 4.5 0.3 ** ** * * * ** ** MAILBOX LOGICAL NAME * * MAILBOX LOGICAL NAME * ** ** READS TRANSLI AT IONS * * READS I TRANI SLATIONS * ** ** * 0.6 1 4.4 * * * * 0.3 3.6 2.5 ******* ** 2.5 *** * * ******* *** *** * * * * *********** *** **** * * I ************* *********** KEY KEY: ************ *************

* � DIRECT 1/0 * � DIRECT 1/0 10:14 10 39 11 04 2126 21 51 22 16

Figure 4 DIO Utilization fo r Real Internal System and ECAE Workload for 1 Ho ur 100% 100%

MMMM M uuuuu U MU 90% uuuuuuuuuuu 90% uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu 80°•o uuuuuuuuuuuu 80% uuuuuuuuuuuu M M uuuuuuuuuuuu MUUU UU uuuuuuuuuuuu u u u u uu u uuuuuuuuuuuu U U UMU UU U 70% uuuuuuuuuuuu 70% U U M U U U M U U M U uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu 60% uuuuuuuuuuuu REAL SYSTEM 60% uuuuuuuuuuuuu ECAE uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu MEMORY UTILIZATION (PERCENT) uuuuuuuuuuuuu MEMORY UTILIZATION (PERCENT) uuuuuuuuuuuu VERSUS TIME OF DAY uuuuuuuuuuuuu VERSUS TIME OF DAY uuuuuuuuuuuu uuuuuuuuuuuuu 50% uuuuuuuuuuuu �% uuuuuuuuuuuuu FROM 6-NOV-1 986 1 0 14:06.45 FROM 2-FEB- 1987 21 :26 13.08 uuuuuuuuuuuu uuuuuuuuuuuuu TO: 6-NOV-1986 11 14:39.91 2-FEB-1987 22 30:13 07 uuuuuuuuuuuu uuuuuuuuuuuuu TO uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu EACH COLUMN 300 SECONDS uuuuuuuuuuuuu EACH COLUMN = 300 SECONDS 40% uuuuuuuuuuuu (5 MINUTES) �% uuuuuuuuuuuuu (5 MINUTES) uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu uuuuuuuuuuuu MEMORY UTILIZATION uuuuuuuuuuuuu MEMORY UTILIZATION 30% uuuuuuuuuuuu 30% u u u u u u u u u u u u u uuuuuuuuuuuu uuuuuuuuuuuuu USER uuuuuuuuuuuu TOTAL PAGED MODIFY uuuuuuuuuuuuu TOTAL PAGED USER uuuuuuuuuuuu 92.1% 90.9% 90.6% 1.1% uuuuuuuuuuuuu 73 .1% 68.9% 67 .9% uuuuuuuuuuuu uuuuuuuuuuuuu 20% uuuuuuuuuuuu �% uuuuuuuuuuuuu uuuuuuuuuuuu uuuuuuuuuuuuu ssssssssssss sssssssssssss NNNNNNNNNNNN NNNNNNNNNNNNN t:) NNNNNNNNNNNN KEY: NNNNNNNNNNNNN KEY: � 1 0�1o NNNNNNNNNNNN 1 0% N N N N N N N N N N N N N N - NON-PAGED N - NON-PAGED NNNNNNNNNNNN NNNNNNNNNNNNN ��0� NNNNNNNNNNNN U - USER WORKING SET NNNNNNNNNNNNN U - USER WORKING SET NNNNNNNNNNNN NNNNNNNNNNNNN """ S - SYSTEM WOR KING SET S - SYSTEM WOR KING SET � NNNNNNNNNNNN NNNNNNNNNNNNN �:s-. M- MODIFIED M-MODIFIED ;:: "" 10:14 10:39 11:04 21:26 21:51 22:16 � � �'o' -.... ::: 'C� Figure 5 Me mory Utilization fo r Real Internal Sy stem and ECAE Wo rkloadfo r 1 Ho ur � � CVAX-based Systems

A summary of the comparisons of the average Ta ble 5 System-Level Resource Utilization resource utilizations for the real system and the for Real Internal System and workload is presented in Table 5. ECAE Workload Va lidation against customer sites - This vali­ Resource ECAE Real System dation of the workload against the internal system was followed by validation against customer CPU busy 93% 100% systems. The goal of this additional validation DIOjsecond 3.0 3.3 was to determine if the workload was representa­ Memory 23MB 29MB tive of the load placed on systems by Digital's customers . Two semiconductor manufacturers in Califor­ Ta ble 6 Comparison of Resource Utilization nia were used as validation sites for the ECAE on Customer System and workload. Initially, it was determined that there in ECAE Workload were significant differences between the work Resource Utilization Customer ECAE performed at these customer sites and the work per Hour Sites Workload performed at the internal Digital site. The Digital internal VA X systems were used for logic design CPU (minutes/hour) 3.8-4.5 5.0 of gate arrays , circuit boards, and systems; DIO operations/second 1.4-2.3 1.8 whereas at the external sites, the VA X systems Memory (MB) 0.7-0.8 0.8 were used for the design of integrated circuits. Specifically, the work differed in the following ways: customer sites. This workload will put a 10 per­ cent heavier load on the system, making the per­ • DECSIM is used extensively within Digital, formance numbers slightly conservative for the whereas SPICE is the predominant simulation computer-aided electrical engineering market. software used by external semiconductor developers. DECSIM simulations require very Performance Measurement and large amounts of memory as compared to the Analysis SPICE simulations done by customers. This section discusses the performance of the sys­ tems in three major applications: Engineering/ • Design rule checking is both a time-cri tical and disk I/O-intensive task done by semicon­ Scientific, Commercial, and General Time­ ductor designers. Design rule checking and sharing. In each of the environments, single the load it places on the I/0 subsystem were stream, multistream, batch, and multiuser work­ not executed at the internal Digital site at the loads were tested . time resource utilization data was collected. Single-Stream Performance As a result, we modified the ECAE workload to The first step in evaluating the performance of a include the load placed on the system by design multiprocessor system is to establish the base­ rule checking and replaced the use of DECS!M level performance of the uniprocessor relative to with SPICE. a well-known system such as the VAX-11/780. A System resource utilization data was collected large number of single-user benchmarks were on VAX 8800 systems for one week at these cus­ used to establish this base level. tomer sites. In a manner very similar to the pro­ cess used for the initial development of the Single- Us er Performance workload, the data from these sites was reduced Single-user performance was evaluated by using to a typical peak period. Table 6 presents the traditional synthetic benchmarks, well-known comparison of resource utilization on a per-user industry standards, and real application programs basis in the workload and at customer sites. from engineering, scientific, commercial, and The ECAE workload fa lls within the range of general timesharing environments. Most of the utilizations observed at these customer sites for synthetic benchmarks are in FORTRAN; industry both disk and memory utilizations. The workload standards are Whetstones, Dhrystones, Linpack, is slightly (approximately I 0 percent) more CPU and others. The real applications, as mentioned, intensive on a per-user basis than was observed at represent four environments.

Digital Te chnical Jo urnal 71 No . 7 August 1<)88 ------Pe rformance Evaluation of the VA X o200 Svstl!rns

35

(j) ::.: 30 a: <( � 25 I (.) z 20 LlJ a:J u. 15 0 a: w 10 a:J � ::J 5 z

0 1.6-1 .9 1.9-2.2 22-2.5 2.5-2.8 2.8-3 1 3.1 -3.4 3.4-3.7 37-4.0 4.0-4 3 VAX 6200 PERFORMANCE RELATIVE TO VAX-11/780 (VAX-11/780 SYSTEM = 1)

Figure 6 Frequency Distribution of the VA X 6210 Performance on the Single- Us er lk nchmark Set

These lwnchmarks were used ro evaluate segments can be separated into paraltel threads uniprocessor speed compared to a VAX -J Jj7HO that can be run independently. Then the program system . A frequency distribution of the speedup is decomposed and run. either manually or factors on all these benchmarks was plotted. and through directives. The program is initiated as a the central tendency was examined . (Sec Fig­ single job; then the segments of the program that ure 6.) A high percentage of the benchmarks fell lend themselves to decomposition arc divided between 2.2 and 2.8. into subprocesses and executed in parallel on Ta ble 7 summarizes the performance of the different processors. In the manual decomposi­ VAX 62 10 in the single-user environment relative tion method. the optimal number of subpro­ ro a VAX -I 1/780 system The performance aver­ cesses for various levels of mu lti processor sys­ age of the VAX 6210 system, across all these tems is eva luated by varying the number of benc hmarks, is 2.8 times the performance of a subprocesses and calculating the speedup fac­ VAX -ll/780 system tors In the directive decomposition method, the compiler takes care of various optimization fac­ Decomposed Single- user Performance tors . These programs were run standalone with VAX 6200 performance on decomposed pro­ no interference from any other programs on the grams was evaluated through the usc of manual system. Figure 7 illustrates the decomposition and directed decomposition techniques. To process. begin with. a program is evaluated to see if some The benchmark description is as fol lows . To evaluate the maximum speedup facrors that can be achieved through decomposition , code seg­ ment's were selected. Such segments as matrix Table 7 Performance of the VAX 621 0 in the mulripll...:ation and convolution are widely used Single-User Environment in cngineeringjscientitic applications. Different array sizes (from tOO ro 1000) were used with Synthetic Benchmark Set: various arithmetic data types such as integer, and Single-user set 2.5 single and double precision. Industry-standard Benchmarks: An image processi ng program and the Lin­ Whet-s & -d 2.3 pack I OOOD program were used to represent real Linpack-s 2.7 application programs, where only certain seg­ Linpack-d 3.2 ments can be decomposed. Dhrystone 2.8 The performance results are as follows. The Real Application Benchmark Set: multiprocessor efficiency measure, defined as the Engineering set 2.8 relative speedup obtained by the addition of each Scientific set 2.6 processor, is the key metric used here to evaluate

72 Digital Te chnical jo unu:d No. 7 August 1988 CVAX-based Systems

ONE PROGRAM DECOMPOSED INTO PARALLEL CODE

RUNNING ON:

PROCESSOR I PROCESSOR 2 PROCESSOR 3 PROCESSOR 4

SUBPROCESS I SUBPROCESS 2 SUBPROCESS 3 SUBPROCESS 4

SPEED � TIME TAKEN TO COMPLETE THE JOB

Figure 7 Program Decomposition Process

performanu:. As seen in Figure R, the multipro­ was recorded and used to evaluate the multipro­ cessor efficiency measu re on the program kernels cessor batch throughpu t performance. It is is fairly linear. Multiprocessor synchron ization is important that all the streams run simultaneously minimal in this computing environment. The and share resources equally. Large differences in performance was very close to the theoretical the completion times of streams wou ld imply that maximum. A speedup of 3.9 times the uniproces­ maximum concurrency was not achieved because sor performance was achieved on the fo ur­ of some bottleneck in the system. processor 624 0 system. The performance on the i'vlultiprocessor performance on multistream image proe<.:ssing program is slightly lower than batch jobs was very close to linear across all envi­ what was observe d on the program kernels. Thus ronments. Results for the commercial stream, performance ga ined by decomposition depends representing personnel administration, were on ly directly on the amount of code that can be run in parallel. (Note: On the Lin pack I 0000 program, directed decompos ition was used; whereas on the other programs, manual decomposition was 4.50 w 4.00 used .) a: :::> 3.50 (/) < 3.00 Multistream Batch Performance w � 2.50 Measurement and Analysis >- u 2.00 z w The multistream jobs were used to measure the 1 .50 Q system-level batch performance on the multipro­ "- "- 1.00 w cessor systems. As shown in Figure I,these multi­ 0.50 ple streams were run in parallel to allow concur­ 0 VAX 6220 VAX 6230 VAX 6240 rency in the execution of these streams. Maximum concurrency is achieved since each of these streams is identical. No single stream runs KEY: any faster; however, the number of jobs com ­ 0 IMAGE PROCESSING pleted increases almost linearly with the addition t::. MATRIX MULTIPLICATION of processors . Ad equate memory was allocated to 0 CONVOLUTION the jobs 10 avo id unnecessary paging and swap­ X LINPACK ping. In addition , sufficient I/0 resources were present on the system to preclude r;o bottle­ Figure 8 Multiprocessor Effi ciency through necks. The elapsed time to complete these jobs Pa rallel Processing

Digital Te ch nical jo urnal 7 ) No � August t'JR8 ------Performance Evaluation of the VA X 6200 Sys tems

slightly lower - probably because of the higher major tasks executed in this workload are com­ amoum of ljO on this stream . (See figure 9.) pile-link-execute-debug cycle using FORTRAN , BLISS, and MACRO; utilities used include CMS, Interactive Multiuser Pe rformance RUNOFF, and text editors . Measurement and Analysis Hardwarejsoftware setup - Ta ble 8 summa­ In the interanive multiuser environment, the rizes the hardware and software configurations. system must support the activities of a substan­ tially higher number of users and their frequent interaction with the system. The number of users 4.50 w 4.00 on the system increases the amount of context a: ::> 3.50 switching, and the contention for shared (j') <>: resources is also much higher in this environ­ w 3.00 � 2.50 ment. >- u z 2.00 The test methodology included the use of a w u 1.50 remote terminal emulator, VAX RTE, ro create the u:: u. 1.00 interactive multiuser environment. The VAXRTE w 0.50 generated the input for the system under test and 0 received consequent output. The VAXRTE also VAX 6220 VAX 6230 VAX 6240 logged and time-stamped all interactions and

maintained the job mix throughout the experi· KEY: ment. To run a multiuser experiment, the system o GENERAL TIMESHARE under test and the VAXRTE system were booted !'. COMMERCIAL and running. Using scripts, everyfew seconds the D SCIENTIFIC VAXRTE logged a user on to the system under X ENGINEERING test. After all logins were completed, sufficient time was allowed for the system tO reach a steady Fig ure 9 Multistream Effi ciency Measures state. The experiment was then run long enough ro execute the longest script cycle for the specific workload. Wh ile the experi ment was Ta ble 8 Summary of Hardware and Software running, VMS monitOr and other monitoring tools Configurations were used to capture the resource utilization Hardware Configuration data. When the experiment was completed. data was reduced and analyzed. Processor VAX 6240 Wo rkload description - Three interactive Memory 128MB multiuser workloads were used tO evaluate the Disk controller 2 HSC70 multiprocessor performance in the three major Disks (Disk configurations differed for each wo rkload; see below.) environments: Engineering, Commercial, and General Timesharing. Number of RA82 Disks per Workload The Engineering environment was represented Dedicated Compu- by an ECAE wor.kload. This workload consists of Use ECAE Share SDEW the types of tasks done by design engineers devel­ oping electronic circuits: circuit simulation, System design rule checking, schematic fi le transfers Pagejswap from workstations, and tasks supported by VMS Library utilities . Interactive 2 2 4 The multiuser Commercial (Compu-Share) Batch 3 2 workload is based on the Compu-Share Order Database 6 Processing software package . This wo rkload con­ Note: Where necessary, software was distributed sists of three major types of transactions: order over multiple disks to avoid disk bottlenecks. entry, order inquiry, and accounts receivable Software Configuration reporting. The General Timesharing SDEW represents the VMS V5.0 - FT2. 1 (A single-processor system was run with multiprocessing turned off.) types of tasks done by software engineers. The

74 Digital Te chnicalJo urnal No . 7 August 1988 CVAX-based Systems

Performance metric - The multiprocessor effi­ 4 ciency measure is defined to be the relative multi­ UJ processor interactive throughput compared to � 3 (j) the uniprocessor throughput. 4: UJ Also considered in this metric is system respon­ � >- 2 siveness. based on acceptable service criteria for u z light, medium, and heavy tasks. This metric is w (3 u: used to eva luate the number of users supported u. UJ at peak throughput while the system maintains the service time criteria. System resources required to support each application are also VAX 6210 VAX 6220 VAX 6230 VAX 6240 identified . Performance results - The multiprocessor KEY: efficiency measure is very close to linear in both 0 ECAE the ECAE and SDEW environments (Figure 0) . I 0 COMPU-SHARE This result shows that even in the multiuser inter­ t> SDEW active environments near-linear performance can be expected if the system is well balanced in Figure 10 Multiprocessor Effi ciency Measure terms of processor speed and the memory-tO-pro­ fo r All Multiuser Workloads cessor bus speed . It also indicates the efficiency of the VMS SMP software. In the Compu-Share environment, the performance was slightly !ower At the peak throughput levels, response time because of the high amount of disk and terminal criteria were maintained in each workload. l/0 generated by this workload. The perfor­ Ta ble 9 compares users supported and resources mance of the multiprocessor systems under sym­ used by each of these workloads. The maximum metric multiprocessing (SMP) depends directly number of users supported on the VAX 6240 are on the amount of 1/0. It is important to note that 38, 120, and 126 users for ECAE , Compu-Share, even with high amounts of IjO, the multiproces­ and SDEW, respectively. sor efficiency measure is well over three for the In terms of resource utilization , it should be four-processor system. noted that the multiprocessor synchronization

Ta ble 9 Summary of Workload Resource Utilizations Multiuser ECAE Compu-Share SDEW

Number of users supported at the peak 10, 20, 28, 38 30, 60, -, 120 36, 66, 90, 126 Resource utilization

Number of users 38 120 126 CPU - 6240

Percent utilized 100% 100% 100% Interrupt 2% 6% 4% Kernel 12% 29% 20% Executive 3% 7% 7% MP synch 1% 7% 2% User 82% 51% 67%

1/0

Disk 1/0 profile Bursty Uniform Bursty Average disk 1/0 per second 24 113 68 Average buffered 1/0 per second 82 112 76 Memory

Maximum used (MB) 32 60 57

Digital Te chnical journal 75 No. 7 August 1')88 ------Performance Evaluation of the VA X 6200 Sys tems

under SMP is handled by spin locks. A spin lock is a bit in shared memory that is accessible by means of interlocked instructions by all pro­ 400 r------�---.�---n cesses through mutual agreement. Mutual agree­ 6 350 ment impl ies that a process can set the bit and � N 300 ga in access ro the scheduler database if no other ::;­ i= � 250 process has access ro it. If a process tries to set :::> w � � 200 the bit and the bit is already set, then the process (_)0 continues ro "spin" using a sequence of instruc­ f- � 150 z- tions to continue checking to see if the bit is 100 a:� clear. MP synch is the amount of CPU time spent � 50 waiting to change the bit or acquire the spinlock and thus gain access to the scheduler database. 10 20 30 40 so 60 MP synch is I percent for ECAE.7 percent for the TIME OF EXPERIMENT IN MINUTES Compu-Share workload . and 2 percent for the

SDEW workload. Since MP synch is the CPU rime KEY spent waiting tO acquire spinlocks and ind icates IDLE rhe amount of spinlock collisions, it shows the USER level of contention for shared resources experi­ c=J c::::J KERNEL enced by SMP under each workload. For the c:=:J EXECUTIVE Compu-Share workload , this leve l is significantly c::::J MP SYNCH higher. The Compu-Share workload generates the c=J INTERRUPT most disk l/0 compared to the other workloads, which may be the reason fo r a higher amount of time spent by this workload in MP synch. Figure 12 CPU Utilization over Time ­ 'fhe fol lowing three graphs, Figures II, 12, Compu-Share Workload l :1,prese nt the CPU modes usage profiles. The

400 z 350 350 0 z � 300 Q N f-<( 300 N ;:::! (iJ 250 ::;_ f-C: -(j)250 :::> w f- a: � � 200 =>w (_)(!) :::> C/J 200 a._::> f- � ISO Oco z­ f-!215 0 w z (_) 100 w a: (_) 100 a._w a: 50 a._w so 10 20 30 40 50 60 TIME OF EXPERIMENT IN MINUTES

TIME OF EXPERIMENT IN MINUTES

KEY

KEY CPU IDLE

USER c::J USER

c::::J KERNEL c::J KERNEL c=J EXECUTIVE c:=:J EXECUTIVE c:=:J INTERRUPT c::::J INTERRUPT c:=:J MP SYNCH c=J MP SYNCH

Figure II CPU Utilization over Ti me ­ Figure 13 CPU Utilization over Ti me - ECAE Workload SDE W Workload

Digital Te chnical jo urnal No . 7 August 1988 CVAX-based Systems

Compu-Share workload shows higher bur more uniform interrupt and ke rne l mode activities. 120

Compu-Sharc's usc of databases, which generates 100 w heavy 1/0 and loca.l locking, is manifested in the 1-- <{(/) 80 heavy kernel and interrupt mode activity. SDEW a: a: ow-(/) does a fa ir amount of fileman ipul ation . ECAE has - ::::J 60 ....J (!) <{N much lower ljO activity than both Compu-Sharc 1-� 40 o- 1- and SDEW. 20 The ne xt three graphs in Figures 14, 15, and 0 16 compare the ljO profiles. The disk I/0 on 0 10 20 30 40 50 60 ECAE and SDEW is ve ry bursty, and it is interest­ TIME OF EXPERIMENT IN MINUTES ing ro note that their re lative CPl.J mode profi les correlate we ll, showing a re lationship between Figure 16 Disk ljO Utilization over Ti me ­ the two. The 1/0 on Compu-Share is high but not SDEW Workload as bursty. Comparing the disk ljO generated by the mu ltiuser environment on the multiprocessor workloads and rhe effect it has on CPl.J utiliza­ systems shows that VMS SMP is working effi­ tion. Compu-Share puts the heaviest load on the ciently and that the VAX 6240 system is a wel l­ multiprocessor system. However, even with all ba lanced system in terms of the processor and the synchronization necessary on this work load , bus speeds. the multiprocessor efficiency measure is fa irly App lication Characteristics Affecting high (_-1,._-)).'fhe ECAE and SDEW workloads show high multi processor efficiency measures of 3. 8 Multiprocessor Performance and :1 .9. respectively. This level of gain in the This section discusses some of the characteristics in applications that directly affect multiprocessor

70 performance.

60 Me mory-to- Processor Traffic 1-w 50 <{- a:(/) Since these multiprocessor systems share mem­ a: 40 ow - fa ctor that affects multiprocessor efficiency. 1-<'> o- 20 1- Therefore applications that generate lower mem­ 10 ory- to-processor trafficdo perform better, assu m­ ing there arc no other bottlenecks in the system. One way to re duce this traffic is to organize the TIME OF EXPERIMENT IN MINUTES data to improve locality of reference. Data that is accessed together shou ld be placed together. Fig ure 14 Disk ljO Utilization over Ti me ­ ECA E Workload Disk lj O Op erations With the symmetric multiprocessing software,

180 1/0 operations can be handled by each of the 160 processors. A!5 a result, the I/O-intensive appli­ w 140 1-- cations perform much better on the sy mmetric <{(/) 120 a: a: multiprocessor systems as compared ro the asym­ w 100 o-(/) - ::::J metric multiprocessing systems. However, the 80 ....J o <{N 1-� 60 ljO device interrupts are still hand led by the pri· o- 1- 40 mary processor, even under SMP. By reducing the 20 rate at which device interrupts are made, any 0 contention for the primary processor can be 0 10 20 30 40 50 60 TIME OF EXPERIMENT IN MINUTES reduced. To reduce the number of IjO inter­ rupts, larger block transfers may be better in I/O­ Figure 15 Disk ljO Utilization over Ti me ­ intensive applications . Thus, an application that Compu-Share Workload will lend itself to making larger block transfers

Digital Te chnical jo urnal 77 No . 7 August t'J88 ------Overview of the Micro VA X 3500/3 600 Processor Module

With minimum bus cycle time down by more single chip, the CVAX designers selected a subset than a fa ctor of two and dynamic random-access of the fu ll VA X instruction set and data types. memory (RAM) access time remaining re latively The implementation includes 175 instructions constant, the opportunity arose to increase per­ and six data types (also implemented by the fo rmance by using static RAl\1 to add a cache. MicroVAX II system), plus 6 additional string Static RAMs with 35-ns access times and 64-kilo­ instructions: CMPC3, CMPC5, LOCC, SCANC, bit (Kb) densities could be used for this purpose SPANC and SKPC. The CVA.X also provides micro­ at reasonable cost. code support for emulation of 53 additional instructions (six less than the MicroVAX II) and Design Pa rtitioning and five data types . When any of these instructions is Functionality decoded, an emulated instruction exception is To facilitate implementation of the processor generated. This exception causes a set of instruc­ module using custom VLSI, the design was parti­ tion-specific parameters ro be pushed on the tioned into seven major parts: the central process­ stack and control to be passed to operating sys­ ing unit with first-level cache, the floating point tem emulation routines by the emulated instruc­ unit, the second-level cache, the memory con­ tion vecror in the system control block. As in the troller, the Q22-bus interface, the system sup­ MicroVA.X II, the remaining 70 instructions and port functions, and the clock circuitry. Each of three data types are handled by the CFPA chip. these partitions was implemented by a single The CVA.X implements the following registers: chip, with the exception of the second-level • Sixteen, 32-bit, general-purpose registers cache. This cache was implemented by pro­ grammable array logic (PAL) anol static RAMs. • Twelve VAX standard internal processor regis­ Five of the parts are connected directly to a ters ro support memory management, process 32-bit multiplexed addressjdata (COAL) bus: control , interrupts and system identification the central processing unit with first-level cache (SBR, SLR, MAPEN, TBIA, TBIS, TBCHK, PCBB, (CVAX), floating point unit (CFPA), second­ SCBB, IPL, SIRR, SISR, and SID) 1 level cache, memory controller (CMCTL) , and • Five internal processor registers specificto the Q22-bus interface (CQBIC). To reduce loading, CVAX to support the interval clock, first level the chip containing the system support functions cache, error reporting and console emulation (SSC) connects to a buffered version of this bus, 2 (ICCS, CADR, MSER, SAVPC, and SAVPSL) the BCDAL. The clock circuitry (CCLK) was sepa­ rated from the processor chip to conserve pins as The CVAX also provides a means for accessing six we ll as to allow designers more flexibility in additional VAX standard internal processor regis­ choosing a clock rate. ters to support the time-of-year clock, console To maximize performance, the CVA.X, CFPA, serial line , and 1/0 bus (TOOR, RXCS, RXOB, second-level cache, and CMCTL operate synchro­ TXCS, TXOB, and IORESET) . 1 These registers are nously from a four-phase clock generated by the implemented in the sse. CCLK. The SSC and CQBIC operate asynchronous[ y The registers in the SSC are referred to as on a 40-MHz oscillator. The processor module "external " internal processor registers and are was designed to allow the CCLK to be fed either accessed by software in the same manner as other from the 40-MHz oscillator or from a separate internal processor registers, that is, by means of osci l lator. The separate oscillator allowed the MTPR and MFPR instructions. However, the CVAX central processor and memory subsystems to be chip generates a special cycle on the COAL bus sped up when it was determined that the CVAX , with the register number as an address. The SSC CFPA, and CMCTL chips were capable of running responds tO these cyc les by either supplying the ten percent faster than originally projected. CVAX with the register contents (MFPR) or per­ Each of the major parts of the processor forming the register update (MTPR) . Accesses to module is described in fo llowing sections. other unimplemented VAX internal processor registers wi II also cause these cycles to be gener­ The CentralProcessing Unit and ated, but the cycles will terminate with an error First-level Cache condition. (The cycles are timed out after four The CVAX chip is a microcoded 32-bit VAX CPU. microseconds by a COAL bus timer in the SSC.) To implement the entire VAX architecture using a When a register write is made to an unimple-

80 Digital Te chnical Journal No . 7 August 1988 CVAX-based Systems

mented internal processor register, the CVAX TheFlo ating Po int Accelerator ignores the error signal; the resu lt is a long The CFPA chip works in conjunction with the no-operarion. When a regist<:rread of an unimple­ CVAX chip to process floating point instructions mented inrernal processor register is attempted. and to accelerate the execution of some integer the resu lts are undefined. instructions (MULL, OIVL, and EMUL) . The CVAX Also like the MicroVAX II system, the CVAX decodes the instructions and sends the CFPA processor imp lements a memory management control and opcode information by means of a unit. The unit supports full VAX demand-paged dedicated eight-line . The CFPA gets virtual memory , with single-level page tables for its operands from the COAL bus. Unlike the system space addresses and double-level page MicroVAX II, all operands do not have to come tables for process space addresses. In addition, from the CPU. Operands come from the CVAX four l<:ve Is of access protect ion are supported by only if they reside in the general-purpose regis­ the memory management unit. A 28-entry, fu lly ters or fi rst-level cache. If the operands reside associative address translation buffer is provided in the second-leve l cache or main memory, the for sroring recent virtual-to- physical address CFPA takes them directly offth e COAL bus. When translation (as opposed to an 8-entry translation the CFPA has completed the operation, it returns buffer in the MicroVAX IJ). condition codes and exception status by means of Unlike the MicroVAX II system, the CVAX in­ the control bus, and the unaligned result by the cludes an on-chip (first-level), physical instruc­ COAL bus. One, two, or three longword transfers tion and data cache. Because chip area was at a may be required to transfer the result, depending premium. a 1 KJ3, two-way set associative organi­ on the type of operation. The CVAX aligns and zation was chosen . In contrast to the second-leve l sends the result ro its ultimate destination. To cache, this organization achieves a high hit rate improve OMA latency, the CVAX will grant the for the available chip area through increased con­ COAL bus requests while wa iting for the CFPA to ' trol logic complexity instead of increased srorage return the result. array size . The extra control logic complexity of the first-level cache is more efficiently imple­ TheSeco nd-level Cache mented in custom VLSI, whereas the large storage The second-level cache sits directly on the COAL arrays of the second-leve l cache are more effi­ bus and bridges the 4-microcycle gap in access ciently implemented with off-the-shelf parts. time between the first-level cache and main Since the first-level cache organization yields a memory. The project goa l for the second-level set size equal ro the memory page size, cache cache was to maximize system performance look-up and virtual-to-physical address transla­ without placing the schedule at risk. Conse­ tion can be overlapped. Thus a cache cyc le time quently, designers chose to use large storage equal ro the processor microcycle time is arrays to achieve the desired level of performance achieved. (hit rate) rather than complex control logic. By The fi rst- level cache is look-through; that is, keeping the control logic simple, the cache cache hits on read cycles result in no activity on could be implemented in PAls rather than cus­ the COAl. bus, thus preserving its bandwidth for rom VLSI. Thus the chance of design errors was OIVlA transfers . The block size is one quadword so reduced as well as the time needed tO correct any that cache misses on cacheable read cycles cause errors found during design qualification . the CVAX ro ge nerate a quadword transfer on the The large stOrage arrays were easily imple­ COAL bus. This transfer results in two longwords mented using off-the-shelf static RAMs . The of data being remrned in response to a single resulting design was a 64KB, direct-mapped, address. The minimum transfer time is two physical instruction and data cache with write­ microcyc les for the first longword and one fo r the through. The implementation called for six PAls second , which increases the effective COAL bus for control logic, eight 16K-by-4 static RAMs and bandwidth. Further, the first-level cache is write­ four 16K-by- l static RAMs for the data store, and through. However. to improve performance, the three 1 6K-by-4 static RAMsfor the tag store. CVA..,'<: also contains a longword write buffer In keeping with the philosophy of simple con­ which allows the CPU execute out of the fi rst­ trol logic, the second-leve l cache is look-aside; level cache while the write operation is being that is, address decoding occurs in parallel in 1 completed .· the cache controller and the memory controller.

Digital Te chnical jo w-nal 81 No 7 August /'}88 ------Oueruiew of the Micro VA X J 500j3 600 Processor Module

Therefore , the cache does not have to regenerate sequenr ial longwords in less time. Bot h the CVAX COAL bus cycles in the evcm of a cache miss. The and the CQBIC utilize this fea ture tO improve second-leve l cache control logic simply performance. The CVAX generates quadword transfers ro tillcache blocks on a cache miss; and • Wa tches the COAL bus cyc les the CQI3IC generates quadword , hexaword, or • Re turns data to the CV�'( on cacheablc n: ad octaword transfers on block-mode OMA by cycks that miss the first-leve l cache bur hit the devices on the Q22-bus. The combination of mul­ second-leve l cache tiword transfers and the look-through fi rst-level

• Al locates a block on cacheable quadword cache made the added complexity of dual ports CV�'( react cyc les that miss both cac hes (as used in the MicroVA.X II) unnecessary. To work effective ly with the look-aside second-leve l • Updates an entryon CV�'( write cycles that hit cache , the CMCTL must m

82 Digital Te chnical jo urnal No . 7 August I ')88 CVAX-based Systems

OMA write references are buffered in two natu­ design ensures that the operating system does rally aligned ocraword buffers and transferred not attempt tO use the block of memory where to main memory by the most efficient combina­ the scatter-gather map resides. The on-board tion of multiword transfers. The two octaword firmware does not include these pages in a list of buffcrs allow an entire block-mode transfer (up good memory pages that is passed ro the operat­ to !6 words) ro be buffered by the CQBIC. After ing system at boot time. An interesting side effect the fi rst buffer has been tilled by the Q22-bus of putting the scatter-gather map in main memory device, it is emptied into main memory while the was that the relatively long latency on some Q22-bus device tills the second buffer. Since the Q2 2-bus OMA cycles uncovered latent design COAL bus is faster than the Q22-bus. the first bugs in several Q22-bus OMA devices. The buffer is emptied and ready for input from the designs of these devices had been verified by Q22-bus device before the second bufferhas been empirical testing with existing processors rather filled. This arrangement allows the interface to than by testing to the Q22-bus specification . provide sustained throughput at maximum To maintain software compatibility with the Q22-bus transfer rates with no additional latency. MicroVAX II system, the scatter-gather map is ref­ Q22-bus block-mode OMA read references are erenced through a 32KB block of ljO space translated into quadword transfers on the COAL addresses. The CQBIC responds tO writes in this bus. The fo ur words are buffered in a single quad­ address range by buffering the data so the CVAX word buffer and supplied to the OMAdevice on cycle can complete, updating the cache if there demand. Before the buffer is emptied, the next is a hit, req uesting the COAL bus, and updating quadword is prefetched . This prefetch elimi­ the entry in main memory. If any OMA operations nates additional latency on all but the first trans­ are pending, they are completed before CQBIC fe r. To keep the latency of the first transfer at a gives up the CDAL bus. This prevents multiple minimum, the CQBIC responds to the OMA successive map updates by the CPU from locking device after receiving the first longword of a out OMA activity long enough to cause Q22-bus quadword COAL bus cycle, rather than waiting devices to time out (in I 0 microseconds). for the entire quadword transfer ro complete. On reads to this address range that miss the To ti t the entire Q22-bus interface in a single cache, the CQBIC has ro latch the address and chip, some changes had to be made to the bus force the CVAX to retry the cycle. In this way, interface architecture of the MicroVAX II system. CQBIC can acquire the COAL bus to fetch the On the MicroVAX II, the scatter-gather map was entry from main memory . When the CQBIC relin­ stored in a dedicated 32KB static RAM array quishes the COAL bus, the CVA.X retries the cycle, within the bus interface. On the CQBIC, not and the CQBIC provides the processor with the enough space was available ro implement this requested map entry. This retry mechanism is storage array internal ro the chip. Moreover, not also used to implement the interl ocked instruc­ enough pins were available to provide a dedi­ tions in the VAX instruction set. cated bus ro an external static RAM array. The On all interlocked instructions, the CVAX ge n­ sol ution was to srore the scatter-gather map in a erates one or more sequences of a read-lock cycle 32KB block of main memory and to implement a followed immediately by a write unlock cycle. 16-ent ry fu lly associative cache for map entries The CVAX identifies these special locked cycles in the CQBIC. The cache functions in the same by placing a unique code on the parity lines at manner as an address translation buffer. When address time. The CQBIC recognizes the read­ translating a Q22-bus address, the cache is lock code and forces the CVAX to retry until the checked for the appropriate map entry. If the CQBIC can become master of the Q22-bus. Once entry is found, the translation takes place at maxi­ the CQBIC has mastership of the Q22-bus, mem­ mum speed. If the entry is not found, then there ory is effectively locked and the cyc le proceeds. is a delay while the entry is fetched from main The CQBIC releases the Q22-bus (unlocking memory. The translation is then performed. This memory) on the next CVAX bus transaction even de lay is eliminated on OMA transfers that cross a if it is not a write unlock cycle. This release pre­ page boundary, because the entrythat maps the vents memory from staying locked if the CVAX next page is prefetched when the OMA operation has to abort the instruction due ro an error en­ reaches a page bou ndary. On most OMA transfers, countered on the read-lock cycle. this delay is negligible because it is amortized Like the MicroVAX II Q22-bus interface, the over a large number of Q22-bus transfers . The CQBIC gives the CPU the highest rather than the

Digital Te chnical journal 83 No. 7 August 1!)88 ------Overview of the Micro VAX 3500!3 600 Processor Module

lowest priority when arbitrating the Q22-bus. time-of-year clock. (The VAX standard serial line This priority assignment reduces interrupt replaces the serial line chip used as the console latency, since the processor is delayed for a maxi­ on the MicroVAX II. The clock replaces the mum of one OMA transaction before being off-the-shelf clock chip.) Since the console con­ granted the bus to acknowledge the interrupt. troljstatus registers (RXCS and TXCS), console Because the CPU accesses memory over a dedi­ data buffers (RXOB and TXDB) , and the time-of­ cared interconnect rather than through the year clock (TOOR) are VAX internal processor Q22-bus, CPU references ro the Q22-bus are very registers, they are accessed by means of special infrequent. Therefore this priority scheme does COAL bus cyc les as described in the section The not have a negative impact on OMA performance. Central Processing Unit and First-Level Cache. 1 To support a range of CVAX microcycle times To save board space and cost, the SSC provides and fixed Q22-bus timing, the CQBIC was two programmable address strobes for decoding designed to run at a fixed clock rate, asynchro­ additional board-level registers. These address nously to the CPU/memory subsystem. This design strobes decode the second-level cache control made it easier fo r engineers to optimize perfor­ register (CACR) and the MicroVAX 11-compatible 2 mance of the slower asynchronous Q22-bus boot and diagnostic register (BOR) (where ba ndwidth is at a premium). These opti­ To prevent the processor from "hanging" on mizations are made at the expense of lower per­ unanswered COAL bus cycles the SSC provides a fo rmance on the fa ster COAL bus (where there is programmable watchdog timer for the COAL 7 extra bandwidth) due ro synchron ization dclays bus. The timer starts at the beginning of a COAL bus cycle. If the timer expires before the Sy stem Support Functions cycle completes, the sse asserts the error line, The SSC contains all those fu nctions required to causing the CQBIC or CVAX to abort the cycle. support the on-board firmware , the time-of-year This timer could not be used for all COAL bus clock. and the console serial line. The chip pro­ cycles. To do so, the timer wou ld have to be set to vides the logic necessary tO interface the two a va lue greater than the Q22-bus timeout value 64KB read-only memories (ROMs) containing the ( 10 microseconds) so that CPU accesses ro the firmware with the BC OAL bus. Since the ROMs are Q22-bus wou ld not be timed our prematurely. organized as a 64 K by 16-bit array, the SSC must Moreover, the timer would have to be set to a generate two ROM cycles to satisfy each 32-bit value much less than the Q22-bus timeout value COAL bus cycle. This ROM unpacking function so that unanswered COAL bus cycles would not saves board space as we ll as the costs re lated tO a cause Q22-bus timeours during OMA. Since 32-bit-wide ROM array. the CQBIC contains a 10-microsecond Q22-bus The SSC assists in the fi rmware emulation of a watchdog timer, the COAL bus timer was set tO VAX console processor by providing two address 2 microseconds (greater than the longest COAL spaces through which the ROM may be bus cycle) and disabled on all Q 2 2-bus refer­ accessed - the halt-mode ROM space, and the ences. run-mode ROM space. Any I-stream read from the To support a range of CVAX microcycle times, the sse was designed to run at a fixed clock rate halt-mode ROM space protects the processor ' from externa l halt conditions and extinguishes asynchronously tO the CPUjmemory subsystem. the front panel run light. Any I-srream read out­ Since the performance of the functions in the SSC side the halt-mode ROM space. including reads was not critical. the performance impact was not fl from the run-mode ROM space, enables externa l a concern. halt conditions. Under this condition, the front pane l ru n light is illuminated . The firmware is Ha rdware Interrupts organized so that console emulation code is exe­ The interrupt logic is spread among three chips: cuted from the halt-mode ROM space, and diag­ CVAX,SSC, and CQBIC. The CVAX provides fo ur nostics and boot code are executed our of the interrupt request pins that correspond to stan­ run-mode ROM space. The SSC also provides the dard VAX hardware interrupt request levels fi rmware with I KJ3 of battery-backed up RAM for I 4 through 1 7. The CVAX does not provide an storage of data structu res and stack space, and a interrupt-acknowledge pin. The CVAX acknowl­ register for controlling four diagnostic LEOs. edges interrupts when the processor's priority The SSC also contains a VAX standard console level is below the interrupt level by generating serial line and a VAX standard battery backed up an interrupt acknowledge cycle on the COAL bus.

84 Digital Te chnicaljo urnal No. 7 August I'J88 CVAX-based Systems

The "address" used is the level of the imerrupt • A lKB, 90-ns. first-level cache request being serviced. The data read is the offset • A 64KB, 180-ns, second-level cache (instead of the vector within the system comrol block. of 1 MJJof memory) The SSC comains the interrupt-acknowledge pin. The SSC responds to imerrupt-acknowledge • Multiword transfers (longword, quadword, cycles whenever it has an interrupt pending at hexaword , and octaword versus longword) the leve l being acknowledged. If the SSC does not • CPU write buffers (one longword) in the CPU, have an interrupt pending at that level, it asserts memory controller and Q22-bus interface the interrupt-acknowledge signal. The CQBIC passes imerrupt-acknowledge cycles on to the • Larger DMA buffers ( 16 words versus 2 words Q22-bus only when the sse asserts the imerrupt­ for writes, 4 words versus 2 words for reads) acknowledge signal. This imerrupt-acknowledge • A 16-cntry scatter-gather map cache scheme saves a CVAX pin, at the expense of requiring the devices in the sse to have the The combination of reduced cycle times and highest interrupt priority at their level (IRQ 14). architectural mechanisms produced a CPU per­ The CQBIC uses all four CVAX interrupt formance 3.2 times that of the MicroVAX II (as request lines to support the four Q22-bus imer­ measured by the mean of the distribution of rupt request levels. (BR4 through BR7 are con­ results from over 150 CPU benchmarks) . Ad di­ nected to the pins corresponding to IRQ levels tionally, a slight increase in maximum I/0 band­ 14 through I 7.) Since the Q22-bus has only width was achieved (as measured by simulation one interrupt-acknowledge line, it is possible with an ideal Q-bus master) . for a level 7 (I7) device ro steal an imerrupt­ acknowledge cycle intended for a level 4 ( 14) Reliability device. (This "steal" can occur if the level 7 Both the MicroVAX li design and the MicroVAX device is closer to the processor and posts an 3500/3600 design were subjected to extensive interrupt after the level 4 interrupt was acknowl­ thermal analysis. This analysis comributed to a edged but before the acknowledgment reached board layout and chip packaging scheme that it.) To prevent this situation from causing a level would minimize junction temperatures, thereby 7 (I 7) device driver from running at a lower IPL, improving reliability. Both designs also ensure a the CQBIC sets a bit that is returned along with high level of reliabil ity by using preconditioned the vector offset. This bit causes the CVAX to set components that have passed a rigorous quali­ the processor lPL to 17 before passing control to fication program. the driver. If the bit is not set, the processor IPL is Because of its increased complexity, the set tO the level at which the interrupt request MicroVAX 3500/3600 was designed to be more was received . The CQBIC also adds an offset of rolerant of intermittent and transient fa ilure 200 (hex) tO the vector returned by the Q22-bus mechanisms. ECC rather than parity is used to device so there is no conflict with existing VAX protect main memory, and the data path between system control block entries. the CPU and main memory (including both caches) is protected by byte parity. There are Pe rformance Relative to the also four timers (three for the Q22-bus and one for the CDAL bus) to detect unanswered bus Micro VAX IIPr ocessor Module cycles. The CVAX can detect four types of CFPA The reduction in gate delays due to rhe new chip errors. four types of memory management unit technology allowed the processor microeycle errors, one type of interrupt error and one type of time to be reduced to 90 ns (versus 200 ns for microcode error. Errors that are detected syn­ MicroVAX II) and the minimum bus cycle time chronous ro CPU execution are reported by to be reduced to 180 ns (versus 4 00 ns for means of a machine check on the same cycle on MicroVAX II). The increase in the number of tran­ which the errors are detected. (Comparatively, sistors made available by the new technology the MicroVAX II reports the errors on the subse­ ai.Jowed the fo llowing architectural mechanisms quent cycle.) Unique machine check frames or to be used to increase performance: hardware error flags are provided so that the • A larger prcfetch buffer( 12 versus 8 bytes) proper error recovery routine can be invoked. The recovery routines typically log the error, • A larger translation buffer (28 versus 8 entries) clear the error condition, retry the operation a

Digital Te chnical jo urnal 85 No . 7 August 1988 ------Overuiez.u of the Micro VA X 3500!3 600 Processor Module

specified number of rimes. and continue if suc­ without the assistance of another device on the z cessfu l. lf the routine is unsuccessful and the Q22-bus fa ulty hardware can be disabled. the system runs in a degraded mode until repaired . Otherwise , Summary the system will crash. Errors detected asyn­ Having met performance goals, MicroVAX 3500/ chronously ro CPU execution arc reponed by a �600 systems were shipping in vo lume within high priority interrupt and arc logged, hut in three years of the first shipments of MicroVAX II. most cases arc nonrecoverable. Errors that arc At rhar time, rwo system packages, over twenty corrected by hardware are reported via a lower mass storage and communications options, three priority imerrupt, so they can be logged. operating systems, and over 200 software prod­ Data from reliability qualification testing ucts (for VMS alone) had been qualified and were ve rified that the predominant fa ilure mode was available from Digital. Scores of hardware and intermittent. suggesting that the error recovery software products were also available from third­ capabil ities built into the system would signifi­ party vendors. This offering wou ld never have cantly increase the uptime of the system. been possible without the level of compatibility that results from strict ad herence ro existing CPU Te stability (VAX)and ljO bus (Q22-bus) specifications. Most of the architectural mechanisms used to increase the speed of computer systems (such References as caches and special purpose buffers) present I. T. Leonard, ed., VA X Architecture Refe r­ testability problems. These mechanisms an: ence Manual (Bedford: Digital Press, Order almost always designed to be software traospar­ No. EY-345 9E-DP. 1987) t: nt. which makes them invisible to diagnostic 2. KA 650-AA CPU Module Reference Ma nual software. To solve this problem, special diagnos­ (Maynard: Digital Equipment Corporation, tic modes are provided fo r the both the first- and Order No. EYKA6 50-UG, 19R7). second-level caches. The first-level cache diag­ nostic mode provides a way for the CP tO explic­ .') . T Fox, P. Gronowski, A. Jain, D. Leary, itly write the rag store and clear the valid bits by and D. Miner, "The CVAX 7R034 Chip, using selected instructions. The second-level a .1 2-bit Second-generation VAX Micro­ cache diagnostic mode provides explicit access processor," Digital Te chnical journal to both the rag and data stores through two (August 1988, this issue) : 95-108. blocks of r;o addresses (the cache diagnostic 4. E Mclellan, G. Wolrich, and R. Yo dlowski, space and the cache tag diagnostic space). "Development of the CVAX Floating Point Through the cache diagnostic space, the data Chip," Digital Te chnical jo urnal (August store can be read or written, the tag store can 1 9H8, this issue): I 09-1 20 be written and the valid bits can be cleared . C. When not in diagnostic mode, cache appears in 5. DeVane, "Design of the MicroVAX 3500/ this space as high speed R.At'vl . During power-up 3600 Second-level Cache," Digital Te chni­ self-test. diagnostic code is transferred from ROM cal jo urnal (August 198H, this issue) : to this RAM to allow fast execution of the code H7-94 without requiring that main memory be fu nc­ 6. D Morgan, "The CVAX CMCTL - A CMOS tiona l. Through the cache tag diagnostic space. Memory Controller Chip," Digital Te chni­ the stare of the cache tag bits. parity bits. valid cal jo urnal (August 1 988, this issue) : bits. and several points within the cache control 1 :) 9-143. logic can be read. 7. !3. Maskas, "Development of the CVAX The MicroVAX .)' 500/)600 processor module Q22-bus Int erface Chip," Digital Te chni­ design also provides a diagnostic mode for main cal jo urnal (August 1988. this issue): memory and a means of writing to main memory 129-138 through the Q22-bus interface. The main mem­ ory diagnostic mode allows memory test times to 8. J Winston, ''The System Support Chip, be significantly reduced. Further, writing to main a Multifunction Chip for CVAX Systems,'' memory through the Q22-bus interface aJJows Digital Te chnical jo urnal (August 1988, the scatter-gather map fu nctionality tO be tested this issue): I 2 1-128.

DigitaJ Te chnical jo urnaJ No. 7 August 1988 Charles]. DeVane I

Design of the Micro VAX 3500/3600 Second-level Cache

The MicroVAX 3500/3 600 processor module, the KA650, is a CVAX-based uniprocessor that incorporates an unusual cache architecture: a two-level cache. The first level is a small fa st cache on the CPU chip, and the second level is a large, somewhat slower cache on the processor module. Along with high qualityand high performance, time-to-market was a crucial goal fo r this third-generation Micro VA X system product. Consequently, project engineers adhered to a philosophy of design simplicityfo r the second-level cache. Cache performance measurementssu pport their design decisions.

The Micro VAX 3500/3 600 Project ulcs in parallel with the CVAX project . a process The primary goa l of rhe MicroVA.X 3500/:)600 that relied heavily on simulation . In turn, rhe project was simple. The chip designers in rhc MicroVA.X project team provided the initial Semiconductor Engineering Group (SEG) were debug testbed for CVAX : CVAX first booted VMS 1 working on a new single-chip VAX . CVA.X. The in a MicroVAX y;oo;.1600 system. chip wou ld have its own on-chip cache and was Overview of the projected to achieve a performance level three rimes the original MicroVAX chip used in rhe KA650 Processor Module MicroVA.X II system . The MicroVA.X Development The system fu nctional partition (Figure I) shows Group wou ld work in concen with the SEG effort. how the KA650 processor module fits imo the Our goal was ro ship a high-quality, high-perfor­ entire computer system. The processor module manu: CVA.X-based uni processor , which would communicates to mass storage, communication, be upward compatible with Micro VAX II systems. and other 1/0 devices ovcr the Q22-bus. Main This new product must be available as soon as memory connects ro rhe processor on a private CVA.X chips cou ld be produced in volume . memory bus which uses both the backplane and Give n rhe objectives of high quality and "over-the-top" ribbon cable. t\ console panel car­ MicroV�'( II system compatibility, the remaining ries bit rare and configuration switches, a single­ design goa ls were carefully prioritized as listed digit hexadecimal display, a connector for the below: console serial line. and a NiCd battery for the

processor 's rime-of-year (TOY) clock. I . Time ro marker The module functional partition in Figure 2 2. Raw computational performance shows the basic parts of the KA6 50 processor :) . Memory expansion module. Al l memory traffic tlows over the COAL bus (CVA.X datajaddress lines). Only ljO space 4. Diree1-memo y access (OMA)jreal-time per­ r registers reside on the 13CDAL bus (buffered formance CVA.X data/address lines) . 5. Sysre 111 cost and price The: memory control ler subsystem and the Q22-bus interface subsystem are each single 6. Ad ditional fu nctionality chips: the CMCTL (CVA.X memory controller) The importance of quickly deliveri ng the and CQBIC (CVA.X Q22-bus interface chip) 21 MicroVA.X 3500/YlOO to market led to a close Most of the system support functions arc con­ working relationship between the engineers in taincd in another chip, the: sse (system support ' SEG and MicroVAX Development. We designed chip). Each of these was dcsigned in parallel and built the MicroVAX CPU and memory mod- with CVA.X.as part of a complete CVA.X chip set.

Di�ital Te chnic:aljournal 87 ,Vo . ., August J'JHH ------Design of the Micro VA X 3500j3 MJ O Second- level Cache

CONSOLE SERIAL LINE !

CONSOLE PANEL MEMORY DATA BAU D RATE AND CONFIGURATION SWITCHES 1 DIGIT HEX DISPLAY TOY CLOCK BATTERY

MS650 MEMORY 1 MINIMUM 4 MAXIMUM KA650 PROCESSOR

CPU WITH TWO-LEVEL CACHE

MEMORY ADDRESS AND CONTROL

Fig ure 1 Micro VA X 3500/3 600 System Fu nctional Pa rtition

The primary problem left to the KA6'50mod­ CVAX designers included an on-chip cache that ule designers was to ba lance two key goa ls: to cou ld be accessed in one microcycle, and made design the board-level cache for the highest per­ the cache as large as practical. From the module formance possible and to do so without endan­ perspective, CVAX can run a bus cycle as quickly gering the project's time-to-market goal. as two microcycles. However, the memory system requires five microcycles to access main memory. Tw o-level Cache Architecture To compensate for this second gap, the module Description designers inc luded a module leve l cache that The KA6'50 is Digital's first commercially avail­ could be accessed in two microcycles, and made able processor to incorporate a two-leve l cache. the cache as large as practical. The first level is a small cache on the CPU chip with a cycle time of one microcycle, or First-level Cache 90 nanoseconds (ns). 'T'he second level is a large The first-level cache is a 1 kilobyte (KB), two-way cache on the processor module with a cycle time set associative cache with a quadword block size . of two microcycles, or 180 ns. In comparison , The cache is organized as 64 rows, each row con­ the cycle time of main memory system is five taining two sets, and each set containing 8 bytes. microcycles, or 4'50 ns. Two bits in the cache disable register (CADR) The goal of each level of cache is to reduce select whether the first-level cache stores effective memory access time on processor read !-stream only, 0-stream only (ordinarily used cycles. At the chip level. the CVAX processor only for diagnostics). or both !-stream and wou ld prefer to use just one microcycle to access D-stream references. The cache allocates a block memory. However, the CVAX bus interface unit whenever a cacheable read reference misses the (BIU) requires two microcycles to access mem­ cache. The CVAX BIU then generates a quadword ory off the chip. To compensate for this gap, the read cycle to fillthe allocated block.

88 Digital Te chnicaljo urnal No. 7 August 1988 CVAX-based Systems

A<16:2> � 1 ...,__ SYSTEM ...... _ CONSOLE BCDAL<31:0> SUPPORT SECOND- ....._ PANEL / SUBSYSTEM ..,....- A<31:2> ... LEVEL CACHE ) .... ) L I 6.\7 j BCDAL TRANSCEIVERS I ADDRESS LATCH J L I .-LJ-. ) �I� ...,__ � 022-BUS CPU AND FPA CDAL<31:0> INTERFACE 022-BUS � � � ) 7 MAIN MEMORY CONTROLLER

MAIN� MEMORY� INTERCONN ECT

Figure 2 Micro VA X 3500/3 600 Module Fu nctional Pa rtition

The CVAX Bill waits to determine whether a the fi rst-level cache for three microcycles per read reference hits in the cache before starting quadword block and six microcycles for an acta­ the bus cycle to access memory. This wait helps word. However, these delays stall CPU execution free the processor bus for use by DMAdevices , but only if the CPU requires access to the cache dur­ requires fa ster RAMsin the second-level cache. ing those microcycles. The processor writes directly through the cache ro memory. Therefore, when a cache block Second- level Cache is allocated, the block being replaced need not The second-leve l cache is a 64KBdir ect-mapped be written back to memory. The CVAX Bill also cache, which like the first level, also has a quad­ incorporates a write buffer to support dump­ word block size. This cache is organized as and-ru n writes by the processor. If the COAL bus 8K rows , each row containing one set of 8 bytes. is busy when CVAX needs to write, the Bill wi ll The second-level cache allocates a quadword buffe r one write cycle. The buffering allows the block whenever CVAX reads a quadword that processor to continue execution, read ing from misses the second-leve l cache . (Quadword reads the fi rst-level cache. Thus, some write cycles are ordinarily the result of allocat ion in the first­ require only one microcycle. level cache. Unusual bit settings in the CADR, When DMAdevices write to main memory, the however, can cause the CVAX Bill to generate cache must be updated to reflect the change in quadword cyc les on reads without actually main memory. Cache data that is no longer con­ enabling the first-level cache.) Thus, the second­ sistent with the contents of main memory is level cache will include the same kind of data as ca lled stale data. To prevent stale data from accu­ the first-level cache: 1-stream on ly, 0-stream mulating in the cache when DMAdevi ces write to only, or 1- and 0-stream references. memory, the cache will check and inval idate one Instead of wa iting to determine whether a read or two blocks as necessary. Inva lidation ties up reference hits in the cache , the memory con-

Digital Te chnical jou rnal 89 No . 7 August 1<)88 . ------Design of the Micro VA X 3500/3 600 Second- le/lel Cuche

trol ler begins accessing memory in parallel with schedule. At the beginning of the project, we the tag look-up in the second-level cache. If the doubted that 2'56-kilobit (Kb) static RAlvls with reference hits in the cache, the memory con­ 4 '5-ns access time wou ld be available soon troller will aborr its response to CVA.X (although enough. However, we expected 64Kb RAMs to be the control cycle to the memory modules com­ mature when Manufacturing wou ld need produc­ pletes normally) . tion volumes of the pans for the Micro VA,'\ 3'500/ Like the first-leve l cache, the second-level 3600 system. cache also writes directly through ro memory. The 64Kb RAMs were available in three organi­ The memory control ler will perform a dump-and­ zations: 64K by l, 16K by 4, and 8K by 8. We run write if the write is an unmasked Jongword . cou ld have arranged these to form a 256KBcache Therefore many write cycles can complete in two (using 32 64K-by- I RAMS) , a 64KB cache (using microcycles. This completion time assumes the 8 I6K-by-4 RAMs) or a 32KB cache (using 4 SK­ memory modules arc not busy completing a pre­ by-8 RAMs) . The 256KB cache wou ld not have vious dump-and-run write. aborted read cycle. or even tit on the module. and so was not consid­ refresh cycle ered The 64KB cache wou ld fit (requiring only During DMA the second-level cache will also slightly more module space than the 32KB check and invalidate one or two blocks as neces­ cache) and was actually cheaper than the 32KB sary. These checks prevent stale data from accu­ cache. So naturally we chose the 64KB cache. We mulating during DMA write cycles to memory. then added fou r I6K-by- l RAMs for byte parity.

Design of the Cache Organization KA650 Second-level Cache We quickly ruled out organizing the cache with The importance of minimizing time taken to more than one set. More than one set would deliver the product to market made simplicity a either require too much logic or run too slowly. high priority. For most major design decisions. To get data fast enough from the correct set on we chose the simplest implementation. a read hit would require a multiplexer and a separate set of RAMs for each set. This additional Cache Sp eed logic wou ld take more space than we had avail­ The cache speed was determined by the fastest able Another possibility was to use a "select set " CVA.X bus cycle. CVAX can read or write a single signal generated from the tag-srore match signals !ongword in two microcycles ( 180 ns) and read a as an address bit into the data store RAJvts. This naturally aligned quadword in three microcycles organization , however, would run roo slowly. (270 ns) . Each added wait state costs another The cache performance simulation data avail­ microcycle (90 ns) For example, a typical quad­ able ro us assumed the cache was flushed on word read from main memory requires five every context switch. We fe lt this assumption microcycles for the fi rst longword and three might be overly pessimistic for caches as large as microcycles for the seconcl longword -atotal of 64KI3. Furthermore, we expected that more real­ 720 ns. Therefore the goal of the second-level istic data would not show a large performance cache was to allow CVAX ro execme from mem­ advantage for a two-way set associative cache ory with no wait states Preliminary timing dia­ over a direct-mapped cache. We therefore chose grams determined that ro keep up with a I 00-ns the simpler direct-mapped organization . CVA.X the cache wou ld require 4 '5-ns static RANts. When later in the project KA6 '50mo dule design­ Block Size ers changed the clock speed from I 00 ns to When choosing the block size for the second­ 90 ns, they also replaced the 4 '5-ns cache RAMs level cache. we again decided in favor of simplic­ with 3'5-ns RAMs . ity. We chose to make the second-level cache use the same size block as the first-level, which was Cache Size already set at a quadword. At quadword block Increasing a cache's size always improves its size, the second-level cache can allocate a block performance. Since high performance was a simultaneously with the first-level cache. The major priority. choosing the cache size was sim­ second-leve l cache simply capwres the data from ply a matter of finding the largest RAM that would the quadword read as it comes from memory over run fa st enough, fit on the board, and not risk the the COALbus .

90 Digital Te chnicalJo urnal No . 7 August 1988 CVAX-based Systems

We had several additional reasons for not cycle. the memory controller begins accessing choosing either a longword block size or a size main memory in parallel with the tag check in larger than a quadword . Use of a longword block the second-level cache. If the cycle misses the size in the second-level cache complicates the cache, then main memory is prepared to respond control logic and potentially degrades perfor­ as quickly as possible. If the cycle hits the cache, mance. To respond to a CVAX quadword read, the the memory controller senses the hit and aborts cache would require two separate tag look-ups. If its response to the bus cycle. A drawback of this the first look-up hit but the second look-up scheme is that the memory controller must still missed, the cache would have to retry the bus complete the control cycle to the dynamic RAMs cycle. The retry wou ld invalidate the block in the of main memory. Consequently, the control ler fi rst-level cache and waste bus bandwidth. On the cannot respond as quickly as it had initially other hand. use of a block size larger than a quad­ if the cache hit is immediately foll owed by a word wou ld require extra data path and control cache miss. We expected this penalty to be to perform block fi ll operations. insignificant . The alternative to a look-aside architecture Ta g Store Organization wou ld be to place the memory control ler on a Once we knew the data store size (64 KB) , orga­ separate bus. The bus cycle would pass to the nization (direct-mapped), and block size (quad­ controll er on ly after the cycle missed the second­ word), we could determine the organization of level cache. This design would have improved the tag store. the efficiency of main memory usage. However, The tag store requires one row for each of the this design requi res additional data path and con­ 8. 192 quadword bloc ks of the data store. Of the trol to create the separate memory bus, and CVA..-'< 30 bit physical address. 13 address bits reduces processor performance by adding at least (bits I')th rough 3) are used to select the quad­ one additional microcycle to the penalty for a word block and associated tag store row . Each tag cache miss. row must store a parity bit, a va lid bit, and enough of the memory address to specify where Ha ndling of Write Cy cles in main memory the quadword block of data We chose a simple write-through design fo r the came from . Since the KA650 would architec­ second-level cache instead of a more complex turally support no more than 64MB, address write-back design. The penalty of not using bits 29 through 26 would always be zero to write-back is reduced by the CMCTL dump­ access main memory. This left 10 address bits and-run write feature . When CVAX writes an (bits 2') through 16) to be stored in the tag unmasked longword to main memory, the CMCTL row. Therefore the tag store wou ld require latches the address and data and termi nates the 8,192 words of RAM,each word consisting of bus cycle before the write to main memory is 10 tag bits plus a valid bit and a parity bit. actually completed. If write cycles occur back to To make this SK-by- I 2 array, we used three of back (which is common for VAX processors) , the same 16K-by-4 RAMs used in the data store. then the second write will be delayed while We did examine the special 2K-by-9 tag-store the first one completes. However, many write RAMs be ing developed by some memory vendors. cycles can still complete in the minimum two We concluded that these RAMs were too small microcycles. and their availability too risky for the KA650. DMA Access to the Cache Look-aside Architecture To maintain design simplicity, we decided not to The design of the first-level cache keeps most allow DMA to read or write the second-level of the processor memory traffic off the COAL cache. This section discusses several of the bus. Instead of this "look-through" design, the possibilities we considered and rejected. These second-leve l cache uses a "look-aside" architec­ include OMA reads, OMA write-through. and a ture which simplifies the bus data path and con­ cache without valid bits. trol and improves performance on cache misses. First, we considered allowing the CQBIC In the look-aside architecture, both the (which is the only DMAdevice on the COAL bus) second-level cache and the memory controller to read from the second-level cache . However, reside on the same bus. When CVA..-'( starts a read the cache control logic is synchronous with the

Digital Te chnical jo urnal 91 No. 7 August /'}88 ------Design of the Micro VA X 3500/3 600 Second-level Cache

CVAX clocks. The control logic design wou ld CVA..,'<..Such a bit would allow a machine check have been significantlycom plicated if that logic handler to isolate a failing bit in the data store. had to respond to the CQBIC, which runs asyn­ Without this control register bit, software can at chronous to the CVAX clocks. best determine in which byte the error resides; if Second . the cache must recognize Dl'vlA writes multiple bytes have errors, only one byte can be to memory to prevent stale data from accumulat­ identified . ing in the cache. We considered letting Dl'vlA Ta g Store Parity - The tag store parity must be write cycles write through the cache, but again generated and checked by the tag store itself concluded the timing was too complex to be Two separate parity trees arc used: practical. (Bus parity was also a concern , which • The predictive parity tree is discussed in the section Cache Parity.) Instead, the second-level cache latches the address and • The error-checking tree simply invalidates one or two blocks if the The predictive parity tree generates the parity address hits in the cache. of the tag field of the address. This tree predicts Finally. while considering Dl'vlAwrite- through, what the parity stored in the RAM must be for the we thought about designing the cache without bus cycle to hit in the cache . Predictive parity is any valid bits. Power-up routines in the read-only fast because the parity is calculated while the tag memory (ROM) code cou ld initialize the cache RAMs are looking up the tag. This scheme does to match main memory. The cache would then not delay the tag comparison and is sufficientto remain consistent with memory unless an uncor­ guarantee that bad parity stored in the tag RAMs rcctable ECC (error correcting code) error was will force a cache miss. However, it is not encountered in main memory. When that error sufficient to determine whether the parity in the occurred, the cache wou ld simply disable read RAMs is actually bad. Thus, a second parity tree, hits until the operating system could restore con­ the error-checking tree, is needed. sistency with main memory by writing the quad­ The error-checking tree identifies bad parity in word bloc k containing the error. Of course once the cache tag RAMs. The output of this second we decided against Dl'vlA write-through, we had tree is checked after the hitjmiss decision is to include valid bits. made, to determine whether a miss was caused by bad parity. If bad parity is detected, the cache Cache Pa rity control register error bit is set, the cache-enable To improve the integrity of the second-level bit is cleared, and an interrupt is posted to the cache , both the data store and the tag store of the processor. Since the bad parity forced a miss, no second-level cache are protected by parity. state is corrupted, and a process or system crash Data Store Parity - Data store parity was sim­ is averted. plified by taking advantage of the COAL bus par­ Second-level cache tag parity covers both the ity supported by CVAX and CMCTL. The data 10 tag bits and the valid bit to protect against store simply stores and returns parity captured erroneously set valid bits. offth e bus, and asserts COPE (CVAX data parity enable) to have CVAX check the parity. Cache Diagnostic Sp ace This parity checking scheme was another rea­ Early in the project we recognized the value son we rejected DMA write-through, since CQBIC of being able to directly access the cache as neither generates nor checks COALbus parity. 64KB of fast RAM . Thus we created "cache diag­ One drawback to this simple scheme is that the nostic space" in the 64MB address range from processor cannot easily determine the source of a 1000 0000 to 13FF FFFF. In cache diagnostic COAL bus parity error. A COAL bus parity error space, the cache RAM appears as 1,024 copies of can be caused by a cache fa ilure, a CMCTL fa il­ the 64KB of cache. The cache responds to all ure. or an actual bus fa ult (such as open etch) . CVAX read and write cycles in this address range, This lack of isolation makes error diagnosis effectively forcing a cache hit. For simplicity, difficult or impossible when CVAX detects a DMA access to cache diagnostic space is not per­ COALbus parity error. mitted. One usefu l feature we did not think to include During power-up self-test, some diagnostics was a control register bit to disable the assertion are relocated from the boot/diagnostic ROM of COPE and the subsequent parity checking by to cache diagnostic space for faster execution.

92 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

Cache diagnostic space was also useful at initial bus cycle) to collect a total of 268 million debug of the CVAX chip set. We were able to sequential bus cyc les of interest. For example, to down line-load diagnostic programs through the study the read hit rate, the four counters simu lta­ console serial line and execute them from the neously collected: cache diagnostic space. With the diagnostic pro­ • The total number of cacheable quadword read grams in this cache space, we could continue cycles debug work on the module without relying on either main memory or the Q-bus interface. • The number of cacheable quadword read Writing to cache diagnostic space cou ld cor­ cycles that hit in the second-leve l cache rupt normal cache operation by creating stale data in the cache. To prevent this, write cycles to • The number of cacheable quadword read cache diagnostic space normally inval idate the cycles that missed the second-leve l cache tag for that address. This invalidation also pro­ (Counting borh the cache hits and misses pro­ vides a simple means for fl ushing all or part of the vides a useful error check.) cache . To simplify diagnosis of cache faults, a • The number of cacheable quadword read diagnostic mode bit in the cache control register cycles that hit in the cache, or that wou ld have can be set to cause writes to cache diagnostic hit if the valid bit had been set space to set the valid bit instead of clearing it. Setting the diagnostic mode bit also clears the Since CVAX gives no external indication when cache enable bit. Thus normal allocation and a memory read is satisfied by the internal cache , DiVtA invalidation are prevented from acciden­ only reads that miss the first-level cache (and tally upsetting a diagnostic pattern being written therefore generate a bus cycle) can be directly into the cache . These features simplify the task of measured. Thus, it is very important to note that putting the cache in a specific state for diagnostic the read hit rate of the second-level cache alone is purposes. not the same as the read hit rate of both caches taken together as a single whole (which is beyond Pe rfo rmance Measurements the scope of this paper) . This is not a problem for Measurements of second-level cache perfor­ write cycles because the first-leve l cache is write mance bear out that the fundamental architec­ through. tural decisions were sound . The measurements were performed on a small Te st Results system consisting of a KA6 50 CPU with 16MB of For the workloads tested , the read hit rate was main memory, an RQDX3 disk controller with an typically 85 percent and ranged between 82 per­ RD'J4hard disk, and a DEQNA Ethernet interface. cent and 91 percent. This is what we intuitively The CPU module was modified with additional expected: the large size of the cache would keep circuitry to detect various kinds of cacheable bus the hit rate high, even though the fi rst-level cache cycles. The system ran VMS version 4.7 A. To heav­ tends to strip off much of the memory access ily load the system with reasonably realistic locality. workloads, we used varying combinations of We measured the read hit rate of the second­ three basic tasks: leve l cache with the first-level cache turned off, just to get an idea of how well a simple but large • Assembling and linking a large program writ­ cache can perform. The memory read hit rate ten in VAX iVtAC RO ranged between 96 percent and 99 percent when • Running a CAD program that compares the the fi rst-level cache was turned off. This demon­ topology of two large net lists strates that even a simple direct-mapped cache performs well if it is large enough. However, note • Copying large files (greater than 8,000 that turning on the first-level cache tends to radi­ blocks ) across the network cally alter the bus trafficseen by the second-level four I G-bit counters and a logic analyzer were cache . Therefore a direct comparison between used to log the occurrence of particular bus hit rates with and without the fi rst-leve l cache cycles. For each measurement, the cache perfor­ can be misleading mance was monitored continuously for 5 to 30 The "would have" hit rate is a measure of what minmcs (depending on rhe workload and type of the read hit rate wou ld have been if DiVtA write

Digital Te chnical jo urnal No. 7 August 1')88 ------Design ol the Micro VAX 3500!3 600 Second-leuel Cache

cycles wrote through the cache instead of invali­ Ta ble 1 Comparison of Benchmark Results dating the cache. The modifications ro the CPU for First- and Second-level Caches module included an extra tag comparator that Second- First- ignores the valid bit. Once the cache has initially level level tilled. valid bits are cleared only by DMA invali­ Bench- Neither Cache Cache Both MicroVAX II dates. If the tag matches but the valid bit is mark Cache Only Only Caches cleared, then the cache miss was caused by a HANOI 0.45 0.70 1.00 1.00 0.42 DMA inval idate and wou ld have been a hit if the PRIME 0.68 0.81 0.97 1.00 0.24 DMA cycle had written through. FFT45 0.52 0.69 0.91 1.00 0.28 The "would have" hit rate showed the benefit JACOBI 0.47 0.65 0.93 1.00 0.27 of DMA write through wou ld have been negligi­ CAE2 0.51 0.69 0 95 1.00 0.31 ble. The incremental improvement in hit rare was typically 0. L percent, though in one case it rose to abour l.:) percent (copying large files over the Each cache provides a significant performance network. with no ot her computational rasks). boost, bur performance with the first-level cache This improvement is lost in the noise when com­ alone is better than performance with the second­ pared to the normal task-to-task variation in hit leve l cache alone. The faster cycle time and rare. Again. this is what we intuitively expected: two-way associativity of the fi rst-level cache our­ DIVtA tends not to write into memory currently in weighs the large size of the second-level cache . use by the processor. Clearly we made the right An extreme example of this is the Towers of decision to avoid the added complexity of DMA Hanoi benchmark, where the performance of wrire through. both caches rogether is no better than that of the Memory write cycles were also measured for first-level cache alone. the same tasks as memory reads. However, instead of measuring the "would have" hit rate. we Conclusions counred the number of cycles that took longer At the project close, we had mer our fundamental than two microcycles to complete. This gives us goa ls. The MicroVAX 3500/3600 CPU is compat­ some measure of the cffccrivcness of the CMCTL ible with the MicroVAX II but delivers three dump-and-run write buffer. rimes the performance - performance attribut­ The memory write hit rate ranged between able in parr ro the two-level cache. And because 77 percent and 89 percent. Of all memOt)' write we adhered ro a simple design approach, the new cycles, 4 6 percent to 63 percent took longer than system was ready to ship as soon as CVAX chip two microcycles (the minimum write cycle sets were available in production volumes. time): and 57 percent to 44 percem took longer than rwo microcycles and hit in the cache. References We had hoped more cycles could rake advan­ 1. T. Fox. P. Gronowski. A. Jain, B. Leary, and tage of the dump and run write buffer in the D. Miner, "The CVAX 78054 Chip, a .1 2-bit CMCTL However, this performance is still good Second-generation VAX Microprocessor," for the relative simplicity of the CMCTL write Digital Te chnical jo urnal (August I 988, buffer. Also remember that the CV.AX internal this issue) : 95- 108 write: buffe r helps shield CPU performance from the delays of many write cycles. The complexity 2. D Morgan, "The CV.AX CMCTL - A CMOS and schedule risk of adding another write buffer Memory Controller Chip," Digital Te chni­ or designing the cache for write-back operation cal jo urnal (August 1988, this issue) : would nor have been justifiable. 159- 143 To examine the relative impact of the two-leve l :) . f3. Maskas, "Development of the CVAX cache on processor performance, we ran bench­ Q22-bus Interface Chip," Digital Te chnical marks with both caches enabled, each cache jo urnal (Au gust 1988.this issue) 129- 138. alone . and both caches turned off. Ta ble 1 shows some typical results normalized to the perfor­ 4. J Winsron, "The System Support Chip, a mane<: of the KA650with both caches turned on . Multifunction Chip for CVAX Systems," Performance of the MicroVAX II is shown for Digital Te chnical jo urnal (August 1988, comparison . this issue): 121-128.

94 Digital Te chnical jo urnal No 7 August 1')88 ThomasF. Fox Pa ul E. Gronowski Anil K. ]a in Burton M. Leary Daniel G. Miner

TheCV AX 78034 Ch ip, a 32-bit Second-generation VAX Microprocessor

The Micro VAX 78034 chip - also known as CVAX-is a second-genera­ tion single-chip VAX microprocessor. A primary project goal was to develop a chip with three times the performance of thefirst singl e-chip VAX processor, the Micro VAX 78032. Therefore, architecture and circuit design efforts were directed toward decreasing ticks per instruction (TPI) and machine cycle time. The designers reduced the TPI by 27 percent and achieved a 90-nanosecond (ns) cycle - a significant improvement over the 200-ns cycle time of the first-generation chip. Implemented in a 2-micron CMOS process, the chip comprises six major fu nctional units. These include the instruction queue, execution unit, memorymanagement unit, bus interfa ce unit, microsequencer and control store, and a unique on-chip cache.

The CVAX 7H054 CPU chip is a second-genera­ implemented . In the CVAX chip, both the TPI tion , single-chip VAX microprocessor. This chip and the machine cycle time were improved to is the CPU of the MicroYAX 5500 and 5600 com­ meet the performance goa l. purer systems, which have approximately three Much effort went into reducing the TPI. By rimes the performance of the MicroVA.,'<. II com­ way of comparison, the MicroVAX II system, 1 2 puter system. · The VAX 6200 family of systems which is based upon the MicroVAX 78032 chip, uses slightly faster 80-ns (speed-binned) CVAX performs ar approximately l l.5 TPI; whereas C:Pt : chips in a multiprocessor configuration . In rhe MicroVAX .)600 system, which uses the this paper. we describe the CVAX chip and CVAX 78034 chip, pe rforms at approximately explain how the increase in performance was 8.4 TPI. The TPl was lowered mainly by reducing achieved. the average number of cycles req uired to access memory. This reduction in the number of cyc les Project Goals was achieved by the inclusion of the fo llowing The pri mary project goa I was ro develop a singl< :­ architectural features in the system: chip CPU that implemented the VAXar chitecture and delivered three times the performance of the • A !-kilobyte (KI3), on-chip instruction and J'vl icroYAX 7H0.)2 CPU chip used in the MicroVAX data stream cache, which is capable of a long­ II com puter systems. Of the several clements in word read each cycle this goal. performance presented the greatest • A 64KB. second-leve l cache on the board , de-ign challenge . which is capable of a longword read or write The performance of a CPU is inverse ly propor­ in rwo cycles and a quadword read in three tional to the product of ticks per instruction cycles ('l'PI)1 and rhe machine cycle time. TPI depends on the performance of the system architecture . • A 28-entry translation buffer (TB). which The minimum machine cyc le rime depends on achieves a high hit rare fo r virtual-to-physica l circuit speecl and on how the architecture is address translation

Di�ital Te ch nical journal 95 No . � August t'JH8 Th e CVAX 78034 Chip, a 32- bit Second-generation VA X Microprocessor

Ta ble 1 CVAX Instruction Set Architecture cation. The first-generation chip, the MicroVAX 78032 CPU, has a 200-ns cycle time and was Instruction Type Number implemented in a 3-micron NMOS process. In Implemented Fully by CPU comparison, the CVAX 78034 CPU chip had a goal of a 90-ns cycle time and was implemented Integer/log ical 89 in a 2-micron CMOS process. However, only Address 8 60 percent of the improvement in the CVAX Bit field 7 cycle time results from the fabrication process. Control 39 The remainder results from architectural and cir­ Procedure call 3 cuit innovations, which are described in the sec­ Miscellaneous 10 tion Internal Organization. Queue 6 The section following presents an overview of System support 11 the CVAXar chitecture . Character string 8 Subtotal 181 CVAX Architecture Implemented by Floating Point Chip The CVAX 78034 CPU chip implements the VAX architecture , which has 16 general-purpose reg­ F floating 24 isters, the processor status longword, and 18 mis­ D floating 23 cellaneous privileged registers. All 304 VAX G floating 23 4 instructions are supported by the system. The Subtotal 70 chip fu lly executes 181 instructions and pro­ Implemented Partially by CPU vides microcode operand parsing for 2 1 instruc­ tions that are emulated with macrocode. The Character string 3 chip passes 70 F, D, and G floating point instruc­ Decimal 16 tions to a companion floating point chip. The Edit remaining 32 instructions are fu lly emulated in CRC 1 macrocode. Ta ble 1 summarizes the instruction Subtotal 21 set architecture. Implemented Fully in Macrocode The chip memory management hardware and microcode provide a demand-paged virtual mem­ H floating 28 ory environment. The virtual memory size is Octaword 4 4 gigabytes, and the physical address space is Subtotal 32 1 gigabyte. Total 304 External interfa ce The CVAX bus provides a fl exible interconnect Other factors i ntluencing the lower TPI are as protocol between all CVAX family members. The follows: primary data bus is 32 bits wide and is time mul­ tiplexed to share addresses and data. Up to four • More efficient microcode was implemented for longwords can be transferred with each address. some instructions. In general, most complex Strobes provide timing information for syn­ instructions, such as CALLx, RET, PUSHR, chronous and asynchronous devices. Direct mem­ POPR, and INSV, were coded for speed rather ory access (DMA) request and grant signals are than for space. used to control arbitration of the data and address • Six additional instructions were implemented line (DAL) bus between the CPU and peripheral in microcode. These instructions are CMPC3, chips. CMPCS, LOCC, SKPC, SCANC, and SPANC. Shown in Figure 1, the CVAX 78034 CPU chip is a synchronous device on the CVAX bus. In addi­ • The instmction decode section decodes all tion to supporting the CVAX bus protocol, eight specifiers instead of relying on the microcode dedicated pins support a floating point coproces­ ro decode some specifiers. sor interface. These pins are time multiplexed The machine cycle time reduction was deter­ between the CPU chip and the coprocessor chip mined in part by the technology chosen for fabri- to transfer control and status information.

96 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

- DS - HALT DS -- AS - PWRFL AS

- CRD INTERRUPT BM<3:0> BYTE MASK BM <3:0> CONTROL - MEMERR � LATC H --- - INTIIM

- IR0<3:0>

CONTROL STATUS CS <2:0> LATCH CVAX 78034 CENTRAL PROCESSOR UNIT

PARITY DP<3:0> CS DP<3:0> TRANSCEIVERS

f-- WRITE - LATCH WR WR 1 -- DBE DBE T

- DMA { DMR l 1 �r CONTROL - DMG DATA DAL <31 :00> 0<31:00>  � TRANSCEIVERS �

-

ADDRESS BA<31 :00> � LATCH

CAC HE MEMORY - CCTL - AS AND WRITE BUFFER { CONTROL - CWB DAL <31 :00>

CPSTA < 1 :0> � CPSTA < 1 :0>

CPDAT<5:0> CPDAT< 5:0>

DMG - DMG ROY f-- CLKA- CLKA ERR 1- CLKB - CLKB CVAX 781 34 FLOATING-POINT RESET RESET-- ACCELERATO R - CLKA R5Y - CLKB ROY ERR - RESET ERR

MR10B6· 1159

Fig ure 1 CVAX 78034 External interfa ce

Digital Te chnical jo urnal 97 No . 7 August 1988 The CVAX 78034 Chip, a 32- hit Second-generation VA X Microprocessor

A clock chip generates pairs of 180-degree • Execution unit (E-Box) phase-shifted clock signals that are distributed to • Memory management unit (M-Box) all synchronous MOS components in the system. The clock also generates auxi I iary pairs of clocks • Bus interface unit (BIU)

that can be used by any non-MOS components in • Cache the external interface. Separation of the clocking for MOS and non-MOS elements provides better • Microsequencer and control store skew control for the critical MOS clock signals. The photomicrograph in Figure 2 and the block diagram in Figure 3 illustrate all fu nctional units Microarchitecture on the chip. The CVAX 78034 CPU chip has some pipelining and is microprogrammed. The chip comprises six InternalOr ganization 5 6 7 major functional units: · · Th is section describes the six major fu nctional units of the chip. As noted earlier, the emphasis • Instruction decode and prefetch queue here is on those aspects of the design that en­ (!-Box) hanced the machine 's performance. In addition,

Figure 2 Photomicrograph of the CVAX CPU Chip

98 Digital Te chnical Jo urnal I I - - r-:::"l t::Ho x- IN'TER Al DATA AND AOORlSS BUS I - TO ASYNCHRONOUS I NTER RU PT PINS

II I I CACHE TJ I I I I I J.--- _ _j 1 I CONTROL STORE MBOX I I I � ------, EXTERNAL DATA AND --, ,----, 1 ADDRESS BUS STATE 1 MACHINE I I

I INTERNAL DATA AND I ADDRESS BUS STAT E I �·1AC HINE I

CBUS I I I I I I

Figure 3 CVAX CPU Block Diagram ------Th e CVAX 78034 Chtp, a 32- bit Second-generation VA X Microprocessm·

at the end of this section we discuss the design Jongword and send it to the instruction prefetch approach taken tO build in chip testabiliry. queue. The flow of information berween all fu nctional When a microinstruction that loads the pro­ units on the chip is synchronized by four on -chip gram counter register is detected, for example, clock phases of nominally equal duration . All cir­ during a branch instruction, the prefetch queue cuits were designed to function with the partial is fl ushed. A new instruction must then be phase overlap or underlap that can result from fetched before the processor can proceed. external clock skew and variations in the fa brica­ Up to three prefetched longwords of instruc­ tion process. tion stream data can be queued by the prefetch queue. In addition, the prefetch queue rotates Instruction Decode and the instructions to bring the opcode to the front Prefetch Queue and extracts in-line instruction stream data for The instruction decode and prefetch queue, the use by the E-Box. 1-Box, controls macroinstruction sequencing and 8 instruction stream prefetch ing During a micro­ Execution Unit (E- Box) cycle, the !-Box determines what the next micro­ The main fu nctional blocks in the execution unit, code dispatch will be , based on the instruction the E-Box, are the register file, program counter stream data and the current processor state. (PC), constant generator, shifter, and arithmetic The CVAX I-Box is designed to generate the and logical unit (ALU) . The data path has rwo microcode dispatch address for every specifier precharged 3 2-bit read buses, called the A and B flow. This design differs from the MicroVAX CPU buses, and a static write bus, called the W bus. 78032 chip design; there, the 1-Box provides the The functions performed by the E-Box duri ng a dispatch address for just the first rwo specifiers of cycle are determined by the current microin­ a macroinstruction and relies upon the micro­ struction and internal state. Fol lowing are code to generate the dispatch address for addi­ descriptions of each of the main functional tional specifier flows at a performance cost of one blocks. microcycle per specifier. The register fileco ntains 31 single-read-port/ Primary subsections of the CVAX 78032 1-Box single-write-port registers and 8 dual-read-port/ include the instruction decode read-only memory single-write-port registers. The register file is (ROM), the dispatch programmable logic array used in the data path where compact layout is (PLA) , and the prefetch queue. especially important. Therefore, to save chip area The instruction decode ROM (!ROM) contains the register filecel l was designed using an NMOS the information about VAX macroinstructions pass gate rather than a fu ll transmission gate. that is required to parse the instruction stream. The .1 2-bit PC is located in the data path along The IROM determi nes the number of specifiers with the program counter adder. This adder is for an instruction, the sizes of its operands, and a used to increment the PC as macroinstructions partial microaddress for the execution micro­ are parsed . code of the instruction. Literals can be introduced into the data path by The dispatch PLA cxamines !-Box state, instruc­ conditionally discharging the precharged A or B tion stream data, and other microprocessor states bus lines. to predict the next hardware-supplied microad­ The shifter function is implemented as a data dress for the microsequencer. This PLA is self­ extractor rather than a full shifter, which would timed and evaluates in slightly under one clock require more hardware. The extractor can extract phase. 32 contiguous bits from a 64-bit field.When the The J-Box instruction prefetch queue operates values on the input buses are identical, the high­ in parallel with the instruction execution hard­ order bits appear to wrap around to the low-order ware on the chip. Whenever a longword in the positions, thus mimicking a fuiJ shifter. instruction prefetch queue is empty, the 1-Box The shifter has the two 32-bit precharged read issues a request to the M-Box to read the next buses (the A and B buses) as inputs and a 32-bit aligned longword in the instruction stream If the ou tput. The shifter is implemented using NMOS M-Box and BIU are not doing some other read or transistors . The control diagonals are run in write operation, they will fetch the requested polysilicon strapped by metal at both ends.

100 Digital Te chnical jo urnal No. 7 August 1988 CVAX-based Systems

Because the RC delay in asserting the control occurred. Therefore, all access protection fields lines is long, the control lines are driven before in the TB are simu ltaneously compared to the the input data is valid. The inputs are then condi­ current access type while the translation is in tionally pulled low, discharging the outputs. progress. This scheme requires that the access The ALU in the data path is capable of addition, protection field be fully decoded before it is subtraction, and a variety of logic operations. The stored in the TB. ALU also includes a 1-bit left/right shifter and In addition to interacting with the cache, the additional logic to support multiply and divide M-Box interfaces with the BIU and the 1-Box. operations. The ALU is implemented using a The M-Box contains three registers: the virtual carry-lookahead scheme with propagate and gen­ address (VA) register, the virtual address prime erate logic. (YAP) register, and the virtual instruction buffer The ability to read the register file, do an ALU address (VIBA) register. After a data read or or shift operation, and write the result back into write using VA or YAP, YAP is loaded with the the register file all in one cycle is important to most recently used address plus four. In this way, the machine's performance. This critical path YAP can quickly ge nerate sequential longword was alleviated by partially overlapping the regis­ addresses. During a memory operation, the ter file write with the next register file read. The M-Box sends the address to the cache and BIU. partial overlap introduces a race between the The M-Box will forward data from the E-Box write and the read, but the circuit delay in assert­ during the next cycle if the operation is a write, ing the read select line is sufficient to ensure that or capture data for the E-Box if the operation is the race is always won without extending the a read. cycle time. Whenever there is space available for a long­ word in the 1-Box prefetch queue, the 1-Box Me mory Ma nagement Un it requests instruction stream data. If the M-Box When memory management is enabled, the M-Box does not decode a memory read or write request uses a fu lly associative translation look-aside from the current microinstruction, it services the buffer (TB) to translate virtual addresses to physi­ instruction stream read request using the virtual cal memory addresses. The major design goal for address stored in the YIBA register. After a the M-Box was to achieve a TB miss rate that was prefetch reference succeeds, the YIBA register is one third that of the MicroYA.X 78032 CPU chip. incrememed by four in preparation for the next Consequently, we increased the size of the TB prefetch. from 8 to 28 page table entries (PTEs) . Further­ more, we used a more efficient microcode routine Bus Interfa ce Un it 10 reduce the number of cycles required to fe tch The bus imerface unit, the BIU, controls external a PTE on a TB miss. A PTE is composed of the chip operations, internal cache access and higher order bits of the physical address, the refresh, and arbitration for the internal data and access protection fie ld, and other memory man­ address bus. The BIU contains two state agement information . In the MicroYA.X 78032 machines. CPU chip. a least-recently-used algorithm was employed to replace the PTE on a TB miss. How­ • The internal state machine controls the arbi­ ever, the implementation of this algorithm tration for the internal data and address bus requires complex circuits and a large amou nt of (IDALs). chip area as the TB size is increased . For this rea­ • The external state machine, controls the arbi­ son, we implemented a simpler bm almost tration for the external pins and DAis . equally efficient nor-last-used algorithm in the CYA.X 78034 CPU chip. The design goal was to achieve a single-cycle To realize a single-cycle cache read operation, read operation for hits to the internal cache and a both a virtual-to-physical address translation and two-cycle write operation for an ideal memory a check of the access protection field of the subsystem. In addition, better system reliability PTE must occur in just two clock phases. How­ is achieved by providing parity protection on all ever, there is not enough time to check the the external data transfers and internal cache access protection field after the translation has rcadjwrite operations.

Digital Te chnical jo urnal 101 No. 7 August 1988 Th e CVAX 78034 Chip, a 32- bit Second-generation VA X Microprocessor

To accomplish a single-cycle read operation, requests . If the read reference misses the cache, the two state machines were implemented as self­ it is queued and serviced only after the write timed PlAs that require just one phase to evalu­ operation completes. This overlapping of read ate. The separation of control operations between and write operations reduces the number of the two state machines allowed the PlAs to oper­ memory stall cycles, resu lting in a lower TPI . ate in different phases . Readjwrite-re lated . inter­ To fac ilitate support for multiprocessor appli­ nal time-critical signals are generated by the cations and DMA activity, the BIU provides a pro­ internal state machine . This stare machine evalu­ tOcol for internal cache coherency. To activate ates first, stalls the CPU if necessary, conrrols the this fu nction , an external device first gains own­ cache, and sets states for the external state ership of the external address and data bus by machine. Time-critical external strobes are con ­ means of the DMA request and grant protOcols. trol led by the external state mac hine. The exter­ The device then presents an address, qualified by nal state machine operates ne xt, controls the ter­ certain strobes, to the processor. The processor mination of external operations, clears the larches the address and then performs a cache interna l stare machine flags, and grants control of look-up If a cache hit occurs, the marching external buses and strobes to external devices. cache entry will be invalidated. On a cache miss, the external state machine Eight pins are dedicated to the floating poi nt unconditiona lly drives the external read data to interface . To optimize the operand transfer rate the M-Box or the 1-Box, and a phase later the state between the CVA.X 78034 CPU and its floating machine validates the data. This scheme made it point processor, both chips read the fl oating possible to service the next microinstruction point operands from memorysi multaneously. while the previous one was completing. The BIU also controls all memory transactions. Cache A memory read operation is performed in one The goals for the design of the internal cache cyc le if there is a hit in the internal cache and no were twofold: tO reduce the me mory access time cache parity error is detected. However, when a to one microcyc lc for data that is resident in the cache miss occurs during a read operation , a rwo­ cache; and tO minimize the number of cache ref­ longword block in the cache is allocated to store erences that miss the cache. the data , which must now be read from memory. To achieve the one-microcycle access time, the The BIU stalls the CPU until the first longword of internal cache is designed to perform the cache data is received. The BIU initiates the external look- up in parallel with the translation buffer read cycle, sending the address of the first long­ look-up. This scheme uses the 9 virtual address word to the external memory system. When the bits that do not change during the address transla­ first Jongword of data is received , the BIU sends it tion process to index into the array. Because the to the cache and E-Box or 1-Box, and unstalls the cache look-up and translation buffer look-up are CPU. The fe tch of the second longword is over­ performed in parallel, the data for the selected lapped with other chip activi ty to minimize the cache entry is ready when the translated address effective me mory access time. The second long­ is being latched intO the tag comparator. The word of data is written into the alternate long­ cache tag is then compared to the translated word in the allocated quadword (two longword) address . If a match occurs, the data is driven onto cache block. The cache block is va lidated only if the I DAL before the end of the cycle. both longwords in the block are fe tched success­ To achieve our second goa l - minimization of fully. the number of cache misses - we used a rwo­ The BIU contains a longword write buffer way set associative cache with a block size of wh ich supports a dump-and-run write mecha­ 8 bytes . This two-way set associative cache was nism . Chip activi ty, including cache reads, can designed to meet both performance and chip size proceed in parallel while the BIU is wai ting for requiremenrs. First, a random replacement algo­ the completion of a write operation. The BIU may rithm was selected tO reduce circuit complexity have up to three different operations in progress with a minimal impact on cache performance . at once : a write to memory, a read from memory, With reference to chip size, we determined that and an internal cache entry invalidation. Descrip· a cache size of 1 KB was the largest that could be tions of these operations in the BIU fo llow. used. In addition, the cache is designed so that Wh ile a write to memory is awaiting com ple· it can be configured by sofrware tO act as an tion. the internal state machine can service read instruction-only cache or as an instruction and

102 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

dara cache. The instruction-only option was pro­ 200 rows. Selection of a row causes all eight vided to simplify hardware in multiprocessor words to be driven onto the precharged bit lines systems where the designers do not want to deal which form the inputs of an 8 to l multiplexer. with DMAinval idates. The three remaining microaddress bits, 3 The cell chosen to implement the cache array through l, choose one of these eight microwords is a one-transistor (lT) dynamic RAM . The IT ro be driven onto the microinstruction bus. The cell, illustrated in Figure 4, was chosen because fi nal value of bits 3 through l can be modified by of its small area. A comparable array design with values on the microtest bus. This 3-bit bus con­ either a fou r-transistor dynamic RAM or a six-tran­ veys state information from other sections of the sistor static RAM cell would have required 2.4 ro chip w the microsequencer. In this way, various :) times as much area. The storage capacitance of processor states may be polled ro enable up to an the IT cell is I I 0 femtofarads, resulting in a bit­ eight-way microcode branch. line tO cell-capacitance ratio of 8 to 1. With a The primary function of the microsequencer is fo lded bit-line structure and the use of a dummy to supply microaddresses tO the control store . cell (which stores half the charge of the storage The microsequencer selects a microaddress ce ll), a voltage differential of 200 millivolts was based on microcode control and external control realized at the sense amplifiers. Because of the from the testability logic. In addition to generat­ dynamic nature of the IT cell, a refresh counter, ing microaddresses, the microsequencer receives composed of linear feedback shift registers, was exception request lines from other sections, pri­ designed tO control which row is refreshed dur­ oritizes these requests, and generates base ing idle cache cycles. addresses for microcode exception service rou­ We designed byte parity into the cache to tines. These base addresses can be modified by detect data corruption resulting from either soft the section signaling the exception by means of or hard errors. A study was done to determine the the microtest bus. soft error rate of the cell. The soft error rate for The microsequencer contains a last-in-first-out the cache array was found to be 10 FITs, where (LIFO) queue of eight microaddress entries I FIT is equal to I fa ilure in one trillion operat­ called the microstack. A latched copy of the ing hours. To protect against data corruption due microaddress bus is stored on the microstack to minority carrier injection, the array is sur­ when a microcode exception occurs . Once the rounded by a deep N-type implant ring. exception has been serviced, this latched copy The CVA.X CPU chip is the first microprocessor allows reexecution of the microinstruction that in the industry to include an on-chip dynamic caused the exception. In the case of a microcode 1 T cell cache. subroutine call, the current microaddress is incremented and stored on the microstack. This forms the address when returning from the sub­ Control Store and Microsequencer routine. The operations and interactions of the five func­ tional blocks described so far are all controlled by microcode in the control store. The micro­ Te stability Issues sequencer supplies the microaddress to the As a complex microprocessor chip, the CVAX control store. The control stare contains 78034 CPU chip has some difficult testability I .600 words of read-only memory. Each 4 1-bit issues. A large number of internal state bits and word is divided into a 28-bit field, which controls buses arc not normally visible at the pins of the the execution sections of the chip, and a 13-bit chip. Early in the design process, techniques fi eld, which controls the microsequencer. Con­ such as level-sensitive scan design (LSSD) and trol store access is achieved in less than three built-in self-test were eliminated as possible clock phases. testabi I i ty strategies. Both of these strategies The control stOre is organized into 200 rows of wou ld have had a significant impact on chip area 8 words each. H-shaped cells, 7 by 8 microns in and performance. Instead, an ad hoc method of size, are used to implement the array. design for testabi I ity was developed 9 A microaddn:ss is supplied to the control store The design for testability strategy has two main by the microsequencer by means of the 11-bit themes: ( l) make maximum use of existing hard­ microaddress bus (bits l 0 through 0). Eight of ware for test observability and controllability, these bits, I 0 through 4 and 0, select one of the and ( 2) add special test hardware to those areas

Digital Te chnical jo urnal 103 No. 7 Augus/ I 'JR8 Tb e CVAX 78034 Chip, a 32-bit Second-generation VA X Microprocessor

BIT LINE (METAL 1) BIT LINE

POLY WORD LINE

METAL 2 WORD WORD LINE LINE STRAP

VDD (POLY PLATE)

WORD LINE

WORD LINE

VDD

WORD LINE

ACCESS DEVICE

I I I I I I I I I I I I

EACH DIVISION = 1.0�m

KEY

� N+ SOURCE/DRAIN DIFFUSION c:=J METAL 1 c::J METAL 2 � POLYSILICON � METAL 1 CONTACT

Figure 4 Cacbe ITDy namic RAM Cell (Four Cell Sbown)

I 04 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

of the chip where observabil ity or controllability the data to a single-bit output stream. In this wou ld not otherwise be possible. manner, a II of the LFSRs may be observed at once. The chip already had some important features In addition, all of the LFSR outputs are fed into a that could be exploited . multiplexer that allows any one of the registers to be observed. • The chip is controlled by the microcode con­ The test logic requires only one dedicated test tained in the control store. Thus, it is an obvi­ pin to select test mode and uses less than 2 per­ ous candidate fo r controlling the chip when in cent of the chip area. Moreover, inclusion of this test mode. logic does not affect chip performance. When in

• Many of the internal registers are readable and rest mode, 3 to 1 5 other pins are redefined for writable from the internal buses. By transfer­ test functions. A 4-bit test-mode configuration ring this read and write data to the main bus register selects which of the LFSRs is to be that connects to the pins (the DAL'i), much of observed, whether the LFSRs will be in scan or the internal state can be observed and compress mode, and whether or not test broad­ mod ified. cast mode is enabled.

• The interface for the tl oating point coproces­ TheRole of Simulation and Modeling sor chip contains a mode that broadcasts a value from the internal cache or register fileto Complexity was managed and detailed circuit the pins. This mode is also used during test for behavior was predicted through the use of mod­ cache and register fi le observabihty. els and simulation. During the design , the chip was modeled at five levels of abstraction. As the These fe atures alone were not enough, however, design progressed from concepts to implementa­ and some specialized test hardware had to be tion, the level of abstraction was refinedto reflect added. the increasing detai I of the design .

• To make use of the chip microcode in test mode, it is necessary to be able to externally Choosing the Microarchitecture choose the addresses of the microword to be The performance model was the earliest and he executed. Thus, a test mode was added to the most abstract of all the models. The performance microsequencer. In this mode, the micro­ model was used to predict the machine's perfor­ sequencer ignores its normal choice for the mance and to quantify the speed advantage of the microaddress and uses the value from a group various microarchirectural options under consid­ of pins. eration. Written in PLji, the performance model was driven by trace files. These files consisted of • The cache is difficult to test in its normal oper­ streams of opcodes and operand specifiers ating mode. To overcome this, a special cache derived by running typical VAX applications pro­ diagnostics mode was developed. grams. The psuedo-microcode contained in the

• Some special test microcode was added to model approximately modeled memory request allow more efficient testing of some areas. patterns ancl microinstruction counts for each type of VAX instruction. As we hac! planned, the • A few major internal buses were not observ­ performance model did indeed help predict the able. Dual mode linear feedback shift registers machine's TPI. Moreover, the model also helped (LFSRs) were added to these buses: the output identify performance bottlenecks in the micro­ of the I-Box instruction decode ROM, the architecture . microinstruction bus, and the microtest bus. As noted in the section Project Goals, perfor­ mance is inversely proportional to the product of The cache refresh address counter is also the TPI and cycle time. Specifically, the cycle implemented as an LFSR. time depends on the delay through the critical The dual mode LFSRs allow the data bus to be speed circuits. Therefore, to identify the critical captured and scanned out serially. Alternatively, circuits and determine the propagation delays the claracan be compressed every cycle using the through the circuits, we carried out cycle time linear fe edback technique. The outputs of the feasibility studies. SPICE, a circuit-level simula­ LFSRs are inputs to another LFSR that combines tor, was used in these studies. With the chip die

Digital Te chnical jo urnal 105 No. 7 August 1988 Th e CVAX 78034 Chip, a 32- bit Second-generation VA X Microprocessor size as a given requirement, we determined the Ve rification of Logic - Transistor microarchitecrure of the machine by selecting Level those features that minimize the product of TPI The DECSIM simulation tool also supports MOS and the cycle time. transisror-level modeling. We used this roo! as a Ve rification of the Microarchitecture switch-level simulator, that is, we modeled tran­ sistors as open or closed switches . The model was Once the microarchitecture was defined, a autOmatically generated from the schematic data­ detailed specification was written fo r each sec­ base. tion of the chip. Next, an abstract behavioral This level of modeling reflected the true behav­ model was written to verify that the specification ior of the schematics with greater subtlety than described a VAX CPU. Much more detailed than the schematic-level behavior model. However, the performance model, this model was con­ this model was not nearly as computationally trolled by microcode, ran real VAX code, and efficientas the behavioral model. closely modeled the major chip buses, global DECSIM MOS modeling identified sequencing signals, and clocks. The model was written in errors, charge sharing problems, sneak paths, and Digital's DECSIM behavioral modeling language. race conditions that the more abstract models had Many microcode and microarchitecture bugs failed to detect. were identifiedand fixedas a result of this be hav­ ioral model testing. Physical Te chnology Logic and Circuit Design The CVAX 78034 CPU chip is implemented in a P-EPI, N-well CMOS (complementary metal­ The detailed logic and circuit design began while oxide-semiconductor) process developed in­ the abstract behavioral model was being written. house. The process has two layers of aluminum During this phase of the design, SPICE simu la­ inrerconnect and a single layer of polysilicon. tions were used extensively tO predict circuit The critical process dimensions and chip charac­ behavior. Because SPICE simulates transistor teristics are summarized in Ta ble 2. behavior in detail, it requires a large amount of The chip contains 180,000 transistor sites computer resources. Consequently only critical with 134 ,000 actual transistors, and measures circuits were simulated and these were often sim­ 9.7 mm by 9.4 mm on a side. (See Figure 2.) It plified to contain on ly the essential elements. is packaged in an 84-pin surface-mountable Circuit simulations typically involve tens of tran­ ceramic chip carrier with 50-mil leads, uses a sistors rather than hundreds or thousands. single + 5 volt supply, and has a worst-case power dissipation of 1.5 watts. Ve rification of Logic - Gate Level The abstract behavioral model had been used to ve rify the specification. Now it was necessary to verify the implementation of the specification. To Ta ble 2 CVAX Chip Process make this verification, we wrote a schematic-level behavioral model that captured the logical and Fabrication Process timing characteristics of every schematic. Almost Fabrication process CMOS every node was modeled explicitly. This essen­ Gate oxide 300 A tially gate-level model was also written in the Substrate N-well in P-EPI DECSIM language. The model idenri.fi.ed many Device types N-channel enhancement logic and timing bugs, especially between sche­ MOSFET; matics designed by different engineers. P-channel enhancement The schematic-level behavioral model was sub­ MOSFET jected to intensive verification because it offered a good compromise between implementation Interconnect Pitches (Linejspace Drawn) detail and simu lation efficiency. This model of Polysilicon 2 micron/2 micron the CVAX 78034 CPU chip was used by the sys­ Metal 1 4 micronj2 micron tem designers in other design teams to model the Metal 2 5 micronj2 micron interaction of the CPU with other chips in board Contacts 2 micronj2 micron designs.

106 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

25

20

15

(f) z Q !;{ LLU 0-'Q_ a: a_ W<{ en� 10

=>a:::;; � Z<{ ::;; I u z w CD

5

PERFORMANCE RATIO MicroVAX 3500/3600 to MicroVAX II

Figure 5 Micro VA X 3500/3 600 and Micro VA X 11 System Benchmark Comparison

Summary Acknowledgments The CVAX 78034 CPU chip met the project The authors wish to acknowledge the technical design goa ls. Depending on the benchmark or contriburions of D. Arc her, D. Bhavsar, W. Bider­ application program being run, the performance mann, S. Carroll, D. Deverell,). Keefe , S. Martin, of the MicroVAX 3500/3600 systems is 2.6 to A. Olesin, S. Persels, ). Reinschmidt, L. Rozek, 4. I times that of the MicroVAX II computer. P. Rubinfeld, M. Schenstrom, D. Schumacher, (Refer to Figure 5.) This performance increase B. Supnik, ). St. Laurent, T. Thrush, and was achieved by reducing both the TPI and the B. Worster. machine cyc le time. The main factors influencing TPI are the l K.B, on-chip cache ; the 64KB on-board cache; No tes and References and the 28-entry virtual-to-physical address trans­ lation buffer. The cyc le time was reduced as a l. D. Dobberpuhl, et a!., "The MicroVAX result of the advanced process technology chosen 78032 Chip, A 32-Bit Microprocessor," and the architectural and circuit innovations Digital Te chnical jo urnal (March 1986): made by the design team. 12-23.

Digital Te chnical jo urnal 107 No . 7 August 1988 ------Th e CVAX 78034 Chip, a 32- bit Second-generation VA X Microprocessor

2. ). Beck, et a! ., "A 32-Bit Microprocessor with 6. P. Rubinfeld, et al, "The CVAX CPU, A On-Chip Virtual Memory Management," CMOS VAX Microprocessor Chip," Proceed­ IEEE International Solid- State Circuits ings of the 1987 IEEE International Co nference Digest of Te chnical Pap ers Conference on Computer Design: VL SI in (1984): 178-179 Computers and Processor (October 1 987) : 148-152. 3. To compute clock ticks per instruction (TPI), typical application programs and 7. D. Archer, et al, "A CMOS VAX Microproces­ benchmarks are fi rst run. Then the number sor with On-Chip Cache and Memory Man­ of clock cycles required to execute these agement," IEEE journal of Solid State programs is divided by the total number of Circuits, vol . SC-22, no. 5 (October 1987): instructions executed. The result of the 849-852. computation is the TPI. 8. D. Archer, "The Instruction Parsing Micro­ 4. VA X Architecture Ha ndbook (Maynard: Digital Equipment Corporation , Order No. architecture of the CVAX Microprocessor," EB-19580, 1981). Proceedings of the 20th Annual Wo rkshop on Microprogramming (December 1987): 5. D. Archer, et al, "A 32-Bit Microprocessor 147- 153. with On-Chip Instruction and Data Caching and Memory Management," IEEE Inter­ 9. D. Bhavsar and D. Miner, "Testability Strat­ national Solid-State Circuits Co nference egy for a Second Generation VAX Micro­ Digest of Te chnical Pap ers (February processor Chip," International Te st Co n­ 1987) : 32-33. 329. fe rence (September 1987): 818-825.

108 Digital Te chnicalJo urnal No . 7 Augus/ 1988 Edward]. Mclellan Gilbert M. Wo lrich Robert A] Yo dlowski

Development of the CVAXFlo ating Point Chip

The CVAXfloating point accelerator (CFPA ) chip is a CMOS floatingpoint coprocessorfo r the CVAX system. The purpose of the CFPA project was to provide gains in floating point perfonnance equal to those of the CVAX CPU fo r integer perjonnance. Combined with an aggressive schedule, the primarygoals required the CFPA chip to perfo nn at three times the levelof the previous generation Micro VA X floating point unit (FPU) and to be complete two years aft er delivery of the Micro VA X II system. Designers obtained a perfonnance gain of only 25 percent through base technology improvements. Consequently, most gains are achieved through the use of a multiplier array, improved arithmetic algorithms, and a fa st and effi cient interfa ce with the CPU.

Fu nctional Overview Ta ble 1 CFPA Physical Characteristics

The CFPA YLSI chip is the companion floating Number of transistors 65,000 point processor for the CYAX CPU. The chip's Package 68-pin surface-mountable hardware structures and algorithms provide high chip carrier with 50-mil lead overall system performance. In all, the chip exe­ spacing and heat sink cutes 76 instructions. Die size 7.3 mm x 9.1 mm The CFPA supports Power dissipation 1W Fabrication process 2 micron drawn, N-well, • Three VAX floating point data types: dual aluminum CMOS F_ floating, O_ floating, and G_floating

• Floating point calculations, which include a polynomial evaluation instruction

• Integer multiply and divide instructions equal the central processor chip's expected per­ formance level for integer operations, and (2) ro • Conversion between integer and floating point data types adhere to the same development schedule set fo r the CVAX CPU chip. Specifically. these goals • Complete detection of all exception condi­ required instruction execution times to be three tions times faster than the MicroYAX FPU on average. The CFPA operates synchronously with the Further, the schedule allowed little time to CPU at speeds of 80 and 90 nanoseconds (ns) per achieve these significant performance gains; the cycle. Opcode, control, and status information is design wou ld have to be completed on ly two communicated between the coprocessor and the years after the MicroVAX II system design . CVAX by means of a dedicated 8-bit bidirectional In order to improve computer performance, coprocessor bus. the clock frequency andjor the amount of work Ta ble 1 lists the CFPA physical characteristics. completed in a cycle must be increased. The CYAX CPU uses the improved speed characteris­ CFPA Project Goals tics and greater density of the CMOS process to The two main goals of the CFPA chip design pro­ reduce the clock cycle time from 200 ns in the ject were ( 1) to provide the CYAX system with an MicroVAX II design to 80 or 90 ns. A pipelined improvement in floating point performance to architectural approach was necessary to achieve

Digital Te chnical journal 109 No. 7 August 1988 ------Development of the CVAX Floating Point Chip

this reduction. In particular, while the arithmetic � and logic unit (ALU) operates on one microin­ �---C-P U----� �----FPU----� struction , the register file is free ro access data � �� for the next microinstruction. This improvement allows more work to be completed in each micro­ t t CVAX BUS cycle and offers a reduction in the cycle time as wel l. t The previous generation floating point design, SECOND-LEVEL used in the )-11 FPA as well as the MicroVAX II CACHE RAM and VAX 8200j8300 systems, already pipelined 64KB register file access with ALU operations. This pipelining was necessary ro allow a I 00-ns cycle Fig ure 1 CFPA Example System time - twice the frequency of the companion Configuration CPUs - in the ZMOS process technology. Since the pipelined registerjALU operation was already achieved, the improvement in cycle time for After supplying operands to the CFPA, the the CFPA is limited by the speed of the ALU and CVA.X relinquishes control of the coprocessor bus does not benefit from additional pipelining. The to receive the result status of the floating point improved technology allowed for an ALU imple­ operation. Control of the coprocessor bus, how­ mentation that provides a 20 percent decrease ever, does not imply control of the CVAX system in cycle time, matching the CVAX microcycle. bus. The CFPA ensures availabil ity of the CVAX Therefore, the necessary performance increases system bus by monitoring the direct memory for the CFPA wou ld not be created by scaling the access (DMA) grant signal from the CVAX. lf a cycle time. Instead, CFPA designers wou ld make DMA has been granted, the floating point result improvements in the amount of work done per status will be retransmitted until the DMAopera­ microcyclc and in the interface between the pro­ tion is complete. Receipt of the floating point cessor and the floating point chip . This interface status while the DMA grant signal is deasserted is described in the following section . guarantees availability of the CVA.X system bus An overview of the chip's overall performance for the next cycle. Control of the coprocessor bus is presented in the section CFPA Performance at is returned to the CVAX after successfully driving the end of this paper. floating point status. The CFPA drives the result data on the CVAX system bus one cycle later, Processor-to-bus Interfa ce completing the operation . In addition to the CVA.X system bus used to trans­ Floating point instruction latency comprises fer floating point data , a dedicated 8-bit bidirec­ overhead devoted to opcode, operand and result tional coprocessor bus is used to communicate transfer, and actual computation, or execution between the CVAX and theCFPA.An exampleofthe time. Due solely to improvement in CVAX cycle CFPA system configuration is shown in Figure l. time - from 200 ns in MicroVA.X systems to The CFPA normal ly monitors the coprocessor bus 80 or 90 ns in CVA.X systems - overhead times for opcode and operand information unti l it is are improved by facrors of 2.5 or 2.2, respec­ ready to drive a result back to the CVAX . After tively. Designers achieved additional improve­ decoding an opcode, the CFPA monitors control ments in the interface by reducing the actual signals on the bus that indicate the presence of an number of cycles required for these overhead operand Operands may come from a CPU general transfers. As compared to the MicroVAX II sys­ register, internal cache location , or from the tem, the CVAX system requires fewer cycles to memory system. When operands arc transferred access and transmit register and internal cache from CPU general registers or internal cache operands located on the chip. Moreover, external locations, the data is transmitted directly cache and memory operands are input directly between the CVA.X and the CFPA. Operands from from the CVAX system bus as opposed to being external memory or cache locations are ind icated fetched by the CPU and later retransmitted to the on the coprocessor bus at the start of the external FPU as in the MicroVAX II system. The resulting memory access. The CFPA then monitors the interface improves performance by a factOr of CVA.X system bus and latches the returning data approximately 2.5 (90-ns cycle) ro 2.8 (80-ns withom CVA.X intervention. cycle) over the MicroVAX II system.

110 Digital Te chnical journal No. 7 August 1988 CVAX-based Systems

Despite these improvements, more than half until the fi nal sum is formed through the use of the cycles required to execute a floating point carry save adders. Multiplier arrays consist of instruction in the CVAX system can still be rows of carry save adders which add in a new mul­ attributed to overhead costs. The possibility of tiple of the multiplicand at each row. The carry pipelining macroinstructions - overlapping the save adders produce a result, or partial product, operand fe tches of the next instruction with exe­ consisting of two outputs, the carry and the sum; cution of the current instruction -as well as if added, the two outputs represent a single num­ operand forwarding was studied. In such a system ber equivalent to the partial product at that the effective instruction time is determined by step obtained using full propagation addition. By the longer of the operand transfers or the actual deferring the final summation of the sum and floating point execution time. Instruction time is carry words, the comparatively time-consuming not determined by the additive effectof the inter­ carry propagation addition need be performed face and execution. The one-instruction macro only once to produce the result. pipeline interface was rejected due to the risk The only drawback to the multiplier array is and complexity of the design. Moreover, perfor­ the large percentage of chip area devoted to this mance goals had already been met and develop­ one operation. Nevertheless, the magnitude of ment time was at a premium. performance gain warrants the use of an array in any high-performance computation unit. Algorithms Another common method used to improve the Al though the interface figures prominently in the processing of multiplications involves multiple­ achievement of overall performance targets, most bit Booth encoding. This method, which requires of our design efforts were focused on the actual significantly less hardware, is aimed at reducing execution unit. To maintain and even increase the number of partial products needed to be the benefits gained by the interface design formed . The multiplier operand is encoded - or improvements, we needed an equal or greater recoded - as a control pattern used to deter­ improvement in execution times. Since the most mine a sequence of shift and add or subtract oper­ important instructions for a floating point unit ations on the multiplicand. Multiple bits of the are additionjsubtraction , multiplication, and to a multiplier can then be retired in a single opera­ lesser extent division, designers set about opti­ tion. This method of reducing the number of mizing these instructions. The remainder of multiplication steps can be employed either with instructions implemented by the CFPA also or without an array structure. benefit from the shift, multiply, and divide opti­ The previous generation MicroVAX FPU exe­ mizations and demonstrate performance gains cutes multiplication using a fixed, 3-bit-per-cycle relative to the MicroVAX II FPU as we ll. Finally, Booth algorithm without the use of a multiplier all instructions gain from microcode improve­ array. Single-precision multiplication requires 8 ments in atypical case handling and from fa ster cycles to compute 25 product bits; o_floating code entry and exit techniques. and G_floating double-precision formats require 19 and 18 cycles to produce the necessary 57 or Multiplication 54 product bits. Additional cycles are needed to Floating point multiplication consists of multipli­ set up the multiply loop, calculate the initial par­ cation of the fractional, or mantissa, portions of tial product based upon the multiplier least­ the operands and the summation of the corre­ significant bit (LSB), and round and normalize sponding exponents. Many multiplication tech­ the finalpr oduct. niques have been developed and implemented to The CFPA multiply algorithm takes advantage increase the speed of this frequently executed of the greater density and transistor count instruction. Perhaps the best technique for VLSI afforded by the CMOS process. The CFPA imple­ implementation at this time is the multiplier ments a multiplier array, which consists of four array. The array is particularly well suited for rows of 65 carry save adders. The multiplicand VLSI implementation due to the array's regularity select logic associated with each row of the array of circuit connections which allow for a very as well as the interconnect berween the rows is compact and repeatable cell design. configured to implement a 2-bit Booth encoding. The process of multiplication involves a series As a result of this configuration, 8 product bits of additions. It is possible to delay the carry prop­ are completed per pass through the array. Single­ agation necessary to complete these additions precision multiplication requires three passes

Digital Te chnical jo urnal 111 No. 7 August 1')88 ------Development of the CVAX Floating Point Chip

through the array, and double-precision requires array of 4 rows of 2-bit-per-row retirement re­ seven passes to complete. quiring only 1 .3 mm of chip height. The result is The array can be evaluated twice per cycle. a three and one-half to four times gain in the over­ Therefore, single-precision multiplication re­ all performance of multiplication. quires one and one-half cycles, and double-preci­ sion D_ Roating and G_ftoating formats require Addition/Subtraction three and one-half cycles of processing in the Floating point addition involves a series of steps. array. Before running the array, one-half cyc le is needed for set up and initial product calculation . 1. The exponents are subtracted to determine After the multiplier array completes, a cycle is the shift amount necessary to align the frac­ used to complete the full carry propagate add, tions. which combines the final carry and sum outputs 2. The fraction operand with the smaller of the array. This cycle is followed by a normal­ exponent is shifted into alignment and ization cycle during which valid status is added or subtracted. returned to the CVAX . When we compare the MicroVAX II system to 3. The result is shifted back to the normalized the CFPA, the number of cycles required to com­ form (� result < 1.0). Normalization plete a MULF instruction has been reduced from shifting is accompanied by exponent adjust­ 14 to 4 (a ratio of 3.9 to 1 at 90 ns, 4 4 to 1 at 80 ment. ns); to complete MULD or MULG instructions, 4. The result is rounded and checked for the reduction is from 26 to 6 ( 4.8 to 1 at 90 ns, overflow or underflow conditions. 5.4 to 1 at 80 ns). If we include operand transfers and count each interface cycle of the MicroVAX II Typically, the shifting operations and their con­ system as equivalent to two CVA.X cycles, how­ trol consume large amounts of chip area and ever, the reduction in the total number of cycles potentially a large portion of the total calculation fo r MULF is from 27 to 9 (3.3 to 1 at 90 ns, 3.8 to time. An analysis of these operations was used to 1 1 at 80 ns); and for MULD, from 43 to 14 (3.4 to guide trade-offs in the design of the CFPA. It was 1 at 90 ns, 3.8 to I at 80 ns) for register-mode noted that although large shifts are sometimes instructions. When operands are read from or necessary to compute the final result, their fre­ written to memory, the overhead support per­ quency of occurrence is very small. Furthermore, centage becomes an even greater factor; and the a small shifter, capable of covering the vast impact of the actual CFPA multiplication speed is majority of cases in a single operation provides reduced. the benefit of a small control circuit that can be To fu rther increase performance, we consid­ more easily optimized for speed. It was decided ered an array of sufficientsiz e to complete single­ that the speed and area advantages gained by precision multiplication in a single pass and ctesigni ng for the most frequently occurring cases double-precision multiplication in two passes. provided the best solution under project con­ However, such an array would require three straints. times the chip area for a 2-bit algorithm. A 3-bir­ Specifically, a small shifter that is capable of per-row multiply would require 8 rows to com­ left-four ro right-seven bit shifts proved ro have plete single-precision multiplication in one pass adequate range for most alignment and normal­ and 9 or I 0 rows to complete double-precision ization shifts. In up ro 80 percent of the cases, multiplication in two passes, as well as an adder additional cycles arc not needed for alignment to calculate the multiplicand facror of 3 Either of shifting Larger alignment shifting utilizes the these alternatives, if feasible, wou ld save only multi pi ier array for a shift capability of 16 bits one cycle in single-precision (a reduction from 9 per cycle. The array minimizes the worst-case to 8, or 11 percent) and two cycles in double­ shift time without requiring a large shifter. precision multiplication (I4 to 12, or I4 per­ Although it rarely requires additional cycles, nor­ cent) . ln addition to the area requirements, the malization shifting may cause a longer latency. circuit design difficulty and risk involved to Additional cycles, however, are nor necessary for implement a larger array were deemed much too normalization in 93 percent of the cases. great for the limited gains. We therefore chose w To reduce the shifter control complexity, a trade off these smaUer gains in favor of a partial modified ALU ca.lculates the absolute value of the

l I 2 Digilal Te chnical ]o uriUII No. 7 August 1988 CVAX-based Systems

exponent difference. The modified ALU docs not method cal ls for shifting over, or normalizing, req uire additional calculation rime to accom­ multiple leading bits when the partial remainder plish this calculation . The absolute value result is small. A partial remainder with multiple lead­ simplifies control logic to enable rhe alignment ing ones ind icates a small negative remainder, shifter tO complete in the next clock phase. Only whereas leading zeros ind icate a small positive one additional generate term is needed to enable remainder. Multiple quotient bits can be deter­ two carry chains executing simu ltaneously; one mined for cycles in which the magnitude of the calculates A minus B. the other B minus A. The partial remainder is small. Shift operations most significantbi t (MSI3) of the first carry chain replace arithmetic operations on unnormalized determines the sign of the operation. To produce remainders, reducing the number of ALU cycles the absolute value or positive result. the MSB of needed to develop the final quotient. This the first carry chain is used to select the finalout­ method of division is called normalizing, non­ put from the two carry chains. In addition , the restoring division and is also used in the MSI3 is used to select the fraction requiring MicroVA.X FPU. The difference between the two alignment. implementations is in the normalization shift 'T'he CFPA completes addition or subtraction range provided for partial remainder and quo­ operations in three cycles for most cases. This tient development. minimum execution time is exceeded for only Of course, this algorithm is quite data sensi­ 2'5 percent of all addition or subtraction opera­ tive. A division that results in a partial remain­ tions, almost all of which require only one addi­ der of all ones or all zeros can be completed tional cycle. in a minimum amount of time; whereas, if a The major improvement over the MicroVA.X II string of alternating ones and zeros is produced FPlJ in the addition/subtraction algorithm is the at each ALU operation, the process degener­ elimination of no-operation cycles necessary for ates to a one-bit-per-cyc le pace. The observed control eva luation preceding the alignment and average rate for an algorithm that allows normalization steps. The resultant reduction as unlimited shift range is 2.66 bits per cycle. compared tO the MicroVA.X II FPU is from eight Unfortunately, the shift range chosen implies a cycles to three for both single- and double-preci­ control structure directly between the shift sion additions/subtractions in the actual floating and ALU operations. The time between these point unit ca lculations (3 to 1 at 90 ns, 3.:) to operations is critically important to the over­ 1 at 80 ns) . all cycle of the chip. We chose 4 bits as the The overall performance gain in equivalent left shift range for the CFPA to reap the maxi­ cycles is 20 to 8 for single-precision (2.8 to 1 at mum benefit from the technique without intro­ 90 ns, 3. 1 to 1 at 80 ns) and 26 to 11 for double­ ducing inordinately difficult control paths precision addition/subtraction (2.6 to 1 at 90 ns, between the shift and ALU operations. This .) 0 to 1 at 80 ns ). amounts to an increase of 2 bits of shift range over the MicroVA.X FPU. Correspondingly, the Division average number of quotient bits developed Floating point division consists of division of each cycle increased from 1.5 to 2.4. Expand­ the fraction or mantissa and subtraction of the ing the shifter beyond a range of 4 for this exponents. Division presents a more intractable method provides a diminishing improvement, as problem than multiplication when designing for shown in Table 2. high-speed performance. The difficulty arises due to the fa ct that the partial remainder at each step must be examined before the next operation Table Average Quotient Bits per Cycle can be determined . Various algorithms have been 2 proposed to reduce the number of arithmetic Shifter Range Average Speed steps-. but no single sol ution seems tO optimize 2 1.5 both performance and size constraints. 4 2.39 The CFPA uses a method of division that offers 6 2.54 an improvement over single-bit division algo­ 8 2.64 rithms. which perform an arithmetic operation Unlimited 2.66 to produce a single quotient bit per step. The

Digital Te chnical journal 113 No. 7 August I'J88 ------Develop ment of the CVAX Floating Point Chip

Increasing the number of quotient bits devel­ 2.5 bits per cycle as compared to I bit per cycle oped per cycle from I.5 to 2.4 results in on the MicroVA.X FPU. increased speeds in the CFPA divide loop re lative tO the MicroVAX FPU: 1.8 times greater for Microcode Control Structure 90-ns cycles, and 2.0 times greater for 80-ns The control structure for the CFPA is influenced cycles. The overhead cycles involved in setting by two opposing constraints. The complicated up the divide sequence and normalizing the quo­ requirements of instructions such as extended tient arc re duced from 7 tO 2. As a resu lt, the mu ltiply and integerize (EMOD) and polynomial CFPA realizes a performance greater than the evaluation (POLY) require the flexibility offered MicroVA.X li FPU in terms of number of cycles by a microcoded approach. Performance goals, reduced for division. Including the processor-ro­ however. require the speed of hardwired control FPU interface cycles, the number of cycks for structu res to avoid costly delays incurred during sing le-precision division is reduced from 3 7 to microcode branch handling. The final imple­ 18 cycles (2.3 at 90 ns, 2.6 at 80 ns); for mentation combines a small control Pl.A (pro­ D_ floating dou ble-precision division, 61 tO 35 grammable logic array) to provide the fl exibility (I.94 at 90 ns, 2.2 at 80 ns) . of microcode control with hardware control Comparatively, this method of division is very structures for speed critical paths. These control efficient, especia1ly when we consider the small structures are enabled through the microcode to amount of control circuitry and data path area emulate complete hardwired control for impor­ required. Designers can increase performance tant instructions. The structures provide support additionally by using algorithms that employ for alignment, normalization, multiplication and multiples of the divisor, or by implementing a division steps. Standard microcode control sup­ divider array structure . The use of multiples of ports the less critical instructions. the divisor requires both additional registers to Functions are performed under more straight­ hol d the mu ltiples (3/4 , 1, 5/2) and further forward microcode control when the code does expansion of the left shift capability to rake not penalize the instruction performance. This advantage of the longer normalizations created by trade-off simplifies critical circuitry in some this approach (3.6 bits per cycle with left shift instances. The on ly exception to this rule is in range expanded tO 6). In add ition, the control the handling of exception conditions. If an logic required to support the selection of the exception cond ition can be isolated from the nor­ proper multiple is more complex and wou ld be mal instruction flow, it is also processed in much more difficult to implement in the con­ microcode rather than through the more expen­ strained cycle time. The other alternative of exe­ sive hardware control . cuting the divide step in an array structure for The usc of hardware structures reduces performance capable of 3 to 4 quotient bits per the total number of microcode terms needed cycle involves an even greater cost in hardware to implement the instruction set. This reduc­ and is nor consistent with the project goals. tion is important to ensure that the microcode Integer division docs nor automatically bene­ Pl.A can be implemented with an access time ti t from hardware devoted to floating point divi­ of one half cycle. Instructions generally use sion . Since floating point division relies on one code tl ow for all data types. In addition , the normalization of the operands, integer divi­ similar instmctions merge sections of flows to sion must either convert operands to the normal­ further minimize terms. For example, the add­ ized form or accept a slower one-bit-per-cycle compare-and-branch (ACB) instruction, which is algorithm. The CFPA design for integer division one of the more complicated instructions imple­ normalizes both the divisor and dividend in mented by the chip, required only three addi· order to use the 2.4-bit-per-cycle divide algo­ tional terms beyond the addition and compare rithm. Normalization of the divisor and dividend instruction flows . Despite this effort, almost proceeds at 5 bits per cycle. The number one third of the code was devoted exclusively of quotient bits needed to complete the integer to two instruction types, EMOD and POLY. division operation is determined by the differ­ By spl itting, or "folding," the Pl.A into two ence between the normalization shift amounts half-height interleaved arrays, the target speed of the divisor and dividend. Consequently, was met with a penalty of only a few dupli­ integer divides arc typically executed at cated terms. In rota!, 76 VAX floating point

I 14 Digital Te chn icaljo urnal No. 7 August 1988 CVAX-based Systems

as wel l as int eger multiply and divide instruc­ can be returned prior to determination of the tions are implemented in the CFPA. In compari­ finalresu lt. son to the MicroYAX FPU, the total number of microcode states was reduced by 20 percent, to CFPA Implementation I on ly 59 After deciding on a set of basic algorithms that appeared to meet the project goals, the develop­ Microprogramming ment effort proceeded to actual implementation . As mentioned earlier, the use of hardware Individual algorithms can sometimes result in a support contributes to improved performance proposed hardware solution that requires modifi­ for most instructions. However, since the CFPA cations to either the hardware or to the aJgorithm cyc le time during execution is very similar to in order ro be implemented within design con­ that of the MicroYAX FPU (80 or 90 ns versus straints. Merging the requirements of several I 00 ns). we needed fu rther improvement to meet algorithms can create implementation confl icts the project goa ls. Algorithmic improvement in throughout the physical design. Care must be the convcrt-tloating-to-integer (CYTFI) and taken to consider the opposing requirements EMOD instructions provides between three and while incorporating the necessary features in a fou r times the performance of the MicroVAX FPU single design. The algorithms for the CFPA were for the same instructions. But these ga ins wou ld chosen with a single hardware microarchitecture hardly translate to improved overall performance in mind. That architecture evolved as the design when considering the frequency of use for progressed, but the architecture maintained the these instructions. Therefore, to reduce cycles basic structure that was used as a framework for for all instructions, we examined transitions early circuit design and feasibility study. The fol­ during code entry and exit with internal proces­ lowing section outlines the overall hardware sing. Since the CFPA always receives the opcode microarchitecrure for the CFPA. This section is in advance of the operands, it is possible to followed by explanations of the more interesting reduce the execution time for all instructions circuit design issues. by performing the first step of each operation repeatedly in anticipation of receiving the last Microarchitecture operand. In this way, as soon as the interface The CFPA contains rwo main functional units: recogn izc that the operand is va lid and the comrol sequencer is able to act on that informa­ • The execution unit, which performs all arith­ tion . the first step of the instruction is already metic calculations complete.

In the CVAX system, as in the MicroVAX II sys­ • The bus interface unit (Bill) , which controls tem. tloating point status must be returned before all 1/0 operations data can be received . One reason for this return of status is that it prepares the write path back to A block diagram of these units is shown in the general-purpose register fi le located on the Figure 2. CPU chip. Status conditions must be checked The execution unit consists of two main data before the result register is written; the register paths and their associated control logic. The update can thus be inhibited in the case of an 65-bit fraction data path contains an integral error or exception condition. Latency was multi plier array and also processes integer data. reduced on almost all instructions by transmit­ Also included in the fraction data path are a small ting the resu lt status back to the CYAX CPU in the 4-bir left to 7-bit right shifter, a general-purpose same cycle as the last step of execution. This is ALU, scratch register, ROM constants, and quo­ accomplished by checking the result prior to the tient register and shifter. The second data path, last normalization or round operation in order to the exponent data path, is 13 bits wide and con­ determine if the possibility of an exception con­ tains a modified ALU design used tO calculate dition exists. Since F_ tloating and O_floating absolute va lues needed in tloating point addition . formats usc an exponent with a range of 256 val­ The exponent clara path operates in parallel with ues. and G_tloating format increases that range the fraction clara path and may be controlled inde­ to 2,048 possible vaJues, the exponent is in pendently or cond itionalJy based upon results range for most results, and a no exception status from the fraction data path. A 160-term PrA,

Digital Te chnical jo urnal 1 l 5 No . 7 Aug us/ I ')88 ------Oeuf!lopment of the CVAX Fl oat in� Po int Chip

CPSTAT- 10., CPDAT,S:O - AS BIU BIU SEQU ENCER MULTIPLIER MULTIPLIER ROY - CONTROL 80 X 22 DECODE REGISTERS ERR DMG T-ST..c2 � 0 + I I t 32-BIT INPUT BUS

I INPUT� MUX EXPONENT :.--- DATA PATH �

MULTIPLY AR RAY

EXPONENT CONTROL � r------/1 DAL· 31 0() SIGN FRACTION PROCESSOR CONTROLS

STATUS LOGIC -r-- FRACTION ALU El=m' EXECUTION UNIT t-- SEQUENCER CLKB 160 X 44 RESET � 32-BIT 1/0 BUS

KEY

BIU - BUS INTERFAC E UNIT DAL - DATA/ADDRESS LINES

Fig ure 2 CFPA Block Dia�ram

which accesses a single 44-hit microword each latches clocked on nonconsecutive phases, which cycle. conrrols the execution unir are nonoverlapping. The BIU conrrols the interface between the CPU and memory system . A 70-rerm PLA in the Multiplier unit controls all 1/0 transactions between the As noted in the section Multiplication . it was rec­ CYAX and CFPA. The BllJ also conrrols the test­ ognized early in the chip design that the multi­ mode logic to aLlow visibi lity to the data paths plier array wou ld he key to meeting the desired and execution unit PLA during operation. performance The CFPA implements multiplica­ Figure 3 illustrates the physical layout of these tion by using an array of carry save adders with structures on the CFPA die. partial product wraparound. The wraparound enables the array tO be cycled as many times as Circuit Design necessary . The fi nal carry and sum addition is executed in the fraction ALU. A static implemen­ Clocking tation of the carry save adders is necessary since The CFPA chip employs a fou r-phase overlapping data propagates through multiple rows of the clocking scheme which provides timing resolu­ array. tion . Much of the control circuitry design calls To build the carry save adders, we used a fo ur­ for combinational circuits that operate between transistor XOR. This approach allowed for mini-

Digital Technical jo urnal No . 7 August 1988 CVAX-based Systems

Figure j CFPA Physical Layout

mum delay and req uired rhe lcasr amounr of chip worst-case deviccs. a half cycle takes 4'5ns. An area. As a resulr of SPICE simulation . we found array size of four rows takes 26 ns to propagate that doubling the mini mum size of rhe rransisrors through the array , allowing 19 ns for latching, in the mu ltiplier array cou ld provide a 20 per­ return of partial products, and control sw itching. cent speed increase . Si nce the cell area was con­ For typical devices four rows complete in 18 ns, strained by rhe necessary imerconnect in the allowing 22 ns in an 80-ns cycle for the mcral layers. rhc device sizes were incrcascd wraparound path. without affecting the cell size Further dcvice size incrcascs, however, would have forced us to Control PLA increase the cell size and wou ld not have We also recognized the fraction shift control PLA improved speed appreciably due ro increased as a po�sible speed limitation . The shift comrol self-loading With the approach we chose . SPI.CE PLAwas the largest PLA in the control section and simu lation showed a worst-case delay of (l ') ns had to evaluate in a single clock phase. Because per row and a typical delay of 4. '5 ns . no clock signals were available ro comrol evalua­ To obtain rhc dcsired multiplication pcrfor­ tion of the PLA, we used a "dummy" A.i\10 array mancc and minimizc the area neccssary for the term to start evaluation of the OR array. A mu ltiplin array. wc used a technique in which "dummy" OR line conrrols output clocking. mak­ thc array is cyclcd rwice per microcyclc. For ing the PIA self-timed. Because this PLA could be

Digital Te chnical jo urnal 117 No . 7 Aufi/1.<1 1988 -----�- Development of the CVAX Fl oating Point Chip

eva luated in a single clock phase, both alignment path controls, exponent data path, exponent data and normalization operations were able ro elimi­ path controls, microsequencer, bus interface nate an unnecessary wait cycle present on the unit, and clock ge nerator. Consistent divisions MicroVAX FPU . We were also able to expand the and global signals between these major sections divide algorithm to 4 bit shifts per cycle. were maintained in both the behavioral and tran­ A-; we had suspected, the limiting factor in the sistor modeling levels as well as in the final mask final chip cycle time was the multiplier array. artwork. This approach allows maximum possi­ The Al..Us and the large control PU\s in bath the ble checking to be carried out on each section , microcode control section and the BIU easily met independent of the state of other sections of speed requirements in the CMOS I process. the chip. Upon completion of the initial design concep­ Design Methodology tion, a behavioral model was written in the As VLSI technology improves, both chip area and DECSlM simulation language. This model helped density increase, allowing much larger and more us to refine the algorithms and further define the complicated designs ro be attempted. Critical to data path and control structures. We rewrote the any large project, the ability to predict and adjust mode l several times to improve detail and incor­ the design according to the most current infor­ porate design changes. From early in the develop­ mation plays an important role in achieving a suc­ menr, the behavioral model was merged with cessful project outcome in a minimum of time. the CVA.X CPU chip model and a small system envi­ This section describes the various phases and ron ment to provide a platform for more extensive fe ed back paths of the design process for the CFPA testing. Existing diagnostic programs were there­ and some of the unique aspects of VLSJ design. fore able to be run on the model to provide early In the first phase of design, we defined the checks on the design integrity. Additional tests major sections and the necessary global signals were written to ve rify specific features of the communicating between them. The major out­ CFPA implementation before we began the puts of this phase were hand-drawn sets of notes detailed circuit design for critical sections. on the necessary functions of each section and Throughout the development phase, we used the preliminary sketches of possible impl ementa­ VAX Arc hitectural Exerciser (AXE) extensively tions. Early in the design, we recognized that cer­ to test instruction compatibility with existing tain subsections wou ld be critica l to meeting the VAX implementations. Despite a degradation of desired performance goals. These particularly approximately 1M 1 while using the simulator critical sections were ro run test code, well over 5 00,000 test cases were run on the behavioral model before the • The multiplier array design was considered ready for fabrication. • The exponent input path Using the DECSIM MOS device simu lation sys­ tem , we created a transistor-leve l model from • The fraction shifter controls finalschemat ics as they were completed . By col­ We therefore generated more detailed prelimi­ lecting test patterns from the appropriate signals nary designs for all of these sections. Moreover in the behavioral model, the team cou ld begin to we tested their feasibility with SPICE circuit sim­ debug the schematic in complete sections as ulations. The MSB and L" B logic in the multiplier other sections were still being designed. To do was also ve rifiedwith an APL language simulation this efficienrly, the DECSIM group modifiedtheir of the multiplier array. simulator to allow designers tO write a binary One of the hazards in the early stages of a pro­ state file and reload the filefor examination. This ject is the tendency ro spend roo much effort per­ facility gave logic designers a very efficientmeans fe cting one small piece of the design. If the origi­ to debug the transistOr-leve l logic. Designers nal requirements are modified at a later dare, cou ld run their simulations in batch mode over much time is wasted. The design team, therefore, night, examine the resulting patterns for mis­ made a conscious effort to keep all parts of the matches with the behavioral model results, and design at similar leve ls of detail at al l times then "back up" to the area before the failure test throughout the project. point to find the underlying cause. They cou ld For purposes of design checking and chip perform all these steps without rerunning the implementation, we divided the CFPA into seven entire simu lation each time they wanted tO go major sections: fraction data path, fraction data back in time tO look at another signal.

1 18 Digital Te chnical jo urnal No. 7 August 1988 CVAX-based Systems

A-> each section of the transistor-level sche­ Te st Features matic was developed to a satisfactory leve l of To aid the debugging process and provide more accuracy, the third phase of the design ­ complete test coverage, the BIU contains test creation of the physical layout artwork - began logic. This logic allows visibility to both data on that section . To create the artwork, a Calma paths or to the main PLA. A simple test load GDS interactive editing system was used . Over sequence allows one of 16 possible test modes to the course of the project, three layout designers be selected. Various groups of internal data path were employed fu ll time. Toward the end of the and control bits and two test-drive timing options layout phase, up to four additional designers arc allowed. The test mode can be enabled or dis­ were working on various parts of the chip. Each abled at any time by asserting a single test pin. section was checked with the interconnect Certain test modes are available white operating verification (IV) wirelist extraction tool and a at fu ll speed in a system configuration . design rule checker (DRC) program. A" all the sections were drawn and global inter­ CFPA Performance connect wiring was added tO the chip layout, the Although there is no absolute measure of perfor­ fourth phase of the design - rhe back end mance in computer system design, the floating checks - began. The IV program was used to point performance of the CVAX system is com­ extract actual capacitance values for all nodes on pared at approximately three times the perfor­ the chip. We used these capacitance values in mance of the MicroVAX II system. Using some of two ways to check the design. First, they were the more widely publicized benchmarks of compiled into the DECSIM MOS transistor-level tloating point system performance, the CVAX sys­ simulator. The timing feature of this tOol was tem with CFPA running at 25 MHz shows better used to quickly check for gross timing problems than three times the speed of the MicroVAX II over rhe entire chip operating as a whole. Once with FPU. The system calculates 3, l 05K single­ we identifiedan area as having a possible timing precision Whetstone instructions per second and problem and for those areas where we believed l,996K double-precision Whetstone instructions the DECS!M MOS simulation was inaccurate . we per second. Linpack performance of 0.68 Mflops created and ran SPICE circuit simulations. In a single-precision and 0.4 5 Mflops double-preci­ second usc of the extracted capacitance values, a sion demonstrate over four times the perfor­ program called PATH was written in the SCAN mance of the previous generation MicroVAX compi ler generator language. PAT'H allowed the implementation. circuit designers ro easily and accurately create Ta ble 3 1 ists the typical cycle counts for regis­ wirelisrs representing critical paths for submis­ ter-to-register execution of floating point addi­ sion to SPICE The program extracts a circuit tion , subtraction, multiplication, and division. path description from the much larger wirelists generated from either the IV tool or the chip­ wide schematics. Wirelists created by the IV pro­ gram include interconnect and capacitance infor­ mation directly from layout artwork. Ta ble 3 CFPA Cycle Counts for Optimized Although the chip design process appears in Instructions this discussion to be a neat progression, the Opcodet various aspects of the actual project quickly over­ CFPA Operand Total lapped one another. Almost all phases were rak­ Instruction Cycles Transfers Cycles ing place simultaneously on the various sections ADDF/SUBF 3 5 8 of the chip. ·ro keep track of a11 these activities MULF 4 5 9 and continually update the project completion DIVF 13 5 18 date. we used a spread-sheet program as a track­ ing root . ADDD/SUBD 4 7 11 'f'he design ream of l l people completed the MULD 6 7 13 project in 21 months, including 6 months for DIVD 27 7 34 product conception and I 5 months for imple­ ADDG/SUBG 4 8 12 mentation. Due to the extensive modeling and MULG 6 8 14 simu !arion prior to device fa brication, initial DtVG 26 8 34 parts were fu nctional at speed.

Di?,ilal Te cb11ica/ Jo urnal 119 No. 7 Aug ust I 'J88 ------Development of the CVAX Floating Po int Chip

Acknowledgments Te chnical jo urnal (August 1988, this issue): In addition to the authors, members of the CFPA 9'5- 108. team were Roy Badeau, John Kowaleski, Omar E McLellan, G. Wo lrich, et al . "The CVAX FPU, A Malik, and John Ellis who contributed ro the logic CMOS VAX Floating Po int Unit," ICCD Proceed­ and circuit design as well as Katie Alexandrowicz, ings (October 1987): 1 5 3- 1 '56 Mike Benoit, Larry Commodore, Sharon Crafts, Bob Hicks, Dennis Hodges, Ellen Kagan, Marc P Rubinfeld, et al., "The CVAX CPU, A CMOS Schenstrom, Tim Thrush, and Peter Wilcox who VAX Microprocessor Chip," ICCD Proceedings created the layout for the chip. Te st, product (Ocrober 1987): 148-152. engineering, and additional technical contribu­ W. Bidermann, er al., "The MicroVAX 78132 tions were made by Dilip Bhavsar, Shlomo Floating Point Chip," Digital Te chnical journal Daniely, Larry Harada, Charlie Hilman, Vinod (March 1986): 24-36 Rai, and Nigel Scott. G. Wo lrich, et a.l., "A High Performance Floating Reference Point Coprocessor," IEEE journal of Solid- State Circuits, vol. SC- 19, no. 5 (October 1984): l. D. Sweeney, "An Ana lysis of Floating Point 690-696. Ad dition," IBM Sys tems jo urnal, vol . 4 (196'5): _) 1-42. T. Leonard. ed., VA X Architecture Reference Ma nual (Bedford: Digital Press, Order No. Genera/ References EY34'59E-DP, 1987).

T. Fox, P. Gronowski, A. Jain, B. Leary, and 0. L. MacSorley, "High-speed Arithmetic in D. Miner, "The CVAX 78034 Chip, a 32-bit Sec­ BinaryCo mputers," Proceedings of the IRE, voi . ond-generation VAX Microprocessor," Digital 49 (I96 1): 67-9 1.

120 Digital Te chnicaljo urnal No . 7 August 1988 jeffWinsto n I

The Sy stem Support Chip, a Multifunction Chip fo r CVAXSy stems

Developed as a general-purpose companion to the new CMOS VAX VLSI chips, the System Support Chip (SSC) contains a common core of periph­ eral system ju nctions which are required to support a MicroVAX system environment. Theseju nctions include timers, VAX console support, and standby RAM. In addition, the SSC provides system designers with "h ooks" to other system ju nctions. With these peripheralju nctions integrated on a single chip, system designers can substantially reduce the number of com­ ponents on a module and add fe atures previously not considered cost effective. Primarily used with the CVAXCPU chip, the sse is also compat­ ible with the NMOS Micro VAX CPU chip.

Background and Goals We decided a chip that provided the common In 1984 , as the VAX 8200 and Micro VA.'< II chip core of these peripheral functions would be a sets entered production , Digital's SemiconductOr strategic componem for Digital products. This Engineering Group (SEG) directed its attention single chip wou ld integrate many of the periph­ toward defining the next generation of MicroVA X eral functions usually required on MicroVAX CPU systems. 1 This paper describes the project modules. Consequently, a system designer could hisrory and fu nctionality of one of this new gener­ substantially reduce the number of components ation's peripheral chips, the MicroVAX System on a CPU module and add features that previ­ Support Chip (SSC). Deve loped over a period of ously would not have been cost effective. More­ 18 months beginning in late 1984, the sse was over, the chip would allow him to add features designed as a general-purpose companion to the without lengthening the project schedule or CVAX CPU. As such, the chip is used in the VAX requiring extra resources. As a result, the system 25 6200 fa mily and in the MicroVAX 3000 fa mily. designer cou ld produce a more competitive As pan of the definition of the new CMOS VAX Digital product at little additional cost. fa mily of VLSI chips, SEG looked at the periph­ From the system designer's viewpoint, the chip era I functions that surrounded the existing would

MicroVAX II CPU. We observed that, to build a • Fully implement many functions used identi­ marketable product, each system group had cally across different MicroVAX systems, such added a col lection of timers, decoders, and other as rimers, ROM support, and standby RAM low- and mid-complexity fu nctions to their • Provide the "hooks" to support other func­ respective modules. A high level of similarity tions that wou ld be implemented differently in from module ro module was apparent in the the different system environments makeup of these fu nctions. In addition ro examining these existing mod­ Thus each system group wou ld no longer need tO ul<:s,we tal ked with the system designers tO learn design , implement, and debug these important what additional functions should be included on peripheral fu nctions from scratch. Instead, they the next generation of systems. Again, we found could use a readily available part that had been that the various systems under development debugged and qualified. Further, since the SSC wou ld have a signi ficant number of overlapping wou ld usc custom CMOS VLSI, this chip wou ld fu nctional requirements. contain some additional usefu l functions, such as

Digital Te chnical jo urnal 121 No 7 August J'J88 ------The System Support Chip, a Multifunction Chip fo r CVAX Sys tems

general-purpose timers, that are expensive to Ta ble 1 SSC Physical Characteristics implement in off-the-shelf or gate array tech­ nology. Total device count 84,000 (approx.) With these goals outlined, we began develop­ Die size 8.0 mm x 7.5 mm ment of the SSC. The following section presents Power dissipation Less than 1.0 W, worst case an overview of the chip. In the balance of the Packaging 84-pin surface mount paper, we describe the chip functions in detail and discuss the trade-offs made and problems Clock 40 MHz external; 20 MHz encountered during development. internal; 25.6 kHz for time­ of-year clock SSC Overview The SSC incorporates onto a single chip a com­ mon core of functions required to support the nents on the module and thus the product cost. VAX system environment. Table 1 lists the essen­ The MicroVAX 3000 uses two 64 kilobit (Kb) tial physical characteristics of the chip. Figure 1, ROMs in parallel, forming a 16-bit ROM word. a photograph of the chip, shows the major sec­ The VAX 6200 system uses two 64Kb ROMs in tions. Grouped into three main categories, these series. sections are To provide data-width compatibility between the 32-bit-wide CVAX bus and the narrower ROMs, • Support for power-up booting and the VAX the sse includes packing support for 16-bit console word-wide or 8-bit byte-wide external ROM. • Clock and timing functions With packing support, the SSC performs multiple reads of the narrow ROM word, assembles a • Features required by the VMS operating system 32-bit longword, and sends the longword back to and those commonly required on a VAX CPU the microprocessor. The SSC performs this func­ module. tion by directly driving the output enable and We begin our detailed discussions of the chip address lines 1 and 0 of the ROM. (See Figure 2.) fu nctions with the sse console and boot code The ROM's other address pins are driven by an support . external address latch, and the data lines of the ROM drive the CVAX bus directly. Console and BootCode Support To pack a ROM, the SSC asserts output enable, The peripheral support described in this section drives the appropriate combinations of ROM includes ROM packing, halt-protection, the address pins 1 and 0, and receives the narrow UARTs, and standby RAL\1 . data across the CVAX bus in consecutive ROM acess cycles (unbeknownst to the microproces­ ROM Pa cking sor) . The SSC then deasserts output enable, puts When a MicroVAX CPU is powered up, it begins the packed longword on the CVAX bus, and com­ executing code from read-only memory (ROM). pletes the read transaction. To properly communicate with an off-the­ shelf ROM, the microprocessor requires addi­ CPU Ha lt-request Protection tional interfacing logic. The SSC provides this System designers requested that the SSC help logic by generating the signals needed for the prevent an undesired condition in the halt logic. ROM-to-microprocessor interface. The SSC also When the halt pin is asserted on the micro­ provides the packing support for data-width processor, it executes a special trap to console compatibility between the ROM and the micro­ code stored in the ROM. A second assertion of processor. the CPU's halt pin (typically generated when At project outset, SSC designers assumed the someone repeatedly presses the halt button on module designers would use four ROMs in paral­ the system front panel) causes a second such lel to provide a 32-bit-wide ROM word to the trap, overwriting the pointer needed to return CPU. However, with ROMs becoming denser to program code upon leaving console mode. every year, it is now possible to put all boot, con­ Without this pointer, normal operation of the sole, and diagnostic code in one or two 8-bit-wide machine cannot be resumed without booting. ROMs. System designers therefore chose to use Obviously system designers wanted to prevent fewer ROMs, decreasing the number of compo- this condition.

122 Digital Te chnicaljo urnal No. 7 August 1988 CVAX-based Systems

Figure 1 SSC Photograph Showing Major Sections

The SSC prevents the second call by monitoring the addresses of all instruction reads and by inter­ cepting all external halt requests made to the CPU. During normal CPU operation, the SSC passes an initial halt request to the microproces­ sor. The microprocessor immediately begins to execute from halt-protected space, which is a special address space programmed into the sse by the user at boot time. When the CPU reads the first instruction from Figure 2 SSC ROM Pa cking Connection console code, the sse detects this console code Diagram address and masks further halt requests. These

Digital Te chnical jo urnal 123 No . 7 August 1988 ------Th e Sys tem Support Chtp, a Multifunction Chip fo r CVAX Sys tems

4 requests are masked as long as the microproces­ ports within the VAX archirecture) . As a fu rther sor is executing ROM console code. The console simplification, we limited the number of baud code can then run uninterrupted by halts. rates ro eight power-of-two choices (300 tO During console code execution , the sse con­ :)8,400 baud). tinues to moni tor all instruction addresses. When Our most significant improvement tO the an address outside halt-protected space is DLART design was the addition of hardware detected , the sse re-enables halt requests to control-P break-detection. Controi-P entered on a the CPU. VAX console is interpreted as a halt request . Before deciding on the design described above, Thus, the UART must pick ou t this special we considered implementing a software-con­ keystroke from the normal character stream and trolled bit that wou ld enable and disable halts. then signal the CPU to take appropriate action . This scheme wou ld require the software to set Formerly, this function was performed by cum­ the bit upon entering halt-protected space and to bersome firmware . However, the SSC hardware clear the bit upon re-entering normal operation. continuously watches for this character and, AJthough apparently simpler, this scheme proved when it senses control-P, automatically signals to be fl awed because two conditions might occur the microprocessor. that would prevent the user from halcing the sys­ The console code may configure the SSC tem: ( 1) the bit could be accidentally set by non­ such that a break is defined as a controi-P or as boot code, or (2) a software error in the boot 20 spaces; the latter is a definition still used in code cou ld cause the microprocessor to start exe­ some console applications. At one point, we had cuting nonsystem code. planned to use the chip timebase tO define a With the plan we chose , control is automati­ break as a space lasting a fixed number of mil­ cal ly returned to the user as soon as the software liseconds instead of 20 spaces. However, users com pletes execution of the assumably bugfree advised us that this new idea, although more halt-protected boot code. The system designer elegant, would make the UART more confusing can, however, provide software control of the to use. bah-enable fu nction by aliasing the boot ROM Other improvements include better nOtifica­ inro rwo adjacent spaces, where only one copy is tion of overrun and framing errors, and secure halt protected . The software can then control console support . Console security is effected by a halts by jumping between copies of the code. pin. When grounded, the pin prevents a break (This method is used on the MicroVAX II and from halting the CPU. This pin is typically con­ MicroYAX 3500/3600 systems.) nected to a key switch on the computer's front panel. Using the switch, the user can lock out UA RTs console-induced hairs. Further, the SSC allows the CPU to directly Although it was clear from the begi nning that the access the UARTs, time-of-year clock, and bus SSC shou ld provide UARTs , the best choice for reset register by means of the VAX external pro­ number and design was not immediately clear. cessor register protOcol. Using this protocol, the We had rwo choices at the time the chip was microprocessor can address system registers defined: located outside the microprocessor by register

• Double-buffered DEC DLARTs (DC-3 19), number rather than by complete address. The which were in wide use, although a few sse understands this protOcol and is capable of problems with this design had recently decod ing the register number and generating surfaced the desired response. Previously, VAX module designers using off-the-shelf UARTs had to imple­

• Silo designs, which were becoming popular, ment a substantial amount of externa l logic tO though large in size decode the register addresses and enable the UARTs to respond to this protocol. To conserve chip area, the sse team settled on Finally, the UARTs support break transmit and a design very similar to the DEC DLART design, loopback, and properly respond to VAX inter­ making a few improvemenrs in response tO user rupts. ln products containing the SSC, one UART requests. To keep from unduly complicating the is used as the system console; the other is used design, we also decided to limit the number of fo r auxiliary functions, such as remote diagnos­ UARTs to two (the number supported as console tics. or is disabled.

124 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

Standby RAM terminated, error-handling code reads the sse When a VAX system is powered off,the operating internal status flags and takes the appropriate system must store some information in non­ action. volari le memory unti I the system is powered up The timeout imerval may be programmed in again. This stored information describes the sys­ !-microsecond increments up to 16 seconds. tem configuration and contains pointers to restart The larger values are used to time out system data stored on the disk. On the MicroVAX ll CPU self-tesr. module. a watch chip provided 50 bytes of sror­ age for this purpose . System designers indicated Interval Timer this amount was inadequate; 500 ro I 000 bytes The VAX architecture specifiesa complete inter­ was desirable. val clock which the operating system uses to To meet this standby StOrage need, the sse pro­ schedule time-critical system functions at regular vides I kil obyte (KB) of battery backed-up ran­ intervals. On MicroVAX CPUs, logic for the clock dom-access memory (RAM), organized as 256 by is simplified to reduce the amount of circuitry on :1 2 bits. This RAM is also used as a system the microprocessor chip. On these microproces­ "scratch pad" during power-up test. sors, on ly an interrupt-enable bit is imple­ Add i riona l stand by su ppon features are de­ mented. The timer source is generated externally scribed in the section Standby Features. and is driven onto an input pin of the micropro­ cessor chip. When the interrupt-enable bit is set, Timers an interrupt request is generated on the fa lling edge of the timer source, which is a 00-Hz sig­ The SSC timers serve tO improve system reliabil­ l nal on Micro VAX systems. ity. meet architecture requirements, and save The SSC eliminates the need for the module module space. These rimers include the pro­ designer to place another osciUaror on the CPU grammable bus timeout, the interval timer, gen­ module by providing a 100-Hz output suitable eral-purpose timers, and the time-of-year clock for driving the interval timer input to the discussed in this section. microprocessor chip.

Bus Timeout General-purpose Ti mers Since the CVAX bus is a handshake bus, incom­ Early in the SSC development, many potential plete bus transactions can hang the system. Some users voiced a need for general-purpose timers on older VAX systems permit this condition; when future MicroVAX modules. However, no one had those systems were designed, the high cost of specific recommendations on how such function­ implementing a timeout in external logic could aliry should be implemented . Some users nor be just ified in relation to the rarity of this requested four timers; whereas others reasoned event. However. the SSC improves system reli­ that one timer supported with software could do abil ity by providing a programmable bus timeout the work of four or eight timers. at no additional system cost. After some design attempts, we decided to If a transaction lasts longer than a user­ copy, bit for bit, the VAX standard interval programmed interval, the chip clock. We reasoned that it was prudent to select a design that was already well thought out and • Signals the microprocessor that a bus error has in general use. We did add one control bit to occurred provide a one-shot capabiliry. Our decision to include two timers was based on the amount • Te rm i narcs the transaction of available chip area and a desire for some

• Sets certain internal status tl ags based on rhe redundancy. type of transaction that timed out Each timer provides scheduled interrupts with !-m icrosecond resolution. The maximum The status flags differentiate the two types of interval berween interrupts is 1.2 hours. In one­ timeouts: (1) unexpected rimeours of read or shot mode, the timer stops upon generating write transactions. and (2) permissible rimeouts its first interrupt. In single-step mode, a count caused by some unimplemented external proces­ can be caused only by writing to a specific sor registers or by certain interrupt-acknowledge control bit. The interrupt vector is user-pro­ transactions. After the timed-out transaction is grammable.

Digital Te chnical jo urnal 125 No . 7 August I'J88 ------Th e ,�ystem Support Chip, a Multifunction Chip fo r CVAX Sys tems

These timers are not used by the CPU module, grammed address and transaction type are but are available to the end user. We expect them matched . to be very helpful to users designing embedded, The strobes can be programmed either to time-sensitive applications. provide a hook for external logic or to complete a transaction after a delay. When the sse is pro­ Time-ofyear Clock grammed to provide a hook, the strobe might The VAX architecture requires a battery-backed­ be used to drive an external address decoder or to up time-of-year clock with a resolution of enable another chip. After asserting the output I 0 milliseconds (ms) . When the MicroVAX li strobe, the sse takes no further action, permit­ CPU module was designed, the best method for ting another device to complete the bus trans­ providing this feature involved the use of a BCD action . watch chip, approximately one-half gate array of Alternatively, a strobe can be programmed to logic to interface the chip to the MicroVAX bus, complete the transaction after a delay that per­ and some specially written operating system mits an external device several hundred nanosec­ code. Even then the clock provided a resolution onds to respond. When configured in this way, of only 1 second in standby mode. the strobe is usually programmed to respond to The SSC provides a much more desirable solu­ reads of a single longword address. The strobe is tion. Its 32-bit VAX-standard time-of-year clock, then wired to enable three-state drivers which driven by an external 25.6 kilohertz (KHz)os cil­ drive module data onto the CVAX bus. This data is lator, increments every 10 ms. As with all sse­ often made up of external registers, or of dual in­ internal registers, the microprocessor can access line package switches that indicate baud rate the time-of-year clock without using any external selection and other module-specificinf ormation. logic. To further minimize cost and module space Output Po rt usage in systems where battery-backed-up clock Fou r pins on the chip function as an output port. operation is not required, the user may simply The port is written as a register and is capable of ground the 25.6 KHz input pin on the sse driving simple output devices. This output port During normal operation, the time-of-year clock is another general-purpose feature that system will automatically derive its timebase from the designers need to implement various module­ chip's UART timebase, removing the need for the specific fu nctions. Some designers use the port 25.6 KHzos cillator on the module. pins to drive LEOs, which are then flashed in a particular sequence to indicate progress of self­ Other SupportFe atures test. In other applications, system designers have used these signals to control external multiplex­ Programmable Address Strobes ers and to provide simple modem control . As noted in the section Background and Goals, the sse is designed to provide system designers Bus Reset with "hooks" to other system fu nctions. One of The VA,'( architecture requires a reset of the I/0 these hooks is the sse programmable address system when the CPU issues a write to a particu­ decode strobe function, which adds user lar external processor register. This specification flexibility and also saves module space. requires support from both decoding logic and Virtually every CPU module needs logic that r;o system reset logic. In the past, each module watches the bus for particular addresses and designer had to implement both logic blocks in asserts signals when these addresses are sensed external hardware. sse designers saw another This fu nction is typically embedded in gate array opportunity to simplify the CPU module by plac­ logic or in dedicated programmable array logic ing some of the consistently required logic on (PAL) chips. the sse The SSC has two programmable address decode Although the IjO system reset logic varies strobes. The user may program each strobe for a among systems, the decoding logic is the same in particular address of Is, Os, or "don't cares." The each MicroVAX system. The SSC provides this user can also program selectively for read or core logic, taking three actions. First, the chip write transactions. When a strobe channel is decodes the external processor register number. enabled, the corresponding output pin will assert Then it asserts an output pin in response to the during any bus transaction for which the pro- external processor register write. Finally, it

126 Digital Te chnicaljo urnal No . 7 August 1988 CVAX-based Systems

de lays the completion of the write transaction for detector latch will operate for arbitrarily slow several hundred nanoseconds, so that module­ supply transitions. In addition , the latch's reset specificlogic, triggered by the pin assertion, may input includes internal filtering for protection proceed tO take the proper action to complete against fast supply transitions or power-up noise. the 1/0 system reset. If either the external circuits assert the sse input pin or the special power-up latch indicates Standby Mo de: Power-sensing Fea tures a loss of power, the sse sets an internal flag bit at When powered down, VAX systems are required boot time. The bit, which ind icates that the clock tO maintain a running real-time clock for at least and RAM are not valid, is read by the micropro­ 100 hours. Retention of some memory is also cessor during boor. desirable. A5 noted in the section Standby RAM , System reliability is improved by the SSC's abil­ the sse satisfiesthe se requirements by providing ity to determine the integrity of its standby logic a standby operating mode. In this mode, the and to notify the CPU in a software-accessible power supply to the module and to the chip pad fa shion. Moreover, this feature saves design time, drivers is turned offand most internal logic is dis­ since designers need not individually create this abled . However, the SSC RAM and time-of-year tricky but necessary logic. clock are powered by three NiCad batteries sup­ plying berween + 3. l V and + 4. 5 V at approxi­ Flexible Addressing mately 1 SO microamperes. The batteries also The designers of the SSC determined that the chip power the 25.6-kHz externa l low-power CMOS should fitint o any VAX system environ ment with oscillator, which provides the time-of-year clock a minimum of external address decoding or sys­ timebase. Within the SSC, special logic guaran­ tem incompatibility. As a result, the sse control tees smooth transitions from normal operation to and status registers and internal RAM are all situ­ standby mode. ated within a relocatable 2KBaddr ess space. This As part of providing standby operation, the sse arrangement eliminates the need for an external must re liably report at boor time whether standby chip-enable pin and the external decoding logic power was continuously maintained during the that would be needed to properly assert such a standby period . The task of determining whether pin. The power-up boot code programs the base batrcry power had remained stable during the address of the registers by writing a 2KB-aligned standby period was a difficult challenge for the value tO the SSC base address register. SSC designers. There are two ways power can be The SSC base address register is located at a sin­ lost during standby: The ba tteries may run clown, gle fixed address, chosen in cooperation with our or someone may replace the batteries. In either major users. The SSC RAM and registers can then case, the sse detects Joss of power and reports be addressed by adding their specified offsets to such loss to the CPU during the next boor. the va lue in the base address register. A system Except for externa l logic used for voltage mea­ designer can therefore situate the sse registers surement, this entire function is implemented and RAM (tOgether) anywhere in a system's 1/0 within the sse as fo llows. space map. When the batteries run down , the unaccept­ ably low voltage can be detected during boor. Initialization However, our CMOS process is nor optimized for To make the sse especially easy to use, most of the design of logic that can accurately measure the sse configuration bits are grouped in a single intermediate voltages. Thus, external circuits register. These bits include setup for the UARTs, are used to detect whether battery voltage is programmable address strobes, ROM packing, currently below a minimum leveL If volt.age is and halt-protection features. Thus, during system below minimum, these circuits assert an sse initialization , most sse features can be config­ input pin dedicated to this fu nction . However, ured with a single write. these external circuits cannot detect temporary power losses that occur during standby mode, for Micro VA X and Multi-speed example, when the batteries are replaced. To provide for these cases, a special latch on the Compatibility chip, which powers up in a preferred state, Al though targeted primarily as a companion to detects the interruption of battery power during the CMOS VAX CPU, the SSC is also compatible standby or initial power-up. This power-up with the older NMOS MicroVAX CPU used in the

Digital Te chnical ]o ur"al 1 27 No. 7 August 1988 Th e Sys tem Support Chip, a Multifunction Chip fo r CVAX Systems

MicroVAX II. Thus, new low cost or low perfor­ References mance designs using the older microprocessor T. Fox, P. Gronowski, A. Jain, B. Leary, and chip can also take advantage of the high integra­ 1. D. Miner, "The CVAX 78034 Chip, a 32-bit tion and extra functionality provided by the sse . Second-generation VAX Microprocessor," The SSC is also compatible with modules that Digital Te chnical journal (August 1988, have either high or low cycle times. Originally this issue): 95-108. designed for a I 00-ns microcycle, the CVAX microprocessor runs at 90 ns in the MicroVAX 2. B. Al lison, "An Overview of the VAX 6200 3000 system and at 80 ns in the VAX 6200 Family of Systems," Digital Te chnical system. Early in the development of the CVAX jo urnal (August 1 988, this issue) : I 0- I8. chip set, we decided that chips that were not 3. G. Lidington, "Overview of the MicroVAX performance-critical, like the sse, would run at 3500/3600 Processor Module," Digital just one speed (100 ns), but would be capable of Te chnical jo urnal (August 1988, this interfacing to a faster-running microprocessor. issue): 79-86. Speed conformability would simplify develop­ ment, manufacturing, and field support because 4. T. Leonard, ed., VA X Architecture Refe r­ one SSC could be used across all MicroVAX ence Manual (Bedford: Digital Press, Order systems . No. EY- 34 59E-DP, 1987). Accordingly, the SSC bus interface, running at a I 00-ns microcycle, accommodates microproces­ sors running at microcycles from I 00 ns to 60 ns.

Summary The SSC project yielded a CVAX microprocessor companion chip that provides a high degree of functionality, flexibility, and integration. Com­ prising console support, timers, decoders, and other programmable features on a single chip, the sse permits system designers to develop smal ler, more integrated modules at lower cost. Moreover, improvements made to the generalized features, such as halt protection and break detec­ tion, contribute to increased system reliability without reducing system design flexibi lity. The utility of the SSC is evidenced by plans to include the chip in over a dozen different Digital products, such as the MicroVAX 3000 systems, the VAX 6200 systems, many XMI adapter hoards, and various controller products.

Acknowledgments The author wishes to acknowledge Brian R. Al lison , Barry Maskas, Robert McNamara, Jay T. Nichols, Michael H. Phipps, and Robert M. Sup­ nik, who contributed to the development of the sse functional specification. Further, the aut hor acknowledges the consider­ able efforts of the sse design and manufacturing team: Robert A. Anselmo, Nannette M. Fitzgerald, Robert ). Flanagan, James D. Gorr, Dennis E. Hodges, Keith D. Johnston, Balakrishna Joshi, Joseph R. Manros, John W. May, Michael J. Saldana, and Nicholas D. Wade .

128 Digital Te chnicaljo urnal No . 7 August 1988 BarryA. Maskas I

Development of th e CVAXQ2 2-busIn terfa ce Chip

The CVAX Q22-bus interfa ce chip (CQBIC) is a highly integrated, single chip that serves as the interfa ce between the CVAX microp rocessor and the Q22-bus 1/0 subsystem. The CQBIC VL SI design is the first produced by Digital's japan Research and Development Center in coordination with teams in the U.S. Befo re implementing the interfa ce design, team members built a test chip to ensure thefe asibilityof a CMOS Q22-bus transceiver and to test various design alternatives. Also as part of their research effo rt, they examined alternative designs fo r several ju nctions, including the scatter-gather map cache and the data buffering functions. Project designers then implemented the CQBIC using a mix of full custom and semicustom design databases. A description of the five major fu nctional sections is presented in this paper.

The CVAX Q22-hus Interface Chip (CQBIC) is an Ta ble 1 CQBIC Physical Characteristics evolutionary step in functionality and integration Process 2-micron drawn, N-well, from the MicroVAX II CPU module's Q22-bus dual aluminum CMOS interface design . The MicroVAX II CPU module's Number of transistors 40,900 (approx .) Q22-bus interface comprises 18 discrete chips Die size 9.2 mm x 9.4 mm and a gate array; the module design employs 1 Power consumption 1.5W linked sequential control lers. The advanced Packaging 132-pin surface-mountable CQBIC design integrates these controllers and all chip carrier with 25-mil other interface fu nctionality in a single chip and lead spacing and heat sink retains the linked controller design . Power supply +5 V Specifically, the CQBIC provides the electrical and functional interface between the 32-bit CVAX microprocessor and the 16-bit Q22-bus IjO sub­ system. Integrated on the chip are the complete era! reasons. Primarily, such a chip wou ld reduce Q22-bus interface, data buffering, the CVAX component costs and system module size, and 2 bus, direct me mory access (DMA) interface , a increase system reliabiliry as compared with the scatter-gather (SjG) map cache, and complex MicroVAX II CPU module's Q22-bus interfa ce. control logic. Ta ble I lists the chip's physical Therefore , the primary goa l of the CQBIC pro­ characteristics. ject was to develop a highly integrated chip as an Begun in February 1985, the two-year CQBIC interface between the CVAX microprocessor and project was a joint ve nture fo r three of Digital's the Q22-bus. This chip wou ld ease the task of groups: Japan Research and Development Center, Digital's system designers by standard izing the Large Scale Integration (JRDCjLSI); Semi­ interfacing to the Q22-bus and by providing the conductOr Engineering Ad vanced Pe ripherals same or improved I/0 bandwidth performance as Development (SEGjAPD); and Micro Systems the MicroVAX II CPU module Q22-bus interface. 'I Development (MSD) .- Ac hievement of this performance goal was compli cated by the single-port memory architec­ Project Goals and Organization ture of the fi rst planned CPU module and its twO· A highly integrated , single-chip, CVAX bus to level instruction and data, direct-mapped cache Q22-hus adapter was a desirable product for sev- scheme. In comparison , the MicroVAX II CPU

Digital Te chnical jo urnal 129 No 7 August 1')88 ------Development of the CVAX Q22-bus Interfa ce Chip

module has a dual-ported memory architecture be presented in the context of the five other VLSI with no caching. However, the DMA single-port chips being designed by the SEG groups with a architecture was required for the new two-level focus on the CPU module product. The U.S.­ cache architecture ; with a single-port organiza­ based project leadership had to provide budget, tion . DMA addresses can be viewed by the caches schedule, and task coord ination for )ROC, MSD, so that the caches can invalidate valid entries dur­ and for other organizations within SEG. ing I/O-to-memory write transactions. Conse­ A-; the initial cusromer, MSD performed three quently, to both accommodate this architecture major specification reviews. This group continu­ and meet its performance goals, CQI3IC had to ally provided direction concerning design be designed to consume little CVAX bus band­ trade-offs, and requested specific fu nctionality width whi.le performing DMA transactions. Such revisions to tailor CQBIC more to their CPU a design wou ld not greatly degrade CVAX application. microprocessor performance. Digital's Engineering Network was the primary A second important project goal was ro pre­ means of transferring written communications serve 1/0 performance and operating system between groups. We also exchanged information 4 software compatibility. Therefore, CQBIC by sending facsimile copy and by mailing mag­ wou ld provide the same Q22-bus virtual to CPU netic tapes and documents. Ar times telephone physical memory address translation as contained discussions and personal visits were necessary. on the Micro VAX II CPU. Specification development began with a two­ In addition to meeting these goals, the CQBIC week visit to rhe )ROC facility in Tokyo. At that project wou ld also serve to demonstrate the feasi­ time, we wrote the first draft with key members bi I i ty of a remote VLSI design center for the SEG of the )ROC team. This draft specification laid organization. Moreover, through this project the the foundation for subsequent architecture and )ROC/LSI Group would have an opportunity to functionality research, and served as a communi­ demonstrate its VL'ildesign capabilities. cation medium. The draft specification was then Further complicating the challenges presented maintained by the )ROC team and SEG and was by the design goals, the distance between the frequently revised and reviewed. working groups, the cultural and work style The follow ing section presents the project differences, and the language barrier was the research conducted to ensure the feasibility of newness of the JRDC team. Many of the JRDC project goals and to resolve major questions team members could read and write English, but raised by the draft specification. had some difficulty speaking and listening to English Also. the Japanese language was com­ Project Research pletely fo reign ro MSD and SEG. Wrirten English Project research fo cused on two areas. First, we served as the primary form of communication wanted to evaluate the risks involved in the throughout the project. Further, the JRDC ream imp lementation of a CMOS Q22-bus transceiver. members had to learn not only about Digital's For this purpose, SEG team members implemen­ products and architectures, but also the Q22-bus, ted a rest chip. Second, we wanted to determine the other five chip specifications under develop­ the best means to achieve our stated performance ment, the SEG semicuswm and custom chip goals. The tests and studies which we conducted design rool suites, and Digital's CMOS technol­ and their resu lts are described below. ogy. To help with this steep learning curve , experts from each of these areas fa ci lira ted the Q22-bus Tr ansceiver Te st Chip training and information flow. These experts To determine whether or not a CMOS Q22-bus provided answers to specific questions and transceiver could be implemented, several stud­ helped to solve specificpr oblems as fo llow-up to ies were performed by SEG circuit designers fo rmal training sessions. responsible for the cell library. These studies Based on the MicroVAX II CPU design experi­ showed feasibility, with two major implementa­ ence in SEG and MSD, SEG provided leadership for tion risks: both the chip specification development and the project . This role involved conveying to the JRDC • The proposed differential comparator to be team the chip functional definition and detailed used as the receiver required a stable vo ltage behavior specifications. This information had ro reference.

130 Digital Te chnical jo urnal No. 7 August 1')88 CVAX-based Systems

• The :;:-; 1 00-milliampere (mA) peak, 70-mA proceeded under a plan that included integral steady-state sink current Q 2 2-bus transceivers transceivers. were to be on the same substrate as complex control circuitry. Three problems could Architecture and Performance Studies result: As the octal transceiver test chip was being devel­ oped, MSD, JRDC and SEG conducted architec­ CMOS larch-up due to charge injection from ture and performance studies. These studies input signal overshoot wou ld answer questions about the organization of Excessive noise due to substrate current the SjG mapping function, the data buffering transients required to meet the performance goals, and the sequential controllers partitioning and clocking Excessive localized power dissipation to manage the two asynchronous buses and the With several design alternatives available ro us, internal functions. we needed more experimental data to determine the better alternatives. To obtain this data , a SjG Mapping Q22-bus octal transceiver test chip was designed, A RAM structure was first proposed to implement fa bricated, and packaged by SEG circuit design­ the SjG mapping functionality. The MicroVAX II ers. Av ailable after seven months, this packaged CPU design had used such a structure, with two octal transceiver rest chip was tested in a 8K-by-8 static RAMs. This proposal, however, was MicroVAX II CPU module and performed well rejected since not all of the RAM would tit on a under system conditions. single chip with all the other required circuitry. The test chip experiments showed that CMOS Increasing the chip size was not an option. The latch-up due to worst-case overshoots below chip size was limited for cost reasons as well as ground did nor occur. These results matched our packaging cavity size reasons. The chip's cost is expectations. We were nor concerned with over­ directly proportionate ro its size, and the design shoots above the + 5 volts (V) bias because of the of a new package was outside the scope of the Q22-bus termination voltage of 3.4 V. Tests also project. Moreover, implementation of a portion showed that special care would be required in of the RAM would have introduced a system soft­ the allocation of dedicated ground pins for the ware incompatibility with MicroVAX II and Q22-bus transceivers to avoid noise coupling would have reduced the planned performance. from substrate bounce and package power-lead As the problem of SjG mapping functionality inductance. Also, in the chip layout, we would was studied, it became clear that system memory have to use many paraltel traces of metal inter­ was adequate. rurther, CQBlC could not imple­ connect to prevent metal migration when sinking ment the fu ll 8192-entry RAM on a chip size that 100 mA of peak current. Finally, due to low could be fabricated with reasonable yield . Also,a channel resistance of the Q22-bus driver output capability to prefetch S/G map entries based on pull-down device, the power dissipation of the expectation was considered necessary to sustain test chip was shown ro be within reliable opera­ peak, as opposed to average, performance. We tion limits. Therefore, CQBIC power dissipation looked to the Q22-bus DMA devices which per­ was not a concernin terms of thermal characteris­ form transactions with incrementing addresses. tics of the planned packaging. In particular, Q22-bus devices are designed to The tesr chip results did lead to a compromise utilize the Q22-bus block-mode data transfer concerning the stable voltage reference. Because protocol. This protocol transfers data packers of of large variations in CMOS process materials, a eight-word blocks. With this protocol available, precision off-chip or external resistor would bet­ we could design the CQBlC tO cache the SjG map ter serve ro establish the required voltage than entries from system memory on demand and on wou ld some risky process-desensitized structure expectation. in CMOS. The next two problems were how to imple­ Prior to these tests, we designed CQBIC to ment the cache and how many entries to include facilitate the usc of either integral transceivers or in the cache. A 16-entry cache provided the bal­ off-chip transceivers. Fortunately, the rest data ance we sought between several factors: appro­ demonstrated the feasibility of a single chip with priate chip area, implementation complexity, integral Q22-bus transceivers, and the project design risk, and DMA 1/0 performance impact.

Digital Te chnical jo urnal 131 No . 7 August /')f/8 ------Development of the CVAX Q22-hus Interfa ce Chip

Data Buffering cache hit rate result in an insignificant DMA ljO CVAX bus cycle times were targeted to be fo ur or impact on CVAX CPU performance. Given the more times greater than typical Q22-hus cycle buffering and control organization and optimiza­ times. Abo. the CVAX bus was being designed ro tions described above, performance difference support DMAmul tidata transfers. This design was between the single-port and the dual-port mem­ consistent with the Q22-bus block-mode data ory designs cannot be detected by a Q22-bus transfer prorocol . To bridge the bandwidth gap device. The result is improvement in Q22-bus between the two buses and ro minimize the use read and write throughput over the MicroVAX II of CVAX bus bandwidth, data buffering tech­ CPU. The CQBIC maximizes Q22-bus perfor­ niques were investigated to optimize for Q22-bus mance and minimizes CVA,'( bus usage. Moreover, block-mode throughput for read and write trans­ CQBIC can sustain Q22-bus block-mode transfer actions. These investigations resu.lted not on ly in write data rates of 3 .1 megabytes (MB) per sec­ the determination of buffer sizes but also in a ond and read data rates of 2.5 MB per second decision on how to control the: buffers to opti­ Finally, ro optimize the CVAX ljO write perfor­ mize sustained throughput and minimize initial mance. a dump-and-run buffer was ro be imple­ latency. mented in CQBIC This buffer is used ro avoid The MicroVAX ll CPU is capable of supplying tying up the CVAX bus while the slower Q22-bus read data to the Q22-bus with a very consistent transaction completes and while deadlock situa­ access time because memory arbitration is nor tions are resolved. required . To achieve MicroVAX II average read performance, read data prefetching was consid­ Controller Pa rtition ered necessary to compensate for the memory Given these buffering functions, the control of arbitration time. For CQBIC. the first read of a the data path and of the two major bus interfaces Q22-bus transaction would be time delayed by was natu rally partitioned into five linked con­ the DMAre quest and grant time, to obtain master­ trollers and a prioritization fu nction. Each bus ship of the CVAX bus. and by rhe subsequent sys­ interface was partitioned into a master and a slave tem memory access rime The de:la y would always control ler. The S/G map cache also required a he longer than MicroVAX II read latency. which control ler. Then ro assist in coordination of con­ had on ly memory access time read latency ro trol tlow decisions, a priority resolver function consider. We determined that rwo quadword read was needed. buffers wou ld be sufficient to sustain the This partition allows the Q22-bus and the n:quirecl throughput because read data is CVAX bus to operate in parallel while all dead­ prcfetched based on expectations of the Q22 bus lock conditions arc resolved . Fortunate ly the block-mode protocol. Low latency was achieved CVAX chip team implemented a bus transaction by providing a response to the Q22-bus as the retry capability. This retry capability proved first longword of the quadword read data was essential to our partition and implementation of obtained from system memory . CQBIC control fu nctionality. Pipelining the buffered write data could be achieved with two buffers. each eight words Clocking deep. An ocraword block is the packet size of the Two primary factors led us to select a 50-nano­ Q22-bus block-mode protocol and is rhe maxi­ second (ns) two-phase nonoverlapped internal mum multitransfer block size of the CVAX bus. clock scheme . First, the MicroVAX II CPU mod­ The conrrol logic wou ld be designed to al low one ule's '50-ns single-phase clocking scheme was buffer ro be unloaded ro system memory while a proven approach and mapped we ll to the fixed the other was being hlled. The latency wou ld be Q22-bus minimum asynchronous timing specifi­ better than that of the MicroVAX II CPU module, cations. Second, we expected synchronous CVAX since the CQBIC data was packed into fa st octa­ bus cycle timing to vary with CMOS technology word buffers. The average throughput wou ld be improvements. The variable CVAX cycle time and sustained by the fo ur times or greater bandwidth four-phase overlapped clocking scheme could of the CVAX bus, as compared to the Q22-bus, by not be used to generate the fixed Q22-bus tim­ the use of pipe lined data buffers. ing. Also. having two clocking schemes in one The CQBIC buffering and transaction optimiza­ chip was determined tO be a design too complex tions in conjunction wirh the CVAX CPU internal ro manage.

U 2 Digital Te chnical jo urnal No. 7 August 1')88 CVAX-based Systems

The i m plication of the sclccrccJ CQBIC clock­ pcnsate for the long latency that could occur, for ing scheme was that. with reference tO all inter­ example, when the look-up misses the cache and nal control lers. the CVAX bus and the Q22-bus requires an SjG map memory read access. were asynchronous. The standard cell library was rejected because it did not offer a content addressable memory Research Results Summary (CAM). which is the structure required to fa cili­ The result of the research was a single chip tate fast address look-ups. In addition, the use of design that wou ld achieve the stated project goa ls standard library cell latches and exclusive OR by providing gates was estimated tO almost double the desired look-up rime on the L 6 cached entries. • Integral Q22-bus transceivers Again to contain chip size and also w meer con­ • A 16-cntry map cache. with preferching trol performance, cusrom programmable logic array (Pl.A) sections were required . The PLA • Tw o octaword Q22-bus write buffers structu res offered by the standard cell library • Two quadword Q22- bus read buffers. with were roo slow and required a clocking scheme prefetching diffe rent from the CQBIC two-phase clocking

• A longword CVAX write buffer scheme. This decision to implement custom PLA structures is credited as the reason performance • Transaction partitioned sequent ial controllers . goals were achieved. In fa ct , performance goals which are opt imized for Look-ahead data could not have been achieved without custom bulfcring control and for utilization of multi­ PLA structures. ple-transfer transactions to minimize CVA.X At the time logic and circuit design began, the bus and Q22-bus usage standard library cells available fo r this design The research n.:sults were documented in the were found ro be inadequate. Many necessary form of a revised chip specification and a behav­ fu nctions were missing or were not tailored for ioral model. The chip was implemented from the the specificappl ication . Also. in many cases the revised specification with a process which was performance of library cells did not match the unique and unproven performance required by the rwo-phase clocking scheme. Hence the .JRDCteam developed irs own Implementation Process extensions tO the standard cell library. The com­ CQ131C was implememed using a mix of standard mon logic structures such as NAND, NOR, flip­ library cells. custom library cells, and fu ll cus­ flop, and latch were used from the standard cell rom layout sections. At the time. SEG could not library as much as possible. since these struc­ offera fo rmal design tool suite to deal with such tures reduced the risk of circuit problems. Cus­ a mix of fu ll custom and semicustom design

Digital Te chnical jo urnal 133 No . � August t'.JRR ______Development of the CVAX Q22-bus Interfa ce Chip

then used ro rest the CQBIC further in the con­ CQBIC provides the power-up, initialization, text of the application system. System simulation power-fail, and power-down protOcols ro the proved that all CVAX bus specifications which system and performs Q22-bus and CVAX bus were communicated were understood and imple­ address decoding. Further, the chip performs mented correctly. The system simulation served the page address SjG mapping fu nction for DMA as an independent test of the CQBIC design. devices by using its 16-entry SjG address map Although no CQBIC prob lems were found by cache. MSD during system simulation, the testing did This cache contains a copy of rhe most recently prove that the system wou ld operate. We later used SjG pointers, which are located in system learned that several bugs could have been found memory. The cached pointers are used to map had more time-varied events been scheduled 22-bit Q22-bus virtual ro 29-bir CVAX bus physi­ with rhe system simulation rest cases. cal addresses. CVAX bus and Q22-bus transac­ When completed, the CQ BIC MOS model was tions are optimized by using a CPU dump-and-run correlated to rhe behavioral chip model. The write buffer, two pipelined Q22-bus octaword MOS chip model was then pl aced in the MSD sys­ write buffers, and two pipelined Q22-bus quad­ tem model for regression testing. word read buffers. The chip performs transparent When we were confident that the CQBIC address and data alignments, and packing and design was complete, that is, when no new bugs unpacking of internal buffers. were found afrer thorough testing, the chip CQBIC is composed of five global control sec­ was released to SEG for a finaldes ign review and tions. A block diagram of the chip control sec­ submittal for fa brication . The database was tions is shown in Figure I. copied over the Engineering Network from the Each section contains an independent sequen­ JRDC facility in To kyo to rhe Hudson, Massachu­ tial controller: setts, plant. After completing a final design review and subsequent problem fixes, the chip • The Q22-bus arbiter was submitted for fabrication. Eight weeks later • The SjG map first-pass parts were probed and found to be func­ tional. Packaged parts were run in the MSD CPU • The Q22-bus master module. This testing revealed several timing bugs re lated to events from both buses occurring at • The Q22-bus slave and CVAX bus master

the same time. After extensive testing, the bugs • The Q22-bus electrical interface. were fixed,an d a second revision was released for fabrication . When the second pass part was tested A photomicrograph showing rhe floor plan of in the CPU module, another timing prob lem the conrrol sections is shown in Figure 2. related to coincident transactions from both Each section shown in the Figure 1 block interfaces surfaced. This particular bug was diagram is described next. obfuscated by a pass 2 bug. A third revision was prepared and fabricated . This third pass was Q22-bus Arbiter Section available in time for the firstcu stomer shipments. As a Q22-bus arbiter, the CQBIC is the default The final chip functionality is briefly described Q22-bus master and the highest priority below. requester. The arbiter accepts requests from Q22-bus DMA devices and from the master sec­ The CQBIC Fu nctional Organization tion , and grants mastership with first priority tO CQBIC is an asynchronous CVAX bus device and rhe master section . In response to a master requires a fixed 40-megahertz oscillator input to request, the arbiter exercises a demand master­ derive Q22-bus riming. The oscillator input is ship protocol to Q22-bus devices ro ensure used ro generate a two-phase, nonoverlapped low-latency interrupt vectOr or data reads. In clock which is distributed ro all chip sections. response to interrupt requests from the The CVAX bus interface was designed to accom­ Q22-bus, the arbiter receives the requests and modate transaction cycle times from I 00 ns to passes them to the CPU. When the CPU acknowl­ 60 ns. This design anticipated a CVAX CPU tech­ edges the request, CQBIC reads a vecror from nology change and subsequent performance the Q22-bus device and supplies an acknowledge improvement. signal.

134 Digital Te chnicaljo urnal No. 7 August 1')88 CVAX-based Systems

� 022-BUS � MASTER CONTROL < )

ARBITER AND INITIALIZATION 0 0 CONTROL 0

(/) :::> (/) (]] :::> v v (]] X N <( N > 0 0 SCATTER AND GATHER MAP CONTROL

r---- SLAVE SLAVE CVAX BUS PRIORITIZE 022-BUS CONTROL ---=> CONTROL K= - <=> r {j

KEY ==> ADDRESS AND DATA PATHS

CONTROL PATHS

Figure 1 Control Section Block Diagram

When multiple CQBIC chips are connected to Either as arbiter or as an auxiliary device, the the Q22-bus, they take on different functions. arbiter function performs the system power­ The first chip operates as Q22-bus arbiter; the up, initialization, power-fail, and power-down others operate in auxiliary mode. As an auxiliary sequences. mode device, a CQBIC chip does not perform Q22-bus arbitration . Instead, the chip behaves as SjG Map Section a typical Q22-bus DMA device that is a default The SjG map consists of 8,192 longwords allo­ Q22-bus slave . Therefore, when the CPU initiates cated from system memory. Each map entry con­ a Q2 2-bustransaction, itsCQBICrequestsQ2 2-bus sists of a 20-bit page pointer, a 3-bit descriptor mastership. The arbiter CQBIC serves as Q22-bus which CQBIC ignores, and a valid bit. The low arbiter and grants the bus accordingly to auxil­ 9 bits of a Q22-bus address pass through as an iary mode CQBICs and other DMAdevices . interpage offset; the upper 13 bits select the con-

Digital Te chnicaljo urnal 135 No . 7 August 1988 ------Development of the CVAX Q22-bus Interfa ce Chip

t t 1 • • ' t I I • • - 1 � I •

tt hct.r.:�������:

- ''�!!.u �r �'I' t' �t. t •••• :: I tsr.•.P"! t

Iii �!a . 8 i .. ' [ ' ' ' . ' ......

Figure 2 Photomicrograph of CQB!C

rents of one of the 8, 192 SjG map locations. The As noted in the section Project Research, we CPU informs the CQBIC of the SjG map location selected a map cache size of 16 entries. The re­ by writing a base address into the CQBJC map search of Q22-bus DMA device transfer sizes and base register. This write flushes the valid bits of the number of devices active in a dynamic system rhe cached map entries. showed that 16 entries were sufficient tO avoid To avoid map cache coherency problems, the thrashing on entries. The effects of the Q22-bus CPU accesses the SjG map through a VAX l/0 fa ir arbitration scheme were used to show that address range decoded by the CQBIC master sec­ the simple first-in-first-out (FIFO) replacement tion . The slave section then performs the SjG algorithm selected did not waste performance map memory transaction . This indirect approach and was consistent with incrementing DMA prevents the CPU from directly modifying the device addresses. As a DMA device transfer SjG map memory independent of the 16 cached address incremented tO a page boundary, the next pointers . A CPU to SjG map write inval idates the map entry would be prefetched, and the previous cached map entry as the slave section performs map entry was nor used unless the current 1/0 the memory write. CPU to SjG map reads return request completed and another was requested. the cached copy if it was cached or return rhe We found that the operating system's allocated SjG pointer from system memory. map entries for I/0 requests ro Q22-bus DMA

Digital Te chnicaljo urnal No. 7 August 1988 CVAX-based Systems

devices from a free pool list maintained in a last­ CPU th<.: n retries the same transaction when bus deal located, first-allocated manner. The overhead control is returned.) S/G map transactions have a of one exrra read for a map entry per page was higher priority than Q22-bus slave transactions found tO be insignificant . The slave section therefore performs S/G map transactions in parallel with Q22-bus slave trans­ Q22-bus Master Section actions. When the master tries tO access the The master section contains two configuration Q22-bus and it is busy. the arbiter attempts to registers and three status-and-error reporting reg­ gain mastership. Until mastership is obtained, isters in addition tO all the conrrol circuitry. the slave can perform a retry to satisfy the The master section's fu nction is ro decode all Q22-bus transactions. the CVAX bus addresses and cycle status codes. This decoding determines which of two types of Q22-hus Mastership actions is required : When the master acquires Q22-bus mastership. it sequences the transaction . A special case of the • A transaction to an internal register, the S/G sequence occurs when the 1/0 memory segment map, or the Q22-bus acldr<.:ss maps back to system memory through the • Q22-bus mastership prior to completion of the slave and map cache . In this case a retry is used. transact ion and the slave gives the data to the master. Each of these actions is described in the text The CPU writes to the Q22-bus arc accepted below. by the master in a dump-and-run manner ro If a decoded address requires no CQBIC improve performance. response, a signal pin is asserted to external logic Q22-bus Slave Section for control of buffers and timeout counters. The slave section design implemented the two Tra nsaction to an Internal Register quadword read buffers and the two octaword When the master section detects a CVAX bus write buffers. This section was the key tO realiz­ address for one of the two control or three ing the performance goals cstabl ished for the address registers in CQBIC. it returns or writes chip. The slave has ro resrond ro all Q22-bus the data. The master section also fa cilitates a transactions by checking the address in the S/G memory lock for the CPU to perform a read-lock map and then sequencing the CVAX bus to put or and write-unlock operation. First, the master get data. The slave must coordinate its intentions detects a CVAX bus interlocked transaction and with all other chip sections to avoid deadlock then performs a retry until Q22-bus mastership is conditions. This coordination is realized in a pri­ obtained . Q22-bus mastership is held until an oritization circuit which receives state inputs unlock transaction or an exception occurs. A<; from all sections of the chip and outputs status long as other Q22-bus devices follow this proto­ codes ro the slave and master sections to trigger col, memory that is mapped to the Q22-bus can actions. be shared . The slave watches for master or Q22-bus trans­ action requests. When the slave receives Q22-bus Tra nsaction to SjC Map addresses. it passes these ro the map cache for A� noted in the S/G Map section. SjG map trans­ validation . If the S/G entry is not cached, the map actions arc controlled by the master section. The cache signals the slave to acquire a new SjG map master requests the slave and map cache sections pointer from system memory. The map cache to complete the memory and cache transactions. will cache this new entry if the valid bit is set. If To construct the memory address for the slave and the va lid bit is cleared, then an exception is map cache, the master uses the significant low taken. When the address is validated, the slave 1.1 bits of SjG map 1/0 address as an offset from proceeds to sequence the transaction to or from a the map base register. buffer and system memory. During slave writes ro the system memory, the CVAX is signaled ro Tra nsaction to Q22-bus invalidate its internal cache. To avoid deadlocks, the master utilizes the CVAX The slave maintains rwo octaword write buffers CPU retry transaction. (CVAX CPU relinquishes ro optimize Q22-bus octaword block-mode trans­ CVAX bus control to the CQBIC slave section . The actions. By using a CVAX bus multitransfer burst,

Digital Te chnical jo urnal 137 No. 7 August 1988 ______Development of the CVAX Q22-bus Interfa ce Chip

the slave can unload one buffer tO memory while We learned how to manage efforts from a distance filling the other octaword buffer. and to coordinate and communicate complex For each new Q22-bus read request, the slave technical information around the globe. prefetches four words from memory. This pre­ fetch is done in anticipation of block-mode trans­ Acknowledgments actions. These four words are buffered and sent to The author wishes to acknowledge the technical the Q22-bus master. As the third word is contributions of S. Kyu, S. Akanuma, A. Abe, unloaded, the slave prefetches four more words. H. Hayashi, M. Hasegawa, M. Ta iji, S. Iida, As either a Q22-bus block-mode read or write M. Kikutani, K. Koga, L. Walker, D. Grondalski. transaction nears a page address boundary, the ). Lipcon, and R. McNamara. slave performs an SjG map entry prefetch of the next entry. The slave then passes the prefetched References entry tO the map cache . 1. B. Maskas, "Developing the MicroVAX II An additional function of the slave section is a CPU Board," Digital Te chnical journal Q22-bus addressable interprocessor doorbell (March 1986) : 37-47. register. This register accommodates arbiter and auxiliary mode operation by supplying to the 2. P. Rubinfeld, et a!., "The CVAX CPU, a CPU a memory access semaphore, an interrupt CMOS VAX Microprocessor Chip," ICCD request, and a vecror address. Proceedings (October 1987): 148-152. 3. G. Lidington, "Overview of the MicroVAX Q22-bus Electrical Interfa ce Section 3500/3600 Processor Module," Digital The Q22-bus is a 120-ohm transmission line Te chnical journal (August 1988, this with near- and fa r-end parallel termination. The issue): 79-86 length of the Q22-bus can vary from 25 to 60 cen­ timeters and is subject to reflection and crosstalk 4. T. Leonard, ed ., VA X Architecture Refe r­ noise. CQBIC contains 33 transceivers and ence Manual (Bedford: Digital Press, 9 receivers which connect directly to the Order No. EY-3459E-DP, 1987). Q22-bus. The open-drain outputs and filtered inputs were designed to operate reliably in the Q 2 2-bus environment. The input filter rejects crosstalk and reflection noise by staging a low pass RC filter. The filter is constructed with an n-diffusion resistor and p-type field effect transistor (PFET) capacitor with a differential amplifier receiver which main­ tains a narrow noise immunity region. The open-drain output driver controls the edge rates. This control minimizes transmission-line retlections and crosstalk for ac load variation from 30 to 330 picofarads, and de termination variation of 24 0 ro 60 ohms at 3.6 volts. To satisfy the l 00 mA sink current possible on each of 33 outputs without excessive heating, low inter­ nal power dissipation was achieved by low steady-state "on" resistance. A disable control allows the output to power down without affecting the Q22-bus.

Conclusion A single chip Q22-bus interfa ce was realized and is being shipped in Digital's systems as the result of the successfu l venture for JRDC, SEG, and MSD.

Digital Te chnicaljo urnal No . 7 August 1988 David K. Mo rgan I

The CVAXCM CTL - A CMOS Memory Controller Chip

The CMCTL -part of the CVAXfamily of chips - is a high-performance ECC memory controller fo r single-processor sys tems. Implemented in Digital's CMOS technology, the CMCTLis op timized to satisfy Q-bus-based sy stem requirements. The CMCTLop erates as either a synchronous or an asynchronous interfa ce between the CVAX bus at cy cles fr om 60 to 100 nanoseconds and the private memory interconnect. Fo r memory read or write operations, the CMCTL supports the CVAX multiple-transfe r proto­ col. Data parity and memory error checking is implemented fo r all data transfe rs. Thechi p's high performance is achieved in part by a high-speed, page-mode access protocol.

'T'he decision to design a CVAX memory controller had to interface directly to the CVAX bus and ( CMCTL) was made in july 1984. The primary hand le the memory transactions originating from goal of the CVAX CMCTL project was to design a the CVAX CPU chip. Located in the CVAX CPU high-performance, single-chip, error-correcting chip is an integral primary write-through 1-kilo­ code (ECC) memory controller for a single­ byte (KB) cache. The size of this cache can be processor system. This chip wou ld be part of a optionally expanded with a second-level cache CVAX fa mily of core peripheral functions. function on the CVAX bus. Consequently, the Several systems being developed at that time CMCTL-to-CVAX bus interface had to work with utilized the ;VI icroVAX II CPU chip, the predeces­ or without the optional second-level cache. Fur­ sor tO the CVAX CPU chip. Because company rev­ thermore, the primary cache and the optional enue for Q-bus-hased systems such as the second-level cache use byte parity for memory MicroVAX II is significant and a performance error detection . Therefore, the CMCTL bus inter­ benefit could be gained from a custom chip face was required both tO generate and to check design, the memory control ler design goals were byte parity. For CVAX-based systems operating at focused to satisfy the requirements of a Q-bus­ 1 00-nanosecond (ns) and 60-ns CVA.'( bus cycles based system. The initial system requirements fo r and implementing a second-level cache, the per­ the CMCTL were determined by studying the formance goals were respectively 2.5 and memory control ler specifications and by dis­ 4.0 times the performance of the MicroVAX II sys­ cussing requirements with key members of the tem. These goals governed the CMCTL bus mem­ project team fo r the existing MicroVAX II system. ory performance, or memory cycle time, require­ ln addition , the Electronic Storage Development ments described later in this paper. Since (ESD) Group was consulted on the requirements memory size requirements are proportional to of a memory controller. CPU chip performance, the CMCTL had to sup­ Let us now examine the key aspects of the port a memory size larger than that of the CVAX CPU chip that influenced the system MicroVAX II. The MicroVAX II CPU memory sys­ requirements for the CMCTL. First, the CMCTL tems have a byte-parity, memory error-detection scheme. To meet the reliabiliry requirements for

A shorter version of this paper first appeared in the Proceed­ larger memory systems, the CMCTL was designed in[/S of the I 'J87 ICCD: VLSI Computers and Processors. primarily as an ECC memory controller. Ocrober 1987 entitled "The CVAX CMCTL. A CMOS Memory Since a direct memory access (DMA) fu nction Control ler Chip" by D. Morgan . K. Chui, J Clouser, S. Nadkarn i. and R. Strouble. Copyright 1987, The Institute can also become the bus master on the CVAX bus, of Electrical and Electronic Engineering, Inc. the system requirements fo r the CMCTL were

Digilal Te chnical jou rnal 139 No . 7 August 1988 Th e CVAX CM CTJ, - A Cil105 MemOIJ' Co ntroller Chip

intluenced by these fu nctions also . Because the write transaction on the CVA.-'( bus that signals the CVAX CPU chip performs on ly synchronous trans­ termination of the interlock instruction. actions and a DMA function cou ld be either syn­ Certain base technology constraints influenced chronous or asynchronous. the CMCTL is designed the CMCTL specification . First . the high perfor­ to run as a synchronous or asynchronous slave on mance requirements for memory in a system that the CVAX bus. Further. the CVAX CPU chip can docs not implement a second-level cache deter­ handle only two of t he possible four types of data mined that the CMCTL be implemented in a sin­ transfer lengths on the CVAX bus. However, a gle custom chip. At the time, it was not possible Q-bus 01'v1A fu nction (CQBIC) needed to gener­ tO implement a memory controller with the ate all four possi ble data transfer lengths in order req uired speed in a commercially available gate to efficiently hand le data transfers between the array that would run synchronous with the CVAX Q-bus and the CVAX bus which have data widths CPU chip. Furthermore. in a Q-bus-based system. of 16 bits and 32 bits, respectively. The require­ memory expansion occurs in the Q-bus back­ ment to work with the Q-bus DMA function meant plane Therefore. a single memorycontrol ler that that the CMCTL needed to handle all four data resides on the CPU module and controls the transfer lengths In addition . since a D1v1A func­ memory by means of signals on the backplane is tion cou ld optionally generate and check parity. the simplest and most quickly implemented sys­ the CMCTL had ro be fl exible in this regard as tem solution . Ano ther factor that influenced the well. Finally. thl' CVAX CPU chip executes inter­ single-chip alternative solution was the limited locked instructions which must have the effectof space available on the CPU module that imple­ "locking" or "unlocking" the memory from ments a second-leve l cache. Ta ken rogether, these Di'viA read-modify-write transactions. Interlocked factors ru led out the possibility of designing a memory transactions are not defined in the Q-bus slower memory controller using commercially protOcoL Therefore , interlocked memory trans­ ava i !able memory control ler components for sys­ actions arc hand kd with a bus int erlock scheme. tems that implement a second-leve l cache . The In this scheme, the CQBIC stalls. i e . RETRY. availability of CMOS- I technology in Digital's the CVAX CPU chip memory read lock bus trans­ Hudson . Massachusetts. fa cility in 1984 drove action on the CVAX bus until it becomes the the design technology choice. Q-bus master first - locking out 1/0 to mem­ ory - before the CVA.-"\ can pe rform interlocked Sy stem Overview instructions. RETRY is a slave response to a bus The CVA.-"\ CMCTL is the core control function of master on the CVA.-"\ bus that tells it to retry the a single CVAX CPU memory system. This chip bus cyc le because it cannot complete the serves as the interface between devices on the requested operation _ The CQBIC releases the CVAX bus and a CMOS private memory intercon­ Q-bus after it sees a CVAX CPU chip memorv nect (PMl) Figure I shows the major interfaces

PMI

CVAX BUS

Fig ure 1 Major In terfa ce Co nnections oj tbe CMCTLChi p

I 4 () Digital Te chnical jo urnal No . 7 August 1988 CVAX-based Systems

'','· '·'····· .. • ·- -·· .•. -· u.:�

Figure 2 Photomicrograph of CMCTLShowing Major Sections

of the CMCTL in a CVAX system, and Figure 2 The CMCTL connects directly to its other major shows the major sections of the chip. Ta ble 1 lists interface, the PMI. The PMI consists of control, the physical characteristics of the chip. address, and data signals which interconnect the This section presents a brief overview of CMCTL and the memory array modules. Through the CMCTL chip's two major interfaces, data transfer support, and error-checking and notifica­ Ta ble 1 CMCTL Summary Characteristics tion features. Process 2-micron drawn, N-well, dual CMCTL Major Interfa ces aluminum CMOS process As interface to the CVAX bus, the CMCTL responds Number of 20,000 as either a synchronous or an asynchronous slave transistors device. When the CVAX CPU chip is bus master, Die size 7.6 mm x 8.0 mm the CMCTL responses are synchronous. When a Power dissipation 1 .5 W worst case DMA device is bus master, a bus-mode signal Packaging 132-pin surface-mountable chip carrier with 25-mil lead spacing determines whether the chip responds as a syn­ Power supply +5 V chronous or asynchronous device.

Digital Te chnicaljo urnal 141 No. 7 August 1988 Th e CVAX CMCTL - A CMOS Memory Controller Chip

these interconnections, the chip controls up to CMCTL Performance four memory modules, each containing one, two, The CMCTL achieves its performance in part by or four banks of dynamic random-access memory using a high-speed, page-mode RAM access pro­ (DRAM). Each memory module is required to tocol on the PMI. DRAMs that run in page mode buffer all the PM! signals. can perform data transfers in approximately one­ Data Tra nsfe r half the cycle time of those run in non page mode. The CMCTL responds to CVAX single-transfer The CMCTL fu lly supports the CVAX bus multi­ memory write or read operations within two or ple-transfer protOcol and can perform one to four four CVAX bus cycles, respectively. During a data transfers on a memory read or write opera­ memory read operation, the CMCTL starts a mem­ tion. Each data transfer can have up to four bytes oryread access in parallel with an optional cache of data. Since ECC is generated across four bytes, tO increase memory read performance. If the write data with less than four valid bytes will memory read address hits in the external cache, cause the CMCTL tO do the actual memory write the CMCTL aborts the read operation. The on the PMI as a read-modify-write cycle. Other­ CMCTL performs memory write transactions as wise, the wri te data goes directly to memory. dump-and-run. Error Checks and No tification Ta ble 2 lists the memory operations and the corresponding performance for synchronous data The CMCTL performs two error-checking fu nc­ transfers with 4 bytes of data. Two numbers are tions: shown for multiple-transfer memory operations. • CVAX bus data parity error checks The first is the time in CVAX CPU bus cycles to complete the first transfer; the second, the time • Memory error checks to complete subsequent transfers. In order to To assist with the error checking of data transfers tune the memory performance across different on the CVAX bus, the CMCTL checks data parity CVAX bus speeds, the CMCTL provides a pro­ on memory writes. The chip generates parity grammable mechanism for varying PMI transac­ with the data on memory reads. tion timing. For CVAX bus cycle times less than For data transfers on the PMI, the CMCTL has 100 ns, the CMCTL can be programmed to add two memory error-checking modes: 7-bit ECC, slip cycles to memory read operations in incre­ and single-bit parity. In ECC memory error mode, ments of the CVAX bus cycle time. The asyn­ the CMCTL detects double-bit uncorrectable chronous performance of the CMCTL can be memory errors and detects and corrects single-bit estimated by adding one bus cycle to the syn­ memory errors. In parity memory error mode, the chronous data transfer numbers in Ta ble 2. CMCTL can detect single-bit memory errors . The CMCTL memory read access time is very The CMCTL uses fou r outputs to notify the important for systems that do not have a second­ CVAX bus master of four error conditions. These level cache. For example, a 90-ns CVAX bus cycle error-condition notices are as follows: with a 5/3 CMCTL memory read access with a

• The bus transaction was successful and com­ second-level cache results in CPU performance pleted with no errors . 3.0 times that of the MicroVAX II. Without the second-level cache , the CPU performance is • The memory data transfer resulted in an uncor­ reccable ECC or parity error.

• The memory data transfer resulted in a cor­ Table 2 CVAX CMCTL Read and Write Perfor­ rectable memory error. mance (in Numbers of Bus Cycles)

• The CVAX CPU chip-initiated memory write Memory Operation CVAX Bus Cycles had a parity error. (4 Bytes of Data) 100 ns 90 ns 60 ns In addition tO these four outputs, the CMCTL pro­ Single read 4 5 6 vides an output that indicates when the CMCTL is Multiple read 4/2 5/3 6/3 not going to respond to either a memory or an CPU single write 2 2 2 1/0 operation. This output reduces the number DMA single write 3 3 3 of external components required to detect Multiple write 3/2 3/2 3/2 addresses not implemented in a system.

142 Digital Te chnical jo urnal No . 7 August 1988 CVAX-based Systems

reduced by 15 percent, or tO 2.5 times the uses single-bit parity to detect memory errors . MicroVAX II. If the CMCTL memory read access The data path transport delay for a memory read was fixed at 6/3 without the second-level cache , or write is one-half the cycle time of the CVAX the CPU performance wou ld be reduced another bus. This performance measure includes module­ 10 percent, or to 2.0 times the MicroVAX II, at a level interconnect delay. 90-ns CVAX bus cycle. Therefore, the ability to program the CMCTL memory read access time as Memory Control an integral multiple of the CPU bus cycle is a very The PMI interface provides 20 signals. These sig­ important fe ature that helps maximize the CPU nals comprise all the control strobes and memory performance . address signals needed to control DRAMs. A fast memory access time is achieved by detecting a CMCTL Functions valid memory address and starting a memory The CMCTL was designed to integrate both the access within 25 percent of a CVAX bus cycle control and data path fu nctions required to con­ time. trol the data flow to and from memory. The CMCTL has an integral refresh counter for refreshing memory. Registers The CMCTL contains two registers: Summary

• A status register The CVAX CMCTL is the core control fu nction of a complete memory subsystem. The chip provides • A control register the control for a flexible memory subsystem that How each fu nctions within the CMCTL and the fu nctions at CVAX bus cyc les from 60 to 100 ns. system is described below. The status register is loaded with important Acknowledgments information when the CMCTL detects an error. The author wishes to acknowledge the technical The system error-handling software uses this contributions of F. Aires, M. Benoit, K. Chui, information to log the error. The CMCTL has a ). Clouser, N. Fitzgerald, ). Gerde, B. Griswold, memory error status register that captures the N. Murthy, S. Nadkarni, J. Siegel, R. Strouble, fa iled memory address along with the type of K. Steward, and K. Te nHuisen. memory error (bus parity error or memory error) and error syndrome . In ECC mode, the error syndrome is a 7-bit encoded number. For correctable errors, this number indicates which data bit was corrected. In parity mode, the error syndrome has no useful meaning. The chip's control register serves several func­ tions. First, the control register regulates a diag­ nostic test mode . Second, this register controls the PMI cycle tuning. Third, memoryerror detec­ tion and correction can be turned on or off to fa cilitate the testing of the CMCTL error-check­ ing fu nctions and memory module RAMsby mem­ ory diagnostic software . Finally, a refresh opera­ tion can be forced for high-speed refresh testing.

Data Pa th

In ECC error detection mode, the data path uses a modified Hamming code to detect double-bit errors and to detect and correct single-bit errors . The PMI interface has 39 signals; 32 are used for the memory data, and 7 are the memory check bits. In parity error detection mode, the data path

Digital Te chnicaljo urnal 143 No . 7 August 1988 ISSN 0898-90 LX