Digital TechnicalJournal

515 Number 4

February 1987 Editorial Staff Editor- Richard W 13eane

Production Staff Production Editor- jane . 13lakc Designer- Charlotte 13eJJ

Interactive Page Makeup- Leslie K. Schoemaker

Advisory Board Samuel H. Fuller. Chairman Robert M. Glorioso

john W. McCredie

Mahendra R. Patel

F. Grant Savicrs William D. Strecker

The Digital Technical journal is published by Digital Equipment Corporation. 77 Reed Road, Hudson. Massachusetts 01749.

Changes of address should be sent to Digital Equipment Corporation. attention: Media Response Manager, 200 13aker Ave .. CFOl-l/M94. Concord,

i'>lA 01742

Comments on the content of any paper arc wel­ comed Write to the editor at Mail Stop HL02-.3/K ll at the published-by addrcss. Comments can also be sent on the ENET to RDVAX::I3EANE or on the ARPANET to llEANE'!;,RDVAX DEC@DECWRL

Copyright © 1987 Digital Equipment Corporation Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and arc not distributed for com­ mercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. Requests for other copies for a fee may be made to the Digital Press of Digital Equipment Corporation. All rights reserved.

The information in this journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digi­ tal Equipment Corporation assumes no responsibility for any errors that may appcar in this document.

!SUN l-55558-001-7 Documentation Numbcr EY-671 I E-DP

The following are trademarks of Digital Equipment Corporation DEC, DECnet. the Digital logo. LNO.) Plus. MicroVAX I. MicroVAX IJ. NMI, PDP-J I. PDP-I lj2�t. PDI'-l lj-44, RSX, RSX-IlM, RSX-1 I M-PUIS. Sill. UNJilllS, VAX. VAX-I l/750. VA)>-LI/780. VAX-11/782. VAX 8200. VAX 8.)00, VAX 8500 VAX 8550. VAX 8600, VAX 8650. VAX 8700. VAX 8800. VAXBI, VAXIII 787.32. VAXclustcr, VAX.station. VAXstation Jl, VMS

ADA is a registered trademark of the U.S. Government

Data General is a registered trademark of Corporation

Harris is a trademark of Harris Corporation

IBM is a registered trademark of I ntcrnational Business Machines Corporation Cover Design l.ightspecd is a trademark of Lightspeed , This issue .features the VAX 8800 .family. Our couer depicts Inc. the growth of a chambered nautilus as a metapbor .for the is a registered tradcmark of Motorola. Inc grotl'tb of the VAX famiiJ'. As those chambers spiral from the center, so the power of the VAX family grows .from the SCAI.OSystcm and ValidGED ar� trademarks of Valid . Inc Micro VAX systems, through the VAX 8200 and 8.300 CPl!s, to the neuJ VA X 8800 multiprocessor. The image was cre­ TK1Solver is a trademark of Software Arts. Inc ated using the Lightspeed system. CNIX is a trademark of American Telephone & Telegraph Company llell LaboratOries

The co11er was designed by Deborah Falck, Eddie Lee and Book production was clone by Educational Services Tsuneo Taniuchi of the Graphic Design Department. Media Communications Group in 13cdford, MA. Contents

8 Foreword Donald]. Mcinnis

Products 10 New An Overview of the Four Systems in the VAX 8800 Family M. Burley Robert 20 The VAX 8800 Sudhindra N. Mishra

The CPU Clock System in the VAX Family 34 8800 A. Samaras William 4 1 Aspects of the VAX 8800 C Box Design john Fu, james Keller, and Kenneth j. Haduch 13. 52 The Memory System in the VAX 8800 Family Paul]. Natusch, David C. Senerchia, and Eugene Yu L.. 6 2 Floating Point in the VAX 8800 Family john H.P. Zurawski, Kathleen Pratt, and Tracey jones L. L. 72 The VAX 8800 lnputjOutput System james P. _lanetos

Bus 81 The V AXBI -A Randomly Configurable Design Paul C. Wade

A Logical Grounding Scheme for the VAX 88 8800 Michael W. and Gerald]. Brand Kement 100 The Simulation of Processor Performance for the VAX 8800 Family Cheryl A. Wiecek

VMS on the VAX 8800 System 1 1 1 Stuart]. Michael S. Harvey, and Kathleen Farnham, D. Morse 12 0 A Parallel Implementation of the Circuit Simulator SPICE on the VAX 8800 System Gabriel P. Bischoff and Steven S. Greenberg

129 The Impact of VAX 8800 Design Methodology on CAD Development Dennis T. Bak

136 On-line Manufacturing Data Access on the VAX 8800 Project Andrew J. Matthews Editor's Introduction

affected their design, and why they used TTL in the . The V �"X 8800 fa mily does not have a separate floating po int accelerator. As jo hn Zurawski , Kathy Pratt, and Tracey jo nes point out, how­ ever, a custom ECL unit achieves high perfor· mance through the normal . Thus less hardware is needed, and operands are fetched faster. 1/0 devices are linked to the CPU by the VAXBI . In his paper, Ji m janetos discusses the NBI adapter, which conta ins logic to handle Richard Beane W. CPU references and DMA requests. Then Paul Editor Wade describes how the V AXBI design team had to abandon the traditional approach and use a This issue features papers about the design of variety of techniques to specify the bus. Some the VAX 8800 fa mily of CPUs, written by mem­ chip problems were resolved only after a thor­ bers of the design team. The technology used in ough analysis of the physical configuration. Digital's latest high-end machine, the VAX 8800 jerry Bra nd and Mike Kement discuss the m u ti pro cessor, a !so forms the basis for the importance of using ground correctly as a signal otherI three fa mily members: the 8700, 8550. conductor to achieve high performance. They and 8500 CPUs. describe the sources of ground-related noise in Bob Burley's overview re lates the processes the CPU, and what they did to isolate and con­ used in the 8800 design and the fu nctions of the trol those sources. memory interconnect (NMI) , the VAX BI I/0 Many VMS features support multiprocessing. bus, and the four logic boxes forming the five­ Stu Fa rnham, Mike Harvey, and Kathy Morse first stage . The early discovery of design describe the hardware that supports multipro­ flaws and the use of automated tools helped to cessi ng, then the interlocked instructions, achieve an aggressive completion schedule. exception handlers, and traps that implemenr The micromachine implements the microar­ VMS multi processing. To show how multi pro­ chitecmre and contains four of the five pipeline cessing decreases execution time, Gabriel stages . Sudhin Mishra describes how microin· Bischoff and Steve Green berg converted the structions are handled, emphasizing the use of SPICE circuit simulator into CAYENNE, a paral­ microbranches and microtraps to ensure lel progra m. They created master and slave pro­ coherency . cesses that ran CAYENNE times faster than The VAX 8800 clock syste m, discussed bv Bill 1.7 SP ICE. Samaras. was designed using an automate tim· d The final two papers re late some of the autO· ing verifier. He describes the trade-off between mated tools and techniques used on the 8800 using the verifier and maximizing the accuracv ' project. Dennis Bak first descri bes building the of timing signals by minimizing their skew. CAD suite from existing tools, newly developed The C Box and the M Rox are two parts of the ones, and modifications. The methodology was pipeline. john Fu, Jim Ke ller, and Ken Haduch tru ly innovative, serving as a framework for describe the C Box's no-write allocate cache and fu ture projects. Then Andy Matthews discusses the delayed-write algorithm that ensures correct the on-line system that transformed CAD data write-through . The C Box must also handle intO specifications used by Manufacturing. This cond itions and maintain data system minimized the prod uct stan-up ti me by coherency between processors . The M Box han­ eliminating pape!Work. read and write requests for the mcmorv arraysdles . Paul Natusch, Dave Senerchia, and Gen� Yu explain how the Cllesigns of the NMI and the

2 Biographies

Denn is T. Bak Dennis Bak is a principal software engineer in the Advanced VAX Development Group. a project leader, he is currently developing new CAD tools improveAs designer productivity on future design projects. In other positwions, Dennis performed configuration testing for PDP-11 and VAX systems. Prior to joining Digital in 1980, he worked as a research engineer at Ford Motor Company, doing advanced development on electronic engine-control systems. Dennis earned a B.S. degree in elec­ trical engineering from the University of Michigan in 197 4.

Gabriel Bischoff In 1985, Gabriel Bischoff joined Digita l after receiv­ ing a DiplP.oma of Engineer and a Diploma of Advanced Studies in device physics from the Ecole Centrale de Lyon (1980) and a Ph.D. degree in E.E. from Cornell University (1985) . a senior software engineer in the Semi­ conductOr Engineering Group, heAs is investigating the application of paral­ lel computing architectures for VLSI CAD cools, particularly circuit simula­ tors . Gabriel developed a parallel version of the circuit simulator SPICE for shared-memory multiprocessors. A member of IEEE, he has published papers on device modeling and circuit simulation.

Gerald J. Brand jerry Brand is a principal engineer currently developing high-density, high-availability power systems. Prior to working on the power and packaging team for the VAX 8800 fa mily, he designed two MPS power modules that are widely used in Digital's products. Before joining Digital in 1980, Jerry worked for over 14 years in disciplines ranging from oceanography to gas-turbine instrumentation. He holds a B.S.E.E. degree from the University of Illinois and participated in the M.S.E.E. program at the Un iversity of New Hampshire. Jerry teaches circuit analysis and elec­ tronics in the continuing education program at the University of Lowell.

Robert M. Burley a senior product management manager, Bob Burley was the engineering productAs manager for the fo ur systems in the VAX 8800 fa mily. a program manager in the LSI Acquisition and Test Group, he was responsibleAs for relations with external vendors and acquiring technologies for the advanced gate arrays used in new CPU designs. Prior ro joining Dig­ ital in 1980, Bob was a product and business development manager at Colt Industries, Inc., and a product and manufacturing manager at Scott Paper Company. He earned his B.S. degree in mathematics and economics from Hobart College in 1965.

3 Biographies

Stuart J. Farnham As a principal software engineer in the Vi\JIS Develop· ment Group, Stu Farnham is currently working on future directions in mul· tiprocessing. Earlier, he provided VMS support at the corporate level for Software Services. Stu was a developer and instructor for the VAXjVMS Sys­ tems Seminar. He joined Digital in 1982 after working as a software engi­ neer at Pitney Bowes , Inc.

John Fu Currently earning his M.S. degree in science at the University of Illinois, John Fu was a principal engineer on the VAX H800 project. He worked on the design of the C Box and configurations for the VAX 8800 fam ily. Formerly, he worked on large-systems designs at Interna­ tional Computers Limited and on control systems for Siemens Limited . John was also a project manager at Systems and Software, Inc. He received a 13.Sc . (Hons) in computer science ( 1977) from the Uni­ versity of Manchester in England. John is a member of the British Computer Society and the lEE in England.

Steven S. Greenberg As a team leader in the CAD Department of the Semiconductor En gineering Group, Steve Greenberg codeveloped the CAYENNE program . early provider of circuit and simulators at Digital, he did researchAn in timing veri fi cation and circuit simu latOrs. a Digital industrial fellow at the University of California at Berkeley, SteveA<; performed research on iterated timing analysis. Before joining Digital in 1976 , he was a member of the technical staff at RCA and a CAD engineer at Texas Instruments . Steve received a B.S.E.E. degree (1966) from M.l.T. and an M.S.E. E. degree (1979) from Northeastern University. He is a member of IEEE and Tau Beta Pi.

Kenneth J. Haduch In 1974. Ken Haduch joined Digital after earn ing his Associate in Electronic and Computer Te chnology degree from the Elec·

tronic Institutes, Pittsburgh. He worked as a technician in Manufacturing on the PDP-1 1/70 and VAX- 1lj780 CPUs and in Engineering on the DR7'50and FP7 '50 designs. Ken helped develop the C Box as a hardware designer on the VAX 8800 project. He isto cur rently a hardware engineer in the Advanced VA,'( Development Group, working on the hardware design for a new VAX processor. Ken is also pursuing a B.S. degree from Northeast­ ern University.

MichaelS. Harvey Mike Han'ey joined Digital in 1978 after receiving his B.S. degree in computer science from the University of Vermont. He worked on developing the RSX- 1 1 M and RSX- 1 M-PLUS operating systems and then led the team that developed the VAX-Il 1 RSX layered product for rhe VMS system . Since joining the VMS Development Group, Mike has par­ ticipated in new processor support for the VAX 8300 and 8800 systems , specializing in multiprocessing. As a principal software engineer, he is cur­ rently working on future directions for VMS multiprocessing and support for high-end VA,'( CPUs .

4 James Janetos Jim Janetos is currently studying computer architec­ ture as aP. graduate student at Purdue University. He joined Digital in 1980 after receiving his B.S.E.E. degree (Summa Cum Laude) from the University of Michigan, where he was elected to Tau Beta Pi. a design engineer, Jim worked on memory upgrades for the PDP-11/24As and 11/44 systems, on memory system designs, and on dynamic evaluations. On the VAX 8800 project, he initially worked on the diRAMagno stic software for the 1/0 adapter, the NBJ. Later, he designed the NBIB module, one of the two mod­ ules in the NBI.

Tracey Jones Earning her B.S. degree in computer engineering from Boston UnL.iversity, Tracey Jones joined Digital after graduation in 1982. a firmware engineer in the Advanced VAX Engineering G,roup, she wroteAs a major portion of the that performs floating point operations in the VAX 8800 fa mily of processors. After promotion to senior engineer, Tracey enrolled in Digital's Graduate Engineering Education Program and is now pursuing an M.S. degree in electrical engineering at Brown University.

James B. Keller Jim Keller is the project leader for the instruction-fetch and execution units, the and E Boxes, and the console for a new VAX pro­ I cessor. On the VAX 8800 project, he worked on the design of the C Box. Prior to joining Digital in 1982, Jim worked on fiber optics and the designs of several microprocessor boards at Harris Corporation. He earned a B.S. degree in electrical engineering in 1980 from Pennsylvania State Univer­ sity, where he was elected to Eta Kappa Nu. Jim has applied for three patents on the technology in the VAX 8800 design.

Michael Kement Mike Kement is a senior design engineer in the Power SystemW. Technology Group, currently working on EMf and EMC. He was the design engineer for the power system on the VAX 8800 project. Mike has worked on the power systems of many products since joining Dig­ ital in 1974, including the LA 36 and LA180 terminals, the PDP-11/44, VAX- 1 1/780 and 11/750 systems, and the VAX 8600 CPU.

Andrew J. Matthews a senior software manager in the Advanced VAX Systems CAD Group, AndyAs Matthews is currently automating the CAD to CAM transition. He has managed the development of surface-mount CAD processes and a pilot program of advanced CAD to CAM data methods. Andy designed the prototype and first release of VLS, the VAX layout software Digital uses for module design. He worked for Adage, Inc., as the manager of applications programming before coming to Digital in 1977. Andy holds a B.S. degree in C.S. and M.E. ( 1968) from Boston University. He has pre­ sented two papers at the Design Automation Conference.

5 Biographies

Sudhindra N. Mishra Sudhin Mishra is a project leader in the Advanced VAX Development Group, currently developing a design verification CAD tool. As a principal engineer on the VAX 8800 project, he designed and implemented most of the I Box and originated the system-level simulation of the CPU. Before joining Digital in 1982, he was a senior research engi· neer at Prime Computers, Inc. Sudhin has worked on projects ranging from radar and heat-seeking missiles to computers. He earned a B.Sc. degree in engineering from Ranchi University and an S.M. in E.E. and C.S. from M.I.T. Sudhin has applied fo r a patent on the technology in the VAX 8800 design.

Kathleen D. Morse As a consulting software engineer, Kathy Morse is responsible for all low-end CPUs and peripherals. She is also one of the designers for fu ture directions in VMS multiprocessing. Kathy provided VMS support for the VAX- 11/782 and MicroVAX I and II systems, and the MA780 memory. She joined Digital after receiving her B.S.C.S. degree (1976) from Worcester Polytechnic Institute, where she also earned her M.S.C.S. degree ( 1985). Kathy is a member of IEEE, the Professional Coun­ cil, ACM, Tau Beta Pi, and Upsilon Phi Epsilon. She has published in the Compu ter Measurement Group's Conference Proceedings, the Digital Technical journal, and DA TAMATION.

Paul J. Natusch As a principal hardware engineer, Paul Natusch is cur­ rently managing the hardware development for a new VAX processor in the Advanced VAX Development Group. On the VAX 8800 project, he was a member of the memory system team and later rook over as its leader. Ear­ lier, he worked on an upgrade to the VAX- 11/750 memory control ler, which expanded it from 2MB to 8MB. Paul joined Digital in 1980 from Storage Technology Corporation, where he was a diagnostic engineer. He received his B.S.E.E. degree from Cornell University in 1979 and an M.B.A. degree from Northeastern University in 1985.

Kathleen L. Pratt Educated at Rensselaer Polytechnic Institute, Kathy Pratt came to Digital after receiving her B.S. degree in computer and sys­ tems engineering in 1980. She worked on hardware designs for networks in the Local Area Networks Group, then on the design of the floating point hardware for the VAX 8800 central processor in the Advanced VAX Devel­ opment Group. Kathy is currently a senior engineer working on the fl oat­ ing point design for a new VAX processor.

William A. Samaras Bill Samaras is a principal engineer working to design a new VAX processor. He joined Digital in 1982 to design the clock system on the VAX 8800 project. Formerly, at Accutest Corporation, Bill designed VLSI testers and timing systems. He holds an Associates degree ( 1973) from Northern Essex Community College, and B.S. degrees in engi­ neering technology ( 1975) and electrical engineering 976), both from (I Southeastern Massachusetts University. Bil 1 teaches for continuing education at the University of Lowell. He has applied jointly for a patent on the technology in the 8800 clock system.

6 David C. Senerchia Dave Scnerchia is currently a senior engineer in the Electronic Srorage Development Group. He is a member of the design team working on rhe main memory for a new mid-range VAX system. On the VAX 8800 ream, Dave designed the initial array module for main memory and participated in the architecture and design of the memory system, the M Box. He joined Digital in 1982 after earning a B.S. degree in electrical engineering from Washington University.

Paul C. Wade As a principal engineer, Paul Wade is working on advanced development for future VAX CPUs. He was responsible for the electrical design, verification, and resting for the VAXBI bus. Paul also designed pans of the VAX 8200 system. Before joining Digital in 1980, he worked as a project engineer ar Microwave Semiconductor Corporation, RCA, and Lock­ heed Electronics. Paul earned a B.S.E.E. degree (1973) from Newark Col­ lege of Engineering. He holds a patent on gallium arsenide technology and has written nine papers on that ropic. One paper won the Beatrice Winner Award at the 1980 ISSCC.

Cheryl A. Wiecek Cheryl Wiecek is the engineering manager of the Sys­ tems Architecture Group and is responsible for the VAX architecture and a number of Digital's interconnect architectures. She worked on VAX instruc­ tion-set characterization and performance simul.ation for the VAX 8800 CPU. Cheryl also worked on PDP-11 performance simulation after coming to Digital in 1978. She was a programmerjanalyst at the Connecticut Edu­ cation Association and taught mathematics in Connecticut. Cheryl holds a B.A. degree in mathematics ( 1974) and an M.S. degree in computer science ( 1979) from the University of Connecticut. She has published five papers on computer performance in ACM and IEEE journals.

Eugene L. Yu Gene Yu is a senior design engineer in the Engineering Group ar Palo Alto. On the VAX 8800 project, he designed the memory system interface to the memory interconnect, the NMI. Before joining Digital in 1982, Gene worked at as a hardware designer on their 400 and 9900 systems, and at Data General Corporation on Nova products. He earned a B.S. degree in electrical engineering from rhe University of Massachusetts. Gene has applied for a patent as coinventor of the NMI and memory design for the VAX 8800 CPU.

John H.P. Zurawski John Zurawski is a consulting engineer working as the project leader for computer arithmetic in the Advanced VAX Develop­ ment Group. He led the team that designed the floating point strategy and hardware for the VAX 8800 fa mily. John joined Digital in 1982 from the University of Manchester, where he was a post-doctoral research associate . He holds a B.Sc. degree in physics (1976), and M.Sc. (1977) and Ph.D. (1980) degrees in computer science, all from the University of Manchester. A member of IEEE, John has published four papers on com­ puter technology.

7 Foreword

The VAX 8SOO family consists of four new pro· cessors: the VAX 8800, VAX 8700, VAX 8'550, and VAX 8'500 CPUs. The VAX 8800 family and the VAX 8200 system introduced a major new IjO bus. the VAX!3I. We also introduced a com­ pi ete ly new set of ljO adapters for the VAX BI bus. which will be the new fou ndation IjO chan­ nel for many future mid· to high-end VAX sys· terns. The VAXI31 bus wil. l replace the on this class of system. The VAXlll offers a six-fold increase in performance and substantially better reliabil ity and maintainability features in com­ Donald J. Mcinnis Group Manap,er, parison to the L Nll3US. Aduanced VA.X Enginel'rin[!. The 8800 represents a significant advance into new areas of high-performance computing fo r the VAX family. A customer can replace a VA,'(. ll/780 CPU with a VAX 8800 CPU in the same

Since the announcement of the VAX-I t j7HO sys· footprint and effect an order of magnitude rem in November 1977. Digita l Equipment Cor­ increase in the amount of work done. The VA.-'( poration has steadily expanded the VAX family 8500 CPU is really a replacement product fo r the with new VAX products: the VAX-I l/7'50. VA,'(. VAX-11/78'5 CPU kernel. However, the 8500 has llj7:)0, MicroVAX I, VAX·llj72'5, VAX-II/ the same price. twice the performance, and one­

7H'5, VAX 8600, MicroVAX ll, VAX H6'50. VAX third the foot print. 8200. and VAX 8300 systems The marker accep· To produce a product that has a good price; ranee of the VAX fa mily has been excel lent across performance ratio in the marketplace, you have almost all computing applications. This remark­ to push hard on some dimensions of technology. able and steady increase in the usc of VAX sys· A number of new pieces of technology were tcms creates a continuous demand by the VAX introduced on the VA. -'( 8800 project, such as the customer base for enhanced products across a .! I 22-layer backplane and a 4 80-pin, zero insertion segments of the computing industry. In the fa ll fo rce connector. In the VLSI technology area, of 1982. the development team for the H800 one 8800 includes a total of 186 emitter-cou· project (known internally as "Nautilus") was plecl logic (ECL) gate arrays and a total of 28 cus· assigned the responsibility of designing nL'\vsys­ rom-designed LCL parts. tems to enhance the mid-to-high end of the VA.-'( The cycle time of a VAX CPU is a large determi· fam ily. nant in its performance. The challenge of meet· This issue of the Digital ing a 4 ';-nanosecond cycle rime (versus 200 Technical journol represents a sampling of the types of design engi· nanoseconds for the 11/780) required signifi­ nccring rhar went into the VAX HHOO fa mily. It cant advancements in technology implementa· takes an amazingly large number of different tion and in CAD tools fo r analysis. engi necring disci pi ines to design and manufac­ Enhancements were made to the base operat· ture a product of this complexity. A-; time moves ing system software for the VAX 8800 processor. on , each successive development project seems These software enhancements represent a to require a bigger investment in a larger number technological change that is available to our CliS· of disciplines to produce a product attractive to romers. The VMS operating system was improved the marketplace. It is unfortunate that neither significantly to provide much better throughput time nor space rermits US tO give proper visibil­ fo r cusromers using the VAX 8800 dual proces· ity to all the design. manufacruring. ancl cus­ sor as a general-purpose system. The ULTRIX-32 tomer-service engineering efforts that Icc! to rhe operating system was enhanced tO support shipm ent of the VAX 8800 fa mily. tightly coupled multiprocessing. Software library structures were also developed for cus­ of people to have a broad engineering focus tomers who might want to improve the through­ proved to be invaluable, especially in the simu­ put of a single job by decomposing it to run in lation and prototyping phases. The core manage­ parallel on the tightly coupled dual processors ment ream started with very experienced peo­ of an 8800. ple, most of whom had VAX-llj78 0 or To meet the performance goals, the overall VAX-11/750 development experience: Sas Dur· design of the VAX 8800 system is necessarily vasula, VAX 8500 project manager; John Hittell, quite complex and was potentially difficult to manufacturing manager; Steve Jenkins, engineer· implement quickly and correctly. We under­ ing manager; Nancy Kronenberg, VMS engineer· stood this from the beginning of the project, ing; Bob Kusik, CAD manager; Steve Omand, based on our understanding of the experiences customer service engineering; and Bob Stewart, of previous projects (e.g., the VAX-11/750, VAX chief architect. Many contributors at the next 8600, and Jl1 VLSf CPU chip projects). To level also had similar backgrounds, and all manage that complexity in a timely manner, we remained in place for the duration of the pro­ selected some key strategies and stuck with ject. This continuity was a major factor in com­ them through the completion of the project. pleting a very successful project and a very suc­ They proved to be very successful since the cessful family of products. hardware prototypes were relacively error free, and the manufacturing start-up was very smooth and rapid. Some of these strategies are as fol­ lows:

• The project followed a structured design methodology that ensured the completion of comprehensive specifications before any detailed design was done.

• We made a large investment in our CAD team and in CAD tools to automate the design pro­ cess.

• The basic design was managed by a chief architect.

• The system was simulated extensively before we built any hardware. (We finished the pro­ ject with 14 VAX-11/780 and 11/785 sys­ tems in our. cluster. During our peak simula­ tion effort, however, over 30 dedicated VA,'( systems were used for a period of several months.)

• Since many different engineering and manu­ facturing locations were involved, we made extensive use of Digital's worldwide network for electronic mail and data exchange.

A more important factor than any of the above examples, however, was the people who worked on the project. We attempted ro build an excellent team that worked well together. The attribute of teamwork and the willingness

9 Robert M. Burley I

An Overview of the Four Systems in the VAX 8800 Family

The VAX 8800 multiprocessor and the VAX 8700, 8550, and 8500 systems all derive from the same fu ndamental design. Th eir sustained appli­ cations throughput ranges fr om 3.0 to 12 times that of the VAX-11/780 system. In the design process, automated tools helped to correct design bugs early. ECL technology and a two-phase clock system achieve a 45-nanosecond cycle time. Microinstructions are processed simulta­

neously through Jo ur logic boxes that implement a five-stage pipeline. A high-speed memory interconnect, the NMI bus, links CPUs to memory and the ljO subsystem, which connects to VAXB I buses. Many reliabilityfe a­ tures, including extensive diagnostics, are implemented.

Design work on the VA,'\ 8800 system began in nating design errors in the physical hardware. September 1982 and concentrated on develop­ The complexity of the VAX 8800 design cou­ ing a balanced, high-performance system based pled with the new technologies involved would upon the use of ECL components and multipro­ have created costly delays in the development cessing. Although performance was the primary schedule had traditional approaches been used. product goal, many technology, packaging, and Early in the project. goals were defined to iden­ implementation decisions reflected the equally tify logic design problems and to solve all tim­ pressing business requirements for reliability ing problems through the use of extensive and ease of manufacwring. design verification tools. The flexibility of the design ultimately A hierarchical design and simulation environ­ spawned four CPU systems: the VAX 8800. VAX ment allowed the engineers to move freely 8700, VAX 8550, and VAX 8500 models. These throughout the design at any level from gates, systems share many common fu nctional and layouts, and behavioral models through com­ design attributes yet maintain noticeable imple­ plete system simulation and timing verification. mentation differences in the areas of perfor­ ConsiderJble computing resources were required mance, multiprocessing, expansion capability to allow that freedom. This environment, with

(memory and ljO). and packaging. As a result of its carefu lly managed libraries and databases, these implementation variations. the sustained allowed this work to be done before any hard­ applications throughput (SAT) rates for these ware was actually assembled.1 A.; a result, the systems range from approximately 3.0 to 12 design matured within our VAXcluster systems, times the rate for a VAX-11/780 system. Sus­ evolving ro hardware protOtypes only after it tained applications throughput is more indica­ was essentially complete and stable. In addition tive of usable performance for a given system to the expected savings in prototype costs and a than the more frequently reported peak num­ reduction in overall development rime, the per­ bers that can be derived from ideal or biased vasive use of software tools significantly shifted b conditions. Table I compares the physical and the traditional de ug effort to an earlier point in performance anributes of these four VAX pro­ the design process. Cumulative bug-detection cessor systems. plots were used extensively to provide insight

into the srabi I ity of the design. Design Environment The effect of this shift was ro provide stable, Traditional design environments have placed early prorotypes for extensive system characteri­ the greatest emphasis on discovering and elimi- zation and resting, leading to earlier design

10 Digital Technical journal No. 4 Fe bruary 1987 Products New

CPU and Memory Attributes of the VAX Fa mily Table 8800 1 VAX 8500 VAX 8550 VAX 8700 VAX 8800

CPU Attributes

SAT (compared 3.5 6.0 6.0 10.0 to 12.0 to VAX-1 1 /780)

Cycle Time 45 ns 45 ns 45 ns 45 ns

Number of 2 Processors

Upgrade To 8550 None To 8800 None Potential

Writable Control 15K 15K 15K 15K in each CPU Store (Words)

User Control 1K 1K 1K 1 K in each CPU Store (Word s)

Microword Size 143 Bits 143 Bits 143 Bits 143 Bits

CACHE Size 64KB 64KB 64KB 64KB (in each CPU)

Internal 32 Bits 32 Bits 32 Bits 32 Bits

Instruction Buffer 16 Byte 16 Byte 16 Byte 16 Byte Look Ahead Type Look Ahead Look Ahead Look Ahead in each CPU

Maximum Total 16MB/s 16MB/s Over 30MB/s Over 30MB/s 1/0 Data Rate

4 4 Maximum 1/0 2 2 Channels

Memory Attributes

Maximum Physical 80MB 80MB 128MB 128MB Memory Size

Cycle Times: Hexword Read 495 ns min. 495 ns min. 495 ns min. 495 ns min. (256 bits) 1260 ns max. 1260 ns max. 1260 ns max. 1260 ns max. Octaword Write 270 ns min. 270 ns min. 270 ns min. 270 ns min. (1 28 bits) 540 ns max. 540 ns max. 540 ns max. 540 ns max. Longword Write 135 ns min. 135 ns min. 135 ns min. 135 ns min. (32 bits) 495 ns max. 495 ns max. 495 ns max. 495 ns max.

acceptance. This strictly controlled design envi­ The design schedule recovers during these later ron ment allowed us to complete physical debug phases, and substantial cost savings are realized along with the required system evaluation and because fewer engineering changes are made testing in only eight months. and stable manufacturing can begin quickly. In a software-intensive design environment, the production of actual hardware is deferred CPU Design Overview somewhat in favor of design stability, resu lting The VAX 8800 fa mily of designs were structured in a slightly longer soft-design period . The delay around the fu nctional elements, or "boxes," of in hardware availabi l ity, however, is more than the system. The CPU, memory, ljO, and bus balanced by the stability of the hardware proto­ subsystems were all matched to provide the nec­ types, which can then be accelerated through essary system ba lance. One simple model is to the evaluation and qualification-testing phases. treat performance as a fu nction of two va riables:

Digital Te chnical Journal 1 1

No. 4 Februmy 1987 An Overview of the Four Systems in the VA X 8800 Fa mi�y

the instruction execution rate, and the amount 8700 and VAX 8800 systems) illustrates the key of "work" each instruction can perform. The functional elements common to the VAX 8800 design of the VAX 8800 fa mily focused on what family design. we call the "short tick" approach to achieve the necessary, sustained performance. Te chnology In this approach, the instruction and data The raw speed, off-chip drive capabilities, and streams are kept simple and are executed availability of bipolar emitter-coupled logic quickly. Any design trade-offs were resolved in (ECL) logic components provided the most favor of speed and simplicity, thus reducing straightforward means of achieving the desired design complexity. The use of high-speed cus­ performance of the VAX 8800 family. Most logic tom and semicustom VLSI components com­ is implemented in 1200-gate ECL arrays. Cus­ bined with several new internal bus architec­ tom logic chips designed by Digital provide fur­ tures resulted in a family of processors with a ther performance gains for floating point opera­ 4 5-nanosecond (ns) cycle time. All models tions and general-purpose registers. The cache is

employ a five-stage instruction execution implemented in 1 0-ns and 15-ns ECL RAMs. pipeline, integral floating point acceleration (F, Nine-layer, controlled-impedance CPU logic D, G, H fo rmats), and the VAX BI bus as the pri­ modules and a 22-layer, controlled-impedance mary I/0 subsystem. The extensive use of CPU backplane were developed to meet the sig­ microcode controls with minimal hardware nal-integrity and signal-propagation require­ assist augments current performance while ments crucial to an ECL design. Other multi­ providing flexibi l ity for fu rure enhancements. layer backplanes were designed for the private

The block diagram in Figure 1 (using the VAX memory array bus and 1/0 subsystems.

ECC CONSOLE MEMORY

VAX 1-v;;----� PROCESSOR I PROCESSOR I (STANDARD -i (UPGRADE I VAX 8700) I VAX 8800) l L.-----r---.J I I

HIGH SPEED MEMORY INTERCONNECT BUS (NMI) I r---..1----· I I BUS INTERFACE I BUS INTERFACE I --, r--1(OPTIONAL) L-, I 1 I I I I I I I I I I L------' I I I I I I I r---__1___ .., r----1--- -, _____ L ___, I I I VAXBI I I VAXBI I VAXBI I VAXBI I I 1/0 BUS I I 1/0 BUS I 1/0 BUS I 1/0 BUS I I (OPTIONAL I I (OPTIONAL I STD 8700/8800 I STD 8800 I 8700/8800) 8700/8800) I I 1 1 � � L---"7...----J L----;o::- ---...J '----7------....J / ' / ' / ' ", I� ., r ., � I 2 I I 3 1 I 4 I � '-7 � � 7 ' / ' / �' / v v v

Fig ure 1 VA X 8700/8800 Rlock Diagram

12 Digital Technical jo urnal 987 No. 4 FeiJrumy I New Products

An innovative scheme of bus bars and ribbon fo rmed within each processor. There are four straps routes the appropriate power tO each of logica.l boxes: the Box) , the (I the backplanes, minimizing cable management cache (C Box) , the (E Box) , and problems for system power. The eight CPU logic the memory subsystem (M Box). Each processor modules, all memory arrays, and all IjO con­ contains these fu nctional units and their related trollers attach to their respective backplanes by buses. Five buses are implemented within each means of zero insertion force (ZIF) connectors. CPU : the cachejALU bypass bus, the cache data which improve our ability to manufacture and bus. the instruction-buffer data bus, the vi rtual­ service the system. Figure 2 shows the two dif­ address bus, and the write data bus. Figure 3 is a fe rent module types (CPU and VAXBI) usL"cl in bl ock diagram of the processor configuration. the VAX 8800 fa mily.

CONSOLE SUBSYSTEM INTERFACE

VISIBILITY BUS

I E IBD BUS BOX BOX cBOX

CACHE DATA BUS

HIGH SPEED MEMORY INTERCONNECT BUS (NMI)

NBIA MEMORY ADAPTER CONTROLLER

Figure Ty pical CPU and f/0 Modules 2 TO NBIB ADAPTERS

An L"XtensivL" L"nvironmental monitoring sub­ system , called the EMM, has been implcmL"ntL"d CjA BUS - CACHEjALU BYPASS BUS IBD BUS - INSTRUCTION BUFFER DATA BUS throughout the system. The EMM constantly VA BUS - VIRTUAL ADDRESS BUS monitors curre nt fluctuations, air flows, and WD BUS - WRITE DATA BUS temperature variations, providing warni ngs at the system console. ThL"EMM can automatica lly Figure Processor Block Diagra m power down the system in thL" L"venr that safe 3 operating limits arc violated .

A short overview of each functional box fo l­ CPU Subsystems lows. Other papers in this issue of the Digital The designs of the CPUs in the VAX 8800 fa mily Te chnical jo urnal and the Hardware VA X are partitionL"d along the logical functions pn- Handbook contain substantially more derail.2

Di�ital Technical jourmtf 13 I No. -1 l'

Pipelining the VA X Fa mily start. A segment of with K words 8800 I Pipelining, which functionally involves the by 143 bits, the user-writable control store, is E Box, the C Box, and the M Box, is primarily provided for the system user to optimize appli­ controlled by the I Box. Pipel ining is a proven cations. The logical fu nction of the I Box method to improve performance . The incorpo­ includes the following: ration of pipelining, in conjunction with faster 'fhe instruction buffer microcode instruction execution rates, or cycle •

times, increases aggregate throughput more than • The instruction decoder can be achieved by improvements of the cycle The time alone. The concept of pipelining is based •

upon partitioning instruction execution to • The condition code and microbranch logic allow simultaneous operations upon multiple • The interrupt and processor-register logic microinstructions. The VAX 8800 family employs a five-stage pipeline. In this design a • The file-address generator new microinstruction executes every 4 5 ns, Figure depicts the implementation of the with five microinstructions executing simulta­ S I 13ox. neously. A simplified schematic of the VAX 8800 fa mily pipeline is represented in Figure 4 . Th e C Box Tbe C Box for each processor is built around a (i 4-kilobyte (KB) write-through data cache DNA A I I cs R W,C memory that is physically indexed and direct DNA A W,C I cs R mapped. Functionally, the C Box provides very W,C DNA cs R A high-speed physical memory, high-speed DNA cs R A W,C l address translations, and a communication path DNA R A W,C cs I I for the processor to the NMI bus. The compara­ tively large cache size was specifically selected DNA - DECODE/NEXT ADDRESS allow large applications to remain fully resi­ CS - CONTROL STORE LOOK-UP (MICROCODE INSTRUCTION) ro R - REGISTER READ dent in the cache, substantially reducing mem­ A - ALU OPERATION ory traffic and processor wait states. The com­ W,C - REGISTER WRITE, CACHE OPERATION plete C Box implementation includes a KB translation buffer, a 64KB cache data store, Figure The Pipeline in the VA X 8800 I 4 and an NMI interface. The translation buffer Fa m ily consists of a 1 K-entry cache of virtua 1-to-physica I addn:ss translations. This translation buffer con­ The I Box rains a tag store and a data store organized into ')I process-translation slots and 512 system The I Box contains the microcode store and con­ 2 trol center and performs five primary fu nctions. region-translation slots. Using a portion of the virtual address ro compare the tag-store and Buffering the prefetched VA X instruction­ • data-store addresses, the trans lation buffer con­ stream data received from the cache catenates the page frame number with the low­

• Decoding and controlling the execution of order virtual-address bits to form the physical microinstructions address for the data store cache. Data reacl from the cache data store (a cache • Monitoring and servicing microtraps, inter­ "hit") requires no memory request. If the rupts, and exceptions required clara is not in the cache data store (a

• Supplying instruction-stream embedded data cache "miss"), logic embedded in the NMI interface uses the cache-miss address tO spawn a Interfacing between the console interface • commandjacldress transaction that is sent to the module and the processor memory subsystem . Upon return, the requested For each processor, a writable control srore of data from memory is passed to the requesting

I6K words by 143 bits is loaded directly from CPU and then placed in the cache data store for the intelligent console subsystem upon system subseq uent use. This design allows the translation

14 Di)!.ilal Technical journal I No. 1 Februmy 987 New Products

CACHE DATA BUS TO CONSOLE INTERFACE ./ .I WRITE � ON INSTRUCTION READ INSTRUCTION TO INSTRUCT! BUFFER BUFFER BUFFER DATA BUS MANAGER ALIGN

OPCODE i BUS CONTROL WATCHER SPECIFIER

CONSOLE DATA/CONTROL INSTRUCTION FILEt r_ .I DECODER ADDRESS

DECODER CONTROL

GATEWAY l MICROWORD MUX I CONTROL I WRITABLEt CONDITION BRANCH MICRO- CODE t CONTROL & SEQUENCER BRANCH STORE

TO SEQUENCING C BOX INTERRUPT CONTROL I LOGIC f-- TO E BOX I INTERRUPT PENDING

Figure 5 I Box Block Diagram

buffer and the cache data store to be free to VIRTUAL ADDRESS • process other processor requests unti I the requested data arrives from memory. A block diagram of the C Box is shown in Figure 6. MICROWORD t I TAGt DATAt I STORE STORE Th e E Box TRANSLATIONI I BUFFER The E Box receives data from the I Box and the C Box. processes that data, and returns it ro rhe PHYSICAL ADDRESS C Box. The E Box performs five primary fu nc­ CACHE � � tions required by the processor. MEMORY CACHE DATA REFILL DATA INTERCONNECT Handles all arithmetic, logical and bit-shift STORE • I I INTERFACE operations • Maintains the program and general registers CACHEt DATA BUS NMI t < > • Mainrains the processor registers • Controls data transfers between the C Box, • FROM EXECUTION BOX t FROM INSTRUCTION BOX the I Box, and the clock-module registers

• Provides condition-code information to the Figure 6 C Box Block Diagram l Box microsequencer

Dip,ital Technical jo unwl 15 No. 1 Ft!hruary I 'J87 An Oven,iew of the Fo ur s:vsterns in the 8800 Fa mil)• Vt 1X

TO C BOX FROM I BOX t WRITE DATA BUS �ST RUCTION BUFFERt DATA BUSr- FROM BO c ; I + r- CACHE DATA BUS

v

FROM C BOX + I LATCH VIRTUAL ADDRESS BUS < I t SLOW REGISTER PROGRAM DATA FILE COUNTER FILE t • 1-- r--- ARITHMETIC AND LOGIC UNIT I--

PARITY CHECK FROM C BOX I t t l FLOATING MULTI PLIER SHIFTER POINT

t t CACHE(ALU BYPASS BUS r-.. < v

Figure No.1.· Block Diagram 7 /;'

The major dements of rhc Dox, located phys· Th e M Box ically on rhe data-slice modEul es and rhe shifter The M Box. the memory subsystem, consists of modu consist of a , a data fi the memory control logic, memory arrays, and a progra1<:.m-count er logic, the main ALU, k,and a dedicated memory array bus rhat provides a shifter. The logic of the E Box includes integral usable data rare of over '50MBper second rhe floating point operations that are optimized and memory subsystem. The control logic optimizto es a 64-bir multiplier (implemented in custom­ multiple memory read and write operations, designed VLSI chips) rhar augments the speed of implements three-way interleaving, and buffers borh inreger and floating point multiplication. memory transactions fo r optimum dara move­ Figure 7 is a block diagram of the E Box. ment. The dedicated memory array bus, coupled

16 Digital Te chnical journal No. I 4 Fe brumy ')87 New Products with the memory control logic, effectively off­ The Clock Subsystem loads the NMl bus, providing balanced bus The clock subsystem generates, controls, and access and loads. The interleaving algorithms distributes timing signals to all the components are based upon array boundaries. making the of the processor system. The clock subsystem memory control logic technology independent. contains the console interfa ce, an oscillator, a The result is that as increasingly dense memory phase generaror, clock-control logic circuits, and arrays become available, few if any controller the logic circuits for distribution. modifications will be required . The VAX 8800 family implements a two­ The error checking and control (ECC) is built phase. nonoverlapped clock subsystem operating around 7 check bits for every 32 bits of data . at a cycle time of 4 5 ns. A stable, high-frequency This protocol provides automatic single-bit cor­ oscillator ( 120 MHz nominal with variable out­ rection and doubk-bit detection. put). coupled with a phase generator, provides In the VAX 8800 multiprocessor, all memory is the signal. The implementation of a two-phase fu lly sharable. Current systems in the VAX 8800 design with matched signal-length distribution family are offered with 16MB per memory array, throughout the CPU is most efficient for the giving the VA..'{ 8700 and VA..'{ 8800 systems a pipelined, latch-based design of the VAX 8800 maximum memory capacity of 128MB, and the family. This design avoids the inefficiencies VAX 8500 and VAX 8550 systems a maximum of associated with the compressed signal-assertion 80MB. Figure 8 is a block diagram of the M Box. times resulting from approaches that specify minimum delays for given logic ckments. A-clock and B-clock signals arc cl istributcd to alternate latches in a given logic stream. Al l data INSTRUCTION BOX transfers occur between latches clocked by dif­

fe rent phases ro assure a race-free design. The essence of fast- is managing and

HIGH SPEED MEMORY INTERCONNECT BUS (NMI) controlling skew. In this regard, signal propaga­ tion and distribut ion presented significant chal­ POWER SUBSYSTEM lenges in the areas of controlled etch lengths.

------..., O controlled impedance, routing, and placement. r MEMORY CONTR L I I To assure a stable, re liable design. all design I activity was predicated on worst-case design I I mlcs rather than using the typical-case limits. I I I I The NMI Bus I I Integral to the design of this family of proces­ I sors was the development of a high-speed mem­ I I ory interconnect bus called the NMI bus. This I I bus, analogous to the synchronous backplane L __ _ __..J interconnect (SBI bus) in the VAX-1 1/780 CPU. links the subsystems for CPU logic, central memory, and The NMI bus is a 32-bit syn­ 1/0. chronous bus, physically implemented within the 22-layer backplane. This bus provides the control and datapath functions as well as the

distribution of clock signals for the VA.,'( 8800 family. One fundamental problem in the design of high-performance systems revolves around ba l­ ancing the bus access needed at any given instant with the raw bandwidth available. To provide the correct balance, the N Ml bus was implemented as a pendecl (vs. interlocked) bus, Figure 8 M Box Block Diagram resulting in very high bus-access availability.

Di[!Jtal Te chnical journal 17 No. 4 Fe bruary J 'J87 An Overview of the Fo ur Sy stems in the VA X 8800 Fa mi�J!

Since memory is the critical resource in sus­ (Digital Storage Architectu re) devices are all tained operations, the NMI bus uses a modified ported directly to this high-performance I/0 round-robin arbitration that gives the memory a subsystem. higher priority when there is contention for the bus. This arbitration priority eliminates any Reliability lock-step conditions and also provides for recov­ Reliability was one of the primary goals of the ery of states and data in the event of preemp­ VAX 8800 design. Numerous features were tion. This high bus-access capability, coupled implemented that more than doubled the basic with usable data rates of up to 60MB per sec­ compming kernel availability compared to the ond, provides the necessary balance to support VAX-11/780 system. Some of the key functions CPU. memory, and l/0 transactions. The inclu­ include sion of write buffers within each CPU, coupled Environmental and power monitors that with the large cache size, effectively reduces • query the system and maintain safe system the number of transactions presented to the bus. operating levels Measurements on a VAX 8800 system in our Engineering VAXcluster environment have indi­ • Automatic verification of hardware, firmware, cated that the N.MI bus is rarely busy more than and software revision compatibility 50 percent of the time; the CPUs usc approxi­ • Electrically keyed modules and module slots mately 25 percent of the available access time that prevent improper installation and dam­ and bandwidth. Other applications may see age to the modules or the system somewhat diffe rent ratios. • Automatic electrostatic discharge (ESD) pro­ tection of modules during installation and VAXB I Bus removal The VAX 8800 fa mily uses the VAX bus inter­ ECC on main memory connect, called the VAXBI bus, for the 1/0 sub­ • Parity checking on internal RAL\1.s system in order to provide adequate balance for • the CPU performance. The VA.,'{J3I bus, a 32-bit • Bus protocol ch ecking for the memory inter- clocked bus with distributed arbitration, is capa­ connect ble of usable data rates in the VAX 8800 fa mily Timing and voltage margining up to 8MB per second, depending upon word • size and application. Custom logic on each • Remote diagnostics capability interface module provides all bus protocols, as • Dual-to-single processor reconfiguration weJI as integral data-integrity features, including (VAX 8800 system only) master transmit and command acknowledge. The VAX 8800 and VAX 8700 systems can be Diagnostic Development configured with up to four VAX BI channcls. Similar to the hardware development, the whereas the VAX 8550 and VAX 8500 systems design methodology for the diagnostics accept up to two. Therefore, fully configured depended very heavily on simulation. Almost all VAX 8800 and VAX 8700 systems can support the diagnostic tests were debugged on behav­ aggregate IjO bandwidths up to 30MB per sec­ ioral and structural models of the design before ond. Similarly, fu lly configured VAX 8550 and the initial prototype was powered up. There VAX 8500 systems can support aggregate band­ were three major benefits of this methodology. widths up to 16MB per second. Each VAXBl bus 1. Microdiagnostic and macrodiagnostic can support up to 16 nodes, or logical tests were useful for design verification acldrcsscs, which connect to any combination of testing. networks, intelligent and nonintelligent 2. devices, DMA devices, and VAXcluster systems. Test vectors for automatic test equipment as well as providing for connection to existing (module test) were extracted from the UNIBUS-based devices. simulation database . All of Digital's network protocols interface 3 . A comprehensive diagnostic package was directly to the VAXBI on the VA,'{ 8800 fa mily. available shortly after the prototype was Thus , VAXcluster. , DECnet and DSA powered up.

18 Digital Te chnical journal No. 4 Fe bruary 1987 New Products

The diagnostic for the VAX 8800 fa mily con­ Summary sists of tests specific to this processor and The VAX 8800 fa mily of products merges fast generic to the VAX architecture. The processor instruction-execution rates, large physical mem­ is tested primarily with microdiagnostics. These ories, large high-speed data caches, VAXBI 1/0 rests execute from the processor's writable con­ channels, pipelining, and balanced internal-bus trol store and are governed by the console. architectures to provide high system-applica­ VAX generic diagnostics are included to test tions throughput. Spanning an applications the UNIBUS and VAX BI adapters and options. Al l throughput range that is from 3 to 12 rimes that the diagnostic code fits on the console's of the VAX-11/780 system, the VAX 8500, VAX Winchester disk. When the system is powered 8550, VAX 8700, and VAX 8800 systems are up. a subset of the microdiagnostic tests are matched ro the network and applications strate­ executed. gies offered by Digital Equipment Corporation.

Balanced Sy stems References The VAX 8800 design effort delivered four dif­ 1. D. Bak, "The Impact of VAX 8800 Design fe rent systems, the 8800, the 8700, the 8550, Methodology on CAD Development," and the 8500, all reflecting the overriding con­ Digital Technical jo urnal (February cept of balanced system design. While the CPUs 1987, this issue): 129-135 themselves demonstrate excellent internal bal­

ance between their logical and functional sub­ 2. VA X Hardware Handbook (Maynard: systems, they are also balanced members of the Digital Equipment Corporation, Order extended system that can span much larger No. EB-2 1710-20, 1982). physical distances. Monolithic or isolated com­ puting resources are no longer capable of accessing, manipulating, and distributing the volumes of information needed for complex or extended solutions. In this light, the VAX 8800 fa mily should be viewed in the context of a bal­ anced network. The movement of data is gov­

erned by speed and distance. An inverse rela­ tionship exists as shown in Figure 9. The VAX

8800 fa mily fits on the rop bound of the band­ width range throughout the distance function.

w T H N LOGY P X - --'-=E-=-C'-''-'"--"-O "-"- ;;}_ COM LE - ____ SIMPLE � 100 <' 0 :::!. 0 z 8 10 LJ.J (!!_ CIJ � I I t­ o � z <{ CIJ 10 100 1000

DISTANCE - METERS (LOG SCALE)

Fz�� ure 9 Bandwidth versus Distance

Digital Technical journal 19 I No. 4 Fe bruar)' 'J87 Sudhindra Mishra I N.

The VAX 8800 Microarchitecture

The 8800 processor has a simple but ef ficient microarchitecture. Its VAX pipelined micromachine has a one-cycle next-address loop and fo ur-cycle latencies fo r both microbranches and microtraps. Instruction prefetch and decode are done in parallel with microcode execution. The instruc­ tion buffe r is a bit-sliced, fo ur-longword circular queue. The decoder is primarily a RAM-based table. For sp ecial events, hardwired logic is used fo r decoding. A bit-sliced microsequencer provides up to 32-way condi­ tional microbranching, using a collection of about 80 branch conditions. A hardware micros tack provides up to levels of nested subroutine calls 15 and returns. Microtrap conditions are prioritized over levels, and 16 microtraps are chained, not nested.

The term "microarchitecture" means the speci­ FETCH fi cation or description of the interrelationships MICROINSTRUCTION between the pans of the micromachine that implements the instruction set processor. In terms of this definition, the microarchitecturc of the VAX 8800 processor will be described by elucidating the organization of its micromachine and the interaction between its componenrs . Figure shows a simple three-stage state­ INTERPRET MICROINSTRUCTION machine modelI of an abstract micromachine appropriate for implementing the of a typical von Neumann processor. Figun.: 2 Fig ure State- machine Model of an shows a block diagram depicting the essential 1 Abstract Micromachine elements of such a micromachine. This stare­ machine is capable of executing microcode rou­ From the early LBM and CDC computers to the tines to implement an instruction set processor. modern CRAY machines, computer designers In such a system, every macroinstruction is have used a technique called "pipelining" decoded by the hardware to produce the starr­ obtain higher performance. Pipelining overlapsto ing addresses of a small set of microprograms, the execution of instructions in rime; that is, which execute sequentially to produce the several instructions can be executing at the desired effect. Barring some exceptions. a same rime. This technique provides a higher microprogram or microcode routine can exe­ throughput when the pipeline is fu lly loaded, cute ra ther independently in the sense that each but tlwre is a cost involved. If the pipeline is microinstruction produces the address of the broken, extra processing is required to refill it. next microinstruction. The last microinstruction Moreover, if any active instructions have par­ causes the selection of an external address. such tially executed. info rmation about their stares as one produced by the decoder, ro starr the may have to be saved to continue processing execution of another routine. after an abrupt interruption. In Digital's vernacular, the Box is the logical The degree of pipelining varies from one partition containing the instrIuc tion-processing machine another depending upon the design hardware. Figure 3 shows a block diagram of the choices andto trade-offs made by the system archi­ VAX 8800 I Box with the basic elements of its tects . A metaphor often used to indicate the micromachine. degree of pipelining is the length of the pipeline

20 Digital Technical journal • I .& 1 ( 7 New Products

MICRO- MICRO- MICRO- r-- MICRO- ADDRESS DATA DATA � ADDRESS CONTROL LATCH CONTROL LATCH r- INTERPRE- GENERATION I-- r--- STORE f--- - SIGNALS EXTERNAL OR OR TATION LOGIC - ADDRESSES REGISTER REGISTER LOGIC - AND CONTRO-LS -

Figure 2 Block Diagram of an Abstract Micromachine

BRANCH CONDITIONS, CAC HE structions. A higher degree of pipeli ning makes TRAPS, INTERRUPTS short cycle times possible, thus leading to a � higher throughput when the pipeline is fully INSTRUCTION IB DATA loaded. But longer pipelines entail increased BUFFER overhead in terms of their ability tO resume oper­ , ations after a break in the pipeline caused by any SPECIFIER, CONTROL SPECIFIER abnormal event. Therefore, an architect's goal is ,..---L----'--N_,UM BER to design the system so that the pipeline re mains PC MICRO­ DECODER INCREMENT loaded most of the time and recovery from a bro­ SEQUENCER TO E BOX ken pipeline is not roo inefficient. The VAX 8800 CPU is a prime example of a processor with a pipelined microarchitecture.

DECODER CONTROL Sy stem Considerations The design philosophy of the VAX 8800 proces­ sor was to optimize the hardware so that it MICROWORD CONTROL TO E BOX CONTROL would execute the microcode efficiently. A STORE CONTROL TO C BOX large control store (144 bits by 16,000 entries) holds the entire microcode. Using fa irly general­ MICROSEOUENCER CONTROL ized datapaths , the microcode executes the logic of the instructions. However, special hard­ Figure VA X 8800 I Box ware is used to speed up performance in critical 3 areas. The processor logic is primarily designed stated as the number of stages, for example, a with latches, which are clocked with a globally three-stage pipeline or a fou r-stage pipeline. distributed, two-phase, nonoverlapping clock­ The number of stages conveys the extent of time ing scheme. The two clock phases are called the overlap for typical operations in a computer. A-clock and the B-clock. A typical example of In a machine with a pipelined microarcbitec­ logic design, based on the above approach, is ture, these operations are executions of microin- shown in Figure 4.

OUTPUT

CL - COMBINATORIAL LOGIC Figure A Typ ical Section of the VA X 8800 4

Digital Tech nical journal 21

No. 4 Fe brttai:Y 1987 Th e VA X 8800 Microarchitecture

It is apparent from Figure 4 that the data flow • Unstalled A-latches, which are not affected by in such a logic system occurs through rhe per­ a stall petual data transfers between the latches con­ nected to the A-clock and those connected to The micromachine is implemented only with the B-clock. Each data transfer may be consid­ stalled A-latches. Hence the effect of stalls on ered atomic in the sense of hardware operation. the execution of the micromachine is largely A microoperation may be envisioned as a logical transparent. operation that is atomic in terms of the execu­ A mechanism is also required to deal with tion of a microinstruction. such as a register hardware exceptions when the results of the read, a register write or an AIU fu nction. Hence execution of a microinstruction have to be a microoperation constitutes one or more data undone. In a pipelined microarchitecture, sev­ transfers. and the microinstruction execution eral microinstructions may have partially exe­ simply constitutes a time sequence of micro­ cuted when an exception condition is detected. operations. as shown in Figure '5. In that case it is necessary to undo the effects of all those microinstructions. The most common technique used to deal with such situations is CLOCK called a microtrap. Since microtraps relate A B A B closely to the micromachine execution, every I I I I processor has its own scheme ro implement ALU FUNCTION STORE RESULT them. In every case. however, microtraps must READ REGISTERS ADD IN REGISTER permit the "roll back" of some number of microinstructions because the detection of a TIME trap condition usually occurs quite late with respect tO microinstruction execution.

Figure 5 Example of a kl icroinstruction In the VAX 8800 processor, microtraps are implemented so that the offending micro­ In high-performance machines, like those in instruction is allowed to complete, but subse­ the VAX fa mily, there is usually a mismatch quent microinstructions in the pipeline are between CPU cycle times and memory-access blocked. Since the offending microinstruction times. For example, consider an ADD instruc­ may have caused some undesirable results, the tion . If the operands are in registers, the ADD trap-handler microcode must fix the problem. can be done rather quickly. But if one of the Depending on the particular situation, either operands has to be read our of memory, the ADD the microinstruction execution flow is resum­ cannot be performed until the desired <..lata ed fro m the blocked state or a new flow is arrives from memory. Most VAX processors have originated. a fast cache memory, tightly bound to the pro­ Sy stem Buses and Datapath cessor's arithmetic units, w alleviate the mem­ ory-latency problem. In the case of a cache miss Figure 6 is a block diagram of the VAX 8800 on a required datum. however. the only alterna­ CPU datapath, showing all the major buses. The tive for a von Neumann processor is tO wait A hardware organization of the CPU provides a processor in such a state is said to be ·'stalled." two-cycle operation between the cache and the Under such conditions, the state of the proces­ AIU, as shown. The processor has several func­ sor must be "frozen" until the cause of the stall tional units in addition the main AIU. These to no longer persists and the stall is broken. The additional units perform high-speed multiply two-phase clocking scheme provides a conve­ and divide, shifting, and floating-point arith­ nient way to implement stalls, in which one of metic operations. the clock phases (the A-clock in the 8800) may There are severa l possibilities for selecting be blocked. Stalls are controlled by rhe cache inputs ro these fu nctional units. For operations through a special hardware signal distributed involving two inputs, both can be presented globally to block the A-clock. Thus, the proces­ simultaneously onto the two legs of the main sor logic contains two flavors of A-latches: AIU as well as most other functional units. The results from these fu nctional units are sent on Stalled A-latches, which are affected by a staJJ • the W bus for writing to either the multipart

22 Digital Technical journal 987 No. 4 Fe bruary I Products New

BYPASS BUS VIRTUAL ADDRESS BUS

BACKUP PC

�""� A7 r-----..MULTI- PLIER & In DIVIDER :t lrllil PC � MUX TB �B � 2._ \ A I CBUS r l I L__ I B \ l1 A

Uf� PC INC A R B := 0, CACHE � A A "-.., ,__ :Jf por- A-PORT B-PORT MUX MUX

I '> I•< . \ em •• .

CACHE DATA BYPASS BUS

w I B DATA <� SHIFTER J I � � ALU '-- f--- B r-- 1\

,-- X R .------t-- .--- E G r--- � �·I::""'"I � '----��B V'-- EXPONENT ['<--- f- 1 �f- V<-- PC INCREMENT SHIFT �� COUNT I BUS ...--- /\ r;:=::A L _\ • {r � .___

SLOW fJ1.- MPR DATA IB FILE

B B L_ DELAY MICRODATA �1\ CACHE DATA A WRITE VA 1:0� • "!" � BUFFER I �B l=- I WRITEF BUS IM MD BUS ------9JA, B - A AND B PHASES OF TWO-PHASE CLOCK v�

6 VA X 8800 Datapath Figure

Digital Technical ]o m-nat 23 Fe bruary No . 4 1987 The VA X 8800 Microarchitecture registn fi le (MPR) or the cache. However, since 3. Write the results into the destination rhe write actually occurs in the fo llowing cycle. register. the bypass bus provides a shortcut (saving a If there is a cache, start a cache operation cycle ) in case the write darum is read hy rhc 4. at approximately the same time as a regis­ very next microinstruction. ter write since memory references arc The virtual address bus carries the virtual buffered through special-purpose mem­ address of any clara-stream (cl-stream) refer­ ory data registers (MDRs or MDs) in most ences. whereas the program-counter bus has the high-performance processors. current (PC) The instruction ­ buffe r data bus provides th<.: instrucrion-strcun Figure '5 shows that the sequence above

(i-strcam) data. The instructions and data fro m occurs in a natural order in time as a conse­ the cache are returnee! on the cache data bus. quence of the microinstruction execution. With However, a cache data bypass bus provides a pipclinecl microarchitccrures, a time reference direct path to the functional units for the data is needed correlate the icrooperations per­ to m remrncd by the cache, in case the processor is fo rmed by various microinstructions with or will be stalled for that data. respect to each other. The notion of canonical

rimes is veil' convenient for this purpose. The Microinstruction Pipeline clock ticks of the reference microinstruction The top part of Figure 7 shows the execution of may be labeled with a monotOnically increasing microinstructions as a fu nction of time in a non­ set of T numbers starting at T0 as shown in pipelinecl microarchitectun:; the bottom depicts Figure H. These T numbers are called the canon­ that in a pipclincd microarchitecture. ical times of a particular microinstruction. The The basic data flow in a processor occurs in microoperation labeled T0 marks the start of a the following sequence: microinstruction execution cycle. Figure H shows the basic microopcrations of a VA.-'{ 8800 1. Read the register operands into a fu nc­ microinstruction with their canonical times. tional unit, such as the ALU . \V e shall use the simple model of a microma­

2. Perform some ALU function. chine in Figure 1 to describe the VAX 8800 micro-

------CYCLES ------.

CLOCK - A A A A A A B A 8 8 8 8 8

MICROINSTRUCTION 1

MICROINSTRUCTION 2

MICROINSTRUCTION EXECUTION IN MICROINSTRUCTION A NONPIPELINED MICROMACHINE 3

MICROINSTRUCTION 1

MICROINSTRUCTION 2

MICROINSTRUCTION 3

MICROINSTRUCTION EXECUTION IN MICROINSTRUCTION 4 A PIPELINED MICROMACHINE

Figure 7 Microinstruction Execution

24 Digital Te chnical journal Fe bruarp 'J8 7 No. 4 I New Products

CYCLE - T, Ts Tg Too Tn To To Ts

CLOCK - A B A B A B A B A B A B A B

r------1 REGISTER I -' N a_ WRITE DECODER o o.. 0 ::::> ALU CACHE MISS I a:w=?Q a:w0 -- �. a:w . I OPERATION >-a:o.: >-a:'-' >-a:'-' OPERATION ACTION I z oo zoo z o o CACHE or-o or-o Or-O I UUl-' OUl-' OUl-' OPERATION L.. ------

Fig ure 8 Canonical Times of a VA X 8800 Microinstruction instruction format as a sequence of basic micro­ The next stage in the microinstruction execu­ operations I ike those in Figure 8. The first stage tion sequence is the fetch of the microinstruc­ in the microinstruction execution cycle is the tion, performed by a look-up in the control microaddress fetch. The microinstruction execu­ srore. In the VAX 8800 system, the microaddress tion cycle begins with a decoder operation. The is pipelined, not the microdata. Consequently, decoder produces the starting microaddress for the microdata from a segmented control store every new microinstruction sequence and pre­ appears ar the appropriate time for the three sents it to the microsequencer. The decoder basic operations ro occur in the indicated order. determines that address on the basis of the con­ The microdata looked up causes a sequence tents and current state of the instruction buffer in which the register read occurs between the (lB) . Each microinstruction specifies to the times T5 and T6, the ALU function between T6 microsequencer whether or not to accept the and T1 b and the register write between T8 and decoder's microaddress. If not, the microinstruc­ T1 0 . The cache operations also occur between tion must either specify the address of the next the times TH and T10 . The section beyond T 1 0 microinstruction directly, as a part of the denotes cache activity with respect to the mem­ microword , or indicate an alternate source for ory if there is a cache miss. (The cachejmemory the address within the microsequencer. Since the interface is controlled by an independent micro­ decoder's operation is concurrent with the machine.) During every cycle, a microinstruc­ microsequencer's, the decoder always has a start­ tion produces the address of the next microin­ ing microaddress for the microsequencer. It is struction, which is then executed. Figure 9 convenient to think of this IB-decoder concur­ depicts the generic microinstruction pipeline of rency as a "hidden decoder cycle." the VAX 8800 processor.

CLOCK - A B A B A B A B A B A B A B A B

CYCLE - 0I 1I 2I 3I I 5I 6I 7I 8I 9I 10 I 11I 12I 13I 14I 15I

- ,------MICROINSTRUCTION A: : DECODER LUK xos RD ALU WR.CACH L------I I I B [��������] LUK xos I RD ALU WR,CACH

c LuK xos R D ALu wR . cAcH ___._ ....�.... ___.__ _ ...... J [����� ���'-� __ I______J______

- D ��E LUK o RD ALU WR, � x s [��� �-�--L-�----�I I ---

E: L K o R D AL _._ u__._x s --''--u _J_ DECODER - DECODER OPERATION [���_:� _ I__ I__ _ LUK - CONTROL STORE LOOK-UP (CONTROL STORE SEGMENT) 0 XOS - BOARD CROSSING SEGMENT (OVERLAPS CONTROL STORE LOOK-UP) 1 RD - REGISTER READ (OVERLAPS CONTROL STORE SEGMENT LOOK-UP) 2 ALU - ALU FUNCTION WR - REGISTER WRITE CACH - CACHE OPERATION

Figure Microinstruction Pipeline of the VA X 8800 CPU 9

Digital Te ch nicaljournal 25 1987 No. 4 Fe bmarv The VA X 8800 Microarchitecture

Microbranch Latency condi tions from the earlier execution are essen­ One consequence of pipelining is that any inter­ tial to reproduce the same sequence. vening microinstructions must be spaced To simplify the hardware design, aU early between the instruction that produces a branch traps are delayed to a fixed canonical time condition andthe instruction that can branch on (Tt0) . Some trap conditions, however, develop it due tO latency in the development of the later than the canonical time with the conse­ branch condition. Obviously, the execution of quence that those traps cannot be returned the intervening microinstructions must be inde­ from. In such cases the microcode must roll pendent of the branch. Usually, microcoders are back the state to the beginning, which causes a able to code some useful operations during the reexecution of the entire macroinstruction. inevitable wait. Otherwise, the intervening Figure 11 shows a sequence in which a instructions must be NOPs (no operation). microinstruction at address T provokes a micro­ Figure 10 shows the microbranch latency in the trap. At the earliest, the trap-handling routine VAX 8800 CPU. can begin at microinstruction X. Meanwhile, microi nstructions U, V, and W follow T, quite Microtrap Latency unaware of the impending trap. In fact, they are A hardware exception causes a microtrap. How­ in partial execution when the trap condition is ever, the trap conditions, like the branch condi­ detected. These microinstructions are said to be tions, may develop after some execution cycles in the trap shadow, and they must be blocked have been completed. Once again there must be from writing any registers, thus making it appear some intervening microinstructions between the as if they had never executed. When control is trap-causing microinstruction and the trap-han­ returned from the trap-handling routine, these dling routine. Moreover, the state of the micro­ trap shadow microi nstructions are reexecuted, machine must be saved so that the current exe­ continuing the sequence that would have arisen cution can be resumed in such a way that the had the trap not occurred. intervening execution of the trap routine appears to be transparent. This state consists pri­ Instruction Buffer and Decoder marily of microbranch conditions that result The IB buffers the prefetched VAX i-stream from the execution of microinstructions in the delivered by the cache and in turn delivers the pipeline since those could influence subse­ opcode and specifier to the decoder. The IB also quent microaddresses and hence the execution delivers the i-stream data to the execution unit, sequence. Therefore, on interruption of the cur­ the E Box. The decoder expects to receive the rent sequence by the trap routine, the branch current opcode and the current specifier byte.

CLOCK - A A A A A A A A 8 8 8 8 8 8 8 8

CYCLE - 0I I 2I 3I 4I 5I 6I 7I 8I 9I 10I 11 I 12I 13I 14I 15I

I I GENERATES MICROINSTRUCTION C: I DECODER LUK RD ALU WR,CACH 1 ----- BRANCH CONDITION L I I xos I ,

D: LUK RD WR,CACH POTENTIAL NOP [��������] xos 1 ALU 1- POTENTIAL E: LUK RD ALU WR.CACH NOP [���] 1 xos I BRANCH F: LUK XOS RD ALU WR. MICROINSTRUCTION [���] I TARGET OF r------G: I DECODER LUK RD CONDITIONAL I XOS ALU MICROBRANCH I---- I I Figure 10 Microbranch Latency

2 6 Digital Technical journal No. 4 February 1987 New Products

A A B A A A B A A CLOCK - A B B B B B B

CYCLE - 9 I I I 0I I 2I 3I 4I 5I 6I 7I 8I I 10I 11I 12I 13 14 15

------I MICROINSTRUCTION T: I DECODER LUK RD ALU WR,CACH CAUSES A MICROTRAP IL------xos I I�--�--�--�------�------� I 1- ,------U: I DECODER RD ALU WR,CACH I LUK xos 1 L------I TRAP SHADOW v: Ix s w_,c cH__, [���� I.__L_u __.K __o___,__R_D_...___A_L_u_ __._ _R _A__ w R ..__L_u_-'--I_xo_ __.s I_R _D __,___ ALu_ __._ w_R. [�= K _ _

Figure 11 Microtrap Latency

Hence the lB saves the opcode for the duration ken due to macroinstruction branches. This con­ of the instruction execution and shifts the dition requires the current contents of the IB to buffered i-stream along to send each specifier in be discarded and new i-stream data to be turn to the decoder. The goal of the VAX 8800 prefetched into the lB. decoder is to produce a starting microaddress The VAX 8800 IB is a four-longword circular corresponding to the opcode and the specifiers. queue, which is usually long enough tO hold an The sequence of microcode execution caused entire instruction. The data is consumed out of by the decoder is first to process all the specifi­ the IB from the position pointed tO by the read ers, making all the operands available, and then pointer. However, new data could be written to execute the operation specified by the concurrently by the cache at the position opcode. If an instruction has no specifiers, the pointed to by the write pointer. Whenever it has execution microcode is initiated directly. In any room, the IB is loaded by the cache if the cache case the decoder always has a microaddress has no other higher priority job to do. Occasion­ ahead of time for the microsequencer. This ally, the IB becomes full (the write pointer microaddress is the starting address of either a catches up with the read pointer), and then it specifier routine or the execution routine, does not accept the datum from the cache. If a based on the contents and the state of the IB. datum is not accepted by the IB, the cache If at any time the IB does not contain enough keeps repeating the transfer until the datum is i-stream data for a successful decode, the accepted. Occasionally, the IB becomes empty decoder will produce a special microaddress. if the cache is busy doing other things and the The microinstruction at that address is simply a decoder has consumed all the data from the IB NOP that again requests the selection of the (the read pointer and the write pointer point tO decoder's address. The micromachine thus waits the same location). in a loop for sufficient i-stream data tO arrive in The IB in the VAX 8800 family is implemented the IB so that the decoder can again dispatch a with four identical gate arrays with 8-bit slices useful microaddress. This wait-loop state of the designed to use a rather clever bit-scattering/ micromachine is commonly referred to as the IB gathering scheme. The IB also contains logic to stall, which is different from the stall described extract and format i-stream data, making it avail­ earlier. Note that clocks tO stalled A-latches are able to the E Box. A common silo holds the not blocked for an IB stall. On the contrary, the opcode history for the duration of a macro­ micromachine runs normally as does the rest of instruction's execution, as well as for recov­ the processor hardware. IB stalls may occur ery from microtraps. The VAX 8800 decoder is when the instruction prefetch pipeline is bro- a RAM-based look-up table for generating

Digital Technicaljournal 27 987 No. 4 Fe bmm:y J The VA X 8800 Microarchitecture

NOP ------�

SPECIAL SPECIAL M THINGS THAT MAKE __ I C.R OA DDRESS ADDRESS __ __- SPECIAL ADDRESSES �------�L------,..j ENCODER

MICROADDRESS ENABLE ------,14 MICROADDRESS )----,i'-----._

OPCODE ADDRESS OPCODE ------•1 10

SPECIFIER BITS

AND STATE SPECIFIER USE ADDRESS OPCODE RAM

STATE 8 CONTROL1

SPECIFIER RELATED 1------ASSISTS

DATA 1FORMAT8 SPECIFIER L_------. CONTROL STATE FLAGS

Figure 12 VA X 8800 JJecoder microaddresses. In the case of special ev<:nts, Microsequencer however, hardware logic is provided for gener­ - Th<: state machine responsible for genera ring rhe ating special microaddresses, as shown in Fig­ ncxr microaddress for a microinstruction se­ ure 12, thus bypassing the RAM Jook-up. The qut:ncc is commonly caUed the microscquencer. decoder also provides controls for the state­ As Fig e 13, this stare-machine is IB shown in u r machine as well as some other hardware assists realized collectively by control store. . rhe rhc ncxr

NEXT MICROADDRESS GENERATION LOGIC r------1 I EXTERNAL _ I CONTROLS I I TRAP I I MICROTRAP I MICROTRAP TRAP I CONDITIONS LOGIC ADDRESS I I I I I MICRO- MICRO- I I ADDRESS DATA I CONTROL I LATCH LATCH I �� f-- STORE f---. I OR OR f..-- I REGISTER EXTERNAL I ,.--.. REGISTER I MICRO- ADDRESSES v/ BRANCHING ______)�I I AND MICROBRAN CH ADDRESS CONDITIONS I I-- I SELECTION I I I LOGIC I I I I I I I I I I I L------1 NEXT ADDRESS. ADDRESS SELECTION CONTROLS

Figure 13 An AfJstroct Microsequencer

28 Digital Technical journal No. 4 FeiJruar)' I 987 New Products

microaddress generation logic, and the microad­ ro utine should begin. Normally, this selection is drcss and microdata latches (or registers) _ caused by the assertion of a microword bit in The goal of the VAX 8800 microsequencer is the very last microinstruction of the current to produce the address of the next microinstruc­ sequence. The next-address field is selected as tion during every cycle. Figure 14 depicts how the default for normal sequencing. This field is the microsequencer achieves this goal. also used to provide an offset in case of subrou­ Each microinstruction may modify its next­ tine returns. microaddress field through a microbranch com­ mand to produce the address of the target Micro branching microinstruction. Microbranch conditions are In normal cases, part of the se lected microad­ delivered by other sections of the machine, such dress can be modified according to the branch as the ALU. These conditions are grouped conditions, that is, whenever the next-address tOgether in ways convenient for microprogram­ field is selected. A combination of two ming so that multiway branches can be taken. microword fields. branch type and branch mask, Microsu broutines can be called and returned se lects the branch conditions, which are then from by means of a hardware microPC stack. ORed into part of the target microaddress. In Stalls cause the microsequencer state to be the VAX 8800 system, the microbranch logic is frozen on a cycle boundary (i.e., the clocks on implemented with five identical gate arrays, microaddress and microdata latches are effec­ each of which generates a 3-bit slice of the tively blocked) . Microtraps allow the microcode microaddress. One microaddress bit is branch to deal with unusual events that would he too sensitive in each slice. This organization permits slow or inconvenient to check normally with up to 32-way branching. Branchings of 2, 4, 8, microbranches, such as Tl3mi sses and address and 16 ways are also made possible by a sepa­ misalignments. The VAX 8800 processor does rate mask bit, called the branch mask, to every not pe rmit traps to be nested. Instead, traps are slice. This bit is used to turn off the sensitivity "chained," meaning that trap routines and hard­ to branch conditions in a particular slice. ware trap priorities are carefully arranged so There are 16 basic recipes for conditional that a second trap is taken only when the first branching in each slice. This arrangement of trap routine fi nishes . (Machine check traps can­ slicing, masking, and branch-condition selection not be controlled in this way.) in every slice requires that all the microbranch conditions be organized into 5 groups of So urces of Microaddresses 16 conditions each. The branch conditions are There are five sources for microaddresses: classified as either static or dynamic. Static con­ ditions, once captured, are available for branch­ The decoder • ing in any later cycle as long as those conditions

• The next-address field in the microword re main unchanged. Dynamic conditions are asserted for just one cycle and must be branched The microstack upon returning from a sub­ • on in that cycle. routine Some special trap-related branch conditions

• The microPC silo for a saved microtrap are saved at the time of the trap so that the trap routine may use them. For speed reasons, the The micromatch register for an address from • basic hardware mechanism for multiway branch­ the console ing is that the selected condition is ORed rather An address from the console is selected in than added to the branch-sensitive microaddress response to an explicit console request and bit. The OR implies that the branch-sensitive takes precedence over everything else. bits of a microaddress must be "zeros" by con­ Addresses from the silo are requeued in vention. If branching is masked in any slice, response to a trap-return command. Addresses however, only unmasked branch-sensitive bits from the microstack are selected in response to need to be zeros. Thus the branch-masking a subroutine-return command. A decoder-gener­ scheme leads to a substantial increase in the ated address is selected whenever the current number of conditional branch-target addresses, sequence ends and a new specifier or execution constrained by the requirement for zeros .

DiRilal Te chnica/Journal 29

No. 4 Febnwr)' I'J87 The VA X 8800 Microarchitecture

MICROWORD NEXT ADDRESS

TOP-OF-MICROSTACK

MICROBRANCH SILO ADDRESSES CONDITIONS CONSOLE ADDRESS .... ! u I A A I I MICRO­ MATCH -; � PUSH REGISTER

I .....,..___ ..... BRANCH CONDITION LOGIC I 15 M A MICRO· I 13 ADDRESS -A SOURCE c R SELECTION pEl 0 c r-:- LOGIC A 5 8 I r-;­ t t t L A 0 '----8

I

I TRAP VECTOR �- -'-?f l MICROSTACK \/ POINTER MICROSTACK POI NTER TRAe AND �------� MICROSTACK MICROTRAP LOGIC t t ... t /1- ) 5 MICROTRAP CONDITION

r-r--T"'"" DECODER'S MICROADDRESS B 1 B A I1 5 ,..-- 14 v-14 / / �

DECODER -\ SELECT j I B I r-

CONTROL )4 A f.-- CONTROL STORE 0 I STORE 0 MICRODATA '-- A r-

CONTROL f--- CONTROL STORE /4 B 1 I STORE 1 MICRODATA

...... B r-

CONTROL f----- CONTROL STORE 2 114 STORE 2 A I MICRODATA '--

Figure 14 VA X 8800 1H icrosequencer

30 Digital Te chnical jo urnal No. /f Fe bruarv I 987 New Products

Microbranch Conditions Subroutine calls and returns are unaffected by Table 1 stalls. In the VAX 8800 CPU, the microstack is Slice Number Microbranch Conditions 16 entries deep and is used exclusively for sub­ routine calls and returns (i.e., microtraps do not 1 State flags use the stack) . Subroutine calls may be nested up 2 WBUS low-order bits to 1 5 entries deep, beyond which the microstack 3 WBUS high-order bits wraps around and overwrites previous call 4 SALU condition codes addresses. Since the next-address field is condi­ tionally ORed into the calling address to make 5 PSL condition codes the return address, a conditional multiway return 6 XALU condition codes becomes feasible. 7 Priority encoder condition codes 8 ALU condition codes Microtrap and Return TB-status 9 A microtrap is caused when the hardware Cache command 10 detects a condition that would not allow the 11 MD number current microinstruction to complete its execu­ 12 AC low tion successfully. The hardware fo rces the next 13 Digit valid microaddress to a fixed location that depends 14 NMI ID on the particular condition, thus overriding the 15 Interrupt pending address that would otherwise be selected. This 16 Interval timer special location is the starting address of the 17 Halt pending trap-handling microcode routine specific to that trap condition. Microtraps are used extensively 18 Console mode by the memory management system tO imple­ 19 Interrupt ID ment the architecture. Micro­ 20 Non_Retry flag traps are also caused by serious system faults (i.e., machine checks), such as control-store or Table 1 shows an example of several micro­ bus parity errors. Table 2 lists the microtrap branch conditions. conditions and their priorities. The priorities are arranged so that if more than one microtrap Microsubroutine Call and Return occurs during a cycle, the one with the highest

As in the normal case just discussed, the default priority will be serviced and the others ignored. microaddress, the next-address field, is selected as the starting address of a microsubroutine. Table Microtrap Conditions and Priorities However, a subroutine-calling microinstruction 2 pushes its own address onto the microstack. Microtrap Condition Priority During the subroutine return, the microstack is selected as the source and then popped. Thus Microbreak Highest the address of the calling instruction is used as a Machine check base for the return. The returning instruction VA parity error may OR an offset from the next-address field to TB tag parity error that base, thus yielding the target return Reserved for ECO address. The fact that bits are ORed rather than Reserved float operand added constrains the calling addresses to have Add rounding zeros in the low-order bit positions. Multiply rounding The write path ro the microstack (PUSH) is pipelined by a cycle for timing reasons. How­ TB miss ever, a bypass path saves what would be the top Access violation entry of the microstack in the read latch (POP) Modify bit so that PUSHs and POPs occur in a fairly unre­ Page cross stricted manner. There are, however, some Unaligned page cross minor coding restrictions with respect to traps Unaligned trap and decoder-made addresses. Conditional VAX branch Lowest

Digital Technical ]om-nat 31 No. 4 February 1987 The VA X 8800 Microarchitecture

Figure 11 shows the microtrap latency and its instruction of the trap routine performs a

consequences on pipelining. As described ear­ trap release , which unblocks the silos so lier, a trap-causing microinstruction, even if it they can resume loading the new states. writes the wrong results, is allowed to complete 2. Microcode can remedy rhe cause of the because it is too late to block it anyway. (The trap so that the microinstruction canonical time of register write is T whereas 9, sequence can be continued. In this case the microtrap signal occurs at canonical time the last microinstruction of the trap rou­ T,o)- The only recourse is to let the trap-han­ tine performs a trap return, causing the any problems caused dling microcode correct hardware to recycle microaddresses U, V, by the trapping microinstruction. The microtrap W, ancl X through the microaddress pipe. signal occurs in time to block all three microin­ This action results in the reexecution of stru ctions in the trap shadow. Therefore, the aborted microinstructions from the trap microtrap logic generates two global signals, the shadow. global microtrap (one-cycle long) and the block writes (three-cycles long), at time T,0. The pur­ In the case of a trap return, the hardware pose of the global-microtrap signal is to trigger selects the microPC silo as the microadclress for any necessary trap-contingent actions in various the next four cycles. As shown in Figure 14, parts of the processor. The purpose of rhe however, the microPC silo does not contain the block-writes signal is ro block register writes at microatldrcsses made by the decoder. Therefore, canonical times T11 , T 1 3, and T1s, thus rendering it is necessary tO resynchronize the microin­ ineffectual microinstructions U, V, and W in Fig­ struction execution sequence with the decoder, ure 11. In other words the blocking of writes by while requcuing the trapped microaddresses hardware is in effe ct until the trap-handling from the silo. This is made possible by keeping microcode takes control of the micromachine. a tag bit in the silo to identify the positions of A silo is generally used to save the stare of the the microaddresses made by the decoder in the machine across a microtrap. In most cases the sequence. If a microaddress from the silo is length of the silo is equal to the depth of found to be tagged, the requeuing is terminated pipelining. Since there are many more branch­ immediately and the microaddress generated by condition bits than microaddress bits, it is more the decoder is selected. A complete recovery economical to save microaddresses in the trap thus occurs since the state of the IB has by this silo than to save the conditions causing those time been backed up, and therefore the addresses. Microaddresses V, and W must be decoder-generated microaddress can be used for U, the continuation. saved in the silo since they may be branch targets of some previous microinstructions. For Chaining of Microtrap s the same reason, however, the address X (over­ By convention, microtraps are not allowed to ridden by X', the starting address of the trap rou­ nest; instead, they are chained. In other words tine) must be saved as well. During the execu­ the trap-handling microcode must ensure that it tion of the trap routine, the trap silos arc wi not cause any microtraps itself. The sole "frozen" (blocked from loading), thus saving II exception is its last microinstruction, which the state of the micromachine at the time of may cause a second microtrap to follow imme­ trap. diately, even as the saved microaddresses from After the trap routine has completed, two con­ the silo are being requeued to resume the origi­ ditions are possible: nal Note that this second microtrap does flow. 1. The recovery from the trap is impossible, not take effect until four cycles later, whereas and hence the microinstruction sequence intervening microinstructions are blocked by cannot be continued. Then the only the hardware as a result of this second micro· recourse is to roll back and reexecute the trap. Consequently, the same microaddresses macroinstruction. That is, the macroPC is end up in the microPC silo once again during backed up from its silo, the IB is fl ushed, the execution of the second trap routine. The and if necessary, any register changes are original sequence may finally resume after the undone. In this case the last micro- last of such chained traps has been serviced.

32 Digital Tech nical journal 987 No. 4 Februar)' J New Products

Acknowledgments The specification and design of the VAX 8800 1 Box was a team effort. Dave Laurdlo con­ tributed to the lB design , the i-srream data for­ matter, and the interrupt logic. Bei Pong Wang was responsible for the decoder, the PC incre­ ment logic, and the 18-state manager. Jack Ward looked after the physical construction of the sequencer and the control store. The entire development was carried out under the exc el­ lent leadership of Doug Clark. Many thanks also go to both Doug Clark and Bob Stewart for their suggestions and guidanc e during the course of this development.

Digital Te chnical journal 33 No. 4 Fe bmarv I ')8 7 William A. Samaras I

The CPU Cl ock System in the VAX 8800 Family

The clock system in the VAX 8800 CPU sends timing signals to every state device every 45 nanoseconds. Thelack of accuracy of these timing signals is called skew, which must be minimized. Two skews exist: global, between modules; and local, within a module (the lower of the two). The design complexity of the overall system dictated the use of an automated timing verifier. Although advantages accrue fr om designing fo r local skew, the verifier could not segregate between skew typ es. To gain the benefit of the verifier, a unique hardware trade-offwas made to minimize total skew: local was made equal to global. The result was that 83 percent of the cycle time is used productively.

Al l synchronous computers must provide some duce just one VAX instruction. The clocking sys­ means of generating and distributing accurate tem must keep the thousands of circuits in the timing signals. The goa l of the timing sysrem in processor "ticking" in perfect step together the VAX 8800 family is to provide low-skew every 4 5 ns. (therefore, accurate) timing signals to al.l pans The 8800 was designed ro contain two com­ of the processor without any manufacturing plete CPUs in the same cabinet. Since both adjustments. Furthermore, the design team CPUs share a common memory, it is beneficial wanted to automate the verification of the rim­ to make the memory system and both CPUs syn­ ing during the design phase. Therefore, design chronous with each ocher. The clock system trade-offs in the clocking system were necessary must keep all three items running together, pre­ ro accompl ish that auromar.ion. This paper dis­ cisely locked in time. cusses how the hardware designs of the clocking system were influenced to provide a good envi­ Modules ronment for r.he automatic timing verification. All the circuitry for both processors and the memory control ler is contained on 20 16-inch Clocking Sy stem Requirements by 12-inch modules, or printed circuit boards. The design of the clocking system required us to These modules occupy slots in a 21-inch-wide address many interrelated problems that had w backplane. Each module contains up w 20 ECL culminate in a common solution. This design gate arrays and miscel laneous ECL logic. The depended on certain fundamental specifications state devices, called latches, reside both in the that were established for the VAX 8800 CPU by gate arrays and the miscellaneous logic of each the system architects. The two primary require­ module. ments are described below. The Clocking Problem Cy cle Time The basic difficulty for this (and any) clocking

The cycle time of the VAX 8800 fa mily of pro­ system is to get the timing signals ro every scare cessors is 4 5 nanoseconds (ns), which means device in the machine at precisely the same that a CPU can accomplish some amount of time. Every synchronous machine fa ces this work during that period. Looking at it.another problem. However, in faster computers, like the way, these processors can do 22.5 million VA.-'{ 8800 system, the tolerances placed on the actions every second. Usually, a number of these timing signals are more severe. In a physical 45-ns cycles are requ ired by a processor to pro- sense, i r is simply not possible to send a II the

34 Digital Technical journal I No. 4 Fe bruar)• ')87 New Products timing signals to every part of each module at the Since skew can be wasted time, our goal was tO same instant. There is some precision, however, make it as small as possible. In the 8800 system, that should and can be achieved. We now discuss there are three major contributors to clock skew: how important this tolerance is tO the VAX 8800 variations in the semiconductor components, systems, and what we did to minimize it. variations in the wiring lengths (described above) , The tolerance , or time difference, that we and different manufacturing tolerances of the encounter in attempting to provide timing signals modules. One common way to remove skew from to every state device at the same time is called the a system is to make some type of adjustment dur­ clock skew. Clock skew is the uncertainty in the ing the assembly of the hardware. Theoretically, time of a particular event. As an analogy, consider at least, all the skew could be removed through an airline flight that is scheduled to arrive at an this method of adjustment. To keep the cost of airport at precisely 5:02 P.M. Now, we know this manufacturing low, however, another of our goals flight will not arrive at 5:02 P.M. on the dot; it was to require no adjustments of any kind . That will probably arrive within a minute or two of goal placed an extra burden on the clock system that published arrival time. This uncertainty in to deliver accurate signals without excessive the time of arrival is the skew of that time. If the skew. By carefully designing the circuits of the uncertainty of arriva l is 30 seconds, this skew clocking system and controlling the skew sources would probably be a very acceptable value and mentioned above, we held the overall clock skew we would say the flight is right on time: it in the VAX 8800 family to 7.5 ns. Thus, on aver­ arrived with low skew. age, 83 percent of our 4 5-ns cycle is utilized. The On the other hand, if the uncertainty of arrival remainder of the paper explains some of the trade­ is large, say 30 minutes, we would probably try offs we made to achieve this figure. another airline. Why? Not simply because we are impatient but for a more fundamental reason. Clock Hardware Overview

When the uncertainty is large, we have less time Figure 1 depicts the hardware in the clock sys­ to do other things that are valuable to us. Usually, tem of the VAX 8800 family. we are committed to the entire time of the uncer­ The oscillator section is the time base of the tainty. Put another way, this uncertainty, or skew, whole machine. The implementation is a custOm is wasted time. Enough of this analogy - how phase-locked-loop design that allows the clock does this skew affect the operation of a digital period to be varied for test purposes during the computer? manufacturing process. Using a phase-locked

As mentioned earlier, since the cycle time of loop makes it possible tO have a very accurate each CPU is 4 5 ns, all state devices are "sched­ timing source at many specific clock periods. uled " to clock at the start of that period. Any The output of the oscillator section connects uncertainty in this time from one latch to to a phase generator that provides two clock another is called clock skew. As in our airline phases with the proper timing relationship example, clock skew is wasted time. There are between them. The outputs (called the A-Clock many factors that increase the clock skew; let us and the B-Clock) of the phase generator are the consider one of the most important ones. actual clock signals distributed to all state Since the backplane width is 2I inches, aJI the devices in the machine. The phase generator is

CPU hardware modules are separated by no more implemented digitally by high-speed, 1 OOK ECL than that distance. Since all the wiring in the sys­ shift registers. This technology creates veryaccu­ tem is composed of controlled-impedance trans­ rate timing without requiring any manufacturing mission lines, the logic signals can travel at close adjustments. to the speed of light. At that speed a logic signal Since there is only one phase generator and could circle the earth about 4.5 times in 1 sec­ thousands of state devices requiring the clocks, ond, or it takes about 4 nanoseconds tO travel the or timing signals, a method is needed to get the 2 inches across the processor backplane. Now output of the phase generator tO every state I we can begin tO understand the skew problem. device without adding very much skew. That is The minimum uncertainty of any signal traveling the purpose of the distribution stage of the clock through the entire processor would be at .least system. The actual circuitry used for the distribu­ 4 ns, which is almost 10 percent of the 4 5-ns tion consists of I OOK ECL differential devices cycle. And that is only one source of skew. and 1 OKH ECL devices. The distribution was

Digital Technical journal 35 No. 4 February 1987 Th e CPU Clock Sy stem in the VA X 8800 Fa mily

CLOCK MODULE CPU (8 MODULES) 1 TYPICAL MODULE CPU BACKPLANE A INTERCONNECT B GATE A ARRAYS l A CLOCK B DISTRIBUTION A I A B A PROGRAMMABLE B ] CONTROL CLOCK CLOCK A A / LOGIC DISTRIBUTION OSCILLATOR A ---, r B f-- ' II + l 20 A,B CLOCK A A A PAIRS, ONE TO I PHASE B DIGITAL A EACH CPU CLOCK ---, MODULE, ONE PHASE I '------TO THE MEMORY CPU (8 MODULES) GENERATOR CONTROLLER, 2 r-t;r AND ONE ,--- B TYPICAL MODULE TO EACH 133.5 MHz "----- B A T 22.25 1/0 CONTROLLER B NOMINAL B GATE ML A A ARRAYS B B B B ,A --, A B PHASE I I B --, ,B CtLOCK '----- B CLOCK DISTRIBUTION DISTRIBUTION I � Jl A l B

MEMORY MEMORY CONTROLLER MODULE

GATE ARRAYS � A Bl CLOCKT T �DISTRIBUTIONl + 1 A I 8 I 1/0 CONTROLLER (UP TO 2)

GATE ARRAYS I A 8l lCLOCK DISTRIBUTION I t I A I 8

Figure 1 Clock Sy stem in VAX 8800 Family

36 Digital Te chnical journal 987 No. 4 Fe bruary I New Products

heavily influenced by our desire to use an auto­ not tolerate unanticipated timing errors. A tim­ matic timing verifier. The fo llowing discussion ing error in a meant that a new gate of the timing verification environment gives a array must be produced to fix the problem. The clearer view of the reasoning behind the clock fa brication overhead for another semiconductor distribution scheme. device, usually taking months, was not consis­ tent with our development schedule. Moreover, Clock Sy stem and the Timing while that new gate array was being fa bricated, Ve rification Environment the debugging of the entire system could be Traditionally, timing verification was accom­ jeopard ized since it was just not possible tO plished by hand calculations using component "fix" an LSI chip. specifications. A designer wou ld simply add all Therefore, the hardware design group wanted the component propagation delays in a particu­ to design the processor with the aid of an auto­ lar path and determine if all timing criteria were matic CAD tool for timing ve rification. Such an met. In the past, this method worked fairly well automatic method for verifying the timing was for several reasons. First , the designer usually essential to the success of the project. Since the knew which paths in a circuit were critica l and entire design was to be "soft" (the schematics could give special attention to them. Second, were contained in computer databases) , it components generally behaved better than their seemed logical that some type of software tool worst-case vendor specifications. fo r automatic timing verification could be Marginal timing problems, or ones that were applied. simply overlooked, wo uld often be less serious We decided that the most appropriate timing than the diffe rence between the worst-case ve rifier for this project was produced by Valid specifications and how the components actually Logic, Inc. Although this automatic tool solved worked. Finally, timing errors were to the problems caused by manual timing verifica­ expected appear during the hardware debug phase of a tion, it also created some very special new project . Therefore, timing errors that were bla­ restrictions. tantly missed during the design could be cor­ It was apparent from the beginning of the rected (with a lot of hard work) during that design effort that some restrictions had ro be phase. That was possible because the overall placed on the design styles of individual engi­ complexity of the design could be compre­ neers to reduce the timing-analysis problem to a hended by the designers. manageable level. CPU hardware designers, like From the beginning of the VAX 8800 design any other creative persons, often assume large effort, we knew that the timing of the design degrees of freedom in their work. Usually, no would be difficult to analyze manually. First, two designers will arrive at the same solution tO the sheer complexity of the machine created a problem, although all solutions may be over fo ur million diffe rent timing paths. It was acceptable. When ten or more designers work impossible to analyze every path manually or to independently, as happened on this project, it is discover every "critical" one with either man­ likely that ten unique design styles will emerge . ual or intuitive analysis methods. Therefore, we placed restrictions on the tim­ Second, hardware circuit loops are widely ing environment for the following two reasons: used in the design ; these are circuits that feed • Some standardization of timing had to take signals back to themse lves during a later place for electrical signals to communicate machine cycle. These circuits are very difficult properly between designs generated by dif­ to analyze, especially when loops cross physical ferent people. boundaries or are nested within other loops. just Since the automatic timing verification soft­ thinking about the timing ramifications of • ware was new, seve ral important features nested loops taxes the mind. Manually analyzing thousands of these cases would be impossible. were lacking. Finally, the hardware design made heavy use The usefu lness of an automatic timing verifier of gate arrays, which contain most of the logic. depends largely on how wel l timing-rule viola­ Our ambitious development schedule and the tions are reported. Knowing that a design con­ large number of gate array designs simply could tains timing errors is useful only if it is easy to

Digital Te chnicaljom-rtal 37 Febn.tctrJ' No. 4 1987 Th e CPU Clock Sy stem in the VA X 8800 Fa mily

fi nd them. One way to aid the reporting of timing accuracy. Therefore, the generation and use of errors is to create an environment that clocks all clocking signals were tightly control led to mini· state devices in the processor the same way. This mize the different ways in which the circuits means that all logic designs in the processor must cou ld communicate. The timing control of state fo llow consistent and strict rules for the clocking devices had to be consistent throughout the of state devices. That was the method we decided design. Moreover, any arbitrary timing contro l to pursue in this design project. of the state devices would have been an impossi­ ble task for the timing verification software . The Ti ming Environment The timing signals in the VAX 8800 processor The clock system needed strict constraints on its were carefully distributed to every state device. circuit design and physical layo ut ro guarantee This distribution was accomplished by carefully

LATCHES

CLOCK SOURCE

LATCHES

I I JI I I I I I I I I I I I I I I ___ I L----J • ____ _.I L---....J I LEVEL 5 LEVEL j LEVEL LEVEL LEVEL 4 1 2 3 Figure 2 Clock Expansion Groups

38 Digital Technical ]o m...,•al No. 4 Fe bruar)' 1987 New Products

CLOCK MODULE BACKPLANE TYPICAL CPU MODULE ------1 r------r------l I 1 I I I I GATE ARRAY I �------, I 1 I I A I I I I I A I I I I A 1 I I I I LJ------1--I-t_J-- I I I I B II I I B I I B I I I I I I I I L ______l I I I I I I TO GATE I I ARRAYS I I I I I I I I I I l ______l L ______L------� '-.r------) FANOUT FANOUT FANOUT FANOUT FANOUT LEVEL LEVEL LEVEL 4 LEVEL 5 1 LEVEL 2 3

Figure 3 Minimized Global Skew Distribution expanding the clock signals at strategic physical skew environment, the timing analysis can be positions in the processor. A simple example of consistent and easily managed. However, com­ this expansion, or fan-out, is shown in Figure 2. plications arise when trying to analyze signals Each time the clock signals are expanded, that cross from the local-skew environment to more timing uncertainty is introduced into the the global-skew environment. Signal communi­ resulting signals. The 8800 design required up cation between logic modules will have to pay to five levels of expansion to produce enough the penalty of using the higher global skew clock signals for every state device. A<> shown in because the timing signals at each end of the Figure 2, some signals are in common distribu­ communication are derived from different dis­ tion groups. Signals existing in the same group tribution groups. Managing the timing interface will have low timing uncertainty between them, across this partition between local and global a characteristic called skew correlation. The skews was beyond the capabilities of the timing timing uncertainty between signals in different verification software. distribution groups has no correlation; there­ As discussed earlier, a timing analysis of the fore, these signals have the highest skew. Signals entire processor was beyond human capacity; from the same group have a skew, called local therefore, it had to be performed with timing skew, lower than the overall group-to-group verification software. The timing verification skew, called global skew. tool chosen for the 8800 development had no It is very tempting for designers to take advan­ facility for distinguishing between local and tage of the lower local skew, which is often only global skews. Moreover, we wanted to use the half that of the global skew. Each clock distribu­ timing verifier to analyze the timing of the entire tion group is usually contained entirely on one CPU as one entity. This decision forced us to dis­ logic module due to the natural physical parti­ allow the use of any local-skew computations in tioning of the hardware. Therefore, communica­ our timing analysis. Now, from a design point of tion between circuits on any particular module view this decision made the environment very can take advantage of the lower local skew. If all easy to work with. All timing transactions any­ signal communication occurs within the local- where in the CPU could be analyzed the same

Digital Technical journal 39 1987 No. 4 February Th e CPU Clock S)'Siern in the VA X 8800 Fa mi�V

TYPICAL CPU MODULE r------�

I I

GATE ARRAY r------,

I I I I I I I I I I I I ______j I I TO GATE I ARRAYS I I I I I I I I I L L ______J ______L------�

Figure 4 Minimized Local Skew Distribution way with the same set of specifications. Every­ Although using the lower local skew would thing comes at a price, however . and the obvious have been va luable. it was sacrificed by making it negative side of this decision was the loss of the equal ro the global skew. abi lity ro apply rhc lower loca l skew . At rhar In short. the hardware of the clock sysrem was point. some performance of the processor designee! to allow the maximum exploitation of seemed ro be compromised just ro simplify the the riming verification software. Of course. hard­ timing analys is. The fo llowing discussion wa re and software rrade-offs are a common explains how this problem was solved . occurrence in any design project. In this case. however. the va lue of the hardware involved The Clock Distribution Solution with operating the machine was balanced against

Since we wanred to rime the CPU as one entity. the software analysis needed during the design we had ro make the global skew as small as possi­ phase of rhe machine. ble ro maximize CPU performance. In the acrual implementation. the global skew was lowered by Summary removing one gating level from the clock distri­ Producing the clocking system fo r a high-speed bution. The gating level removed was necessary computer is best described as an exercise in min­ 3 imizing and managing skew . In the for producing low local skew. Figure illustrates VAX 8ROO the five levels of fa n-our that were required ro project. we avoided exotic hardware techniques produce enough signals when the global-skew so that we cou ld gain the benefit of using an distribution was minimized . Figure 4 shows the automatic timing verifier. The resu lting skew of

- 17 sa me fan ou r ro produce enough signals in th e percenr of the cycle rime was a figure that case in which the loca l -skew distribution would could be to lerated . This balance was a fa ir trade­ be minimized. Table I illustrates the impact of off si nce the simplicity of the riming environ­ this optimization fo r globa l skew. ment allowed us to decrease the rime ro design and build the VAX 8800 fa mily of systems.

Table 1 Distribution Changes Global Skew Local Skew

Optimized Local Skew 9 ns 2 ns Optimized Global Skew 7.5 ns 7.5 ns

Digital Te chnical journal 40 No. ·1 Februmy I ')8 7 jo hn Fu ja mes B. Keller Kenneth j. Haduch

Aspects of the VAX 8800 C Box Design

In each processor in the 8800 fa mily, instructions and data are sup­ VAX plied to the execution units by the C Box. Employing a simple structure with a translation buffe r, cache, and address and data buffe rs, this logic unit is an integral part of the processor's fi ve-stage pipeline. The no­ write allocate cache uses a write-through scheme fe aturing a unique delayed-write algorithm. The C Box bas control logic to accommodate pipeline stall conditions caused by memory accesses. The C Box also maintains data coherency within a processor and betweenprocessors. A dyn amic priority-arbitration scheme solves the lock-out problem between IjO and processor requests.

The performance of a high-speed computer to avoid that is to stare the result of this address depends to a large extent on how fast data can be calculation in a small, fast memory called a passed from its memory to its execution units. If translation buffer. Since each translation can the computer is pipelined, the unit responsible access a page of data (512 bytes in the VAX fo r memory accesses may have to handle architecture), it is likely that the translation will pipeline stall conditions. And if the computer is be used again in the program being executed. a multiprocessor, that unit in each processor may Rather than recalculating the phys ical address also have to handle data coherency problems. In (PA) on those subsequent accesses, it can be processors with the VAX architecture , data retrieved from the TB. accesses are fu rther complicated by the fact that The translation buffer in the VAX 8800 pro­ virtual addresses are normally specified. These cessor holds 512 system and 512 process addresses require translation to physical ad dress translations. The fo llowing summarizes addresses befo re a data access can even be the characteristics of the TB. attempted. In the VAX 8800 system, which is a multipro­ Characteristics of the Translation Buffer cessor with pipelined CPUs, the unit that per­ forms address translations and data accesses is • Direct Mapped the C Box. • 1024 Lines - 51 2 System Li nes C Box Description - 512 Process Lines The C Box consists of three subunits: the transla­ • Allocation on Translation Buffer Miss tion buffer (TB) , the cache, and the NMI inter­ face. Figure 1 is a schematic diagram of this unit. The translation of a VAX virtual address to a A common approach to the problem of data physical address is a complicated process. 1 access latency for high-speed processors, and Accesses to system and process page tables are the one used in the VAX 8800 CPU, is tO use a required, and shifting and adding must be done cache 2 A cache is a small, fast memory located to obtain the final physical address. Performing between the processor and the main memory this address translation process for every data system. If the data requested by the CPU is not reference significantly increases the data access contained in the cache, that data is accessed time and reduces the read bandwidth. One way from main memory and loaded into the cache.

Digital Technical jo urnal 41

No. 4 Febn�ary 1987 Aspects of the VA X 8800 C Box Design

.....-

r- ..., A B - DATA CACHE DATA '-- '-- f-- - .--- ADDRESS CACHE '--- A r--- r- r-- CACHE TAG - � TB CACHE A B DATA ADDRESS � HIT r-- r-- 'VA Pii

v-/ READ STREAM - ADDRESS TB TB - BUFFERING TAG HIT r-- �

TRANSLATION BUFFER NMI WRITE STREAM '- OAT A BUFFERING 1------l ¢:::)

TB - TRANSLATION BUFFER VA - VIRTUAL ADDRESS PA - PHYSICAL ADDRESS A, B - A AND B PHASES OF TWO PHASE CLOCK WRITE - BUFFER

NMI INTERFACE

Figure 1 Block Diagram of C Box

Thus. in the majority of cases , the cache will Characteristics of the Cache contain recently referenced data items, and

future references to those data items will be • Direct Mapped with Physical Address fetched from the cache. The intent is to mini­ • Read Allocate Only mize the number of longer latency accesses to Delayed-Write Cache Update the main memory subsystem. The success of a •

cache memory relies on the locality of refer· • Write-through Memory Update with Write Buffe ring enccs in both time and space. • 1 024 Blocks The data cache in each VAX 8800 CPU holds 64-byte Block Size 64 kilobytes (KB) of both data and instructions . •

The list on the right summarizes the characteris­ • 4-byte (one longword) Line Size tics of the cache. • 32-byte (one hexword) Cache Refill Size The TB and the cache are very similar in con­ cept and structure , except that the TB is used to accelerate address translations and the cache tO each address can point to only one location; accelerate data accesses. Each consists of a tag however, each location can potentially be allo­ section and a data section. The tag section holds cated to one of many addresses. A tag permits the unique identifier, or tag, for the data item the identification of a data item in either the TB held in the corresponding data section. The TB or a cache location. The tag in the VAX 8800 and the cache are direct mapped, meaning that processor is an unmodified selection of bits

42 Digital Technical]ournal

No. 4 February 1987 New Products

subsequent use . (If the address supplied is VA(31 -0) PA(29,0) already a PA, then the TB is not used.) Only physical addresses access the cache. If the data referenced is contained in the cache,

PA(28-16) called a cache hit, then the data can be accessed from there. If the cache does not contain the

TB TB CACHE data, called a cache miss, then the data must be 1---� TAG DATA PA(1 5-6) DATA accessed from memory.

Read Op erations VA(30-18) Cache-miss addresses for reads are passed to the NMI interface, where they are held in the read PA(29-0) address buffe rs . A hexword read request (32 bytes) , with the address of the missed loca­ tion, is then made to memory. The memory data TB HIT CACHE HIT is passed to the requesting unit, and the address held in the read address buffer is used to update the missed cache location. A read miss is the VA- VIRTUAL ADDRESS PA - PHYSICAL ADDRESS only occasion upon which a cache location is TB - TRANSLATION BUFFER allocated . There arc two read streams in the C Box for Figure 2 Translation Buffer and Cache requests to memory: the data stream, called the Address Mapping d-stream, and the instruction stream, called the i-stream. The i-stream requests the memory to send data destined for the instruction unit from the address of the data item being (I Box) , which interprets that data as macroin­ accessed. This concept is depicted in Figure 2. structions. I-stream fetches are initiated by As mentioned earlier, a memory access is microcode, which loads a C Box register called required if the cache does not contain a the physical instruction buffer address (PIBA) . requested data item. In the 8800, both proces­ The PIBA holds the address of the next long­ sors are connected to the memory and the 1/0 word of the i-stream tO be fe tched . If the execu­ subsystems through the NMI bus. All read and tion of macroinstructions is sequential (i.e., write references that go to these subsystems are there are no branches, page crosses, etc.), the processed by the NMI interface. This interface C Box can increment the PIBA contents automat­ maintains a set of buffers for both read and write ically after each fetch. However, should the pro­ reference streams. For the read stream there are gram branch or a page cross occur, microcode actually two sets of address buffers: one for data must be used to reload the PIBA. 0-stream reads, the other for instruction reads. fetches are made only by the microcode, which must specify one of eight memory data (MD) C Box Op erations registers as its destination. 0-stream data is A C Box reference consists of a fu nction code, always returned to the execution unit. an address, and in the case of writes, 32 bits of data. In general, that address is a 32-bit virtual Write Op erations address (VA) . The VA translation process begins In general, the performance of a cache is mea­ with a check to see if the PA is available in the sured by its hit rate when reading data. The TB If the PA is available, called a TB hit, the selection of the update mechanisms for both data is read out and concatenated with the lower cache and memory, however, can have a major nine bits of the VA to form the PA. part of the influence on the design of the cache. There are translation process, the TB also performsAs page two we ll known strategies for updating a cache: access checking. If the PA that pertains to the VA write allocate, and no-write allocate. A write­ is not in the TB, called a TB miss, then allocate scheme updates a cache location microcode must perform the translation. The whether or not the write is a hit or a miss. This microcode then writes the data into the TB for scheme is generally implemented with a write-

Digital Te chnical jo urnal 4 3

No. 4 February 1987 Aspects of the VA X 8800 C Box Design

back memory arrangement (discussed later) . In write-back fu nctions. Add ing that code or hard­ a no-write allocate scheme, the cache is updated ware to the C Box would have considerably only if the write was a hit. The VAX 8800 pro­ increased its complexity. cessor uses a no-write allocate scheme. Second, if there is a write miss with this The no-write allocate scheme does, however, scheme. a cache block that might be full of present a problem. Since only writes that hit valid data could be displaced by a block whose wi ll update the cache, cache updates take two only va lid data was that just written to the pipeline cycles in the C Box - the fi rst to cache. For a cache having a large block size, like check for hit or miss, the second to update the the 8800 has, this action is undesirable. More­ cache for a hit. The C Box was designed to over, in most cases microcode reads data before enable one read reference to complete in each it is written; therefore, writes will generally hit cycle. If two consecutive cycles are needed to in the cache. update the cache, the second cycle could block Finally, the write-back strategy requires a a read reference, thus causing a pipeline stall. complex algorithm to maintain coherency To solve this problem, the C Box implements between caches within a multiprocessor system. a delayed-write algorithm. This mechanism Therefore , for all those reasons, we chose to use delays writes that must update the cache from the write-through approach in the cache. doing so until the first cycle of the next write One disadvantage of write-through is that it reference. The second cycle of the de l ayed tends to generate a Jot of write traffic to the write does not need to be the next consecutive memory. In a shared-bus system like the 8800, cycle. this traffic can limit performance. To reduce The delayed-write algorithm in the C Box memory-write traffic, writes in the VAX 8800 takes advantage of the fact that the first cycle of processor are buffered in a write buffer con­ a write utilizes only the tag section of the cache tained in the NMI interface. This write buffer is to determine whether a hit or a miss has really a one-line, octaword , write-allocate occurred. The second cycle uses only the data cache. A write going out tO the NMI bus is held section. A write that must update the cache has in the write buffer. Subsequent writes to the its address and data placed into the delayed­ same octaword update only the write buffer so write address and data buffers respectively. On that no memory requests are sent on the NMI the next write access, during the cache-tag look­ bus. A write that is outside the octaword cur­ up cycle, the data section of the cache wi ll be rently in the write buffer deallocates it; that is, updated from the address and data contained in the contents of the write buffer are sent to mem­ those buffers, but only if the previous write ory, and the next write replaces those contents access was a hit. Since reading a data item after in the buffer. one has been written is common, this design sig­ Like the cac he , the success of the write buffer nificantly reduces the potential for stalls. in reducing bus traffic relies on the locality of programs in space and time. For example, Write Buff er sequential writes, such as pushes to the stack, All write references, whether or not they hit in will get collected in the write buffer even if the the cache, must eventually go to memory. There writes occurred in diffe rent macroinstructions. are two general strategies in cache design with This collected "package" of writes can then be respect to memory updating: write-through, and sent to the memory more efficiently than can write-back. In the write-through approach, individual writes. write references are sent tO the memory system An other advantage of the write buffer is that it immediately. Conversely, in the write-back decouples the processor from memory activity. approach, writes are held until the cache block When the memory is busy processing transac­ is deallocated (made ready to receive different tions from the other processor or from the IjO data) . subsystem, a processor w iII not sta II due to There are several major problems with a writes. The write buffer is actually implemented write-back strategy. First, it requires either as a two-deep buffer, which fu rther reduces the microcode or hardware to accomplish all the potential for stalls.

44 Digital Te chnicaljournal

No. 4 fe bruarv I Y87 New Products

Pipeline Stalls MD Stalls Jn a pipelined implementation, how well rhe When making a read reference, a microinstruc­ pipeline performs is determined both by how tion must specify one of eight MD registers to be often it is flushed clear and how often it is used as its destination. When data is made avail­ stalled. Stall conditions are generally related to able, either from the cache or from memory, it rhe lack of some physical resource or data . is written into the specified MD register. Subse­ In some implementations, some pipeline quent microinstructions then use the data from stages can take more cycles to complete than this register. If a microinstruction attempts to others for certain functions. If a shorter stage use an MD register that is nor "valid" (i.e., the precedes a longer one, the longer one will be data has not yet been fetched by the C Box) , the unable either to accept fresh data or to pass its pipeline will experience an MD stall. resu lt ro the next stage until finished with its The MD stall condition is a data-dependency cycle. In turn, other portions of rhe pipeline type of stall that is generally seen in pipelined cannot proceed with their operations; therefore, machines. On the VAX 8800 processor, certain the pipeline will stall. In this stalled condition, steps are taken to either avoid such stalls or all stages preceding the "bottleneck" maintain reduce their effects. For example, consider two their input and output conditions until the stage consecutive microinstructions, R and S, as illus­ responsible for the stall completes its function. trated in Figure 3. R is a microinstruction that Some implementations have a combination of performs a read and puts data into an MD regis­ stages that may exhibit these characteristics, ter. S then accesses and uses the data fetched by leading to complex pipeline stall conditions. R. If Rand S are adjacent, the pipeline will stall In the VAX 8800 CPU, the design simplicity in the 8800. The reason for the stall is that the of the pipe! inc ensures that each pipeline pipeline stage accessing the MD data and the stage - except the C Box - always completes stage fetching that data (the C Box) are sepa­ irs function in one cycJe ..l Since the C Box also rated by one other stage, rhe arithmetic and controls data accesses, all stalls in the 8800 are logic unit (ALU). When S tries to use the MD related to the operation of this unit. The data, R is just starting to make the read reference pipeline wi ll experience two types of stalls: the in the C Box. S must therefore stall the pipeline, MD stall, and the VA stall. waiting for data to be supplied by R.

CYCLES

MD ACCESS INSTRUCTION R ALU CACHE FOR TB ( DATA

' R STARTS READ REFERENCE

MD ACCESS INSTRUCTION S ALU FOR TB CACHE ( DATA

S REQUIRES DATA READ BY R. MUST STALL AT LEAST ONE � CYCLE FOR THE DATA. MD - MEMORY DATA REGISTER TB -TRANSLATION BUFFER

Figure 3 Instructions Rand S Are Adjacent

Digital Te chnical jo urnal 4 5 J 987 No. 4 Februarv Aspects of the VA X 8800 C Design Box

CYCLES

MD ACCESS INSTRUCTION R ALU FOR TB CACHE � DATA R HAS COMPLETED READ I REFERENCE, DATA JUST AVAILABLE

MD INTERVENING ACCESS ALU TB CACHE INSTRUCTION FOR � DATA

MD ACCESS INSTRUCTION S ALU TB CACHE FOR ( DATA

"' S REQUIRES DATA. DATA SENT DIRECTLY INTO � ALU, BYPASSED MD UPDATE. NO STALL.

Figure 4 Instructions Rand S Separated hy Another Instruction

On the other hand. if Rand S are separated by access. When data returns from a read miss and one other instruction, then when S attempts to the pipeline is either undergoing or about to use the data read by R, that data is just being undergo an MD stall, the bypass path can be used made available by the C Box (assuming. of to reduce the effects of the sell I or even prevent i r. course, a read hit in the cache) . If S were wait to VA Stalls for the MD registers to be updated before using the data, the pipeline would stall. To eliminate A VA stall condition occurs when the C Box can­ that type of stall, a path has been designed from not process a requested reference . This can be the C Box directly into the input of the AUJ. clueto either an invalidation cycle in the C Box bypassing the MD registers. 'T'herdore, the data (discussed in the final section of this paper) or coming from the cache is sent both to the MD the capabilities of the address and data buffers registers for updating and directly to the AUJ, in the NMI interface being exceeded. where S can use the data . 'T'he net effect is that A� mentioned earlier, for reads there is a set of this bypass path removes the one-cycle latency buffers for d-strcam and i-stream references. The that S would have experienced had it waited for d-stream buffering is one deep, meaning there the data to come out of the MD registers. Figure can only be one read miss outstanding in the 4 illustrates these concepts. C Box. However, the implementation wi ll not Had R caused a read miss, S would still cause allow the pipeline to stall should subsequent an MD stall since the C Box must make a memory reads bit in the cache. !-stream reads never stall fetch for the data. Notice that an MD stall hap­ the pipeline as do VA and MD stalls, which stop pens only when S attempts to use an MD register. the clock. The instruction buffer can "stall" if it Therefore, a general rule for making microcode does not have enough data for the decoder to accesses to the C Box is to make read references complete the decode of the current VAX instruc­ early and to usc the MD registers late. Should the tion operand. This condition causes the CPU to read reference miss, some part of the memory­ perform a no-operation microword. That docs fetch latency will be hidden by the microinstruc­ not stop the clock, however, and thus is not a tions between the read and the MD register pipeline stall.

46 Digital Technical journal 'J87 No. 4 Febmarv I New Products

The C Box can still receive commands even if the stalled instruction. This situation leads to an it contains one read miss. Of course, there is the interesting control characteristic of the C Box. potential that the command being received will One of its sections, the TB, can be stalled; miss in the cache. That will require the NMI another. the NMI interface, must never stall; and interface to request the data from memory, thus the third section, the cache, must remain resulting in a VA stall. That stall lasts from the unstalled but maintain stalled input and output time the command is received until the time the conditions in its logic. Figure 5 depicts the previous read-miss data returnsfrom memory. If logic for stalled and unstalled conditions in the the second command is a read that hits in the C Box. cache, a VA stall will be generated for the one cycle that it takes to determine whether or not Coherency Problems in the C Box there is a cache hit. The read data will then be Jn general, data coherency means that a read taken from the cache and returned to the MD, should always get correctly modified data when after which the stall will be released. a series of reads and writes is made in any Since writes go to memory more than reads, sequence. One way tO maintain coherency is to the buffering for writes is more extensive . The perform all reads and writes to completion in a delay-write buffer and the double buffering in purely sequential manner, thus strictly main­ the write buffer are used to reduce the possibility taining their sequence of reference. However, in of write stalls. These buffers enable the C Box to a pipelined machine, not only can there be sev­ hold a maximum of nine longwords of data eral sources of read and write references, but before the pipeline will experience a VA stall on there can also be more than one copy of the data a write. item. This duplication often leads to very com­ plex solutions to achieve coherency. Stalled and Unstalled Logic in This complexity has been simplified some­ the C Box what in the VAX8800 pipeline by having the If an instruction is stalled, the C Box has either C Box both control and sequence all data not returned the data or cannot take another ref­ accesses. The C Box itself, however, is pipelined, erence. Therefore, all stages prior to the C Box having a d-stream and an i-stream for reads, and a (the I Box and the E Box) must be stalled. The stream for writes. This fact also presents some TB is part of the last stage of the pipeline; there­ coherency problems. Coherency for the C Box fore, it must be capable of being stalled. When means that two conditions must be met. the pipe line stalls, the TB holds the address of 1. After a sequence of reads and writes has the stalled reference. Only the NMI interface completed, any valid blocks in the cache can resolve a stall, either by supplying the read­ must match the data in the memory. miss data or by freeing up its buffers. Thus this interface can never be stalled. However, the 2. Whenever the processor writes to a loca­ cache, being part of the last stage of the tion in memory and then reads that loca­ pipeline, is also the path for supplying data to tion, the data has tO be what was written.

DATA

PHYSICAL � � I ADDRESS PHYSICAL TRANSLA- ADDRESS NMI BOX E BOX TION CACHE I INTERFACE NMI BUFFER DATA �

STALLED STALLED/ UNSTALLED UNSTALLED

Figure Stalled and Unstalled Logic in C Box 5

Digital Te chnical jom-nal 4 7 No. 4 Fe bruary 1987 Aspects of the VA X 8800 Box Design C

Two types of coherency problems exist in the a read is sent memory, and the processor per­ VAX 8800 system: coherency wi thin a proces­ forms a write towhile wa iting for that read.) sor, and coherency between processors. If the addresses of the read and write match, The first type of problem in the C Box arises the cache can give the processor the requested fro m the implementation of the delay-write data but cannot mark the returned data valid in algorithm discussed earlier. A problem occurs the cache . This situation occurs because the when a read is attempted to the cache location read-miss data being fe tched from memory has waiting to be updated by the wri te held in the been made stale for subsequent reads. delay-write buffers. The read will hit, but the The microcode is designed so that it will cache data will be stale. One solution to this never read a data item and then write to it with­ problem is to stall the pipeline while the cache our first accessing the MD registers. However, a is updated, performing the read for the correct cache block is 64 bytes long. The microcode data. The trouble here is that the sequence of could write to any other data item in the block writing to and reading from the sa me location is before coming to the missed data item. There a common occurrence. Thus to stall would sig­ can be as many as three writes and two reads nificantly reduce the read bandwidth. (one each for the d- and i-streams) buffered The C Box solves this problem by comparing simu ltaneously in the C Box, all referencing the selected bits of the read and write addresses in same cache block. Even worse, the C Box can the delay-write buffe r. If the bits match, then send an arbitrary number of writes to memory the data content of that buffe r is used as the read while wa iting for the data returned by the read data . This solution works because, the read. to memory. To maintain coherency. the C Box the delay-write buffer appears tO beto an exten­ performs a set of address matches between the sion of the cache. Since the read address read and write streams . Then it "remem bers" matched the address in this buffer, the data can whether or not any write addresses matched the be taken direct.ly from it. Coherency is rhus outstanding reads and marks them invalid as assured, and no stall penalty is incurred. appropriate. The second type of coherency problem occurs when the read is a miss and thus goes to the NMI C Box Design fo r a interface. To assure high performance, the NMI Multiprocessor Sy stem interface maintains two streams of data requests, The VAX 8800 system consists of two identical the read and write streams. The buffering and VAX8800 processors on the NMI bus connected the control of these two streams operate inde­ to the memory and IjO subsystems Within a pendently. If made to different data items, read processor, on ly the design of the C Box bas been and write requests can be processed memory affected by the requirements of a multiproces­ as quickly as possible, even out of tose quence. sor arrangement. That is because the C box is The coherency problem is to make sure that the CPU's interface to the NMI bus and contains subsequent reads and writes to the same data the central arbitration logic for that bus. item result in its correct state. There are three key issues in designing a If a read request occurs that was a miss, the memory interconnect for a multiprocessor sys­ cache will send it to the NMI interface upon dis­ tem: bus arbitration, bus bandwidth, and data covering that fa ct. Once in the NMI interface, coherency between processors. the read address is compared to the address of the octaword in the write buffer. If those Bus Arbitration on the NM I Bus addresses are diffe rent, the cache wi ll send the Two major problems were encountered in the read directly to memory . Thus the data in the design of an arbitration scheme for the NMI bus. write buffer will be unaffected. If the addresses The first was the fact that between the CPUs and match, however, the write data will be sent tO the 1/0 subsystems, called the NBfs, there was a memory, fo llowed by the read request. Since rhe possibil ity that a high-priority device could lock memory subsystem processes references in a our a low-priority device from the bus. This is sequential manner, the read will always access certainly possible with a fixed priority-arbitra­ the correct data . (Of course, this case is fa irly tion scheme. To address this problem, the C Box simple. A more complicated one is that in which implements a dynamic priority- allocation

48 Digital Technical journal

No. 4 Februar)• 1987 New Products scheme that causes priority to be assigned In multi processor systems like the 8800, between two groups: the 1/0 devices, and rhc however, in which each processor has an inter­ CPUs. Within these groups, the priority shifts nal cache. this technique becomes complicated. between rhe rwo CPUs and the two 1/0 devices. In these systems, a data item can exist not only For example. if all four devices wanted to usc in memory bur also in all rhe caches. To main­ the bus all the time, the order in which the bus rain coherency. each write-back cache would would be granted to the devices would be have to notify rhe other cache when the first cache writes. This technique usually leads ro a first CPU, first l/0, second CPU, second 1/0. complex protocol and design implementation. first CPU. first ljO, second CPU, second ljO, Another approach in a multiprocessor system, etc. rhe one used in the 8800, is ro implement This scheme guarantees that all devices on the write-through caches. In such an approach, all write references go directly to memory so that bus wi II have nearly eq ua I access to rhe bus, rhus solving rhe lock-our problem. each cache on rhe bus can "sec" all write activ­ The second problem involves the "memory ity. The caches can then be invalidated. Such an busy" situation. Whenever rhe memory subsys­ approach greatly simplifies the prorocol for tem cannot process more requests, it sends a cache coherency but, as discussed earlier, gen­ "memory busy" signal. It could happen, for erates a high degree of write traffic. The unique instance. rhar a CPU accesses the bus and design of rhe write buffer helps ro reduce this traffic, although not as much as a write-back attempts ro write ro memory. Upon receiving a memory-busy signal, the CPU will abort the cache would. In the 8800 processor, however, write. When memory is released, some other rhe write buffer reduces traffic enough so rhar device will access the bus and perform a write. the rwo VAX8800 processors can write at their rhus fi lling the write queue in memory. Once maximum banclwiclrhs on rhe NMI bus. again, the fi rst CPU re-arbitrares, accesses the bus, and tries to write. Once again, that CPU Coherency in a Multiprocessor Sy stem n:ccives a memory busy signal. And so on. A multiprocessor system, with internal caches, The NMI arbitration scheme mentioned above presents a number of interesting coherency solves this problem in which a device might get issues when sharing data. Ideally, if one proces­ locked-our of memory. As implemented, the sor writes ro a location and rhe other processor arbitration scheme saves rhe priority state at the reads rhar location, the read will always get the rime before the memory-busy signal was data rhar was written. In practice, achieving this asserred. The arbitration logic then restores that condition is difficult. Several major questions stare so that rhe device that received the signal arise: Did the read happen before the write or wi ll get the bus when the memory-busy signa I is afrer ir' What happens if both processors write deasscrted. ro the same location at rhe same rime' Unless controlled , these siruations can produce unpre­ Bus Bandwidth dictable results. For rhe processors on the interconnect, bus If programs on the processors want to share bandwidth involves two components: read band­ clara. they must usc rhe interlock instructions in width. and write bandwidth. The problem of the VAXar chitecture ." Only after an interlock inadequate read bandwidth is addressed by hav­ instruction is processed will the memory loca­ ing a high hit-rate cache . The higher the hit rate, tion be guaranteed ro have the correct clara. The the fewer the requests tO memory. The problem general method is as follows. Processes must of inadequate write bandwidth can be treated in decide to share a block of memory. One mem­ rwo ways. The first way is to have a write-back ory location is called the software lock, and only cache like rhc one on the VAX8650 processor. ' one process ar a rime is allowed ro write to (or Such a cache writes a block ro memory only lock) that location. This is accessed with an when rhe cache block is deal located. This tech­ interlock instruction, for example, the branch nique can significantly reduce the write band­ on bit ser and set interlocked (BBSSI) or the add width requirements. aligned word interlocked (ADAWI) instructions.

Di!!,ilal Tecb nical journal 49 11)87 No. ·1 Februcny Aspects of the VA X 8800 C Bo.x Design

Upon gaining the software lock. a given process this data '"''iII then result in a cache miss and a can proceed to write any location in the shared subsequent fetch to memory. This simple block. Read·write coherency will be assured approach is possible because the VAX 8800 only if the other processes sharing that data caches are write-through. Although all writes observe the protocol of obtaining the software arc seen on the bus, the write buffer packs lock before modifying the data structure. together consecutive writes within an octaword. The VAX interlock instructions arc imple· Therefore , the number of invalidation cycles mented using interlock microinstructions. performed by a processor will be reduced. These enable a processor to lock and unlock the When an interlock write is performed, the con­ memory subsystem. Once locked. this subsys· tents of the write buffe r are sent to memory. tem excludes fu rther attempts to lock it until an Thus the interlock mechanism ensures that data unlock has occurred. Thus only one processor coherency will work under all conditions. Fig­ or 1/0 system can lock the memory subsystem at ure 6 illustrates the events that achieve any one time. coherency in the 8800. When each processor has an intern a I cache. there is one more mechanism that keeps the two Summary processors coherent. While one processor is The general concepts used in the design of the performing a write to memory and while the C Box arc well known to computer designers . write command is on the NMI bus, the other Our goal was to achieve a simple yet high-per­ processor will examine its cache store see if formance design that avoided unnecessarily to it contains a copy of that data . If the data is complex solutions that did not give comparable there, it is marked invalid. The next request for increases in performance. The choices made

LEFT RIGHT PROCESSOR PROCESSOR

� CACHE I

WRITE WRITE I BUFFER I I BUFFER I OTHER PROCESSOR SEES WRITE ON WRITE INTERLOCK NMI AND LOOKS FORCES WRITE BUFFER IN CACHE FOR CONTENTS TO MEMORY INVALIDATION

NMI

SOFTWARE I LOCK J MEMORY

Figure G Multiprocessor Coherencv

50 Digital Technical journal J ')87 No. 4 Februar)' New Products

have yielded a design that fu lly supports tht: multiproct:ssor concept. The VAX 8800 system can translate addresses and access data faster than any previous VAX processor.

Acknowledgments Al l those who worked on the VAX 8800 system contributed the thinking that went into the C Box designto. Special thanks go to Dave Sager for keeping things going.

References

VA X Architecture Handbook , (Maynard : l. Digital Equipment Corporation, Order No. EB-26 115 -4 6, 1986): 7-11 to 7- 19.

2. A. Smith, "Cache Memories," Computing Surveys , vo l. 14, no. 3, (September 1982) : 473-530.

3 . S. Mishra, ·'The VAX 8800 Microarchitec­ ture." Digital Technical jo urnal (Febru­ ary 1987, this issue): 20-33.

4. T. Fossum, J. McElroy, and W. English, Overview of the VAX 8600 System," Di"Angi tal Te chnical jo urnal (August 985): 8-23 J 5. S. Farnham, M. Harvey, and K. Morse . "VMS Multiprocessing on the VAX 8800 System,'' Digital Te chnical jo urnal (Ft:bruary 1987, this issue): 111-1 19

Digital Tech nical journal 51

No. 4 Fe/Jmmy 1987 Paul]. Natusch David Senerchia C. Eugene L. Yu

The Memory System in the VAX 8800 Family

The memory system in the 8800 fa mily can send data at 71MB per sec­ VAX ond and receive it at 59MB per second. The 8800 and 8700 CPUs can con­ tain up to 128MB of memory, the 8550 and 8500 up to BOMB. Commands, addresses, and data flow between the memory interconnect (NMI bus) and the memory controller, array bus, and array modules. Read, write, and masked-write commands are executed. The designs of the NMI bus and write-through cache affected the memory sy stem design. Although ECL is used in the controller, TTL is used in the array bus. The array modules of 4MB and 16MB contain 256K MOS dy namic chips. RAM

Alime mbers of the VAX 8800 family of proces­ The memory system can deliver 71 megabytes sors (the 8800, 8700, 8550, and 8500) usc the (MB) per second of read bandwi dth and 59MB same type of memory system. Since the per second of write bandwidth. VAX 8800 system is a multiprocessor, that mem­ Since the VAX architecture has a 32-bit for­ ory system must connect co both CPUs and both mar, all datapaths in the memory system must I/0 adapters, called the NBlAs. The bus connect­ also handle 32 bits. These datapatbs are com­ ing these devices is called the NMI bus, and each bined by pipelined and parallel operations to connection on the NMI bus is called a nexus. produce the read and write bandwidths. The These connections are illustrated in Figure 1, most significant occurrence of parallel operations which shows five nexuses: one for each CPU, one is two-dimensional interleaving. The first dimen­ for each NBLA, and one for the memory system. sion interleaves between longwords (32 bits) of data on a single array module; the second inter­ leaves between octawords ( 4 longwords) on dif­ fe rent array modules. As many as three array modules can be active simultaneously with either a read or a write . There are three cases:

• Each module can do one read.

• One module can do a read while the other two can do as many as four writes.

• Two modules can each do a read while the third can do as many as four writes.

Figure 1 Memory Interconnect Structure The selection of the array modules can be programmed from the console when the system The memory system itself consists of three is powered up. Thus the memory system can major parts, as depicted in Figure 2: support a variety of array module sizes and speeds without the need to modify the hardware A memory controller based on ECL technology • in the memory controller. Moreover, the mem­ • A high-speed TTL bus connecting that mem­ ory controller can address 512MB of physical ory controller co a maximum of eight array memory, the limit of the VAX architecture . The modules 8800 is the first VAX system to be able to address this much physical memory. • The array modules themselves

52 Digital Technical journal 1<)87 No 4 Febmmy New Products

COMMAND BUS-INPUT COMMAND AND CLOCK

MEMORY ARRAY NMI CONTROLLER MODULE 8

Figure of MemOI:J! Sys tem 2 Plan

Owing the limits of the <:xisting technol­ protOcol has to be changed to that of the array to ogy, however, the initial machine was intro­ bus. Reads and writes of data fi elds with various duced with 32MB for the 8800 and 8700 sys­ sizes are received by the NMI interface. The NMI tems, and 20MB for the 8500 and 8550 systems. bus supports a very robust set of commands. The 32MB configuration consists of eight Reads and interlocked reads are supported for 4MB modules with 256K MOS dynamic RAMs longwords (4 bytes), octawords (4 longwords), packaged in DIPs. To increase the density of the and hexworcls (2 octawords). Masked writes and machine without using a different semiconduc­ masked-write unlocks are supported for long­ tor technology, a 2MB daughter module was words, quadwords (8 bytes) , and octawords. developed after the initial announcement. This Writes are supported for longwords and acta­ module uses double-sided surface-mount tech­ words. nology and plastic leadless chip carriers. Eight The read-interlocked and masked-write of these daughter modules are mounted on a unlock commands are used ro implement VAX mother module to produce a 16MB array mod­ instructions in which mutual exclusion is ule. This new module has increased the required . For example, the VAX instructions machine's memory to 128MB for the 8800 and ADAWJ, BBC CI, BBSSJ , INSQHI, INSQTI, 8700 systems, and to 80MB for the 8550 and INSQUE, REMQHI, and REMQTI all need these 8500 systems. commands. Since an interlocked instruction locks the entire memory system, the interlock MemorySy stem Architecture bit must reside in the memory controller. This

As shown in Figures 1 and 2, the memory con­ bit restricts the execution of subsequent inter­ troller communicates with the CPUs and the lock commands until the lock has been released NBIAs over the memory interconnect, called the by a masked-write unlock instruction. NMI bus. Commands, addresses, and data After receiving a memory request from a requests are all first received by the NMI inter­ nexus, the memory controller must transfer that face and then passed to other sections of th<: request to the appropriate array module. This memory controller. Addresses and data arc transfer is accomplished using the array bus. srored in custom multipart RAMs, where eight This bus consists of locations arc reserved for addresses and eight for A unidirectional set of command and address data. The NMI interface encodes command • lines from the memory controller ro the array information, passing it to the command-control modules portion of the memory controller.

Since the memory controller communicates • Another unidirectional set of data lines from with the NMI bus and the array bus, the NMI the memory controller to the array modules

Digital Te chnical journal 53 No. 4 Febntary 1')87 Th e Memory Sy stem in the VA X 8800 Fam ily

• A set of data lines (capable of assuming three refreshed once every 4 milliseconds. This func­ states) that can be driven by any one of the tion can be done by refreshing one row every array modules and received by the memory 14 microseconds. To facilitate this activity, the controller memory controller sends signals to each array module from a 14-microsecond oscillator. Upon Various status and control lines that commu­ • receiving a refresh signal, an array module will nicate in both directions handle the refresh arbitration and execute the The array bus has a minimal repertoire of operation. commands, consisting of longword reads, acta­ Occasionally, a bit will be lost due to either word reads, and longword writes, but not hex­ alpha particles or a device failure. In that case word reads. Since the NMI supports hexword the memory controller must handle those errors reads, the memory controller must convert them and other types in a gracefu l manner. To do into two octaword reads and then send them to that, the memory system uses a 7 -bit modified the array modules. Thus the two octawords of a hamming code to generate the ECC, which hexword read can reside on different array mod­ allows all single-bit errors to be corrected and ules. That fa ct increases the memory bandwidth all double-bit errors to be detected. After cor­ because parallel accesses can be executed. The recting each error the memory system logs the array bus supports only longword writes; there­ error's physical page address and the bit. The fore, octaword writes must also be converted. As memory system then interrupts the CPU to call mentioned earlier, the array bus has one line for an error service routine, which logs in a VMS commands and addresses and another for data. file the necessary information to isolate the fa il­ Therefore, an octaword write, which takes five ure. The memory system can also interrupt the cycles to transfer on the NMI (one for the com­ CPU to handle internal parity errors and inter­ mand , fo ur for the data), can be transmitted in locked time-outs. An interlocked time-out hap­ five cycles on the array bus to an array module. pens when a nexus executes a read interlock but Figure 3 shows the corresponding actions dur­ never issues a masked-write unlock. The system ing each cycle on the NMI and on the array bus. software can enable or disable these interrupts. In addition to commands, the memory system Battery backup, standard equipment on both must also execute maintenance tasks, including the 8800 and 8700 systems, can power the memory refresh, error reporting, and battery refresh operation when the system is down. That backup. power allows the memory system to continue to Since physical memory is implemented with refresh the RAMs so that data will not be lost. MOS dynamic RAMs, every array row must be Note that the entire system is not backed up;

CYCLE

2 3 4 5 6 7

COMMAND NMI OR DATA DATA DATA DATA ADDRESS

ARRAY BUS

COMMAND/ COMMAND COMMAND COMMAND COMMAND ADDRESS OR OR OR OR LINE ADDRESS ADDRESS ADDRESS ADDRESS

DATA DATA DATA DATA DATA LINE

Figure Cy cles on NM! Bus and Array Bus 3

54 Digital Te chnical Journal No. 4 February 1987 New Products

BUS ENABLE

ERROR CORRECTION LOGIC 1-T---1

T

A A

MULTIPORT RAM

ECC GENERATION N LOGIC M I

MEMORY CONTROLLER ARRAY MODULE

Figure 4 Datapaths in Memory Controller and Array Modules therefore, all components must be in quiescent monitor actiVIty must "remember" which par· states before the memory system enters battery tions of which arrays are "busy." This statlls mode. Upon sensing that power is eroding, the control logic can best be described by showing how the three basic operations, writes, reads, 8800 wi II write all its data to the memory sys· tern. The memory controller will then complete and masked writes, are executed. all commands and send signals w the array mod· ules informing them to enter battery mode. In Write Commands this mode only five MSI chips on the memory For a write command, the control portion of the controller and approximately half the control memory controller performs only three actions: logic on the array module will be active. it determines the capability of the array module to accept the command, it sends the command, Command Execution and it waits for the array module to signal its The execution of any command received by the readiness to receive another command. memory system is a joint effort between the The write datapath is that portion of the logic memory controller and the array modules. Fig· responsible for the flow of data from the NMI bus ure 4 depicts the datapath in each memory com­ tO the array modules. This path comprises both ponent. Aftera nexus places a command on the electrical interconnects (buses and cables) and a NMJ bus, the interface in the memory controller considerable amount of logic. The major storage ascertains if the command is a valid memory ref· element for the data path is a 9-bit by 32-location erence and, if so, decodes it. The interface then custom multipart RAM (MPR) with two ports for places the command in a queue of commands reads and two for writes. Data received from the waiting to be executed. NMI bus is placed in the next available location Since one array module can execute multiple of the MPR. Upon determining that the required write commands simultaneously, and since mul­ array module is available, the control logic sends tiple array modules can also execute commands, the data from the MPR to that array module over the memory controller must maintain the status the array bus. Each array module holds the data of the array modules. The status control logic until it is strobed into the dynamic RAMs to

Digital Technical journal 55 No. 4 February 1987 Th e Memory Svstem in the VA X 8800 Fa m ily

(DRAMs) . The array module can load four long­ be corrected. A significant feature of this pro­ words of data with their associated ECC bits on cess is that error detection and correction is per­ four consecutive cycles. formed as the read data is pipelined through the Some writes are called masked because there memory control Jer. Thus no aclclitional cycles is a 4 -bit byte mask associated with each data are needed to correct read data. word. The byte mask informs the memory sys­ tem as to which bytes arc to be written. The Masked-write Commands memory system executes this command by first The execution of a masked write involves both a doing a read and correcting any single-bit errors react and a write sequence. The memory con­ that may exist . It then merges the memory data troller executes a masked-write command by with the data received from the NMI bus, and first issuing a react to the selected array module. finally does a write command. This sequence Assuming that there were no memory errors, the easily allows the implementation of longword data returnee! is sent to the MPR, where the and octaword masked writes. Masked writes for bytes arc merged with those sent to the memory quadwords (8 bytes) are executed by perform­ controller over the NMI bus. The memory con­ ing an octaword masked write in which the data troller must ensure that no commands to the of two of the longwords remains unchanged. same array come between the read and write portions of a masked write. After all the bytes Read Commands have been merged into the , the For read commands, the memory controJler per­ memory controller will write the data to the forms four actions: it determines if the selected array module. The array module then generates array module is ready to accept the read, it new ECC data, adds it to the other data, and sends the command , it waits for a data-ready strobes the composite data into the DRAMs. response, and it transfers the data from the array If a single-bit error is detected, the process is module. Imbedded in the command field of the quite similar to the one with no errors, except read are address bits that select the longword of that the data must be corrected. Since corrected the octaword that is required first. This action data and NMI traffic both share the same data­ allows wrapped reads to be implemented. path on the memory controller, the NMI inter­ (Wrapped reads are described later in the sec­ face must be free to correct errors found during tion "Impact of the Cache.") masked writes. This freedom is ensured by The react cl atapath originates at the DRAM, asserting a signal that stops all activity on the which sends the requested data. As in the case of NMI bus. Once activity has stopped, the data write commands, each array module stores an can be routed through the NMI interface, cor­ octaworcl of read data. Once the data has been rected, and then merged with the NMI data in loaded into the latches, the array module signals the data buffer. The process then continues as it to the memorycontr oller that the data is ready. would have if there were no errors. As mentioned earlier, the read datapath between If a double-bit error is detected, the process is the array module and the memory controller is similar to the case in which no error occurred, tristatable. Therefore, the memory controller except that the write is prevented from happen­ must ensure that only one array module at a ing. When the array location is read the second time drives this datapath. Once the data has time, the double-bit error will still be present, been requested by the memory controller, the thus alerting the system that the data is unusable. array module must send the longwords sequen­ tially, beginning with the starting aclclress that Memory Address Path was sent with the command. This action allows The memory controller continuously latches all the memory controller to request any one of the addresses from the NMI bus. Once an aclclress is four longwords as the first to be read. The array­ latched, the memory controller must verify it as module portion of the read data path can transfer a valid memory address. That verification is one longword of data during every cycle. done by comparing the address to valid The error-correction logic in the memory con­ aclclresses of both the control status registers troller receives each longworcl of data plus the (CSRs) and physical memory. seven ECC bits . This logic detects single- and The CSR addresses are hardwired into the NMI double-bit errors, but only single-bit errors can interface logic; therefore, only a simple compare

56 Digital Technical journal 1987 ·No. 4 Fe bruary New Products

of the addn:sscs is required. The compare for a instructions, called the i-strcam. The CPUs valid memory address requ ires a reference to a always request a hexword of data; the NBIAs may "decode" RAM . This RAM is loaded by console req uest either longwords or ocrawords. Thus software when the system is powered up and is there can be as many as eight simu ltaneous used configure memory. Loading the RAM requesters of memory data. These simultaneous from softwareto al lows the memory controller to events require that the memory system buffer support several diffe rent sizes of array modules several commands while executing. In the 8800 without modifying any hardware . implementation, the memory system can access Once the address has been verified as being three array modules in parallel and store rwo valid. it is placed in one of eight storage loca­ commands. tions allocated to address buffering in the MPR. Moreover, since the memory system can The address remains in that buffer until its com­ accept multiple read commands, it must store mand is sent an array module. the identification of the requester and the Even thoughto eight locations are allocated to length of the transaction. The NMI interface address buffering, only seven of them can be used does the actual sroring and returns the identifi­ for rem porary storage. One location is reserved cation with the correct data. This action is possi­ for the error's page address, a pointer to a physi­ ble because all commands are processed in cal page of memory containing an error. Since sequence; therefore, the read returned first is the location of the error page-address buffer is the one stored the longest. However, hexword not fixed, the control logic for the address-buffer reads are returned to the NMI interface as two control must look ahead and not allow a new separate octaword reads; therefore , that inter­ address ro overwrite that error page address . face must ensure that borh ocrawords have been The control of the address buffe r is fu rther returned before discarding the identification. complicated by masked writes and error logging_ To prevent a deadlock condition, the memory Since a masked write is implemented as a read system is given the highest priority during arbi­ fo llowed by a write, the address in the buffer tration. This priority guarantees that the memory cannot be overwritten until the write has com­ system will be able to return data to a requester. pleted. A similar situation exists for error logging When fu ll, the memory system notifies any poten­ on read transactions. Since an error is not tial requesters that it cannot process any more detected until the read has completed , the commands and to try again later, thus preventing address cannot be overwritten until the data has the memory system from overfilling. been checked . Impact of the Cache Design Requirements of the The design of the cache affected the design of VAX Sy stem 8800 the memorysyste m. The write-through design of the cache guarantees there wi be a large num­ Impact of the NMI Bus II ber of longword writes directed at memory. 1 A stated earlier, the VAX 8800 memory system write buffer was installed to bundle a series of Asinterf aces with the CPUs and I /0 systems longword writes into octaword writes; however, through a sync hronous bus called the NMl bus . the write buffe r is only effective if multiple This bus is highly efficient and operates in a longwords arc written in the same ocraword. pcnded fashion similar to the synchronous back­ Extra logic is always required to increase per­ plane interconnect (SBl bus) in the VA.X - 11/780 formance. The extra write bandwidth for this processor. The NMI bus allows several transfers memory system, however, required more logic to be in progress simu ltaneously. than what would have been required to imple­ There arc fo ur nexuses in the 8800 system ment extra read bandwidth. The added com­ that can require memory: the two CPUs, and the plexity was needed ro facilitate interleaving on two NBIA<> . Each nexus is allowed to have rwo longword boundaries for write operations. commands ourstanding at any time. The proto­ When the 8800 project was first initiated, the col supports this arrangement by allocating two goal of the memory system was to maximize codes in a 4-bit field ro each nexus. read bandwidth, thus producing a relatively sim­ The CPUs use 10one of their references for pro­ ple array-module design. In that design, any gram data, called the d-stream, and the other for operation, regardless of its size, kept an entire

Digital Te chnical jo urnal 57 No. 4 Fe/Jruar)' I 'J87 .� st m in 8800 Th e Jl1emor)' e the VA X Fa mi(J' y

array module busy until the operation com­ allow some accurate power estimates ro be plerecl. The control logic on the array moduk made. We fo und that. with an ECL bus, the array was si mple and required a reasonable amount of module would require -5.2 V in excess of its board space and power. When the design allocation. The next problem surfaced in changed to the write-through concept, however. response tO an architectural requirement that higher write bandwidth was required . Therc­ the memory system function with less than eight fo re , the control logic in each array module had array modult:s and, prefera bly. without load to be replicated for each bank (longworcl) of cards. This requirement made ir diffi cult to mcmory to allow independent write operations. implement a termination scheme for an ECL This replication permi tted four longwords to be interconnect. written on fo ur consecutive cycles to the same Wirh these problems in mind, we investigated array module. a TTL interconnect, which clearly offered some This increase in design complexity was nor design chal lenges. the least of which were limited to the array module. ln the initial spccd and skew. Using the SPICE simu lator, we: design, when maximum read bandwidth was constructed an accurate model to verify that a critical, the memory control logic was n: lativcly TTL electrical interconnect could indeed meet simple. It had only track the state of an array our signal integrity, speed, and skew re quire­ 2 module as being busyto or not. However, with the ments. While the sim ulation results showed interleaving capability required fo r the that a TTL interconnect could work, the associ­ increased write bandwidth, the memory control ated skews certainly increased rhe complexity of logic now has track simultaneously the status the memory design . While al leviating rhe prob­ of as many as eightto write operations in progress lems of limited -5.2 V power on the array mod­ on two array modules. ule and the termination of varied load ing, this Although maximizing the longworcl write TTL scheme required ECL-ro-TTL translators in bandwidth was important, minimizing the read the memory controller ro drive the array bus. latency to the fi rst longword required was criti­ We: fi nally decided ro accept the added com­ ca l. Wrappcd reads were implemented to plexity and use TTL for the interconnect. The reduce this latency. A wrapped read is a hex­ sole exception was the clocks, which were dif­ word or octaword command that requests a fe rential ECL, re ceived and translated on the specific longword be re turned fi rst, with array module. other longwords intO that block to fo llow in There were logical rrade-offs as we ll as elec­ "wrapped" fashion. trical ones . The original specification for the NMI clicl nor support quadword masked writes. Other Design Tr ade-offs and Op tions They were added after the implementation of the memory system had progressed consider­ in all design processes, we considered many ably. Since rhe array bus supported only long­ Astrade-of fs and options before committing to a word and ocraword reads. there were three particular design architecture. One area with options to support rhis change: several alternatives was the interconnect

between the memory controller and the array • The first was tO change the array bus proto­ modules. The array modules and the controller col. rhe command generatOr on rhe memory reside in physically separate backplanes inter­ controller, and rhe array module. connected by a cable. We had to decide whether The second was execute rhe command by tO make this interconnect with ECL or 1TL. • 10 performing two longword masked writes. The overa ll project goal was to make the This option would take almost twice as long 8800 an all-ECL machine. Therefore, our first as a quadword masked write if implemented choice for this interconnect was ECL, which likc the firsr option, yet still re quire changes prov ides enhanced signal integrity, reduced ro the command generaror in the memory skews, and overall speed advantages over TTL controller. rhe system and memory design progressed , As however, some real problems arose thar altered • The third was to execute an octaword masked our opinion. The first problem became apparent write in which the data of two of the long­ as the array-module design coalesced enough to words remains unchanged.

58 Digital Tec hnical]o 11r11al

No. 4 February 1987 New Products

Since the design was well advanced, we chose both the + 5 V power and the + 5 V battery are run the last method tO ease the problems of imple­ on the surface with 50-mil etch . With the mixed mentation; this decision actually has little technology on the module, we took special care impact on system performance. The logic to tO keep the TTL signals properly spaced from the accomplish this addition already existed on the ECL signals tO avoid signal integrity problems. array module. Only small changes were required The logic on this module is implemented to the command generator of the memory con­ using nine unique macrocell-array designs from troller and the datapath control . In practice, the Motorola, Inc.. and one custom ECL multi ported frequency of quadword masked writes is RAM. There are 16 custom and semicusrom extremely low si nce they are executed only by devices on the module. It also contains some the NBlAs. I OKH MSI logic, some ECL-ro-TTL converters, and some CMOS logic used for operating with Te chnology Description battery backup. A number of different module and component Array Module Backplane technologies were used for the memory con­ The array module backplane in the VA.'( 8800 troller, backplane, and two array modules. and 8700 CPUs is a 12-layer, 8-slot pressed-pin backplane. The one in the VAX 8550 and 8500 Memory Controller CPUs is a 5-slot backplane. Since a TTL bus was The memory controller is a 9-layer, controlled­ chosen to communicate between the memory impedance, extended hex module (15 inches by controller and the array modules, a good termi­ 11 inches). The lay-up consists of 6 routing layers, nation strategy had tO be developed . Us ing the 2 power layers (- 5.2 V and - 2 Y), and a ground SPICE simulator, we evo lved the termination plane. Since there is a minimal amount of TTL, strategies shown in Figure 5.

r------� MEMORY CONTROLLER I I1 ARRAY MODULES I 8480 I I ECL TO TIL I I NAB COMMAND/ADDRESS-WRITE DATA BUS � - Dl DO '--' c cs - HLD OHMS F374 F374 F374 F374 tr' DO DO DO DO - '-- Dl '-- Dl .._Dl .. - CLK - CLK - CLK EN EN EN c I c c I (TO MODULES) I 8 I I I F374 F374 F374 F374 DO � DO I � � DO DO I +5 VOLT Dl Dl Dl Dl I .. CLK CLK CLK CLK r- r- f- I EN 0 EN 0 EN EN I 8481 >4OHMS700 I TIL TO ECL (> p I ;. DO Dl I i ( NAB READ DATA BUS ( I cs f- HLD I Dl -DATA IN HLD- HOLD (CLOCK) I f- I DO- DATA OUT EN - ENABLE IL------______j CS - CHIP SELECT CLK - CLOCK

Fig ure Te rmination Stra tegies in Memory Controller and Array Modules 5

Digital Technical journal 59 No. 4 February 1987 Th e Memory Sy stem in the VA X 8800 Fa mily

Figure 6 Sixteen Megabyte Array Module

Four Megabyte Array Module Summary The 4MB array module was designed using an The 8800 memory system was designed to VAX 8-layer, controlled-impedance, printed circuit provide 71 MB per second of read bandwidth board . The lay-up consists of 4 routing layers, and 59 MB per second of write bandwidth to the 2 power layers, and 2 ground layers. To support multiprocessor system. The system architecture, battery backup, the module has separate power processor performance needs, and high I/0 planes for + 5 V power and the + 5 V battery. activity combined to make a high-performance Since only a limited amount of -5.2 V and memory a requirement. - 2 V power is needed, these volrages share Since the 8800 contains ECL components, the space on the other power planes. To eliminate memory system has to provide a high-speed path discontinuities that could cause unwanted between the ECL logic in the CPUs and the high­ reflections, we ensured that signals did not cross density dynamic RAMs used for main storage . the power-plane splits by surrounding the Although the memory system does not play a power planes with solid ground planes. direct role in the execution of a VAX instruc­ Approximately half of the logic technology on tion, its performance has ro match closely that the array module consists MOS dynamic RAMS; of the multi processor system. If the memory sys­ the other half is FAST MSI logic. The clock system tem were under designed, the processors would is implemented in ECL to minimize the skew. stall frequently, thus reducing their usable per­ formance. If the memory system were over Sixteen Megabyte Array Module designed, it would contain extra complexity, A 16MB array module was developed tO increase with the attendant extra cost, that could not be the available memory to 128MB for the 8800 used by the system. Thus the memory strategy and 8700 systems and 80MB for the 8550 and played an important role in the price/perfor­ 8500 systems. This array module consists of an mance trade-offs that had to be made. 8-layer mother board (similar to the 4MB mod­ ule) and eight 2MB surface-mounted daughter Acknowledgments boards. The 16MB array module is pictured in Although done by a small group of engineers, Figure 6. the design of the memory system was greatly

60 Digital Technical Jo urnal 1987 No. 4 Fe bruary New Products

infl uenced hy the efforts of many people from the Electronic Storage Development Group and the Advanced VAX Engineering Grou p. We wou ld especially like acknowledge the cre­ ativity, leadership, andro ener gy level of the late .John Henry. Jr.

References

J. Keller, and K. Haduch, "Ao;; pects ]. I. Fu. of the VAX 8800 C J3ox Des ign ," Digital Technical jo urnal (February 1987, this issue: 41-51.

2. SPICE ·was developed by Lawrence Nagel and Ellis Cohen of the Department of Electrical Engineeri ng and Computer Sc i­ ence, University of California, Berke ley.

Digital Te chnical journal 6t No. 4 Febmmy I'J87 John H.P. Zurawski

Ka thleen L. Pratt Tracey L. Jones

Fl o ating Point in the VAX 8800 Family

The processors in the VAX 8800 fa mily were designed with particular emphasis on cost-effectiveness. These CPUs do not contain separate float­ ing point accelerators. Their performance is not compromised, however, especially fo r the double-precision instructions. High performance is achieved, in part, by a custom ECL multiplier and divider unit and by sp ecific hardware fo r exponent manipulation and normalization. The main advantages of this integrated approach are less hardware to repli­ cate and a tightly coupled interfa ce to each CPU, thus less time is wasted fe tching the operands. Microcode branch problems are minimized by using a prediction strategy and extensive hardware assistance.

Unlike other VAX families, the processors in the problem of optimization was ameliorated by VAX 8800 family do not contain separate float­ providing dedicated hardware for the main ing point accelerators (FPAs). Instead, their FPA operations of multiplication and addition. A cus­ is integrated into each processor's main data­ tom multiplier and divider chip is provided path. Therefore, no distinction is made between together with exponent manipulation logic and instructions that are executed in the FPA and a shifter unit optimized for floating point. These those that are not: the hardware is avai I able to logic elements handle those floating point oper­ be used for all functions. For example, the ations that take the longest times to execute. extended (XALU) is also The floating point logic resides in the execu­ used as a counter for the move character instruc­ tion unit, the E Box, of the V�'C 8800 CPU. That tion (MOVC). This usage diffe rs from that in the logic is controlled by microcode in the instruc­ VAX 8600 and VAX-11/780 systems, where the tion unit, the I Box. 1 XALU is used only for floating point instruc­ tions. Furthermore, all the floating point VAX Fo rmats and Instructions instructions, from the most complicated (POLY The VAX architecture supports four floating and EMOD) to the simplest (MOVF) , have point formats: F, D, G, and H. These formats are access to the FPA hardware. discussed at length in references 2 and 3. The There are a number of advantages to this F format is 32 bits wide, the D and G formats are approach. First, logic is not duplicated; only both 64 bits wide, and the H format is 128 bits one arithmetic logic unit (ALU) and one shifter wide. Although the D and G formats have the unit is shared between the floating point and the same width, the exponent field is larger in the normal arithmetic. Second, the design is tightly G format, and its fractional field is commensu­ integrated with the rest of the computer; there rately smaller. This format allows a larger range is no overhead involved in starting the floating but with slightly lower precision. The fractions point computation. are always normalized and the leading bit - the Clearly, since all other VAX families use FPA-;, hidden bit - is not stored. there arc also disadvantages with our approach. Shared logic is more complex than specialized E Box Op eration logic. Performance may also suffer since the Physically, floating point operations are per­ design cannot be optimized toward one class of formed on three modules: two slice modules problem. Those disadvantages can be overcome, and a shifter module. The slice modules contain however, as we shall relate in this paper. The the cache, the main ALU, and a register file. The

62 Digital Technicaljournal 1')87 No. 4 February New Products

shifter module comains the custom multiplier. of the product that are discarded each cycle by the shifter unit. the exponent manipulation the multiplier. logic (the two AlUs) , and the priority encoder. The main problem with design ing an inte­ Figure 1 shows this partitioning. To a large grated FPA is that the VAX fo rmats for integer extent, the shifter module strongly resembles an and floating poim numbers must all be handled FPA bm without the AlU and register file. by the same shared units. figure 2 shows the dif­ The source operands are fetched from either ferent bit orderings for two VAX formats, the the 64 ki lobyte cache or a general-purpose F floating po im and the integer. In the integer (KB) register (G PR) . The operands are sent on the fo rmat, the bit ordering is from right to left. In A and B ports to the AlU on the slice modu les the F fo rmat, the mant issa begins at bit 16 and in­ and to the shifter module. All the components creases in significance to bit 31, then continues on the shifter module are driven in parallel by from bits 0 through 6. The remaining bit positions the A and B ports. are used to hold the exponent and the sign. From Figure it is clear that the datapath is This req uirement for shared handling compli­ highly parallel;I the shifter, XALU . multiplier, cates the carry path of the AlU . The carries om and ALU can all operate simultaneously. This of the !6-bit word boundaries have to be parallelism is used extensively to gain perfor­ switched into the appropriate places, as shown mance and to save cost. For example, in multi­ in Figure 3. The problem with shifting is similar plication operations, the XALU determines the to the carry problem, except that now the carry exponent of the result, the multiplier mu lti­ path of Figure 3 represents the flow of the plies. and the shifter absorbs the low-order bytes shifted bits.

SHIFTER MODULE SLICE MODULES BYPASS BUS<31:0>

SHIFT COUNT BUS · 5:0>

A PORT 8 PORT CACHE DATA

REGISTER FILE

Figure Block Diagram of the E Box I

Digital Te cb nical jo urnal 6 5

No. 4 February I 'J87 Floating Point in the VA X 8800 Fa mily

F FORMAT: BIT POSITION

31 16 15 7 6 0 MANTISSA (LEAST SIGNIFICANT PART) EXPONENT MANTISSA

LEAST SIGNIFICANT BIT_j MOST SIGNIFICANT BIT _j

INTEGER FORMAT:

31 0

L MOST SIGNIFICANT BIT LEAST SIGNIFICANT BIT _j

S-SIGN BIT

Figure 2 Two VAX Formats

The ALU and the shifter unit are both The VAX 8800 processor is designed so that designed to handle all integer and floating point there is little, if any, difference in performance formats. The multiplier expects operands to between register and memory operands. The come only in a floating point format. Therefore, execution times vary from 2.25 to over 5 times for integer multiplications, the data must fi rst the performance of the VAX-11/780 CPU with be converted into a pseudo-floating point format an FPA for the F and D fo rmats. For multiplies, by swapping the places of 16-bit words within one 8800 CPU is 2.5 times faster in F format the integer format. This operation is performed and 4.8 times faster in D format; divides are by the shifter unit. 3.0 times faster. The gain is even more substan­ Table 1 gives the execution times for the most tial for the G and H formats since they are not common floating point instructions. These times accelerated on the 1 1 j780. include the overhead for fetching the operands.

FORMAT: 0 (MOST SIGNIFICANT PART) BIT POSITION 31 16 15 6 7 0

MANTIssA s EXPONENT MANTISSA ______I r- �.______.r- I IMOST SIGNIFICANT BIT__j

FORMAT: 0 (LEAST SIGNIFICANT PART)

__j MANTISSA MANTISSA I IL.....------.-' LEAST SIGNIFICANT BIT _j CARRY IN

S-SIGN BIT

Figure 3 Floating Point Carry fo r D Format

64 Digital Technical journal No. 4 Februarv 1987 New Products

Table Execution Times shifter module contains 2 MCAs and 1 8 custOmThe multiplier parts. So me lOJ KH parts arc Instruction Execution Time (Nanoseconds) used for clock distribution and fo r driving the Register to Register bidirectional bypass bus. F D G H ADD 31 5 495 540 3314 Arithmetic Algorithm Processing MUL 450 675 842 6306 DIV 1607 3197 3107 21649 Addition and Subtraction For an addition operation, �he 32-bit words con­ taining the exponents are sent the main ALU. In the 8800 the D format is slightly faster than There they arc passed theto A and B ports, the G format with its longer opcode, which which feed the shifter tomodu le. These ports re quires an extra cycle in the decoder. The single­ drive all the gate arrays in parallel. precision F fo rmat executes the fastest, and the The exponents are then loaded into the XALU larger 128-bit H fo rmat executes the slowest. and the shift-amount (SALU) , which com­ However, the H format is intended as a ba ckup putes the align ment shifALU t amount sent to the fo r intermediate calculations in the D and shifter. The SALU also generates some 20 branch G formats. Used thus, the H format ensures that conditions fo r the microcode. These conditions the fi nal calculation result has sufficient preci­ indicate the size of the alignment shift and sion and avoids overfl ow or underflow prob­ whether any source operand is zero or a lems. Little hardware as sistance is provided for reserved operand. They also he lp to optimize the H format; it is driven mostly by microcode. the microcode tlow. The which selects the larger exponent Te chnology and savesXAllJ it ,for later use, has a 12-bit datapath Component technology used in the VAX 8800 and a re gister to hold the exponent. The size of processor is an enhanced version of the macro­ this datapath is sufficient for the F, D, and G for­ array (MCA) used in the VAX 8600 CPU.2 mats plus a guard bit for overtlow or undertlow This technology provides about 1,200 gate detection. ALl! is provided to perform arith­ equivalents with a typical gate speed of metic operaAntions on the exponent. The SAUl , 1 nanosecond (ns) . MCAs util ize emitter-cou­ with an l l-bit datapath, subtracts the exponents pled logic (ECL) in a 72-pin package that is to determine the alignment shift amount, which 1 square inch with a maximum power dissipa­ is always positive. The sign manipu lation logic tion of 5.'5watt s . The GPR and the multiplier also resides in the SALU. are made with custom technology, which uses Next, the fractional part of the smaller operand the same package as the MCA but contains a is aligned hy the shifter. This operation involves more advanced pr ocess. Around 1,800 gate either one CPU cycle for F format operands or equivalents are pro vided, and the gate speed is two CPU cycles for the D and G formats . The 50 percent faster than the MCA. This higher shifter unit shifts in the tl oating point format and performance is achieved by using the following can do a full 64-bit shift. The logic that deter­ features: mines the rou nd bits is related to the alignment shift operation but is physically located in the Smal ler transistors and meta l-oxide-walled • priority encoder gate array. This gate array also resistOrs contains some of the shifter fu nctionality.

• Current mode logic instead of the slower ECL Nine gate arrays are used for the shifter unit. Of those, eight make up the datapath, the ninth Four-level logic instead of the two-level logic • is the control device. The shifter can accept of the MCA either a 64-bit operand on the A and B ports or a At 300 by 260 mils, the size of the custom 32-bir operand on either port. The shifter gener­ chip is larger than the dimensions of 221 by ates a 32-bit result that can be either the high­ 2'52mil s for the MCA. order or the low-order part of the answer. The

Digital Te chnical jo urnal 65 Februarv No. 4 I Y8 7 Floating Point in the VA X 8800 Fa milv

shifter datapath gate arrays arc identical: each Second . it was not possible to su ccu m b tO tlw effectively consti tes a byte slice of the e ign tu d s tem ptat ion of using the main AUJ to provi de the and performs a bi t shift of up to seven p l aces division operation . This desire was natural since By hifti n er or by sending the te s g is then p f m cli,·ision is an i n freq uent operation. and the usc co r t r ec byte pos i­ r ec shifter omput to the co r t of an AUJ in a repeated subtract and sh i ft mode tion. This operation is fa cilitated by hav g all p a l i ng For exa l e the VAX 8600 uses i n was ap e . mp . the outputs wired to the OR gates at all possible the ALU for just that purpose . In the 8800 the b e posi ti ons and y enabling the con·ecr output. AUJ yt b main also computes t he virtual address . erf r l ting oi t i n t er c this a a at is very time-critical (in the The shifter p o ms f oa p n . eg . Sin e cl t p h and logical shi fts as w as a , e ll number of miscel­ 8800 as we ll as in m o st other c o mpu t er laneous fu nctions. These include converts from designs) . it cannot be allowed to go any slower.

decimal-format data into i nteger fo rmat and \'icc Inc lud i ng an extra path to accommodate divi­ versa. T h e maski ng of the exponent field and sion wou ld have slowed down this critical path th insertion of the hidden bit a o done by e are ls by around '5 ns, resu lting in a 10 percent perfor­ the sh i ft er. manc e degradation for all operat ions.

After the alignment shift. the output of the ­ J\tlorcovcr, the avai I able space for the mul t i shifter directed to h on the lwpass is t e main ALU plicr and divider uni t was limited since fl oating bus. There. the output is added to or subtracted point operations arc integrated with the rest of from the fraction of the larger operand . The out­ the machine. Approximately one-third of a mod­ put of r ady be or­ the ALU operation is now e to n ule (12 i nc hes by 16 i nches) was ava i la ble. In sh i t r In most cases a contrast, the VA X 86'50 CPU d d a a full malized in the f e . small nor­ e ic tes malize shift of at most one bit position left or module to multipl ication. right will be sufficient. The sp ec i ali ze d hard­ The custom design of the m ultipl ier and ware in the shifter handles this case and then divider unit is basica lly a byte slice of a large rounds the result. Should a larger shift be word-sized m u l t i p l ier and divider unit. The required , t hen microcode will fi rst direct the m u ltip l i er handles 8 bits per cycle, the divider ALU result r ity encoder gate array . h n d l es hit. Figure 4 s hows the complete to the p i or a I There, the position of the lead i ng l is fo und and 56-bit by H -bi t m u l t iplier with its eight byte ­ used to determine the normalize amount for the slice custom chips. Eight chips arc used tO form r r d word of 64 bits (56 data bits subsequent cycle. the eq ui e size The rou n ding operation in the V�'( 8800 CPU p lus 8 g u ard bits). This arrangement is suffi ­ cient to handle F. 0, and fo rmat rat ons . i s unusual in that it is limited to t he low-order G ope i H eight bits. Therefore . a small 8-bit can be format operations arc performed by pa rtition­ used fo r this operation. This adder is both fa ster ing the problem into ma ny smaller '56-bitmulti­ pli cations under microcode ntrol. an d c heaper than the usual method of using a co fu ll 64 -bit adder. The 8-bi t adder is a I so suffi ­ The m u l tipl i cand is loaded into the MD latch after logic. which cient to ca lculate the correct answer in over passing through rhc mask i g n and he exponent field \)\).5 percent of the add it ion operat ions . Shou ld clears the s t and 8 g the hidden bit. The PR l a t c h and t he a carry-out be generated by this -bi t round in inserts add, then clearly t he resu lt created is incorrect. PRGB a rc cleared at the start of the multi ply. ln that case the computer is tra p p ed and The PRGB contains the guard bits for the PR microcode invoked to correct the resu lt. latch. At the end of a m u ltipl y. this latch will hold the bits required for a possible normaliza­ Multiplication tion shift and also for a rounding operation . The least signifi can t e ight bits of the u l t ip i er arc A-; mentioned earlier, the 8BOO conta i ns a high­ m l loaded into the multipl ier larch. The first multi­ performa nce. custom-designed m ulti pl i e r and divider unit. A number of fa ctors i mpelled us to ply cycle is now ready to be perfo rmed . A '56-bit by 8-bit multipl ication is performed usc such a u n i t . First. multiplication is a very between the contents of the MD and mu lt p l ier frequent operat i on that is used exte nsi vely i n i latches. The resu l t is then added to the contents matri x man ipulat ion . For e xa mple , in the LIN­ PACK b n h ma r the time-critical routine con­ of the latch (which is initia l.ly zero ) and then e c k, PR ta ins an even mix of addition and multiplication written back into it with a ri ght shift of 8 bits. operations. ' The PR latch is t hus an accu m ulating latch and

66 Digital Technical journal 1Vo. 4 FehruaJ:J' llJ87 New Products

MULTIPLICAND INPUT MULTIPLIER INPUT

S·BIT SHIFT

PRGB

64 BITS

BOOTH RECODE

MULTIPLIER OUTPUT

Figure Multiplier and Divider Unit 4 contains the 64·bit partial product of each mul· multiplies. Should a greater increment be tiplicarion operation. The next 8 bits of the required, then the multiplier will trap the rest multiplier are loaded into the multiplier larch, of the machine, and the correction will be per· ready for the next cycle. This cycling continues fo rmed by the main ALU . This scheme is similar until the multiplicand has been multiplied by ro the one used for addition. all the multiplier byres. This algorithm is similar The provision of a 64·bit adder inside the to the one used in the VAX R650 scheme, main multiply path is unusual in a high·perfor· except that that processor has a narrower data· nunce machine. High·speed multiplier designs path of 32 bits. typically use carry·save adders, which do nor Notice that the least significant byte of the propagate the carry signal bur save them so they partial product is discarded after each cycle and can be absorbed by the subsequent cycle. This absorbed by the shifter unit. These bytes are fo rm of adder is indeed used in the cusrom mul· required only fo r the H format multiply. riplier ro perform the 56·bit by 8·bit multiply Once completed, the result is sent our function illustrated in Figure 4. However, the through the result latch, then normalized and 8800 also uses a full 64·bit adder for the fo llow· ro unded . The rounding carry is only propagated ing reasons: into the least significant byte of the result. This procedure uses less logic since only an 8·bit • 64·bit adder has ro be provided somewhere instead of a 64·bit incrementer is required. The Ato propagate the carries from rhe carry·save 8·bit incrementer will be sufficient fo r most adders.

Digital Technical jo urnal 67 o. Februarl' 4 I <)87 Floating Point in the VA X 8800 Fa mily

• With the 45-ns cycle time, the 64-bit adder and rounding of results entails either extra fits in the main datapath. A faster clock for logic or additional cycles if the floating point the multiplier would have complicated the hardware in the E Box is used. clock distriburion and heen diffi cult to gener­ • Most designs have a clock system not consis­ ate with low skew. tent with the rest of the machine. This fact

• A fu ll adder in the darap;ah allows the usc of introduces the complication of a specia l a simple nonresroring . clock distribution and difficulties in verifying design. The multiplier and divider chip conta ins a rhc

12-bit by 8-bit multipl ier, two 8-bit adders, • Very few designs arc based on ECL technol­

six latches with a rota! size of 72 bits, as well as ogy. Other technologies. such as TTL, would the rounding, normalizing, and control logic. A require a different power rail and thus an comparable MCA design would require between extra power supply. three and fo ur of these elements. The closest available multiplier to rhe 8800 Alternative Designs fo r the Multiplier requirements is the I 090 I made by Motorola, Inc. This MCA implementation contains an 8-bir An MCA design was certainly possible and could by H-bit multiplier together with a 16-bit adder. have been made ro fir in rhe specified space. However. no latches are included; they must The pe for ance of such a design , however, r m therefore be provided externally, thus increas­ would nor be as good as the custom design for ing rhc cost substantially. On the other hand, multiplication but comparable for division. An division could be provided by repeatedly using MCA design would be I.7 rimes better than an the 1 (J -bir adder of the I 090 I. llj780 with an FPA for a multiply in F forma t, whereas the custom logic chosen is times 2.') Division better. The performance would be 2 ') times The multiplier performs a nonresroring division better fo r the D format, whereas the custom algorithm, 1 bit per cycle . fo r the F, D. and design is 4.8 times better. G fo rmats. The divider can accept a new divi­ Another alternative was to use a commercially dend bit during every cycle. thus permitting a available multiplier. That was tempting because 128-bit by ')(J-bir divide. A divide of this size is such a product has the advantage of being read­ used in the H fo rmat algorithm to form the start­ ily available and tested. Using it would have cir­ ing approximation. cumvented the high risk of a custom design. The booth recodc of the multiplier is modi­ However, there are a number of disadvantages to fied slightly ro accommodate the division using general-purpose ultipliers m : deeode z ln the case of multiplication, the mul­

• Extra logic is required ro mask out the sign tiplier recode selects the correct multiples of and exponent of the data and to insert the the multiplicand to add to the partial product during each multiplication operation . ln the hidden bit. The output of the multiplier would have to be masked. case of division, rhe divisor is loaded into the MD latch, and the booth recode selects either • Most available products cannot handle divi­ or - l times the divisor for each division sion. Thus a separate divider woul have + 1 d step. been required, which was expensive . Even In the nonrestoring division algorithm, the division algorithms using multiplication sign bit of the previous resu It selects the correct require a large amount of ROM ro contain rhe divisor multiple for the next cycle. This selec­ approximation constants. tion is faci I ita ted by feeding the sign signal into

• Many of the available designs arc intended for the modified booth recodc so that it will se­

integer applications, such as HI butterflies lect the multiples of e ither + I or -1 times the and processors . Hence, the divisor. designs are optimized for those applications. The quotient bit generated every cycle is sent Extending these 8- or 16-bit multiplier s ro a to the shifter unit to be absorbed. The first quo­ larger word length, as required for the Vfu'{ tient bit generated corresponds to the most sig­ architecmre, was neither straightforward nor nificant bit of the answer. That bit is then nor­ cost effective. Moreover, the normalization malized and rounded by the shifter.

68 Digital Te chnical journal J 'J8 7 No. 4 Fe bruan• New Products

Microcode Design • In WRITE, the result of the ALU operation is Being integrated into the logic in the main written back to the register file. machine, the floating point logic is also con­ Thus when the next-address cycle has com­ trolled by the main microcode. The VAX 8800 pleted for the first microinstruction, A, the next­ s CPU is an extensively pipelined design. address cycle for the microinstruction, B, in the Although pipelining is a well known technique subsequent cycle is started. This cycle now for improving performance (for example, the overlaps with the look-up cycle for A. As many VAX 8600 CPU) , it comes at a price: the micro­ as five operations can proceed simultaneously in code branch latency will increase. By that we this manner. mean that the microcode cannot branch on a The branch latency of this pipeline is gov­ condition or flag in the very next instruction; erned by the first microinstruction that can instead, it must wait a number of cycles. This "see" a branch condition set in an earlier cycle. delay is a consequence of the overlapping of the For example, if the ALU cycle of A sets a carry microinstructions; each successive micro­ condition, then the first instruction that can instruction starts before its predecessor has possibly use this signal in its next-address cycle completed. is E. Thus the branch latency is three microin­ Figure 5 shows a typical pipeline similar to structions, as shown in Figure 5. that used in the VAX 8800 system. The microin­ Naturally, this branch latency influenced the struction is subdivided into five components: way in which we designed the logic to perform floating point operations. Clearly, we had to In NEXT ADDRESS, the address for the next • avoid branching whenever possible as this microinstruction is computed, as well as would result in an excessively slow algorithm. those for any selected branch conditions. Instead, we had to adopt a strategy based on

• In LOOK-UP, the microcode RAM is accessed prediction and provide extensive hardware to fetch the microinstruction specified by the assistance. current NEXT ADDRESS. Prediction is based on the fact that the speed of algorithms for floating point adds are usually • In the register file is read to fetch the READ, data dependent. For example, for certain data specified operands (e.g., fetch RO and Rl). values, the result of a floating point add will

• ln ALU, the operation in the arithmetic logic require considerable normalization. That

unit is performed (e.g., RO + Rl). requirement is always present when two values

r- CONDITION CODE SET (E.G .. CARRY OUT) �- -- ______------� - ---r -__-r __r- INSTRUCTION A: NA LU R EAD ALu rl w RITE �

B: NA LU READ ALU I WRITE

NA LU READ ALU WRITE BRANCH c: LATENCY D: I NA LU READ ALU WRITE E: NA LU READ ALU WRITE I

NA - NEXT ADDRESS EARLIEST INSTRUCTION THAT CAN BRANCH LU - MICROCODE INSTRUCTION LOOKUP L ON CONDITION CODE OF INSTRUCTION A.

Figure Five-stage Pipeline 5

Digital Technical journal . 69 No. 4 Fe bruary 1 !)87 floating Point in the VA X 8800 Family

of similar magnitude and large cancellation are determines if the exponent of the result could subtracted. In other cases little or no normaliza­ possibly overflow or underflow, taking intO tion is required. It is clearly preferable not to account any possible normalization shift. There pay the penalty of unnecessary normalizations. is also the added complexity of a rounding oper­ The approach we took in the 8800 is to pro­ ation provoking an extra normalization step. ceed down the most likely path, assuming that a That would happen when the rounding incre­ small normalization will be required while wait­ ment caused a carry to propagate throughout ing for the result of the branch signals. The add the whole fraction . and subtract algorithms in particular are struc­ Consequently, the use of a small 8-bit incre­ tured that way. The SALU examines the expo­ menter for the round operation is possible only nents of the operands and other signals; then it if it is known that an overflow cannot happen. sets approximately 20 branch conditions in the The reason fo r this is that halting (trapping) the first two cycles of the add/subtract datapath. machine is not instantaneous (for the same rea­ In certain situations all paths may be equally son that branch latency exists) ; therefore, the probable. In these cases the microcode enables result will always be written. Thus, although the hardware signals to control the datapath. A good microcode can eventually correct the result, it example of this processing is the selection of cannot prevent that result from writing. operands . For a floating point add, it is natural to think in terms of the larger and the smaller Performance Issues operands. For exa mple, the smaller operand is When a program with many floating point the one that is always aligned. However, the instructions - such as UNPACK - is run, its microcode does not know which register loca­ performance is not totally dictated by the raw tion holds the smaller value, and it does not fl oating point speed of the CPU. Having a more want to wait for the whole branch-latency profound effect are other factors, such as period to find out.

Therefore, the microcode will assume that the • The size and organization of the cache - This larger operand is in a particular register. Should factor is particularly important for programs this assumption be incorrect, then the SALU wi ll with large amounts of data because the swap the register fi le read addresses (thus sort­ operands will reside in memory . Having ing the operands) . Not all locations have their superior register-to-register performance will addresses modified in this manner since the not help in this type of program. Clearly, the microcode still needs tO be able to read and larger the cache, the greater the chance that write to specific locations. the required data will be quickly available, Similarly, the SALU determines if the main thus avoiding a lengthy transaction with ALU is to doan add or subtract operation. At this memory. point in the computation the microcode is • The performance of the integer and control unaware of which operation will be required. instructions - Even pr ograms performing The pipeline is still within the long branch extensive floating point operations still have latency of the 8800 and cannot branch until this sign ificant amounts of integer and control latency delay has elapsed . Note that one of the instructions. Doing these quickly can con­ most frequently performed instructions is ADDF. tribute substantially ro the program's perfor­ That instruction will have just completed by the mance. time the microcode can finally branch. There­ fore, the ADDF cannot execute any faster since it To illustrate the effect of these factors, com­ is limited by the branch-latency delay. Conse­ pare the performance of the VAX 8800 system quently, those instructions that are the most with that of the VAX 8650 when both run probable cases are completely hardware driven. UNPACK, as shown in Table 2.� The 8650 has To allow fast paths in the add algorithms, it is faster raw floating point speed, especially for necessary to know that the result cannot possi­ the F format (over twice as fast). Yet the two bly overflow since overflowed results must systems run this with almost the never be written. To prevent overflow the SALU same performance. Clearly, in programs with examines the exponents of the operands. It then these characteristics, facrors other than raw

70 Digital TechnicalJournal No. 4 Fe bruary 1987 New Products

VA X Architecture Manual (Maynard : speed will have a greater influence on perfor­ .'1 . mance. Of course. in applications without them. Digital Equipment Corporation . Order the raw speed advantage of the 8650 will be No. EB- 19580, 1981). more pronounced . Dongarra, "Pe rformance of Various 4. j. Computers Using Standard Linear Equa­ Table UNPACK Performance 2 tions Software in a Environ­ Performance (MFLOPS) ment." Argonne National Laboratory (May

Format l9H6). Computer F Format 0 5. S. Mishra, "The VfuV. 8HOO ,\f icroarchitec­ VAX 8800 1.35 0.99 tUIT," Digital Technical jou rna! (Febru­ VAX 8650 1.30 0.70 ary 1987, this issue): 20-33.

Summary The architecture of a processor like the VAX 8800 CPU is all a matter of trade-offs . Where does the performance make a diffe rence 1 For example, we could have supplied the 8800 with a separate floating point unit tO achieve faster performance. Doing that, however, would have required at least one extra module. To keep the cost of the system constant. this extra module wou ld have entailed removing a module of logic from some other part of the computer. Perhaps removing that module would have resu lted in a smaller cache or a simpler decoder with no optimizations for the frequent instruc­ tions. In any case the net effect would have been w sacrifice the performance of the com­ puter in some other area. All things considered. we fe el that the design is well balanced for the multitude of diffe rent computing tasks that CLts­ tomers will perform with the VAX 8800 system.

Acknowledgments

The authors would like to thank Ron Melanson and his team for the circuit design of the custom multiplier. In addition, we would like w thank Dave Sager for his help and gu idance.

References

1. R. 13urley, "An Overview of the Four Sys­ tems in the VAX 8800 Family," Digital Technical jo urnal (February 1987, this issue): 10-19. 2. T. Fossum, W. Grundmann, and V. Blaha. "The F Box, Floating Point in the VAX 8600 System," Digital Te chnical jo ur­ nul (August 1985): 4 3-53.

Digital Technical journal 71 I <)87 No. 4 Fe /Jruarp james P. janetos I

The VAX 8800 Input/Output System

The VAXB I bus links the processors in the VAX 8800family to ljO devices, including clusters and networks. The VAX 8800 multiprocessor can sup­ portfo ur of these 32-bit synchronous buses, each of which connects up to 16/jO devices. Each VAXB I bus connects to the memory interconnect, the NMI bus, by an adapter, which contains an interface chip to imple­ l\'Bl ment the VAXB I protocol. TheNB/ adapter logic handles CPU references and direct memory accesses to andfr om the ljO devices. The adapter has its own 200-nanosecond clock, which is completely asynchronous with the CPU clock. 45-ns

Tht: VAX 8800 fa mily of systt:ms is another bus. This bus is a limited-length, high-speed major stt:p for Digita I Equ i pmenr Corporation synchronous communications path that provides into tht: rt:alm of high-performance computi ng. the data link between these fo ur devices. The \V hilc: increasing the computing capability of NMI bus is completely contained in the main tilL VAX line for scientific and technical appli­ system cabinet; its cycle time is 4 5 nanoseconds cations. these systems will undoubtedly play an (ns) , the same as the CPU's. The bus protocol important ro le in commerc ial and offiu: mar­ handles severa l outstanding transactions at one kets. In thest: markets, the abi lity ro connt:crro a time. thus effectively increasing the bus's uti­ computing cluster. service many users. and lization. That is, once a device has issued a fu nction in a network arc as important as a fast transaction (e.g., a read), that device relin­ CPU. Indt:ed, in a multi user. multiprogramming quishes the usc of the bus until the responding system, the efficiency of "housekeeping" opera­ device is ready with the data. Other devices arc tions affects the perceived system performance then free to start other transactions. as much as raw processor computing speed . In this fashion, the bus usage is greatly These operations includt: sharing memory increased. The two CPUs communicate directly between many programs, swapping processes with memory over the NMI bus; the 1/0 devices into and out of memory. raging, and responding connected ro the V AXBI buses access memory to interactive user requests. via the NIH adapters. A device on the NMI bus is

Al l members of the VAX 8800 fa mily usc Digi­ called a ''nexus.'' Arbitration among nexuses ta l's new VAXBI bus as their communication occurs in parallel with data transfers and is han­ link to clusters. networks, and interactive users. dled by one CPU in a nearly round-robin fash­ With irs ability to connect to fo ur st:paratc ion. This guarantees that each nexus ga ins irs

VA,'CB!channels, the VAX 8800 system in rarric­ fa ir share of the bus resource . Data transfers on ular offe rs great flexibility in configuring the NM I bus occur in Jongword , octaword , and hexaword lengths 4, 16, and 3 2 bytes respec­ periphera l devices and interfaces. This paper ( first discusses the characteristics of the system tively) . Four levels of device interrupts are supportcd . communication buses in the VAX 8800 system. Following that is a discussion of the interface, called the NBJ adapter, linking the primary sys­ The VAXBI Ba ckplane Interconnect tem bus to the VAXlll inputjoutput (1/0) bus. The VAXBI bus is used as the IjO bus for the system . As shown in Figure from Figure I i I lustrares the various components of a VAX 8800 I, one to four VAXI31 buses can be interfaced to the VAX f\800 system. NMI bus. depending on a customer's needs and The Processor-to-Memory Bus his desired mix of peripheral devices. Each The two CPUs. the IjO subsystem. and memory VAX131 bus is a 32-bit-wide synchronous bus that all share the primary , ca lled the N1\

72 Digital Technical Jo urnal No. 4 February I 987 New Products

device, called a "node," uses a chip called the Control of Peripheral Devices VA.X Bl Interface Chip as its bus interface. This To gain an appreciation of the NI31ada pter chip provides a consistent logical and electrical architecture, it is worthwhile to discuss the con­ interface to the bus. The VAXBI Interface Chip trol of peripheral devices.1 To move data from a impl<:ments most of the bus protocol for its disk into memory or to send program output to node, including bus arbitration and error check­ a periphera l device, a programmer must specify ing. The VA.XBI cycle time is 200 ns, controlled the operation to be carried out (read or write), by an oscillator on the NBIB. a memory address to receive the data or that The NBI adapter acts as both a processor and a contains data to be output to a device, and the memory on the VAXBI bus. The adapter pro­ amount of data to be moved. In early machines, vides the following three important fu nctions: the processor was required to control the entire 1. A means for the master CPU to read and operation - executing instructions to move the write device registers data, waiting for the slower device to complete the operation , and then continuing in this fash­ 2. A wi ndow into memory fo r VAXBl ion until all the data had been moved. This pro­ devices cess wasted a great deal of processor time since 3. The fac ility for VA.XBI devices to inter­ many instructions could have been executed rupt the processsor while wa iting for an l/0 operation to complete.

TO OTHER COMPUTERS

Figure 1 VA X 8800 Configuration

Digital TechnicalJournal 73 No. 4 Februarv 1987 Th e VA X 8800 !nputjOutput SJ•stem

Modnn machines have ljO controllers. which BYTE ADDRESS arc special hardware interfa ces that hand le 0000 0000 device operations. A programmer must specify

to the controller the attributes of the operation 512 MEGABYTE PHYSICAL to be carried out. Once the operation is MEMORY SPACE

accepted by the controller, th(' JHOC('ssor is 1 FFF FFFF freed fro m the details of actually moving the

data. In this way processing and r;o operations 2000 0000 can b(' overlapped, increasing processing

utilization. 512 MEGABYTE 1/0 SPACE For slow devices. such as terminals, tlw coo­ trol ler usually has a small buffer to hold th(' data 3FFF FFFF

to be transferred to or received from th(' proces­ sor. This buffer is loaded by the processor when Figure 2 VA X Address Sp ace it has data to be transmitted to the device. The device accepts the data, then signals wh('n ready TheNB I Adapter for more. When having data to b(' transmitted to An adapter provides an interfa ce between two the processor, the device loads that data into the existing buses, ('ach with its own addressing buffe r and then signals to th(' processor to protocol and data-transfer protocol. The adapter remove the data. This process is ca lled pro­ is responsible for all communications between grammed r;o. the two buses. It is a datapath for the processor For high-speed devices, such as disks. th(' IjO to access d('vicc registers and for devices to controller normally performs dir(' Ct 111 ('!110ry access memory. This datapath is also ued to access ( DMA) operati ions. Th(' proc('ssor loads interrupt the processor and for initialization sp('cial registers in the controller with informa­ functions. tion about the transfer - th(' amout of data to The NBl adapter. consisting of an NBIA mod­ b(' moved and its location and destination . The ule and either one or two NBIB modules, inter­ proc('ssor is then freed while the controller per­ fa ces the VAX 8800 system to the VAX BI buses. fo rms the transfer. In this way large amounts of which arc 1/0 buses in this application. That is, data can be moved with miinimal processor the NBJ adapter issues reads and writes on the int('rvention. VAXBl buses in response to reads and writes that are in th(' NBI address range initiated by the pro­ Addressing in the VA X CPU 8800 cessor on the NMI bus. Likewise, the NBI Th(' master CPU manipulates the l/0 controlkrs adapter issues reads and wires to memory on the with reads and writes of single lonwords to their NMI bus in response to reads and writes ini­ control and status registers. These registers have tiated by VAXBI devices on the VAXBI buses. addresses in physical address space and can be The NBI adapter in the VAX 8800 system sup­ manipu lated by standard VAX instructions. This ports a new generation of high-performance. technique contrasts with that used in many com­ native VAXI31 devices. puters in which special instructions contro l Figure 3 contains a block diagram of the ljO . The address range of the VA,'( architecture NBIAjNBIB adapter system. Basically, the data­ is shown in Figure 2, in which addresses are path of the NB!A module contains an NMI inter­ given in hexadecimal notation. fa ce. which provides buffe ring for addresses and Physical memory occupies the first ') 12 mega­ data transmitted and received during NMI trans­ bytes of the defined address range. The IjO actions. The NMI interface is connected to a adapter and the IjO control ler registers transaction buffer, which is a 16-location, dual­ arc located in the range from 2000 0000 to ported ECLjTTL RAM. The transa ction buffe r 5FFF FFFF. In the 1/0 space. the address range provides five locations to buffer commands and allocat('d for each VAXBI bus is fu rthn subd i­ addresses and up to fo ur longwords of readj vided into space fo r each device on the bus. write data for (DMA)

74 Digital Technical journal No. 4 Februmy I ')87 New Products

transfers by devices on the VAXBI-0 bus. A sec­ tween the NBIA and NBIB modules. Facilities are ond group of five locations is provided for DMA provided for delaying data transfer until a buffer transfers by devices on the VAXBI-1 bus. Two is free, thus preventing data corruption . Another locations are used for the commandjaddress synchronization problem occurs when the mas­ packet and the single longword of readjwrite ter processor wants to read from or write to a data transferred when the processor accesses the VAXBI device when that device wants to make a VAXBI device registers. The NBIA/NBIB TTL memory access. The control logic in the NBIA datapath indicating the layout of the transaction and NBJB modules is carefully designed ro ref­ buffer is shown in Figure 4. The TTL port of the eree such contention problems. transaction buffer connects to a set of two bi­ directional latches used to buffer commands, DMA Transfe rs addresses, and data for transmission across the data-bus cable and from an NBIB module. Fro m VA XBI Devices to Memory The data pathro of the NBIB module consists of a A DMA transfer to memory by a VAXBI device is set of four bidirectional latches used to buffer shown in Figure 5. both DMA commands and addresses and CPU After wi nning the VAXBI bus, the device want­ commands and addresses, as well as data. These ing to make a transfer initiates a command and latches connect to another set of latches known address cycle. In Figure 5, that device is a disk as the BCI cl ara buffer (one longword deep) , controller. The VAXBI Interface Chip in an NBIB which connects to the YAXBI Interface Chip. is programmed to recognize memory addresses (The module side of the interface chip is known on the VAXBI bus. The chip "awakens" the as the BC I.) The interface chip con trots the NBIB control logic, decodes the command, and enabling of data onto the BCI for data transmis­ stores the commandjaddress packet, as shown in sion onto the VAXBI bus. Figure 4. Control logic on the NBIB then sends a Data flows between the NMI bus and the "DMA request" signal to the NBIA. After a syn­ VAX Bl bus by moving it between these two sets chronization delay on the NBIA, the NBIA TTL of larches. Control logic moves data from stage controller begins to transfer the command and to stage, passing contro l successively to the next address from the NBIB to the NBIA. stage as each part of the transfer completes. The Meanwhile, the NBIB takes the longwords of VAXBI bus runs approximately four times data as they appear on the VAXBI bus and stores slower than the VAX 8800 processor and is asyn­ them in the NBIB's data buffers . The NBIA stays chronous with it. Therefore, the additional approximately one cycle behind the NBIB, problem exists of synchronizing conrrol be- re moving data fro m the NBIB buffers and storing

N

NMI NBI INTERFACE TRANSACTION BUFFERS BUFFERS

Figure Block Diagram of NBI Adap ter 3

Digital Tec hnical jo unwl 7 5 Februmy No. 4 1<)87 The VA X 8800 lnputjOutput Sy stem

it in the DMA locations in the transaction buffc r. should be noted that a DMA write transac­ After successfully transferring all data into the tionlr is considered to be complete on the VAXBI transaction buffer, the NBIA alerts the bus before the data is actually written to mem­ which, after a synchronization delay, endsN£318, the ory. A V�'(I31 device is rhus free stan another transaction on the VAXBI bus. At this time the transaction immediately. Thisro perf ormance NBIA TTL controller passes the Dl'viA request enhancement is known as a "disconnected the NMI interface in the NBIA, which then per­to write ." in which rhe write operation is consid­ forms the write to memory on the NM.I bus . ered ro be completed on one bus before that

------, VAXBI INTERFACE I CHIP I I 0 I NBI I TRANSACTION DATA BUS 0 BUFFERS I 0 (40 BITS) I I I I I I

I NBIB I IL------

BUS 1 DATA BUFFERS

I I I I I NBIA I ------�I

TRANSACTION BUFFER ORGANIZATION

DMA DMA "--..--/ 1 0 Figure NHJAjNBIB TTL LJatapath 4

76 Digital Technical journal

No. 4 Tebruar)' /'}87 New Products

�g�� DATA DATA DATA DATA I ARB I I !�: I I I I I·. ·l�lilililil NMI CYCLES C/A -COMMAND/ADDRESSVAXBI CYCLES ARBEMB -ARBARBIT-EMRATIONBEDDED ARBITRATION

Figure DMA Tra nsfe r to Memory 5 operation has actually taken place on the target if the NI31ATTL controller were to signal the bus. The NBI adapter is designed in such a way DMA request after loading only the command/ that a write transact.ion cou ld be wa iting in the address packet into the transaction buffe r, the transaction buffer (e.g., while the NMI interface NBIA NMI interface would attempt to read data controller services the other VAXBI bus) while a from the transaction buffer before that data had second transaction waits in the data bus been loaded . That is obviously a bad thing to do. transceivers. Using two levels of buffering and Indeed. the NMI interfa ce of the NBIA can the disconnected write technique allows the empty the transaction buffer in approximately NBI adapter to su pport a write bandwidth of the time it takes for the NBIA TIL controller to 8 megabytes per second. load one longword . It is interesting to note that during the data transfer from the NBIB tO the NBIA, the NBIB From Memory to a VA XBI Device notifies the NI3IA TTL controller of the DMA A write request from a VAXBI device is similar request immediately after storing the command/ to the DMA operation just described. After win­ address packer. However, the NBIA TTL con­ ning the VAX BI bus, the device wanting to read troller does not pass the DMA request tO the data from memory on the NMI bus transmits NBIA NMI interface controller until the com­ a command and address on the VAXBI bus. mancljaddress packet and all the write data have Figure 6 depicts this transfer. been loaded into the transaction buffe r. The rea­ The interface chip awakens the NBIB control son for this delay is that the NMI interface con­ logic, which then decodes the command and troller runs at the same speed as the NMI bus, or stares the command and address in a data-bus 4'5ns per cycle. buffer location. The NBIB then passes the DMA The NBIA TTL cont roller runs fo ur times request to the NBIA immediately after the com­ slower, or 80 ns per cycle, tO closely match mandjaddress packet is loaded . Aga in similar to the VAX BI cyclI e time of 200 ns per cycle. Thus the write operation, the command or address is

Digital Te chnical journal 77 February No. 4 I ')87 Th e VA X 8800 lnputjOutput Sy stem

transferred to the appropriate location in the transferred . That maximizes the read bandwidth transaction buffer by the NBIA TTL controller. on the VA.XBI bus. The NBI adapter has a maxi­ However, a DMAread is unlike a write opera­ mum DMAread bandwidth of four megabytes tion, in which the data is ready for transmission, per second . in that the data must be fe tched from memory. The DMA read transfer illustrates one funda­ The DMA request is first passed ro the NBIA NMI mental difference between the Nl\11 bus and the interface controller, which arbitrates for the VA.X BI bus. Referring to Figure 6, one can see NMI bus. Upon winning the bus, the interface that the VA.XBI bus is unusable while the NBIA controller initiates a read request ro memory . and memory complete the read operation. (The When the the data is ready, the memory returns NBTB issues stall signals to the requesting device it on the NMI bus to the NBIA. Thence the data during this time.) The NMI is a pended bus, but is transferred into the DMA locations in the the VA.X BJ bus is nonpended, or interlocked . transaction buffer, and the NBIA TTL controller That is, the NMI bus is immediately available for is notified by the NBIA NMI interface that the usc once a command has been transmitted and data is ready. The controller then begins to acknowledged, whereas the VAXBI bus must transfer data to the NBIB, loading it into succes­ wait. Thus "pending" transactions are allowed sive locations in the NBIB buffers . This process on the NMI bus. Indeed, the NBIA NMI interface is illustrated in Figure 4. A "DMA Done" notifi­ can respond to requests from the other VA.XBI cation is sent to the NBIB after the first long­ bus while also having an outstanding read to word of data, rather than all the data, has been memory on behalf of the first VA.XBI bus.

MEMORY

VAXBI CONTROLLERDISK

A A A A I C/ I i�: I" "I•""Is""l'"''l"'"ls""I" " l"'"l""'l""' I""' Is; ce I ""'I"'" I om I om I om I VAXBI CYCLES

. . MEMORY -----LATENCYREAD AD DA DA AD TA TA TA TA EMBC/A - ARBCOMMAND-EMBEDDED/ADDR ESSARBITRATION (NOTNMI TOCYCLES SCALE )

Figure DMA Tra nsfe r fr om Memory 6

78 Digital Te chnical Jou rnal Febmarv No. 4 1987 New Products

MEMORY

VAXBI CONTROLLERDISK

A0 TA Ill CYCLES VAXBI CYCLES CYCLES (NOTNMI TO SCALE) NMI C/A -COMMAND/ADDRESS ARBEMB -ARBARBITRA-EMBEDDEDTION ARBITRATION

Figure Tra nsfe r fr om VA XBI Device 7 CPU

CPU Transfe rs to andjrom VAXBI device controller to cause it to transfer VAXB I Devices large amounts of data. CPU transfers to and from VAXBI devices are similar to VAX BI transfers to and from memory, Sy nchronization the obvious diffe rence being that the transaction In the earlier discussions of data transfers, the is initiated on the NMI bus. CPU transfers are term "synchronization delay" was introduced. In shown in Figure 7. general, some type of synchronization is An other diffe rence is that CPU transactions required whenever more than one independent are limited to longword length when accessing clock exists in a system. This is the case in the VA., '(BI devices. Since there is only one location VAX 8800 system. Timing for the processors, for a commancljacldress packet for CPU transfers memory controller, and NBIAs is derived from a and one location for readjwrite data in the trans­ sophisticated clock module that provides two­ action buffer, the NBI adapter can handle only phase, nonoverlapping clocks with a basic one CPU transaction at one time. These limita­ period of 4 5 ns and tightly controlled skew 2 tions lower the CPU-to-VAXBI bandwidth as The VAXBI timing, on the other hand, is derived compared the DlVlA bandwidth. An analysis of from an oscillator and a clock-driver circuit on bus traffic,to however, has shown that CPU­ the NBIB. This timing has a basic period of 200 ns, initiated transactions account for under 10 per­ completely asynchronous to the VAX 8800 ker­ cem of the VAXBI traffic in a VAX 8800 system. nel . The synchronization of control signals is This fi nding could be anticipated since the CPU thus necessary for data transfer between the must make only a small number of accesses to a NBIA and NBIB modules. A DMA read transfer

Digital Te chnical journal 79 , No. ; Februmy I 'J8 7 Th e VA X 8800 lnputjOutput System

involves the synchronization of a '' DMA request" and a "DMA complete" signal . There­ fore, the synchronization overhead can account for approximately 5 to 15 percent of the time it takes tO complete the operation. Summary The performance of the 1/0 subystem is critical tO the operation of high-performance systems like those in the VAX 8800 family. The 1/0 adapter provides a communication link between the each processor, the memory, and the l/0 devices. The NBI adapter is this link for these systems, providing access to a new generation of VAX BI devices and high-performance ljO opera­ tion for these important new machines.

References 1 . H. Levy and R. Eckhousc , Co mputer Progra mming and Architecture: The VA X- 11, (Bedford: Digital Press, 19RO) .

2. W. Sa maras, "The CPU Clock System in the VAX 8800 Family," Digital Te chni­ caljournal (February 1987, this issue): 34-40.

80 Digital Tec hnical journal No. 4 February 1987 Pa ul Wade C. I

The VAXB I Bus -A Randomly Configurable Design

The VAXBI bus provides a high-performance altemative to the UNIB US system as Digital's general-purposebus. The VAXB I design was completely sp ecified befo re any hardware was built and is independent fr om any physical configuration. The designers had to discard the traditional small-perturbation approach and instead used many techniques to sp ecify the bus characteristics. Two custom chips, a differential driver and receiver, are used to clock the bus. Thebus designs were tested exten­ sively with SPICE, but tests on the physical chips led to some unantici­ pated problems. Further analysis of waveforms, crosstalk, and switching noise led to changes that met all the original goals.

The V�'\BIbus is a new, high-performance, gen­ VAXB I BusDesc ription eral-purpose bus that provides a common inter­ There are several excellent references that face to all of Digital's new VAX products, fro m describe in detail the operation of the VAXBI the VAX 8200 CPU to the VAX 8800 system. bus and the VL'IIchip that implements the bus This bus can also be used for fu ture VAX sys­ logic and arbitration _u.� Therefore, only a short tems. The VAX BI bus is a higher-performance description of the bus will be given here. The re placement for the UNIBUS system and should VAX BI bus is a general-purpose bus with data have a similarly long and productive lifetime. transfer rates high enough (up to 13.3 mega­ The UN!l3USsystem was enhanced many times bytes per second) to serve as a memory bus in duri ng its long hi story. Since there was no fo r­ mid-range VAX systems, such as the VAX 8200 mal specification for this bus until 986, these CPU. All machines in the new generation of VAX many de facto enhancements led toI numerous systems use the VAX BI bus for all 1/0, commu­ compatibility and configuration problems. Hav­ nications, networks, and connecting adapters for ing learned from those problems , the VAX BI mass storage . Those high rates also allow it to design team decided to make a complete design serve as an ljO bus in all sizes of VAX systems by specification of the VAX BI bus before any hard­ using multiple VAX BI channels in the largest wa re was built. Thus compatibility problems systems, such as the VAX 8800 multiprocessor, should not occur if all future designs comply shown in Figure 1. with that specification. Al l the machines in the new generation of One of the most important aspects of that VAX systems use the VAX BI bus for all IjO, con­ specification - and the most difficult to imple­ necting adapters for mass stOrage, communica­ ment - is that the VAX BI bus operates indepen­ tions. and networks. A VAXBl subsystem, con­ dently from any particular physical configura­ sisting of two six-slot card cages and the tion . That is, the bus must be randomly backplanes, is shown in Figure 2. The back­ configurable. The achievement of that specificl­ planes are connected with flexible interback­ tion was the most difficult part of the electrical plane jumpers with terminators at each end. design. The techniques and solutions involved The key to general-purpose operation is the in solving this problem should be instructive to distributed natu re of the VAX BI bus. All nodes fu ture bus designers. on it contain identical interface hardware, and a

Digital Technical jo urnal 81 Februar)' No. 4 /')87 VA XRJ Hus - A Tb e R{(nr/om/J• Conf(c;umhle Oesign

2 CPU 1 CPU MEMORY.

NMI

VAXBI BUS NBI ADAPTER

COMMUNI­ DISK UNIBUS ETHERNET CATIONS ADAPTER ADAPTER ADAPTER INTERFACE

L DISKS TO STAR UNIBUS ETHERNET COUPLER

Figure 1 VA X 8800 Sy stem witb Fo ur VA XBI Ruses

distributed arbitration scheme precludes the

need for a processor to act as a dedicated bus master. The YAXBl bus can support both multi­ ple and networked processors. thus implement­ ing Digital's strategy of . The synchronous operation of the bus achieves high performance by providing predictable communication delays. The distributed arbitra­ tion is embedded within each bus transaction so that fu rther data transactions may fo llow with­ our delay. The VAXBI bus architecture is rigorously specified. and all designs that are verified to its specification will be fu lly compatible with the bus. The task of system designers has been greatly eased by the incorporation of all data-handling and arbitration logic in one YLSI element, the VA XlU Subsystem chip, called the VAXBI Interface Chip. Fig ure 2 787:�2

Digital Tech nical jountal 82 1')87 No. 4 February New Products

That chip also performs self-rest fu nctions and dure from their work on the VAXBI bus. bus error deteoion and handling to improve sys­ Although this procedure was derived from the tem reliability and robustness. The physical bus development rather than being planned in interfaces are also rigorously specified, and the advance, it may help bus designers with their bus clocking is controlled by custom clock­ projects in the fu ture . Therefore, the remainder driver and receiver chips. Figure 3 shows the of this paper describes that procedure, espe­ VAX BI corner of a module, with all the compo­ cially the activities and results that proved mosr nents required for the bus interface contained in significant to the project. a standardized layout. These fe atures free a The first step in designing this bus was the designer to concentrate on his unique design realization that the problem was not completely rather than on the bus details. random but may be bounded. A bus is physically implemented as a group of transmission lines in a backplane. These lines are perturbed by the loading of connectors for modules and by the modules themselves. Each connector, or slot, in which a module may be inserted causes a small perturbation if empty and a larger one if popu­ lated . A transmission line can also continue through cabling and connectors onto another backplane. ln either case the transmission line is terminated in some manner. The classic method of dealing with transmis­ sion line loading is to make the characteristic impedance so lo w that perturbations will be trivial. In that case any reflections from these perturbations will be small, and the line can be end terminated in its characteristic impedance so that there is no reflection. The loading is then Fig ure VA XBI Corner of Module 3 a considered ro be predominantly capacitive. Thus the loaded impedance can be calculated as

VAXB I Electrical Design Zo ' = Z" j y1 Cd / Cc, + A randomly configurable bus has many advan­ Our fi rst approach was to determine if the tages as a data bus in general-purpose computers classic method cou ld be used to deal withtrans­ since their physical configurations are not mission-line loading fo r the modu les on the known a priori and are subject to change during VAXBI bus. Z, , the characteristic impedance, repair or upgrading. The previous state of the ranges from 3 5 ro 100 ohms for the standard art within Digital was to use an artificial intelli­ dimensions of organic printed circuit boards gence progra m, called XCON, ro calculate a made by Digital. Corresponding values of C, , configuration for each unique set of UNIBUS the intrinsic line capacitance, range from 1.8 tO options. XCON is based on an extensive set of 0.6 picofarads per centimeter (pfjcm). How­ bus configuration ru les. Although it is a triumph ever, Cd , the distributed loading capacitance , of applied artificial intelligence, the necessity can be as much as 5 pfjcm for modules in this use it for bus configurations was a bottleneck implementation. That capacitance means that wero hoped to avoid by better bus design with the Z0 ', the loaded impedance, would be in the VA XBI bus. range of 18 to 3 3 ohms, clearly a major pertur­ The design of a randomly configurable bus bation. Therefore , for modules with these char­ involves essentially the design of a group of ape­ acteristics, the small-perturbation approach riodically loaded transmission lines. The charac­ could nor be used. teristics of regularly loaded transmission lines In the case of the VAXBI bus, even if it were are well defined , but those of randomly and possible produce lines whose characteristic unpredictably loaded lines are less well under­ impedancroes were low enough (Zo 15 ohms) , stood . The design team evolved a design proce- massive drivers would be req uired to< supply the

Digital Technical journal 83 February No. 4 1')87 Th e VA XR/ Bus - A Ra ndomly Con(igurable Dest�f{n

necessary current. Therefore . bus power would For rhe clock lines, the timing requirements become a sign ificant portion of the system arc critica l enough to justify thc use of very large power dissipation, an undesirable situation. drivers since only two signals are involved. We Consequently, we had to consider a design selected a diffcrential configuration for clock approach diffe rent from the classic one. signals in order to minimize thc skew, which Our alternative design approach was more cou ld degrade ti ming accuracy. This configura­ pragmatic. Sign ificant deve lopment investments tion also provides noise immunity by common­ had already been made in several key compo­ mode rejection. Since the clock frequency is nents, particu larly the module connector and much higher than tlw data frequency. ECL was the 78732 chip. There fore, the rest of the chosen for the logic tcchnology . The maxi mum design had tO be as compatible as possible with drive capability of standard deviccs is the characteristics of those key components. 2'5-ohm impedance, however, so a custom driver Particular attention was pa icl three areas : the is required We also chose usc a custom diffe r­ w physical layout, to keep capacitance within the ential receiver. for the fo llorowing reasons: drive capability of the 78732 chip; the clock.

• since it is the critical clement in bus timing; and Both parts ca n operate from the ava i table grounding, which is critica l for signal integrity. + 5 V supply ra ther than the -5.2 V supply The VAX BI data lines arc driven direc tly by norma lly required for ECL. the 787�2 chip, which is fa bricated using • The receiver sensitivity and common-mode advanced MOS technology MOS devices. how­ range can be optimized for the driver. ever, are limited in their ability drive current. ro Within the constraints of chip area and power • The receiver input can be designed for mini­ dissipation, open-drain drivers of about 21 mil­ mal bus loading capacitance. liamperes (ma) are the only ones available. The • The receiver output lcveJs can be standard data cycle of the VAXBI is 200 nanoseconds . TTL !cvc.ls. thus eliminating the need for a Therefore , the maximum bus lcngth of se parate intcgratcd circuit (!C) fo r level 1.5 meters (VAXBI specification) is short com­ translation. pared to a wavelength, and a lumped-constant approximation could be uscd for ca lculating the Al together, thcse two custOm clock chips do delays . RC time-constant model was used for the work of five standard res. thus saving power this apprAnoximation , and the voltage swing was and module real estate whi le ·improving pcrfor­ limited 3 V to accommodate a smaller termi­ mance. nating restoistor for faster switching. Thc resulting Since the characteristics of ECL drivers are resistance was 2�8 ohms ('5 Vj21 ma). we ll understood. we require the clock driver to After calculating the tolerances and worst-case use an output driver made from three standard allowances. we chose a standard value for this '50-ohm ECL drivers in parallel. Thus the effec­ resistancc of 270 ohms. By choosi ng an RC time tive drive capahi.lity is 17 ohms (50 ohms/.� ). constant equal to the maximum available propa­ The design termination is intended to match the gation delay (and after subtracting device del:-tys estimated impedance of a maxi mally loaded sys­ and allowing for component tolerances and a tem. approxi mately 2'5ohms differ ential 10 percent timing margin) , we calcu l ated the impedance. This impedance is composed of a capacitance as 4 10 pf This figure became the res istor to ground from each line and a resistOr maximum capacitance for each clara line. between lines. chosen to sink the appropriate including backplanes. interbackplane jumpers, high- and low-state currents. The design was connectors . modules, and bus transceivers on extcnsivcly modeled using the SPJCE circuit the chips. Obviously, the RC time constant is simulator. which indicated that the driver had applicable only on the low-to-high transition. adeq uate current capability for rhis load .' The when the open-drain dev ice is turn ing off. ch:tractcristic impedance of the clock lines was Dev ice turn-on, which is normally much fa ster, made as low as possible by maximizing the line is internally co mpensated for by controlling the width within the space constraints of a 0. l-inch rise time ro minimize thc transmission-Jinc via-ho le (platcd-throughhole in a pri nrcd circuit reflections. board) grid To improve the common-mode

84 Digital Techllical jom·nal Nv. 1 F1 !br11my 1')8� New Products

rej ection , the two lines of each differential pair The opt imized termination has a d iffere tial n are located one above he orher on adjacent lay impedance of ;17 .6 ohms. which rurns our to be ers with g ro u nd p lantes above and below the­ a better marc h for the easured impedances of m pa i s . rhe rest of the hardware. An e m pty backplane r p i na lly, carefu l attention was given to the has a differential i m pedanc of a pro mately e xi g rFo u n d ret urn p a t h fo r all VA Xlll signals. 60 ohms. dropping to as low as 28 ohms when roun d plan es to min imize inductance. arc fu lly po ulated a j u mper cable between back­ G . p ; provided on the modules . ba ckp lan s. and inter­ planes typ ica lly has a 45- ohm diffe r ential e backplane jumpers for data lin<.:s as well as the i mpedanc e . The various possibl VAX BI configu­ e clock lines described above. The data -line rations yie ld a max i m u retlecrion coefficient m ca pacitance was constrained within the 410-p f ar any point of 0.28; probable confi gurat ions l i mit descri bed above by con tro lli ng the line will have even smal ler reflections . width and the ground plane sp i ng partic ­ Reflections of this magn i tude cou ld cause sig­ - ac . A u larly d i ffic u l t problem is the ground inductance nificant timing variations in s in le e nde d sys­ g - tems clue to a fixed receiver threshold vol tage of the 7H7:)2c h i p . The chi p can . 7R7.'12 as many as data lines simulta neously. with a Howe ver. they have no e ffec t on a differential 4R total swi t h i ng current of over one am pere . The line since the reflection is the sa me on both lines c induced voltage V, from simu lt neou s switch­ of the differen tial pair. The on ly variation we , a ing is cal cu lated as found was caused by the differences in imped­ ances on different prin ed circuit lay rs . Subse­ (dijdt) p t er vi V = I. X quent e eriments indicated that imp o ng rhe x in which L is the i n d u c tance ancl dijdt i s the m atchi ng of i mpedances by putting rhe di ffe ren­ ra r of cu rrent cha n ge . For exa mple , if the tial pa ir on the same l ayer reduces the skew e more rhan rhe common-mode noise reduction cl ue grou nd inductance were I 0 na no hen ri es and the 1 to rhe u al cou pling of adjacent layers . u rther c hip switched in I 0 nan o e c o nds . volt of s rn m F switc hing noise wo uld resu lt. Based on these experiments showed that rhe clock system oper­ noise calculations. we d es i gned the p a ckage ates ar freq u en c i es at least 25 percent h i gh r e than t he design goa l over all combinations of with an internal g rou nd plane and 15 groun d pins to minimize inductance and sw itc h i ng bus confi uration . vol tage and temperamre . g , noise . The data l i n es exhi b ite d mor su btle p ro ­ lems. Our i i i a l testing y i e ldeed resu lts veryb n t Test Results similar to our design p re d ict io ns As suffi cient . When t he custom clock devices became avail­ hardware was assembl ed for a maximum config­ able, measurements showed that the driver a tion with heav bus traffic, how ve r, unex­ ur y e cou ld nor power a 25-ohm differential load and pected waveforms were discovered. The wave­ still maintain the desired 700-mV a mpl i t ude fo rms no longer exhibited the e xp onent i a l over all conditions. Therefore, we care fully shape of an RC time constant; instead, they measured the output characteristics in both the resem bl ed step fun t ons with ex pone t i a l ris­ c i n high and low stares ro calculate an opt i m m ter­ ers. Aft er d u e deliberation, we rea l ed that, u iz mination. The TK 1 Solver software was used although the fu ll time constant was fai r ly slow, ro so lve iteratively rhe driver equ atio s fo r the the initial slope, dVjdt , was much faster. There­ n piec emea l linear approx imations of r h e driver fore. irs higher fre uency components traveled characteristics. wh ich did not fi t any s imple down rhe line and- q were refl ected several rimes curve . \XIe r hen calcu lated the opti m u m resis­ dur i ng rhe d u ra t i on of an RC time constant, ta nces and chose rhe nearest standard resistor resul t i ng in the staircase effect. SPICE si m ula­ values. We also reca lculated the output voltages tions y i e lde d an d ti al waveform when a i en c for normal tolerances of resistanc voltage. and t rans miss i on line, origi na lly considered unnec­ e, temperature, and a +;-50 pe c nt variation in essary, was included in rhe model. The overall r e the internal resi stan ce of rhe dri ver . The mini­ riming was not affected by the reflections. Fig­ mum ca lculated a m pli tude was 695 mV. giv­ ure shows this waveform with its staircase 4 ing us a very h i gh confidence of having at l<:asr effect caused by incomplete termination of t he 700 mV for any actual hardware . transmission line.

Digital TechnicalJournal 85

No. 1 Febnuny I 'Jlil The VA XIU Rus - A Randomly Conjigumhle /Jesign

measured after the removal of each pin. As a resu lt the induced voltage increased from an insignificant level with ground pins ro more than one volt with onlyL.:; grou nd pins re main­ ing. With one more pin .1remov ed, the chip no longer passed self-test. These resu lts showed that only a few ground pins are necessary for the VOLTS chip to operate. but 15 are needed to prevent the addition of noise to the hus. The timing analysis involved fabricating spe­ cial lots of 78732 interface chips with the fastest and slowest possible process variations . NANOSECONDS From these lots chips were sel ected at the abso­ lute specification limits. These chips were care­ Figure Simulated Wa veform fr om SPICE fu lly measu red in a range of configurations . 4 including one beyond the specified limits. 'fhen rhe timing margins were calculated over the A second , more significant, effect was due to speci fi ed range of operating conditions. When crosstalk, or coupling between the incs To all possible worst-case conditions and the meet the capacitance budget, the originalI physi­ effects described above had been included, the cal design aimed tO minimize the capacitance ca lculated timing margin was reduced to ro ground. undesired result was that the mutual 0. 5 na nosl'conds . Design verification resting on capacitanceAn from line tO line, while sti ll smal l. this worst-casl' system showed that it could still became proportionally large r, thus increasing operate at a fr equency percent higher than the coupl ing from line to line. The voltage on that specified over the Ifu 0 ll operating range of one line was affected by vo ltages on nearby temperature and voltage. lines: transitions were aided by like transitions and slowed by opposing transitions. In the worst Summary case. the magnitude of this variation was as The VAX BI bus was designed to a rigorous bus­ much as 24 nanoseconds. architectu re specification. After minor adjust­ This worst case occurred on a group of lines ments during design verification testi ng, the hus in close proximity to a "spare " line, nor con­ met a II rhe req ui remenrs of that specification . nected or terminated , which contributed add i­ In particular, this testing proved that the YAXI3l tional mutual capacitance, thus enhancing the bus can operate independently of system config­ coupling. This spare line, incl uded to red uce uration. the need for engineering change orders to the Severa l other points should be noted by bus bac kplane, nearly needed an ECO fo r irs designers for fu ture products : remova l, which could have delayed several new Design ing product to a rigorous specifi­ prod ucts. However, a timing analysis showed 1. cation, calleda top-down design , can rea lly that its remova l was unnecessary. lt should be work. emphasized that this efkn was nor visi ble unri l 2. Diffe rential signals arc recommended for actual bus tra ffi c, consisting of random data par­ critical timing. They arc best located on terns, was being transferred on a large bus con­ the same printed-circuit layer on a figuration . Test patterns were too small and roo module. regular to show these significant effects Simultaneous swi tching noise , described 3. Testing shou ld be performed on real above , was also investigated lw cause its effect hardware with rea l data, as closely as it was similar the effect of crosstalk. Al l VAX I31 can be approxi mated during the design data signa ls roexcept one were swi tched simulta­ process. Too often. the test patterns run neously, and the induced vo ltage was monitored on test structures yield nothing but rhe on the remaining line, which was fixed in the expected results. Testing should also high (inactive driver) state. Ground pins were reveal unexpected problems, not simply then broken off one at a time, the voltage being corroborate the design .

86 Digital Tech nical journal Fe bruarv 1')87 No. ·i New Products

4. Ground rerum paths require careful con­ sideration, particularly under conditions of simultaneous switc hing.

Acknowledgments The following people were invaluable in the su ccessful and timely conclusion of the VA.X BI project : Dana Blanchard. Frank Bomba. Bob Chen, Norm Commo, Ron Desharnais, Rick Gillett, Glenn Herdeg, Bill Lin, Bill Schmidt, Jim Staples, Berry Ann Tyson. Bob Willard. Of course, the VA.X BI bus would not have been pos­ sible withour the contributions of the VLSI team responsible for the 787 3 2 VAXBI Interface Chip. References 1. F. Bomba . R. Chen, and R. Gillett, "Gen­ eral Purpose Bus Eases Interaction of Dis­ tributed Resources," Computer Te chnol­ ogy Review , vol. VI, no. 2 (Spring 1986) : 47-53. 2. VA XBJ Op tions Ha ndbook (Maynard , Digital Equipment Corporation, Order No. EB-2727 1-46, 1986) . 3. R. Schumann and W. Parker, "A 32-bit Bus Interface Chip." JSSCC Digest of Te chnical Pap ers . vol. XXVII (February 1984): 147- 1 48. 4. SPICE was developed by Lawrence Nagel and Ell is Cohen of the Department of Electrical Engineering and Computer Sci­ ence, University of California, Berkeley.

Digital Te chnical jo urnal 87 February No. 4 I '/87 Michael W. Kement Gerald]. Brand I

A Logical Grounding Scheme fo r th e VAX 8800 Processor

The treatment of ground as a signal conductor is crucial in achieving high-perfonnance computer systems. The impact of systemgrounding on signal integrity becomes even more important as systems are connected into networks. For the VAX 8800 CPU design, the authors first identified the sources of ground-conducted noisefr om the jo ur ground systems: the power and logic sy stems, and the safe ty and grounds. They then iso­ RF lated and defined the ground elements in order to specify an intercon­ nection strategy to guarantee the CPU's performance. Then the 1/0 subsystem grounding was established and fi nally a system-to-system grounding scheme was completed.

The design of the ground interconnection is the applications targeted for the product wo uld often given little attention in system design , at preclude its installation in the controlled envi­ least until it becomes crucial to system perfor· ronment of a computer room, with its trad itional nu nce and program development schedules. massive copper grounding grid beneath a raised The treatmenr of this interconnection as a signal floor. The system componenrs would be con­ conductor greatly affects the electrical noise nected for the fi rst time at a customer's site. Our levels. Ultimately, these noise levels arc a criti· goai was to require minimum site preparation ca l factor in limiting the maximum clock speeds efforts; sysrem components were designed to be and thus machine performance. cabled together in a "plug-and-play" manner. Field service personnel have long recognized These product goals, coupled with the EMf/ that many instal lation problems result the RFI and sysrcm safety requirements of the inter· subtleties of grou nding when cabling fromtogether nationa regu latory agencies, required an inte· CPUs, mass storage devices, and peripherals. grated systemI philosophy for grou nding and Particula rly difficult problems occu r when shielding. The approach that we fo llowed on equipment comes from different vendors. The the VAX 8800 project involved three separate traditional approach ro solving these problems but interrelated steps: has been ro dispatch a seasoned fi eld service First, we idenrified the sources of ground­ representative to the site with an assortment of conducted noise within the VAX 8800 and ground straps and other parts. Given the injunc­ devised ways to reduce that noise to the iowest tion ro "make it work," he could, with enough pracric;: lev<:!. Next. we identified the intercon­ ingenuity and customer patience, bring about nections within the ground networks and con­ satisfactOry resu lts. nected in ways that conrrolkd the ground a consequence, early in the development noise. Therethem arc four ground networks: cycAsle the VAX H800 project team set a high pri ­ Power return ority on the logical design of the ground system. l. We knew that the H800 would be used in large 2. Logic return networks. thus inrensifyi ng any problems with 3. Safety. or ac power-fau lt ground grou nd-conducted noise. In fact, the inclusion Radio frequency shield and chassis of the backplane interconnect. called the VAXRI 4. grou nd bus, ensured that many IjO ports with high bandwidths wou ld exist in close electrical prox­ Finally. we extended the concept of system imity to the logic backplane. Moreover, many of grou nd in the VAX 8800 to large-system applica-

88 Digital Technical journal

No. 4 fehruar]' 1987 New Products tions and computer networks in an effort to During normal switching-converter operation, ensure optimal overall system performance. In voltage pulses with rise times of approximately the majority of cases, these networks involve 1000 V per 11s are applied to the primary. These mature products for which it is difficult to make pulses cause capacitively coupled noise cur­ any internal configuration changes. rents with peak amplitudes of approximately 200 milliamperes to be sent into the system Ground Conducted No ise chassis, or safety ground. Figure 1 shows a sche­ matic representation of this process. The para­ Power Sy stem sitic leakage inductance associated with the pri­ The VAX 8800 power system consists of modu­ mary winding comprises a series-resonant lar units of switching power regulators operat­ circuit with the shield capacitance. This noise ing at 50 kilohertz (KHz) . The total three-phase cur rent has a decaying exponential waveform ac power required for a typical application con­ with a frequency in the range of 5 to 10 mega­ figuration is about 5 kilowatts (KW) . The hard­ hertz (MHz) and a repetition rate of twice the ware implementation uses units from a family of switching frequency. Since many power con­ products called the Modular Power System, or verters are used in the 8800 system and MPS, designed by Digital. These units yield low VAX they are all synchronized to a common clock, and tightly controlled diffe rential (normal the noise currents tend to add. Current ampli­ mode) noise levels for the de power that sup­ tudes as high as 2 amperes were observed. plies voltages to run logic. The most practical way to reduce this noise Through their high electrical efficiency of source was to insert a damping resistance, Rd , power conversion, such switching power sys­ that would reduce the Q of this resonant circuit tems have made possible the small sizes and low at the specific frequency range. Q is tradition­ weights of present computers. This power cir­ ally defined as the ratio of reactive impedance cuitry, however, has current spikes (dijdt) as to resistance, and represents a measure of reso­ high as 1000 amperes per microsecond (!is) and nant efficiency. The international safety regula­ voltage slew rates (dVjdt) as high as 2000 volts (V) per 11s. These high slew rates, a conse­ tions, however, strictly limit the fault-current quence of the pursuit of high efficiencies, can impedance in this path. To meet both require­ produce significant noise problems. The rest of ments, we inserted a ferrite bead on the shield this section discusses five of the most important ground lead. This bead is made of ceramic ferro­ noise sources that we identified and resolved in magnetic material that is electrically lossy. It the power system. acts as a small inductance at low frequencies and as a nearly pure resistance at high frequen­ Noise Currents cies. The bead does not block the fault currents When high-voltage slew rates are present across from a short circuit but does reduce the noise parasitic capacitances (i.e., unintentional capac­ current to the desired level. The noise ampli­ itance that is present as a consequence of a tude is reduced by two to four times and the physical metallic structure), a noise current ring frequency reduced to about 1 MHz. Thus a Ill will be generated: potentially serious cause of common-mode noise current in the system is reduced at the C" dVjdt Ill = source to acceptable levels. in which Cp is the parasitic capacitance . In new designs, more effective schemes One significant source of common -mode involving different shield configurations and noise in the MPS regulators is the parasitic interconnections could be employed. capacitance between the primary windings in the high-frequency power transformer and the Power Line Filter solid-foil safety shield between the primary and One of the more subtle (and ironic) sources of secondary windings. The use of this shield, con­ common-mode noise current originates in the nected to a sheet-metal "safety ground," is one power fi Iter designed to reduce the electrical way of complying with the international safety noise emanating from the power line. Figure 2 regulations. 1 depicts a schematic of a typical line filter,

Digital Te chnical journal 89 I'J87 No. 4 Februtn:y A Logical Grounding Scheme fo r the VA X 8800 Processor

0 CURRENTPRIMARY -----L------t-----��------�----- (lp) I I PRIMARY IL_ (Vp)VOLTAGE NOISE (In)CUR RENT

T1 PRIMARY I SECONDARY 01 Ls I • I I I

l __l --J j l i _c E l_ __ i E--__ I� I I �· --1.-

"'T'I L---- c1

L1p � x 1 06H primary leakage inductance 1.2 Cp � C, � 200 x 10-12 picofarads primary and secondary parasitic capacitance to shield Ro is the damping resistance provided by a lossy11\1 ferrite bead Resonant frequency of In is Fo � (L1p x C0) 10.3 MHz [2,.- = Resonant impedance Ro � (L10/C0) 112 = 775 ohms 0, = � 200 With Ro = In (peak) V0 (peak)/Ro milliamps Wilh Ro = 500 ohms @ MHz, In (peak) = milliamps 10 118

Figure 1 Parasitic Capacitance of the Power Tra nsformer

including the parasitic, or leakage, inductance excitation for this resonant circuit. The result is of the common-mode choke, L1• The "Y" capac­ a resonanr current pulse into the chassis with itors, C1, are connected from either side of the each ha lf-cyle of current in the power line. power line tO the chassis, fo rming a high-Q res­ Other considera tions of signal integrity onant circuit with this leakage inductance. The demand that an inductOr be placed in series load current for this power fi lter is dominated with the power ground wire in the fi lter before by the discontinuous current pulses of the that wire is connected to the chassis. The resulting switching power converters, which provide ground impedance forces the resonant common-

90 Digital Technical jo urnal

No. 4 Februmy 1987 New Products

POWER LINE FILTER r------, I LINE I I �: PHASE I 11FI AC/DC 0.05 V I SWITCHING LOW VOLTAGE 200 I POWER DC LOAD O.OSVI'FI CONVERTER (LOGIC) 200 1 GROUND GROUND INDUCTOR 1 ------11H ------I � A Figure 2 Power Line Filter

mode current to flow through the chassis of the across both real and parasitiC circuit induc­ system, probably through the logic returns. lf tances when rapidly changing currents flow the fi lter design has taken this parasitic reso­ through them. This noise voltage is expressed as nance into account, a series resistOr or ferrite Vn Lp dl jdt bead, Rv , may be added to lower the circuit Q. = That reduces the common-mode current at the in which is the value of inductance. expense of fi lter attenuation . The mostLp common source of noise voltage in In the case of the 8800, many of the system switching power converters is parasitic induc­ components had been designed and released tances excited by the rapid rise and fa ll of cur­ before this problem was fully appreciated. rent in the transistor power switch and by the Therefore, our only viable strategy was to segre­ reverse charge recovery in the rectifier diodes. gate this noisy ground by separating the logic These abrupt transitions between the conduct­ returns and chassis grounds to the greatest ing and nonconducting states generate a very degree possible. high dljclt . For example, the primary reset diodes (D1 and in Figure 3) in the MPS con­ Noise Voltages veners have veryJJ2 fast switching times of 30 ro The electrical dual of the noise source just 50 nanoseconds (ns) . As the diode current described is the generation of noise voltages rapidly goes to zero when the switch is turned

Digital Te cbnicaljournal 91 No. Februarv 4 1<)87 A Logical Grounding Scheme fo r the VA X 8800 Processor

L switch off; the resulting dljdt can easily approach 1000 amperes per microsecond_ As current dies, the magnetic field surrounding s the the secondary windings of these high-current cond uctors wi II collapse. That induces a voltage in other conductors enclosed by this magnetic fl ux. According tO faraday's Law, this noise + 300 v voltag<: is

VII - N d0Jdt

in which N is the numlxr of turns in the oth<:r conductors, and d0jdt , which is proportional

to ell j cit , is the rate of change of magnetic fl ux. It is quite possible to cl<:velop volts of nois<: Ls across 2 inches of circuit hoard etch or a sheet­ meed panel through this cff<:ct. -9 Ls 300 x 1 0 H. The original designs of the MPS converter tried to minimize this nois<: problem by making -12 tOO x 10 F. a, az Stray Inductance the high-current loop ar<:as as small as possible, thus minimizing the radiat<:d magn<:tic fl ux. In

Cos - Collector - Base Capacitance of and addition . copper Faraday shields and ground­ a, az (1M) T, plane ci rcu it boards wer<: us<:d. In spite of this D, Dz 300 care. we encountered problems with circulating NOTE: The screened components are not active: cunTtHs induced in th<: m<:chanical support High-Frequencyand Equivalentare off. The Model magnetizing current from is resetting to zero through and to the V source. structure in t h e VAX RHOO system d<:sign. As with the power-line fi lter, we could not rnluc<: th<: nois<: at its source. Therefore, th<: only t 1" _ viable solution was to tak<: great care with the +rnls 2 chassis ground connection of these structures so Vno•se

Cos that the noise currents arc directed away from sensitiv<: circuits. Figure Parasitic Inductance of the 3 Power Switching Stage Th e Logic Sy stem A significant source of noise within the logic system is the energy radiated from the int<:r­ off, tht: circuit parasitic inductance will ring conn<:ct cables from the I JO bus to the disk with the capacitOr in the switch-prott:ctivt: controll<:r. This noise radiates at a fu ndarn<:ntal snubb<:r, The frequency range will. bt: from freq uency of about MHz The bus itse lf is C. 47 a 10 to :)0MHz for typical circuit va lut:s. Tht: high-src<:cl , mass-storage parallel interface. The resu lt is a clifferenrial noise vo ltage at the con­ interconnect cable is composed of individual vener output. coaxial signal pairs that arc transformer coupled Our solurion to this noise voltage source was and driven diffe r entially. Howev<:r, the to install an appropriate ft:rrite bead on the impedance from the coaxial center conductor diode lead to cl amp the oscillations in this fre­ to the outer overall shield is slightly diff<: r<:nt quency range . fro m th<: impedance from the coaxial shield to the outer shield. That is, both signal conductors Radiated Magnetic Flux do not have equal impedances to th<: outer

A substantially more difficu lt problem is caused shidd. which is grouncl<:d to the chassis at l'ach by ra pidly changing magnetic fi elds that radiate end. Th<: result is a n<:t noise current that flows from t he high-current secondary circuits in the on the outer shield. Within the VAX 8800 pro­ power converters. The output rectifiers can be cessor, this current can coup!<: into adjacent conducting as much as 200 amperes when they cables.

92 Di?,ilal Te chnical journal No. 4 Ff! brucuy I ')87 New Products

The only practical method to mmrmize this interconnection for the different ground types noise coupling was careful routing and dressing in the VAX8800 system. of the interconnect cables relative to other com­ munication and power cables. DC Power Return The de-to-de converters in the system required a VAX Sy stem Grounding 8800 de current return that presented a low This section describes the types of ground struc­ impedance through the frequency range of de to tli res present in a large system like the VAX 200 KHz. Our primary consideration was to specify a conductor with a sufficiently large 8800 multiprocessor. A-; such a computer sys­ tem expands in size and complexity, its ground cross-sectional area to keep the IR losses and connections also expand and their interrelation­ heating effects to a minimum. A secondary con­ ships grow in complexity. To appreciate the sideration - often overlooked -was to mini­ grounding scheme as a total system, the various mize the physical distance between the current components must be isolated by function and feed and the return. In a large system the cur­ location. In that way the ground system can be rents involved can exceed 400 amperes. The broken into its constituent elements. The incli­ resulting flux can produce a large magnetic vidual components can then be viewed as func­ field. This field is determined by the relation­ tional blocks that require interconnection. ship Although a designer can choose how to inter­ Magnetic Flux = X Ajl connect the ground clements, he is always con­ I X /.I strained by the existing international regulations in which I is the current, /.I is the permeability in the implementation of the grounds. of air, and A the area and l the length of the con­ ductor. These leakage fields can couple intO Types of Ground Top ologies adjacent devices, sheet metal, and cables. If the There are three choices of ground interconnec­ flux has an ac component, a current may be tion topology : single point, multipoint, and induced in adjacent conductors, as described hybrid. The single-point ground looks like a earlier. wagon wheel with the ground in the center and A power supply in the MPS series used in the the other devices connected radially around the 8800 has a silver-plated bus as its main output. hub. That center becomes the absolute ground That bus is mated to a large connector that is point, called the zero-voltage potential refer­ mechanically mounted on the power backplane. ence, for all devices. Multipoint grounding has This connector is soldered to multiple epoxy­ each device individually connected to a single coated copper strips that are 0.050 inch thick ground plane, all of which is at the same zero­ by 2 inches wide. These strips are fusion welded voltage potential. The hybrid is some mixture of to a horizontal bar that is bolted to the inner lay­ the single-point and multipoint topologies in ers of the CPU backplanes. The supply and which interconnections are made based on the return straps are overlapped to minimize para­ characteristic needs of the subsystem functional sitic inductance and its consequent radiated elements. magnetic flux. The flat, wide geometry of the The single-point topology is not practical to connection is essential to minimize that flux. implement on a large system like the VAX 8800. (See Figure 4.) Minimizing this stray inductance The physical distances and associated im­ is also essential to obtaining rapid power-system pedances of the interconnects begin to domi­ response to load transients with adequate stabil­ nate so much that an absolute ground point docs ity (phase margins) . not really exist. The multipoint ground requires a ground plane, or grid, to be effective. Again, Logic Return in a large system, it is not practical to imple­ The logic return provides a common signal ref­ ment a ground plane into the physical layout. erence for the logic within the system. To mini­ The hybrid scheme has advantages over the mize noise this reference must be designed with other two, but it requires a detailed evaluation a low impedance at the frequency correspond­ of the characteristics of each subsystem element ing to the logic switching speed . With logic before an interconnection can be designed. That operating at rise times of 1 V per ns, or 300 MHz, was the approach we fo.llowed in designing the this reference is consid ered to be a radio

Digital Technical journal 93 1987 No. 4 Februar)l A Logical Grounding Scheme fo r the VA X 8800 Processor frequency (RF) ground and thus can be mod­ meabiliry of air in siemens per meter. For exam­ eled as a frequency-dependent impedance. The ple. for copper. the skin depth is 0.0666/ Vf: in ground impedance at these frequencies is domi­ meters. After the skin depth has been deter­ nated by the depth of penetration of current mined,· the impedance at the frequency of con­ into the conductor. The magnetic field sur­ cern can be fo und using the sheet resistance of rounding the current forces the density of cur­ the material. The specific resistance, R, is equal rent tO decrease from the surface value as the L A, in which p is the specific resistance topX 1 depth into the conductor increases. In the limit­ of the conductOr, Lis the inductance, and A the ing case, as frequency becomes very high, the area. For copper, equals 1.673 microohms per p current will flowas a sheet of charge at the sur­ centimeter. face. The result is a steadily increasing real com­ Another major fa ctor in designing a ground ponent of impedance (resistance) with increas­ plane is the voltage drop across the ground layer ing frequency. The point at which the current at low frequencies (de to 1 KHz) as the total density decreases to 1/f of the surface magni­ load current is sent from the logic modules. tude (approximately 37 percent) is one "skin This voltage drop produces an offset in the logic depth." threshold from module ro module that affects Therefore, the first step in calculating the the noise margins, or tolerance. The voltage ground impedance is to derive the skin depth, drop is a fu nction of the sheet resistance of the in meters, as follows: ground layer (directly proportional to the thick­ ness) and the method of termi nation of the Skin Depth = X F X ljyn- 11 ground layers to the return buses. The connec­ in which F is the frequency in Hz and 11 is the per- tion geometry must be chosen to ensure a safe

@ MPS VAXBI POWER -5.2 V@200 A -2.0 V@ 100 A +5.2 V@ 100 A POWER SYSTEM FLEX-CIRCUIT POWER BUS POWER BUS POWER BUS POWER BUS BACKPLANE

CD HORIZONTAL LAMINATED CPU POWER DISTRIBUTION BUS

VAXBI - 1/0 BACKPLANES

CPU BACKPLANE MEMORY BACKPLANE

NOTES:

1 The return, or logic ground rail, is connected along its entire length to the system chassis and represents the system single-point connection of RF (chassis) power and logic ground.

2. MPS regulator rack is electrically isolated from chassis ground and connected through lossy RF chokes.

Figure 4 Logic Power Distribution Sy stem

Digital Technicaljournal 1987 94 No. 4 February New Products

maximum current density through the ground product's receptacle rating flowing through the layers . Current crowding, particularly at the safety ground system must not resu lt in a voltage connection points and plated through-holes, can rise of more than 4.0 V, and this level must be turn the backplane into a toaster oven. sustained for 10 minutes. With these require­ We used the inner layers of the CPU back­ ments in mind, we used a 1.2 -millihenry choke plane as the logic reference for the VAX 8800 to isolate the VAX 8800 CPU from the building CPU. There are four ground layers, each ground at high frequency. This choke was 0.003 inch thick. Figure 5 illustrates the de designed to saturate as described above if a fault voltage-potential drop as a fu nction of geometry occurs. across the CPU backplane. The return current is Chassis Ground approximately 500 amperes; therefore, this CPU backplane was the most challenging part of the The RF shield comprises the chassis ground and design. the outer panels of the cabinet. The federal reg­ ulatory agencies (FCC and VDE) set and enforce the allowable limits of radiated emissions fro m

v computer equipment. Since the integrated cir­ -5.2000------cuits within the system are switching at high fre­ -.-mV - -7.0 quencies, they can be modeled as RF sources. 5 - .193o-J---·Z:.:.�qt::::?""'A/�b�r-;r--;tJA The interconnecting etches between integrated circu its that are not tightly coupled to a ground layer can be modeled as antennas. The faster the clock and edge speeds, the shorter the antenna needed to act as an effective radiator. The length, in meters, of a fu ll wave­ length is defined as 3 X 1 08j F. NOTE: Measurements were made from corresponding local points Once this wavelength has been fo und, the on the ground plane. It demonstrates the excellent control over voltage drops provided by the internal ground and outer panels of the cabinet can be modeled as power planes of the multilayer CPU backplane. Maximum an attenuator, which decreases the amount of current available to these V inner layers is amps. -5.2 400 radiated energy that can be transmitted from within the cabinet. To maintain this level of Figure Distribution of the Backplane attenuation, all openings, such as doors, must be 5 - Voltage fo r the . V Power bridged with conductive gasketing or fi nger 5 2 Plane stock. The openings fo r air flow must be treated as a wave guide . The attenuation, in decibels, of the opening is related to its size by the fo llow­ C Safe ty Ground A ing formula: The primary fu nction of a safety ground is to .0046 X l X F X V5 900 X Fjgap2 - 1 provide a low impedance at 60/50 Hz, thus allowing fa ult currents to fo llow a path with a in which F is the frequency in MHz, and I is the low IR drop. The design and implementation of length and gap the width of the opening, both this path is strictly contro lled by the interna­ in centimeters. tional regulations, to which all other uses of this ground must comply. The safety ground also Ground Interconnections acts as a signal ground in that it connects prod­ within the System ucts to the ground grid of the building housing Once the separate ground elements had been the system. This connection can be detrimental defined, we began to fo rmulate an orderly inter­ to the system's I/0 signals. Thus it is advanta­ connection strategy fo r the main computer that geous to add an impedance whose magnitude is would not compromise the system's perfor­ frequency and current dependent in series with mance. We used the same return path for both the safety ground. A saturating inductor meets the logic and the de power because there was those requirements. no dichotomy in the requirements for both For a fa ult condition, Digital's internal design returns. In the VAX 8800, the junction of these standards require that a current of twice the returns comes at the point where the horizontal

Digital Technical]ou1-nal 95 No. 4 Fe bruary 1987 A Logical Gro unding Scheme fo r the VA X 8800 Processor

bus bar (remrn) is bolted to the inner layers of • The ljO panel bulkhead and the logic and

the logic backplane. (Sec Figure ') . ) power returns for the VAXBIbus and memory Digital 's internal standards, which meet alI backplanes are tightly bonded tO the single­ the applicable international regulations. man­ point ground at the CPU power-return bus. dates that the de power return be con1H:cted to • The elimination of circulating noise and logic the safety ground. This connection must he able currents through the chassis will maximize to withstand the short-circuit currem of the de the effectiveness of the shielded cabinet as an regulatOr output. (In certain cases it may be atren uaror of radiated energy. desirable to insert a frequcncy-dependcnt impedance in series with this connection to The implementation of this approach is "isolate at frequency'' an element of the system. shown in Figure 6. That could be clone when creating a sing!<.:-po int ground system - directly referenced to the lj O and Expansion of Grounding

chassis - or a controlled hybrid-ground Once the main processor's grounding had been system.) defined, we had to deal with grounds between In the VA,'( 8800 CPU, the de output could. the external clements, such as the 1/0 subsys­ under fault conditions, producc approximatl'ly tem. The VAX 8800 system can accommodate a 400 amperes. Thus the interconnection must large array of r;o devices by utilizing the VA.,'<:Bl handle this high fa ult currem This interconnec­ architecture . The H96 52 EC-ED cab has provi­ tion was accomplished by bolting the junction sions for two expansion boxes, the CI750 and node of the combined de-power and logic the BA lAW. These boxes arc self contained and rl'turn the chassis for the entire length of the have inI tegral power supplies, logic backplanes, ro horizontal bus bar. This porrion of the chassis and interconnects. In keeping with our ground­ was choscn as the connection point becausc it ing architccture. we isolated these boxes from was nor used as a conductor for any other high­ the chassis ground by using low-Q inductances. frequency currents. The signaljlogic ground was then established by In summary, the grounding approach we used means of cables the VA.XBI-to-CPU backplane. for the 8800 fe atured the fo llowing dcsign This scheme ensuresro that the chassis is not used points: as a signaljlogic re turn .

• The logic and de return and thc chassis ground are connected together at the hori­ Sy stem to Sy stem Grounding zontal power-return bus. Grouping systems togcther or networking them has a large impact on systcm noise and the sub­ • 'T'hc power-system outputs and the cha�sis sequent grounding techniques tO eliminate it. In ground arc isolated from ground at fre­ terms of the signa l-to-noise ra tio and from the RF quencies by high impedances using lossy fer­ aspect of grounding. a networked system can be rite inductors. DC currents and line-frc­ divided into two cases : the dense network, and quency (';Oj60 Hz) fa ult currents may thus till' dispersed nctwork. flow unimpeded .

• Particular care was taken to minimize thc Dense Ne twork now of logic-remrn currents through the sys­ A dense nct\vork is a group of computers or sys­ tem chassis, thus isolating the peripheral tems with associated support hardware rhat is boxes (CI750, BA l lAW, ere.) from the sys­ located within one area, either an office or a tem chassis ground. Insulated chassis sl ides. computer room. This area is likely tO contain shunted by lossy ferrite inducto rs, accom­ systems fro m different vendors as well as phone­ plished that isolation. Although there arc still switc hing networks, cxperimental equipment, common-mode currents with the ferrite or industrial control lers and monitors. All these inductors, they reduce unwanted common­ devices sbarc a common ground that could be a mode noise voltages that can couple into cir­ grid or simply a branch ground as part of their cuits through parasitic inductances . That is a safety ground. This connection also provides a far worse problem, as we demonstrated to our signal reference between interconnecting own chagrin. devices in the area through the chassis and

Digital Technical journal

No. 4 Fe bruarp I 'J87 AC FRONT -END CABINET MAIN CABINET

r----r-- --, I Cl750 I I I I I I I I I I AC I I CORD I GROUNDS I I I L-----L ____ ...J

--, +5 V +5 V I L_----�------��77��T7���T.T77�T.T777?�_,�_,-4 VAX81 I 1 I �E-;;�;--r---- _'.;_ I :��IBACKUP I ------+------L------1--- 1. I -, MEMORY I 8 I I "- II1 I I � L------J ! MPS AC/DC CONVERTERS L ______- L J r - -- --, 1 I - - I I -- I I - - - I - ,------l I I , - -, I I I t 1 I 876 I POWER CONTROL II I I"­ I : : lo : :I I I ,a I : 1 o I I­ : : I I I I I I L-- __ J ______i7 ____ 8 l f_ j L_t j

w MPS CHASSIS (ISOLATED) - w CABINET FRAME (PRO 380) NOTES: 1. The return of the power bus lor the CPU backplane is the common connection (single­ - CONSOLE POint ground) for the logic. power. and chassis (RF) grounds. w - ISOLATED BOX FRAME 2. The 1200-pF capacitance is the typical parasitic capacitance between the isolated MPS power-supply rack and the main chassis.

PERIPHERAL FRAME \!f 1/0 3. The indicated common-mode indicators. or baluns, are composedof lossy ferrite cores - surrounding the bus supply and return conductors. H7170 FRAME w 4. The inductors shown between the ··e·· ground or chassis and the subassemblies are - LASO lossy inductors designed to increase the RF impedance and to prevent the circulalion of noise currents in the chassis. w CSP FRAME - 5. The indicated inductance is the stray. or leakage. inductance in the power cord isolating the cabinets from ac. or utility. ground.

Fig ure Sy stem Ground Schematic 6 A Logical Grounding Scheme fo r the VA X 8800 Processor

power line ground in a complex way. Al l these ment . This noise depends on parasitic elements devices can generate high-frequency currents in the circuits and electromechanica l structure . that flow into the ground. These currents must Th erefore, this information is best obtained fl ow through the complex impedance of the empirically by measurements on the actual grid where , consequently, RF voltages can hardware . The noise current amplitudes and deve lop. Under those conditions the ground fu ndamental frequencies should be measured would act as a noise injection poi nt rather than on cable shields, chassis grounds, 1/0 logic a stable reference. returns, and power inputs.

Dispersed Network Segregation of Sy stem

The dispersed network is an interconnection of Ground Ne tworks computers or systems spread over a wide area, A ground system schematic should be deve loped perhaps residing on different floors of a build­ for each particular subsystem. The interconnec­ ing or in different buildings altogether. Commu­ tion of ground types wi II be based on the nication on this sca le cannot depend on a intended system application. As a general rule, mutual RF ground because it cannot be reason­ the ground types should be segregated to ably established . In this case , communication account for the finite amplitudes and often must be accomplished by means of either trans­ unpredictable paths of the noise currents. This fo rmer-coupled circuits, optical links , or diffe r­ segregation of grounds (e .g., power, chassis, ential driver/receiver logic. and safety grounds) can be accomplished by Both types of networks illustrate the fact that carefully choosing the frequency-dependent system networking cannot, and in some cases impedances. These impedances are lossy ferrite should not, be accomplished by attempting ro inductors placed in series with the appropriate create an absolute ground reference to the net­ ground connection. work . App ropriate Signal and Sy stem to Peripheral Grounding Power In terco nnect

As a system expands with the addition of periph­ The optimal signa l interconnections are era l devices, such as disk drives, pri nters , and designed as controlled-impedance transmission LANs, the ground system must be vi ewed as a lines with each signal and its return path closely large hybrid arrangement. Interconnecting these coupled and having equal impedance co the devices must be predicated on the ground-cur­ chassis ground . Depending on the noise sensitiv­ rent characteristics (signature) and the I/0 con­ ity, data rate , and interconnect length, the nections of these devices to the system. implementation can range from coaxial cables This signature is particularly important when with overall shields co ground-plane ribbon connecting devices that were designed co be cables co ribbon cables with alternate ground/ used as small, standalone appl ications . Their signal pairs. Even the crudest, slowest signal designs may have involved decreased line-filter­ Jine that relics on chassis ground for a signal ing capabilities and minimally sized chokes for return is doomed co fa ilure if it is sensitive co ground isolation or perhaps none at aU. It is noise. imperative that such factors be considered when High-performance data lines should certainly connecting pe ri pheral devices to a large syste m. be designed with low-impedance diffe rential line drivers and receivers, either directly cou­ Summary pled or transformer coupled. Single-ended line We now offe r some conclusions based on our drivers and receivers may be acceptable within a recent experiences with the VAX 8800 and subsystem in which the noise between grounds other new systems. These conclusions rake the is low and controlled. Communication through form of recommendations for minimizing noise­ unbuffe red TTL outputs and inputs are never related problems in any compurer syste m. acceptable when leaving a subsystem back­ plane. Ground Noise Current Signature The initial cost of and board space needed for It is important to identify the spectru m of proper line drivers and receivers are more than ground-conducted noise for each subsystem ele- justified in today's distributed computing envi-

98 Digital Te cbnicalJournal No . 4 Fe bruary 1987 New Products ronment. Their use increases reliability and weight, which for copper would be 6 decreases start-up problems. The power inter­ -5.46 X 10- X 4 X 1r X 8.89/63.54, connects should be designed with minimum which equals -0.960 X 10-s. inductance and the lowest high-frequency char­ 3. The resulting figure must now be con­ acteristic impedance that is reasonable. The cir­ verted to relative permeabil ity by add· culating path of supply and return power cur­ ing 1.0 to the susceptibility fa ctor. For rents should be kept as low as possible. This copper, this value would be 1.0 -0.960 design allows better power-system transient per­ X 10-', which equals 0.9999904. formance and ensures the existence of minimal radiated magnetic fields. 4. The relative permeability must be con­ verted to permeability by multiplying the Notes value from step 3 above by the perme­ ability of air (4 X 1r 10-7). For cop­ 1. A short circuit between the high-voltage X primary and the low-voltage secondary per, this value would be 0.9999904 125663 10- 6 , which equals could produce lethal voltages referenced X X 6 to the chassis ground at accessible points 1.2 5662 X 10- . within the computer. With this shield, 5. The next piece of information needed is however, the short wi ll produce a high the conductivity of the material used. fault current to the chassis. That current This value must be in the fo rm of siemens will open various protective devices, per meter, although most listings will be such as fuses and circuit breakers, that in ohms per centimeter. To convert, mul­ render the system safe in the event of a tiply the table entry by l X 10-2 and fault. then take the reciprocal. For annealed 6 copper, this value is 1/1.7241 X 10- Appendix 1 0 which equals 5.8001 X X l - 2 , 107. Determining Skin Depth X To calculate the impedance of a given conduc­ 6. The skin depth can then be determined tor, the depth of current penetration - or skin by the relationship lj(1r X frequency of depth - in a conductor must be calculated concern X conductivity X permeabil­ first . To do that, a designer must perform the ityJ'1'. The result can be manipulated to the form of 1 j conductivity per­ fo llowing steps: ( 1r X X meability) '1'/(frequency of concern)'t'­ 1. Determine the type of metal of which the For copper, this value is lj(1r X 5.8001 conductor is made (i.e., copper, zinc, 6 X 107 X 1.25662 X 10- )'". which etc.). equals 0.06608/(frequency of concern) '1'. 2. Look up in a reference table the magnetic For example, if the frequency of concern susceptibility of the material. (The CRC were 1 KHz, then the skin depth would Handbook of Chemistry and Physics be 2.089 X 1 o-3 meters, or 2.089 mil· contains tables of this nature.) Two types limeters, deep. of listings of susceptibility are commonly If the frequency of concern were 50 KHz, used. The first type gives values of then the skin depth would be 295 micro­ specific susceptibility that must be con­ meters. verted by multiplying the value by 4 X 1r X density of material, called P. For cop­ per, this value would be -0.086 X IQ-(, X 4 X 1r X 8.89, which equals -0.960 X 10-s. The second type uses susceptibility in one gram formula weight. This value must be converted by multiplying it by 4

X 1r X density of material or molecular

Digital Tech nical journal 99 No. 4 February J'J87 Cheryl A. Wiecek I

TheSi mulation of Processor Performancefo r the VAX 8800 Fa mily

An effo rt was initiated in the fa ll of to simulate the performance of 1981 the processor design fo r the VAX fa mily of computer sy stems. That 8800 simulation stayed current with the changing design and continues to be used today fo r studies associated with developing VAX processors. This paper discusses why this simulation was done, how it was structured, and what was simulated. Since the results generated are quite extensive and detailed, only the conclusions fr om these studies are presented here. Wh at was learned fr om the model and how it aff ected the processor design are particularly emphasized.

Many levels of simulation are done within pro­ Methodology cessor deve lopment projects we ll before any The overall structure of the performance model actual hardware is built. Structural models at mirrors the structure used previously fo r the the circuit and gate levels are used in tasks such performance simulation of a PDP-II processor as verifying timing and developing diagnostic design.1 The model contains three parts. all tests . Behavioral models at the fu nction level are developed as separate entities: usefu l fo r verifying processor instruction microcode. Another usefu l class of models simu­ • The instruction stream that is acted on by the lates performance at the microcyclc level. Such processor resources

models look at a processor's design as a collec­ • The microcode that directs instruction execu­ tion of hardware resources that must be man­ tion aged. These models are most useful for gather­ The simulation of the. processor resources ing design trade-off information and verifying • and timing the design performance estimates. By emphasiz­ ing the key hardware resources and how they These three parts are then combined to gener­ interact, performance simulators can ate simulation results. The tasks performed to develop each parr are discussed in the following Focus on how those resources arc being used • section.

• Indicate how well they support the required activities Wo rkload Model The most appropriate model for the workload Provide a high-level view of the interactions • fed to the simulator is the streams of VAX in the processor system instructions from typical programs being exe­ This paper describes the performance simu­ cuted . Information about each executed i nstruc­ lator used on the project that deve loped the rion is required obtain performance data at w VAX 8800 fa mily of computer systems. This the microcycle level about the processor and irs modeling project began in the fa ll of 98 1, and resources. The software used extract these the simulatOr continues to be usedI today to execution streams had already wbeen developed study alternatives for new VAX processor from a previous project. That software is essen­ designs. The following two sections discuss how tially a debugger that uses the VAX T-bit gen­ the simu lator was designed and what was simu­ erate a software trap after the execution ofto each lated. The third section highlights the results instruction in the traced program 2 That tracing and discusses what was learned from them. perm its the collection of the next instruction's

100 Digital Technical journal Februmy No. 4 1987 New Products

operation code, the addressing modes and regis­ very detailed leve.l permits a fi ner analysis of the ters of the operand specifiers, the read and write results, but takes a long time ro develop and references, and the operand values. ru n. Therefore, we had to decide what the trade­ The task of choosing which programs to trace off should be between time and detail. We also was bounded by a number of requirements and wa nted to stay current with the latest develop­ constraints. One requirement was to provide ments in the processor microcode, which we some initial performance estimates for the knew would change significantly during the VAX 8800 fa mily processor. Those estimates project. With all that in mind, we decided to use emphasized integer, logical, and floating-point the latest version of the actual microcode operations in CPU-intensive programs. An other so urces as the input ro a unique process, par­ requirement was to select programs that exer­ tially automated, that extracted the information cised the processor resources that we wanted to needed by the simulator. This strategy allowed model, especially the cache subsystem, where us ro ignore details that were not required by capturing best-case, typical, and worst-case sce­ the simulator, as well as to keep up with narios was important. microcode revisions as they were released. A All the constraints involved the programs useful by-product of this approach was the abil­ from which instructions were traced . A reason­ ity to produce microPC histograms with the sim­ able length for these programs was about one­ ulator. This information helped to explain how half million VAX macroinstructions, thus per­ the microcode was being used . mitting the simulator to process them in a One step in modeling the microcode is to reasonable time. We avoided programs that determine the control fields that are key to the required extensive microcode characterization processor's performance. Only a small number for instructions that were either Jess frequently of the defined fields are actually needed. Many executed or too complex, such as those in the microwords are effectively no-operation instruc­ packed decimal group. Moreover, the trace soft­ tions fo r the simulated processor pipeline. ware was limited to processing executing pro­ Table I contains the microword key for the per­ grams that ran in nonprivileged user mode. Thus fo rmance simulator. Each microword has three we had to avoid programs, such as editors, hav­ fields: SRC, ALU, and DST. In any microword, ing extensive operating-system service calls, each field has a command su bfield and up to which could only be partially traced. three operand subfields. (The address operands We chose six programs ro drive the model. generated by the trace software are actually These included four benchmarks and two popu­ extracted as both the traced program and the lar utilities for creating executable images on simulator are being run. The other operands and VAX systems. The number of iterations in the commands are extracted from the microcode four benchmarks was shortened proportionally, prior to simulation execution.) keeping the mix of instructions constant to Before any actual microcode had been devel­ retain their representativeness. Three bench­ oped, simulated microwords were written man­ marks were written in FORTRAN: Towers of ually from microcode flows provided by the Hanoi, a prime-number generatOr, and single­ group developing the firmware. Once the actual precision Whetstone; one, called Puzzle, was microcode was available, a significant portion of written in PASCAL. The other two programs the performance simulation microcode was gen­ were a FORTRAN compile and a VAXjVMS link, erated autOmatically by mapping real fields to both written in BLISS. For all their constraints, the small number of fi elds that the simulator these programs exercised the model well. The required. This automatic mapping of processor accuracy of the performance estimates was con­ microcode to that used in the simulatOr was firmed later by measure ments on a prototype complicated by several issues. machine. One problem was that the microbranching logic required additional information at simula­ Microcode Model tion runtime ro decide which branch path to How microcoded instruction control is charac­ take. To solve that problem, the firmware group terized has a significant impact on both the flagged microbranches by inserting comments speed and results of a processor performance in their microcode. Those comments were then

simulator. For example, creating a model at a caught by the microcode translation software,

Digital Tech nicaljournal 101 Februm:y No. 4 1987 Th e Simulation of Processor Performance fo r the VA X 8800 Fa mily

Table Microword Key to the Performance • The simulator must have a modular structure 1 Simulator that fa cilitates replacing, reconfiguring, and

Field Command Description Operands reusing routines while minimizing the rumime overhead. Any No operation performed. None • A general-purpose control mechanism is SAC Stall if the memory data ASRC, registers (MDRs) specified BSRC needed to manage communication and syn­ by ASRC and BSRC are not chronization between a number of indepen­ yet valid for input to the dent tasks running in parallel. arithmetic logic unit (ALU). Extensive and flexible ljO features are ALU Send a cache arbitration None • signal and stall the pipe- needed to generate cycle-by-cycle traces and line if it is not the winner. reports with simulated performance statistics. DST Send the cache a read MDR • The ratio of simulated time to real time must request for x Bytes starting number, at Address, and set MDR Bytes, not be a bottleneck to obtaining resu lts. number to valid when the Address We chose a structure that favored changing data is available. and reusing parts of the simulator, but which DST Send the cache a write Signal, ra n slower, over one that ran faster, but was request with x Bytes of data Bytes, starting at Address. The Address hard to change . We did this knowing that the value of Signal determines simulator would be used to try many design whether hardware or micro- ideas that would eventually be discarded. The code control sends the write buffer data to memory. simulator also had many parameters built in so that diffe re nt configurations and timings could DST Conditionally flush the IB Address and provide the cache be tried . The structure we chose could be used with a new Address for to evaluate many design alternatives. Since this prefetching IB data. was the first VAX processor to be modeled this DST Send the cache notification None way, we had to design and build all the software of a new address for pre- for the simulator; none of it could be borrowed fetching IB data once the decoder handles the fro m other projects. Therefore, we knew that IS-address page cross. producing results quickly would be difficult.

DST Send the cache a read/ None The structure chosen required that the simu­ write probe request. lated processor be partitioned into a number of independent components, each modeled by a deterministic state-machine. That machine which marked them for processing at runtime . defined the actions to be done when each state Another problem was that some VAX macroin­ was emered, and the conditions to be evaluated structions had not been coded yet, and others for deciding the next state transition . This were more complicated than required for simu­ approach had several advantages. The hardware lation. (Many of the VAX floating-point instruc­ designers could relate easily to state-machine tions were in this category .) In those cases models of their particular designs, even though sequences of handwritten microcode were used . the states in the simulator sometimes marked performance-related events, not real hardware Pro cessor Simulation Model states. This structure also made it possible to The structure of the processor simulation model replicate components and reconfigure the origi­ was driven by the need to provide timely nal single-processor version of the simulator answers to questions asked by the designers . into a dual-processor version. The results had to be generated, verified, and A monitor is needed to control the communi­ distributed as quickly as possible to be most cation, synchronization, execution, and status of useful in design trade-off decisions. The require­ these independent state-machine components. ments we considered most important were the For communication between components, only following certain types of send and receive operations are

102 Digital Technicatjournat

No. 4 Fe bmary 1987 New Products used. This restriction allows the component states in which no more activity is possible for interfaces to be simple and well defined. There the cycle, the monitOr will increment the mas­ are three types of send operations: ter clock and the execution of components can resume. End-of-simulation and detected-error 1. A targeted send directs source informa­ conditions cause the monitor to generate a tion w a single destination within the report of results by calling each component to current cycle. execute its report code. 2. A broadcasted send directs source infor­ The complete model for the VAX 8800 fa mily mation to zero or more destinations processor ran on a VAX- 11/780 system and exe­ within the current cycle. cuted about six VAX macroinstructions per CPU second. That translates to a ratio of simu Ia ted 3. An arbitrated send directs source informa­ time to real time of about 90,000 tO 1. The con­ tion to a single destination , stalling exe­ trol monitor was written in PLjl; the processor cution of the sending component until state-machine components were written using the information is delivered. VA,'{ assembler macros. Once the ADA language had been added ro the list of VA.X-supported lan­ There are two types of receive operations: guages, we translated the entire processor per­ 1. A targeted receive results in the delivery of formance simulation model into that language. source information from a send operation. This new simulatOr is being used for follow-on processor performance studies. The ADA lan­ 2. A collection receive is limited tO probing guage was chosen because its multitasking fea­ source information from a send opera­ tures provide excellent support for the control tion; this information is used by the moniror functions that we defined. model to make decisions. The monitor keeps two queues for the com­ Ve rification of the Simulation Model

ponents: one for component send requests, the An important and often overlooked aspect of other for component receive requests. The mon­ developing a performance simulation model is itor also synchronizes send and receive requests the effort required to verify that the model on behalf of the components and reports errors reflects the actual design. In the early stages of a when undelivered send or receive entries project, the details of the proposed design are remain in the queues. usually communicated by word-of-mouth. Con­ Synchronization between components is tinuous changes to that original design enlarge achieved using the send, receive, and timing greatly the margin for error within a perfor­ services built into the monitor. The send and mance simulator. Since wrong performance data receive operations allow the specification of a is counterproductive, a great deal of our effort phase number so that components can send and went into verifying that the simulation opera­ receive information only at certain intervals tion and results accurately reflected the current within the basic microcycle clock recognized by state of the design. the monitor. The monitor blocks components Once the performance simulator produced from executing while they wait for send or results, the designers reviewed cycle-by-cycle receive requests to be serviced . States within a traces of simulator activity to confirm that the component can be designated as time sensitive. simulator's operation matched the processor When the next state to be executed within a design. In addition, we developed a set of short component is so designated, that component is tests that exercised certain key functions. These blocked from executing until the monitOr incre­ tests were rerun for each new version of the sim­ ments the clock. ulator, and the test results were exhaustively Execution proceeds on the basis of one compared to those from the previous version. machine cycle. State-machine components are This procedure was effective in revealing unan­ chosen to execute, one at a time, starting at the ticipated interactions and errors due to changes

state at which each was last left. Component made in both the simulator and the design. As execution continues until the required send, the design progressed, we were able to compare receive, or timing service returns control to the our simulation results with those from a behav­ monitor. When all components have reached ioral model used for debugging microcode.

Digital Te chnical journal 103 I'J87 No. 4 Febn1m:1' Th e Simulation of Processor Performance fo r the VA X 8800 Fa mily

Eventually, we could compare our results with Decoder those from a working prototype system. Because The decoder state-machine sends the pipeline a the model tracked the design's evolution microi nstruction during every unstalled cyc le closely, these comparisons showed the perfor· and detects the end-of-simulation condition. To mance model to be an accurate representation do those actions, the decoder requests bytes of the design. from the instruction buffer (IB), using informa­ Performance Model fo r the tion provided in the instruction trace . When the IB indicates that the requested bytes are avail· VAX Family Processor 8800 able, the appropriate microcode flow is chosen This section describes the processor hardware tO start execution. If the IB cannot deliver the resources that were modeled. For each modeled req uested byres, then no-operation microin· component, there is a short summary describing structions are fed to the decoder. The decoder its fu nction, the information communicated must also communicate with the cache control. with other co mponents, and the parameters that For example, the decoder must resolve any lB· can be specified at runtime to control simula­ address page crosses detected by the IB prefetch tion configuration and timing. Al though some hardware in the cache . Also kept by the decoder information about the VAX 8800 family proces­ is a parameter that contro ls the number of VAX sor design is included, reference 3 should be instructions executed between cache flushes consulted for more detail. due to context switching. Figure 1 is an overview of the processor per· formance simulator used for the VAX 8800 fam· Pzpeline ily. The various components are represented by The pipeline state-machine simulates how circles, the communication paths by arrows . microinstructions provided by the decoder are described earlier, each component is an inde·As to executed. During any one cycle, parts of pendent state-machine that communicates with thrbeee consecutively queued microinstructions other components using defi ned send and arc processed: receive operations. • The DST fi eld of the oldest micro instruction The ALU field of the next microinstruction MICROINSTRUCTION • • The SRC field of the microinstruction most recently queued For every cycle that the pipeline is not stalled, the oldest microinstruction is retired after the command in its DST field has com· pleted. The actions performed by the pipeline are described in Table The pipeline can send fl ush requests to the IB,I. and processor read and write requests to the cache (after arbitrating and winning it). The pipeline also manages the vali· dation of the memory cl ara registers (MDRs) . Pipe i ne stalls that resu lt from those actions are madeI known to the decoder. The only pipeline parameter the user must enter is the cycle time in nanoseconds, used fo r calculating perfor· mance data at the end of simulation.

Instruction Buffe r The IB state-machine simulates a first-in, fi rst· Fig ure Performance Model fo r the out (FIFO) cache for VAX instruction stream I VA X 8800 Fa mily data. The IB accepts requests for bytes from the

I 04 Digital Technical jountal February No. 4 I ')87 New Products

decoder and notifies it whether or nor the byres The cache-control state-machine is the center are available. The IB model does nor actually of the performance simulation model in the srore any stream data; however, ir does manage sense that it communicates with all but one of the counr of va lid byres within IB longwords as the other state-machine components. The hard· that data is shifted in and out. The cache-control ware resources managed include the combined component prefetches data for the IB and also instruction-stream-and-data cache , and a long· notifies the IB of prefetched data whenever no word delayed-write buffer used ro hold write-hit other activity is scheduled for the cache during data until it can be written into the cache. Like a cycle. When fu ll, the IB notifies the cache the 18, the cache control model keeps control control of that condition. ln turn, the IB is noti­ and status information only for the cache and fied by the pipeline model when it needs to be the write buffer. During every cycle, the cache flushed due to a change in the instruction control acts on the request chosen during the stream sequence. last cycle by the arbiter. That request can be a The configuration of the IB is controlled by refi ll from memory, a read lookup and the two parameters: rhe number of blocks, and the appropriate cache hit or miss activity, or a write number of bytes per block. For the VAX 8800 to the delayed-write buffer and memory. For a fa mily processor, the IB has fo ur blocks, each cache-write request, the data in the delayed­ four bytes long. write buffer is written to the cache when the next write request is processed , and then only if Cache Arbiter, Control, and Queues the address of the buffered write actually hit in From the viewpoint of performance, the cache the cache. If there are no memory or processor subsystem in the VAX 8800 family processor requests, data is prefetched for the IB automati· contains an important set of resources. This cally. by default. cache design was modeled in the simulator by A number of parameters can be specified at three stare-machine components: the cache runtime within the cache control, most of them arbiter, the cache control, and the cache mem­ specifying the configuration of the cache. Such OI·y-request queues. From the viewpoint of per­ configuration parameters include formance simulation, these fu nctions were the Switching the cache on or off most independent ones that cou ld be segre­ • gated. • The cache size in bytes The cache arbiter state-machine collects requests from the three components that require • The set size cache service. The first, the pipeline model, • The block size in byres sends readjwrite arbitration signals for the pro­

cessor. The second, the cache-control model, • The block fill size in bytes sends read arbitration signals for a stal led-pro­ • The block re placement algorithm (random, cessor condition. The third, the memory inter· least recently used, or FIFO) connect model, sends memory arbitration sig­

nals. During every cycle, the arbiter sends to the • The memory updating algorithm (write back cache control the arbitration winner that will or write through) have the cache during the next cycle. There is a Allocation for write misses fixed priority for choosing an arbitration win­ • ner. Memory has the highest priority, fo llowed Control does not exist for all possible by processor reads and writes of various types; cache options in the processor model for the cache IB prefetching (the default) has the low­ VAX 8800 fa mily, but the cache routines do est priority. The cache-control and memory­ su pport them. The implemented cache configu­ request queues models also provide status infor­ ra tion is 64KB, direct mapped with 64-byte mation used in deciding an arbitration winner. blocks and a 32-byre block fill (done as rwo sep­ Certain types of stalls resu lt in no winner. The arate 16-byte refill sequences) . It features write­ arbiter model requires no parameters to be through memory updating and no allocation specified by a user at runtime. for write misses. For study purposes, another

Digital Technical jo urnal 105 February No. 4 1')87 Th e Simulation of Processor Performance fo r the VA X 8800 Fmn il)'

parameter was included that allows either one­ requests of up to an octaword in size; and pro­ or two-cycle read hits the cache. The VAX cessor data- or instruction-read requests for to 8800 fa mily processor d<:sign impkm<:nts onl"­ 32 bytes (returned from memory as rwo ocra­ cyclc cache read hits. word packets). Until transmitted . each transac­ The cache memory-request queues stare­ tion ··owns" the bus. A one-cycle settle time is machine manages the IB read-miss queue, the required between transactions as well. Arbitra­ processor read-miss queue, and th<:writ <:-huffer tion for the bus occurs during every cycle to queue. The IB read-miss queue has two ele­ choose a winner for the next cycle. Priority is ments, thus allowing two outstanding misses for given first tO the current transaction holding the IB data. A third outstanding miss will replace bus. then tO the one-cycle settle rime, then to the second one. thus avoiding a pipeline stall. memory. and fi nally any pending write or The processor read-miss queue has one clement: read from the cache. A tocache request to memory therefore, two outstanding read misses will stall is queued during the cycle after the request was the pipeline. However, processor read hits arc transmi tted on the bus. The timing of subse­ allowed to continue with one outstanding read quent cache requests for memory is controlled miss . The write-buffer queue consists of rwo by the sum of two param<:rers specified at octaword (16-byte) elements . Consecutive ru ntime. Thes<: parameters arc writes within the same octaword arc buffered unti l an event forces data in the writ<: buffer to • The number of cycles between the time a be sent to memory. That event can be encoun­ cache request transmits on the interconnect tering either a write that is nor in the same octa­ and the rime the cache rcc<:ives an acknowl­ word or a microcode control command. The edgment from th<: bus cache control sends read-miss and write • The number of cycles between the time the requests to the appropriate queue. If a queue is cache receives the bus acknowledgment and fu ll, a signal tells the cache control that no the rime the next cache request can transmit more req uesrs can be accepted . on the bus From the cache queues, requests to memory arc generated and sent the memory intercon­ The VAX 8800 fa mily processor implementa­ nect after the arbitrationw for that interconnect tion has a va lue of two fo r each parameter, has been wo n. These requests are prioritized to although this timing had not been determined facilitate choosing which of three possible when the model was created . Several other requests will be sent to the m<:mory intercon­ parameters were included in the memory inter­ nect at any point in time. To maintain the rank­ connect state-machine for study purposes. The ing, a two-bit counter wil.l increment only on one-cycle settle time can be enabled or disabled. th<: appearance of a write fo llow ing a read. The and the interconnect can acknowledge configu­ request chosen is the one with the lowest ra nk ra tions with either one or two processors. We count . two requests have the same ra nking. also included the capability tO slow the memory priority Ifwi ll be given first the write . then subsystem. relativ<: to the processorjcache ro w the processor read, and fi nally to the IB read . re quest timing. by either two or three rimes. The cache queues component has one parame­ t<:rthat can be specified at ru ntime : the number Memory of cycles that a request ready to be sent tO the We had considered modeling in detail the memory interconnect must remain queued The designs for both the memory controller and the final processor implementation required only array module. The effo rt re quired was so substan­ one cycle, although this timing was nor known tial, however, that we first modeled only the when the model was built. best- and worst-case scenarios. The ensuing resu lts indicated that extra detail in the model Memory In terconnect would nor yield correspondingly enlightening The memory interconnect state-machine handks information; therefore, the memory state­ requests between the cache queues and mem­ machine models only best- and worst-case mem­ ory. Transactions requiring one or more cycles ory performance. The choice of best- or worst­ on the bus include cache-refi ll data , in octa­ case is a para meter specified by the user at word packets, from memory; processor-write runtime.

106 Digital Technical jo urnal Februarv 1')87 No 4 New Products

The best-case memory model assumes memory Processor Resources No t Modeled is never busy and can take requests from the mem­ In addition ro some of the microcode and parts ory interconnect whenever they are generated. of rhe memory subsystem, several other parts of Thus instead of the eight memory-array modules the design are not simulated. The translation the processor is limited to, this model effe c­ buffer that contains virtual-to-physical address tively simulates an infinite number of modules mappings is nor modeled . (The design has a with no contention for specific ones. The only 1 024-entry, direct-mapped translation buffe r, parameter the user must specify is the number half of it for system-space addresses, the other 3 of cycles between the rime the read request half for process-space addresses.) The logic reaches memory and the time memory arbitrates and microcode that handle alignment traps are for the memory interconnect to return requested nor modeled . Any unaligned addresses associ­ read data to the cache. The implementation has a ated with processor read and write requests for va lue of approximately 14 cycles, which reflects rhe cache are automatically aligned by the simu­ the memory read latency. Write requests for lator. FinaiJy. no r;o traffic is generated on the memory are simply delivered; no fu rther action memory interconnect to compete with proces­ has to be taken. sor and memory traffic. These omissions could The worst-case memory model assumes only impact the simulated performance of some pro­ one array module is available to handle read and cessor designs for some workloads. However, write requests. Requests for memory are queued their exclusion from this model did nor impact in a buffer for processing by the array module. the performance estimates generated fo r rhe When all queue elements have requests, a mem­ processor with rhe ser of workload programs ory-busy signal will inhibit the memory intercon­ used. nect from sending additional requests until a queue element is available. A number of parame­ ters can be specified by the user at runtime tO Evolution of the Model control the timing of requests within the mem­ Before presenting studies done with the proces­ ory controller and the array modu le. One sor performance simulator, we should examine parameter is the length of the memory-request how the model evolved. Our most sign ificant queue, a value from one to eight. The processor achievement was ro continue developing the design used a va lue of three for this queue model even as project goals changed and as the length. The other parameters are the numbers of design materialized over time. This continual cycles required for various operations, as adjustment resulted in a model that reflected described below. The actual value specified for the latest design and could be used in new the processor design is contained between the design studies. parentheses fo llowing each parameter's descrip­ The first version of the simulator was nor tion. These parameters are very detailed. It included the pipeline, the instruction buffe r, the cache arbiter, a cache • The time a request must be queued before processing in the array module (2 cycles) shell, and some hand-coded microcode fo r evaluating operand specifiers and for a limited The rime required by the array module to • number of VAX instructions. No lookup was process a read ( 12 cyc les) done in the cache shell. A parameter specified

• The rime required by the array module to the hit and miss percentages desired, and process a write (9 cycles) random number generation was used to decide the lookup results. Runs were made with • The rime required by the array module to process read data for a masked write (2 both two and fo ur IB longwords, and 90 and cycles) 100 percent hit rates in the cache; the workload was the Towers of Hanoi benchmark . Two The time required for a refresh of the array • important results were indicated : first, the per­ module (I 2 cycles) fo rmance was in line with the stared goals; sec­

• The time between array refresh signals (300 ond, it was desirable tO have more than two IB cycles) longwords.

Digital Te chnical jo w·nal Febru.ar)' 107 No. 4 1987 The Simulation of Processor Performance fo r the VA X 8800 Fa mi�y

At that point, a more aggressive set of design component is in processing requests and goa ls was set by engineering management. how we ll the components interact. Therefore, the next ve rsion of the simulatOr Knowing what requests are received and modeled more of the detailed implementation what percent of the time component that was evolving. This detail included the resources arc stalled or busy (and why) decoder, the cache-control and memory-request provides insight into the overall system queues, and the memory interconnect. We performance. We fo und that presenting developed microcode translation software and rhis detailed information in terms of aver­ used the first base-level microcode released to ages-per-instruction was an effective way control the model. Some custOm cod ing was of summarizing the activities. This infor­ done to accommodate single-precision floating mation helped rhe designers in making point instructions that were needed . Both hard­ hardware design decisions at a low .level. ware and microcode bugs were uncovered clllf­ 3 Varying the parameter va lues in a simula­ ing the design and verification of this simulator tor and comparing the results produces version, thus increasing its va lue to the designers. useful information to evaluate high-LeveL Performance Simulation design and configuration decisions . Since Results and Studies the VAX 8800 family processor design was modeled , a number of studies have Using the simulatOr just described, we carried been done evaluate schemes that could out a number of studies to verify the processor's be used in tonew processor designs. performance and to examine design alternatives. Since the detailed results are very extensive , this Analyzing the instruction stream data from concluding section outlines the ki nds of perfor­ 4. the trace that drives the simulator pro­ mance information gathered and highlights a duces information about how the archi­ number of studies that were done. tecture's instruction set is used. This type of information helps designers decide Performance Information Gathered which optimizations are most beneficial , Information provided by a performance simula­ especially in the microcode flows. Gath­ tor falls into four areas: ering this information generally does not requ ire processor-specific fu nctions in 1. Measuring the performance of a program the si muIa tor. Therefore, the simulator on an existing processor and then tracing docs not produce that infor mation. For that same program to drive a processor our purpose, the information was gath­ simulator are used to produce a re lative ered from a norher package of analysis soft­ performance estimate for the proposed ware.' Only individual VAX instruction processor. (Of course, this comparison is rimes that were specific to the VAX 8800 reasonable only if both processors are fa mily processor came from the simularor. implementations of the same architec­ ture.) The information needed to make Highlights fr om Simulation Studies the comparison includes the fol lowing: lnirially we used rhc Towers of Hanoi, the prime­ the total number of instructions exe­ number generato r, and rhe single-precision cuted, the execution time required, and Whctsrone benchmark to drive the model. From the cycle time on the measured system, as it we derived resu lts indicating that the perfor­ well as the total number of instructions mance of the VAX 8800 fa mily processor was simulated, the total cycles required, and between '5 and '5.6 times that of a VAX- 1 1 j780 the proposed cycle time on the simulated processor.4. The designers made one change based system. The VAX- 11/780 processor was on the resource utilization statistics the simula­ used as the comparison machine for gen­ tor generated. Cache read hits had required two erating performance estimates relarivc tO cycles, rather than rhc usual one cy cle, when the the VAX 8800 fa mily processor design. read address also matched a valid delayed-write 2. Simulating the use of resources within buffer address . This number was changed to one processor system components produces cycle whL:n the simulator showed the frequency information about how efficient each of this event was higher than anticipated .

108 Digital Technicaljournal February No. 4 I ')87 New Products

Once the basic processor design had been sor with the same memory. Tripling the successfully modeled, work focused on broad­ speed increased performance by only I.7 ening the microcode coverage and simulating times. various alternatives. Better microcode coverage • Enhancements made in the FORTRAN com­ allowed more programs to be traced and run piler for generating code had a great impact through the simulator. We wanted to use more on rhe instruction stream traced, as well as on diverse programs, like the FORTRAN compile the performance estimates derived using the and the VAXjVMS link, to exercise the design FORTRAN benchmarks. This improvement using the simulator. Alternatives such as cache was particularly noticeable for the FORTRAN flushing to simulate context switching, the compiler released with VMS Version 4. worst-case memory model, and the dual-proces­ sor version were also added. To study the Summary model's behavior, we ran many simulations, The development of the VAX 8800 processor varying the basic processor configuration and performance simulator continued throughout comparing resu lts tO detect the effects. Even the entire project. The simulator helped to ver­ today, this work continues as new design ideas ify the attainment of performance goals and pro­ surface. vided performance trade-off information to the The fo llowing list shows the VAX 8800 fa mily designers. The model's results fostered discus­ processor simulation parameters and configura­ sions about interfaces, helped the designers to tions that were most sensitive from a perfor­ fi nd problems, and uncovered unanticipated mance point of view: interactions. The simulatOr continues to con­

• Context swi tching, simulated by invalidating tribute to current processor design efforts all cache entries every n VAX instructions, through its use in studying the performance

showed a performance degradation from impact of alternatives. 8 percent when done every I 0,000 instruc­ ln addition, we learned a number of impor­ tions, 23 percent when done every 2,000 in­ tant lessons that will be useful in designing structions.to We chose an int erval of 5,000 in­ fu ture simulators. First, it is important to structions fo r the simulator, which is a develop the basic processor simulation fu nc­ conservative estimate. (The degradation was tions as early as possible in a design project. 13 percent for 5,000 instructions.) Having a general-purpose cache model that can be called and controlled from diffe rent proces­ A timing requirement of two cycles for read • sor implementation models is one of the most hits in the cache, rather than one cycle as important fu nctions. implemented in the VAX 8800 family proces­ Second, defining and developing a monitor to sor design, degraded the simulated perfo r­ control the various parts of a simulator, apart mance by 9 percent. from implementing the particular design, has

• The latency time for memory reads decreased significant implications for designers of perfor­ performance by about 0. 75 percent for each mance simulators. Having separate control func­ additional cycle of latency. tions allows the implementor to concentrate on understanding the design to be modeled, as well The worst-case model for memory, using only • as to take advantage of features provided by the one array module, required 14 percent more control monitor tO debug the model. Separating cycles than the best-case model. (This result control from the simulated design , however, contributed tO our decision to use only the does not result in a simulator with the most best and worst cases.) optimized runtime performance.

• A slow memory interconnect and controller re lative to the processor degrades the perfor­ Acknowledgments mance gains when a faster processor is used. I had the support of many people in developing Doubl ing the processor speed by cutting the the perfo rmance simulator for the VAX 8800 cycle time in half increased performance by fa mily processor. The processor hardware and only I.5 times over that of the slower proces- firmware reams explained the design , reviewed

Digital Tech nical jo untal I09 February No. 4 1987 The Simulation of Processor Pe1jormance fo r the VA X 8800 Fa mily

the results, and encouraged the effon. Simon Steely and Mark Firstenberg helped to design and implement the original simulation rools. Peter Craig developed the microcode characteri­ zation process and software . Eric Rasmussen cre­ ated the dual-processor version of the simulator and the ADA performance simulation model.

References 1. C. Wiccek and S. Steely, "Performance Simulation as a Tool in Central Process­ ing Unit Design ," Performance Evalua­ tion Review, vol. 1 1, no. 1 (Augusr 1979): 41-47

2. T. Leonard, ed. VAX Architecture Refe r­ ence Manual (Bedford: Digital Press, Order No. EY-3459E-DP, 1986). 3 S. Mishra, "The VAX 8800 Microarchitec­ ture," Digital Technical journal (Febru­ ary 1987, this issue) : 20-33. 4. C. Wiecek, "A Case Study of VAX-11 Instruction Set Usage for Compiler Exe­ cution," ACM Proceedings of the Sy m· posium on Architectural Support fo r Programming Languages and Operat­ ing Systems (March 1982): 177-184.

110 Digital Technicaljournal Nn F hr t v 7 Stuart]. Farnham Michael S. Harvey Kathleen D. Morse

VM S Multiprocessing on the VAX 8800 Sy stem

Some fe atures of the VAX 8800 architecture are particularly relevant to multiprocessor op eration. Sp ecial hardware, not included in the VAX architecture, allows the VMS operating sy stem to use both CPUs in an asymmetric, tightly controlled fa shion. The processors op erate in a master-slave relationship with one CPU handling all IjO. The hardware handles interprocessor interrup ts, cache coherency, and shared mem­ ory. VMS uses the interprocessor interrupt in managing op erations between the master and slave CPUs. The VMS system also uses interlocked instructions, exception handlers, and traps to handle multiprocessing. These instructions allow events to be scheduled and executed efficiently on both processors.

Every computer system is a combination of hard­ kernel, executive, supervisor, and user. Most of ware and software architectures. the operating the critical resource management code in the system being a direct result of their merger. The VMS system is executed in kernel mode; in fact, same operating system can be implemented on some instructions can be executed only while in diffe rent hard\.vare systems with the same archi­ that mode. Two examples of such instructions tecture, but a user can access only those features are LDPCTX and MTPR (move to processor reg­ that each set of hardware can support. The most ister) . LDPCTX loads the context (stacks, page effective merger is the one allowing users of the tables. and so on) of a process into a CPU so that resu lting operating system ro make maximum the process can execute . MTPR is used, among use of all the fe atures designed into both the other things, ro enable, disable, or trigger cer­ hardware and software architectures.' The tain interrupts during resource management. VAX 8800 multi processor is an example of the result of such an effective merger. Interrupt and Exception Ha ndling The VAX architecture supports the immediate The VAX Architecture and servicing of important events by means of a Multiprocessing mechanism that can transfer control away from Many of the VAX 8800 hardware features impor­ the currently executing process. Events that are tant ro VMS multiprocessing are defined by the primarily relevant to and normally invoke soft­ VAX architecture for single-processor and multi­ ware in the context of the currently executing processor systems alike 2 These features include process are called exceptions. Events that are the processor modes, ljO and software inter­ releva nt ro other processes, or tO the system as a rupts, exception handling, whole, are cal led interrupts, which are serviced traps (ASTs) , and interlocked instructions. This in a system-wide context. 2 The VMS operating section briefly describes these features, which system provides a hand ler routine for each are discussed in more detail later. exception and interrupt defined by the VAX architecture . Processor Modes Upon system startup, the VMS operating sys­ The VAX architecture defi nes fo ur modes in tem initializes a system control block (SCB) , which a processor may execute . In order of which defines the locations of the various event decreasing levels of privilege, these modes are handlers, as shown in Figure The SCB contains I.

Digital Te chnical jo urnal 111 Februmy No. 4 1')87 VMS Multiprocessing on the VA X 8800 Sy stem

(SIRR). A bit exists in the SIRR for each software interrupt level. An interrupt can take place only when the lPL level of the CPU has been lowered below that of the pending interrupt. For exam­ TRANSLATION NOT VALID (PAGE FAULT) EXCEPTION ple, the handler for the interprocessor interrupt (executing at IPL 20) can post a reschedule event (a software interrupt at IPL 3) by setting CHANGE MODE TO KERNEL EXCEPTION the appropriate bit in the SIRR. When the CPU's CHANGE MODE TO EXECUTIVE EXCEPTION IPL drops below IPL 3, the IPL 3 interrupt han­ dler is invo ked, which is the VMS code that ini­ CHANGE MODE TO SUPERVISOR EXCEPTION tiates process rescheduling. This technique allows high IPL code threads ro schedule lower IPL fu nctions in a way that INTERPROCESSOR INTERRUPT allows all potentially interrupted code threads at intermediate IPLs ro complete first. Should a SOFTWARE INTERRUPT LEVEL 1 (UNUSED) SO FT WARE I N TERRUPT LEVEL -ASYNCHRONOUS higher IPL code merely lower the IPL by 2 SYSTEM TRAP DELIVERY force ro execute the lower IPL fu nction, any SOFTWARE INTERRUPT LEVEL - RESCHEDULING intermediate JPL code threads that had been 3 . interrupted wou ld complete out of order, thus . . breaking the software synchronization .

15 - SOFTWARE INTERRUPT LEVEL XDELTA AST Delivery Mechanism 10 MILLISECOND INTERVAL TIMER INTERRUPT In any mode, the VAXjVMS system can interrupt a code thread executing at lPL 0, begin a new code thread (also at IPL 0) , and then continue the previously interrupted code thread. This Figure 1 Sy stem Control Block mechanism is called "delivering" an AST . The hardware notifies the operating system that an an assigned longword that holds the address of AST is deliverable ro the currently executing the handler for each interrupt and exception process by means of an interrupt at IPL 2. (Note serviced by the operating system. that this is the only instance of the VAX hard­ Interrupts and exceptions have varying ware posting a software interrupt) . Any process­ degrees of urgency. Each event has a specific context code thread that must execute without interrupt priority level (IPL) that designates the interruption by an AST has to be executed at relative priority of that event. The VA X architec­ IPL 2 or higher. If a deliverable AST is queued to ture includes 31 IPLs, divided into 15 software the current process and the IPL of the CPU levels (numbered , in hexadecimal, 01 ro OF) , drops below 2, then an IPL 2 interrupt wi ll be and 16 hardware levels (10 to 1F). User appli­ generated . To execute that interrupt, the IPL 2 cations and system services run at the process inrerru pt handler first verifies that the AST can level, which may be thought of as IPL 0. Inter­ be delivered and then delivers it to the process, rupt levels with higher numbers have higher after which the new code thread associated with priorities. That is to say, a request at an the particular AST is executed . higher than the processor's current IPL willJ PL An AST code thread is associated by a process interrupt immediately; requests at the same or with events that are expected to complete asyn­ lower levels will be deferred 2 The interproces­ chronously tO the main thread of the process. An sor interrupt and the 1 0-millisecond ( ms) inter­ example of such an event is an JjO request that, val-timer interrupt are examples of hardware once issued, is handled by the system in parallel interrupts. The rescheduling interrupt and the with the main thread of the process . Upon ljO AST-delivery interrupt are examples of software completion , the associated AST is delivered, interrupts. which causes the main thread of the process to Software executing in kernel mode posts a be interrupted in favor of the AST's code thread . software interrupt by setting the appropriate bit When an AST is specifi ed fo r an asynchronous in the software interrupt request register event, it is assigned a particular processor mode.

112 Digital Technical jo urnal February No. 4 I ')87 New Products

When the AST is queued to a process, its delivery is These instructions are used extensively in the deferred -..vhilc that process is executing in a more operating system to provide multiprocessor syn­ privileged mode than that of the queued AST. For chronization. They are also available to user pro­ example, when an AST in supervisor mode is cesses to synchronize access to shared application queued to a process executing in kernel mode, the data. AST will nor be delivered until the context The VAX System changes from kernel mode to at least supervisor 8800 mode. The specific implementation features of the VAuV:. 8800 multiprocessing system are described Interlocked Instructions in this section. Remember that the 8800 is only The VAX architecture includes a few instructions one of many implementations of the VAX archi­ that allow synchronous access tO locations in tecture. Several important hardware features pro­ memory. Only those instructions will guarantee vided by the 8800 are not specified in the VAX consistent results if multiple processors want architecture but are required for VMS multipro­ simultaneous access to the same memory location. cessing. These hardware features are For bit manipulations, these interlocked Primary processor access to all peripherals instructions arc • Imerprocessor interrupts • BBCCl - Branch on bit clear and clear inter­ • locked • Shared main memory • BBSSI - Branch on bit set and set interlocked • Cache coherency For arithmetic manipulations. there is ADA Add aligned word interlocked. VA X Implementation WI- 8800 For queue manipulation, the instructions are The VAX 8800 system consists of two VAX 8800

• INSQHI - Insert at head of queue interlocked processors that share main memory by means of a JNSQTI - Insert at tail of queue interlocked fa st memory-system interconnect called the NMI • bus.. :; The processor hardware is completely sym­ REMQHI - Remove from head of queue inter­ • metric; that is, either processor can fu lfill the role locked of primary processor for any boored instance of

• REMQTI - Remove from tail of queue inter­ the operating system. Figure 2 is a block diagram locked of the VAX 8800 system.

CONSOLE I I

LEFT CLOCKI RIGHT CPU H � CPU

I NMI I l I I NBI NBI MEMORY ADAPTER AOAPTER

VAXBI VAXBI BUS BUS VAXBI VAXBI BUS BUS 1/0 CONTROLLER t--

1/0 CONTROLLER � 7 7 I 7 -.._)

Fig ure 2 Block Diagram of VA X 8800 Sys tem

Digital Te chnical journal 1 1 3 I No. 4 Fe bruarv 'J8 7 Multiprocessing on the VMS VA X 8800 .�vstem

There is one console subsystem in the 8800. optimize transfers across the NMI to the memory which is shared by the two CPUs . The console subsystem. Therefore , the i nterlockcd instruc­ command language. implemented in software in tions must be issued to flush the necessary write the console subsystem. is a superset of the con­ data all the way out ro memory. If one processor sole fu nctionality specified by the VAX architec­ mod ifies shared data , the other needs to see the ture 2 Both CPUs can be controlled from the change in a synchronized and timely fa shion. single console terminal. After the system is booted, the console terminal can be used like Multiprocessor hnplementation any other terminal connected to the syst<:m. Improvements Al l 1/0 devices arc connected ro the system The VA X 8800 system includes features that arc through VAX BI buses. The SHOO can accommo­ improvements over previous multiprocessing date up to fo ur VAX131 buses, each of which can VAX hardware implementations, such as the accommodate up to nodes. generally ljO 16 VAX:- t lj7H2 system . Larger amounts of physical controllers. 'fhe buses are connected to the NMI memory can be used. all of which is available to by means of the NMI-to-VAXBI adapters, ca l l<:cl the VMS system or the system diagnostics. More· the NB!s. Each NBI consists of either two or over. the 8800 cache provides better perfor­ three parts : an NBIA, which is the interface to mance. and the system has a smaller fo otprint the NMI; and one or two NBIBs. which are inter­ and a better pricejperformance ratio. Perhaps

faces tO the VAXBI buses. An NBIB is one of the the most significant fa ct from a system-manage­ 16 nodes on its respective VAXBI bus. ment viewpoint is that only one console subsys­ Under VMS multiprocess ing. all peripherals tem with one terminal is needed to control the are con tro II ed by the fi rst processor to be entire multiprocessor. This single control point booted . designated the primary processor. The bas ramifications for setting up the system and

other processor, the secondary, is prevented running it as a multi processor. from acc<:ssing any peri phera l devices (disks, The console subsystem has access to the mem· terminals, and so on) because the code commu· ory configuration of the 8800. With previous nicHing with those devices runs in kernel multiprocessors , the system manager had to con· mode. an access mode that VMS util izes only on figure memory by manually determining the the primary . Thus, all ljO peripherals will be appropriate data , then entering it into cus­ accessed only by the primary processor. Typi· tomized command procedures on specially built ca lly, the left CPU in the VAX 8800 system is floppy disks in the console.' chosen as the primary processor. However, con· The console subsystem of the 8800 also elimi­ sole commands are available w designate either nates the need for operator intervention ro boot CPU as the primary one. A change in that desig· or restart the secondary processor. The VMS sys· 11arion takes effect after the next INIT command rem is initially booted on the primary processor is received by the console. and subsequently directs the console su bsystem

']'he VAX: 8800 hardware provides the capabil­ ro boot the secondary. Similarly, the console ity for one processor to interru pt the other. This subsystem restarts the VMS system on the pri­ interruption is accomplished by writing a value mary processor after a power fa ilure. The opcr· of I tO an internal on the ating system then directs the console to restart interrupting CPU by means of the privileged the secondary ar the appropriate point in the

MTPR instruction (from kernel mode only) . ·nw power-recovery sequence. At no time must the VMS system uses this mechanism to synchron ize operator be involved in bringing the secondary the CPUs as different system events occur. on line.' The main memory contains one copy of the VMS software. which depends upon the memory The VMS Op erating Sy stem subsystem and interlocked instructions fo r The multi processing aspects of the VAX archi­ cache coherency and the consistency of memory tecture and the VA X 8800 imp lementation contents. The VAX HHOO memory subsystem provide the underlying hardware support for a auromatically handles all cache updates: no soft­ tota lly integrated multiprocessing computer sys­ ware logic is needed to maintain consistency tem. This section discusses aspects of the VMS between the cache contents in each processor. software that are specifically re lated tO multi· processing as implemented for the 8800. (Sec The 8800 docs implement a write buffe r to

1 14 Digital Tech nical journal I 'J87 No. 4 Fe hruar)• New Products

reference 5 for additional multiprocessing infor­ IPL-based techniques were sufficient to synchro­ mation and recommended programming tech­ nize the code threads in kernel mode. niques.) The second constraint led us to configure a system with two VAX- 11/780 CPUs coupled by Classification an MA780 shared memory. In this configuration, In multiprocessing terminology, VMS multipro­ each CPU has a separate, independent console cessing is classified as "asymmetric" and subsystem; neither has access to the other's con­ "tightly coupled." An asymmetric system is one sole. this multiprocessor requires spe­ in which one CPU, called the primary, has criti­ cial console command files and operator inter­ cal system-wide responsibilities, including the vention for both CPUs. Similarly, the I/0 management of all the CPU resources. The other devices configured on one CPU are inaccessible CPU, called the secondary, has more restricted on the other. Since most of the IJO subsystem responsibilities that exclude the management of code executes in kernel mode, this constraint critical system resources (including itself). This has the effect of limiting the IjO devices usable type of multiprocessing system is also referred by the multiprocessor to those connected to the to as a "master-slave" arrangement. The other primary CPU. classification, tightly coupled, means that both The finalconstr aint led to a design that allows processors operate in a closely synchronized multiprocessing code to be inserted dynamically fashion; if they fail, they fail together. into the running executive . No multiprocessing On a VMS multiprocessing system, both pro­ code is present in a single-processor configura­ cessors share the same copy of the operating sys­ tion of VAXJVMS. tem, although some code is executed oniy by The multi processing capabilities in VMS one or the other CPU. Most of the kernel logic Version 3.0 were extended to support the new in the VMS operating system is executed only by VAX 8800 system. These extensions take advan­ the primaryprocessor. That eliminates the need tage of new functions allowed by the new VAX for the complex synchronization and locking design. For example, as mentioned earlier, the mechanisms that would otherwise be required shared console subsystem allows the secondary to protect the system's data structures from processor to be booted from the primary under access by multiple CPUs. program control; no operator intervention is required. History of VMS Multiprocessing VMS multiprocessing was introduced during the Division of Wo rk between Processors

development of VMS Version 3.0. At that time, As mentioned earlier, the VMS multiprocessing the power of a single VAX- 11/780 processor code is a master-slave implementation. The sec­ was insufficient to build the VMS executive in a ondary CPU is required to do whatever work is reasonable amount of time. Several constraints assigned to it by the primary . The secondary were placed on the multiprocessing develop­ CPU can execute application code only, while ment effort. It had to involve minimal changes the primaryCPU handles the IjO, paging, and to VMS kernel mode routines, use existing hard­ all resource management, as well as the execu­ ware, and have minimal performance impact on tion of application code. Since all system ser­ single-processor VMS systems 6 vices that manage system resources are executed The first constraint above had the greatest in kernel mode, only the primary CPU is impact on the chosen design of VMS Ver­ allowed to execute those services. The sec­ sion 3.0. To achieve fu lly symmetric multipro­ ondary CPU can execute code that is in any cessing, changes would be required throughout other mode: user, supervisor, or executive. the whole operating system to extend IPL syn­ Thus, to be technically accurate in multipro­ chronization as already implemented by VMS for cessing terminology, the VMS multiprocessing single-processor operation . Since those changes system is symmetric for code in the user, super­ were too extensive to make, we chose an asym­ visor, and executive modes, but asymmetric for metric design in which the synchronization of code in kernel mode. critical code was achieved by limiting that activ­ The VMS boot code creates a SCB for each pro­ ity to the primaryCPU . In this context, existing cessor. As described earlier, the SCB contains

Digital Te cbnical]ournal 115 I No. 4 February <)87 VM S Multiprocessing on the 8800 ,S),stem VA X

vectors to routines that handle various interru pt lnterprocessor Interrupts and exception events. Many VMS interrupt and The VA.-'<. 8800 hardware provides a key feature exception handlers are identical for both the for optimizing the VMS multiprocessing soft­ primary and secondary processors. However. ware: the ability of one processor ro interrupt there arc cases in which exceptions or inter­ the other. This interprocessor interrupt mecha­ rupts must be handled differently, depending nism is used extensively on each CPU by rhe upon which processor receives the event. The VMS operating system. interprocessor interrupt and the software inter­ The prim a ry processor interrupts be sec­ rupt used for rescheduling arc both examples of ondary for several reasons. First, the primart y can system-wide events. Both arc vectored through request an invalidation of a translation buffe r the SCB but require different handlers for each entry correspond ing to a system-space address processor. (Figure shows the various interrupt that is about to be inva lidated on the primary. 1 levels in the SCB.) The AST-delivery software This event forces coherency between the trans­ interrupt and the quantum end , a scheduling lation buffers of both processors with respect to event (described later) , are examples of pro­ mapping changes in the shared system virtual cess-related events that also require different address space . Second, the primary can interrupt exception handlers in the SCB of each CPU. By because it has queued an AST, typically for I/0 separating the hand lers into processor-specific completion, for the process curre ntly executing SCBs, the more costly and difficult task of run­ on the secondary. This event ultimately results time separation within an otherwise commonly in the process being rescheduled onto rhe pri­ executed handler is avoided. mary. where the actual delivery of the AST to Typically, when an exception occurs on the the process can be accomplished. Finally, the secondary , that CPU's exception handler primary can ini tiatc and synchronize a system­ "reflects" that exception back ro the primary. wide shutdown or a crash. To do that, the exception handler stores both The secondary processor will interrupt if it the address of the primary's exception handler wants the primary to take back the current pro­ and an appropriate processor status longword cess and find another process for the secondary (PSL) on rhc stack of the current process. The to execute. The secondary will also interrupt if secondary's exception handler rhen saves the it detects a hardware error or if it wa nts to ini­ context of the current process and passes the tiate a system-wide crash. process back to the primary by requesting a rescheduling event. The process eventually exe­ Secondary State Tra nsitions cutes on rhc primary, whose exception handler A st:ne variable is mai ntained to record the cur­ will immediately get control as if the exception re nt state of the secondary processor. The pri­ had occurred there originally. Exception pro­ mary processor uses this state to determ ine cessing is therefore synchronized on a system­ whether or not to schedule work for the sec­ wide basis by virtue of ru nning on the primary ond;uy \X'h en the secondary is booted, the state processor only. variable is already set to !NIT. After booting, the The SCB for the primary CPU consists of mul­ secondary changes the state variable to IDLE. tipie pages of interrupt and exception vectors. During its next reschedule operation, the pri­ The fo rmat of the fi rst page is dcfi ned by the mary wi II notice the I OLE state and attempt to VAX architecture . This page contains vectors for schedule a process for the secondaryto execute. all implcmenrarion-independenr exceptions and After fi nding a process for the secondary, the interrupts, and for a few implementation-depen­ primary sets the state variable to BUSY. The sec­ dent ones. Additional pages of vectors are pro­ ondary, which has been conri n ua I l y checking vided for !jO interrupt hand.lcrs. Under VMS the state variable for this transition, then loads multiprocessing, the length of the SCB for the the process's context from memory and sets the secondary CPU is one page. The pages that make state to EXECUTE. up the l/0 subsystem portion of the SCB are nor The secondary will execute irs current pro­ needed on the secondary, which wi II not initiate cess until the process either receives irs quan­ ljO requests nor receive I/0 inrerrupts. rum of CPU time or is blocked by some request

116 DiJ!,ilaf Te chnical journal

No. 4 Fe bmnrJ' 1')87 New Products that must be synchronized in a system-wide con­ ity (which are not the same as interrupt priority text. (That request must be executed in kernel levels). Thirty-one is the highest priority, one mode on the primary.) At this point, the sec­ the lowest; process priorities are subdivided ondary saves the process's context in memory into real-time (priorities 16 to 31) and "nor­ and sets the state to DROP. Using the VAX 8800 mal" (priorities 0 to 15) ranges. The real-time interprocessor interrupt mechanism, the sec­ priorities are used by time-critical applications, ondary then interrupts the primary and requests such as high-speed data acquisition. When a another process to execute. The primary takes process is created, it is assigned a base priority. the saved process back from the secondary, set­ Its priority during execution is guaranteed never ting that CPU's state to IDLE. Thus, the state to drop below that base priority unless either transition has made an entire circuit. that process or another, privileged process Figure 3 shows the state transition diagram for requests it to. the secondary CPU. The primary's paths are Each process is allowed a quantum of CPU marked P and the secondary's paths are marked time (usually 200 ms, equivalent to 20 inter­ S to indicate which processor controls each tran­ rupts of the 1 O-ms interval timer; however, a sition from one state to another. The only state system manager can change the default). Each not explained above is the STOP state, used only time the interval timer interrupts, the interrupt when the secondary is shut down. handler checks to see if the current process has used up its quantum. If so, quantum-end pro­ cessing is initiated. For a process with a priority in the real-time p range , quantu m-end processing consists of awarding a new quantum to the process and allowing it to continue execution. A reschedule event will occur when a normal-priority process s p s has used up its quantum. In the latter case, the current process is placed at the end of the scheduling queue maintained for that process's priority (there is one such queue for each pro­ cess priority) , and the process at the head of the

s queue is chosen to execute. The priority of a normal-ra nge process is raised after certain bloc king events have p cleared . For example, to provide good response time to interactive users, a process's priority will be temporarily boosted after the comple­ Figure 3 Secondary CPU State Tra nsitions tion of terminal input. This arrangement results in a tendency for comput e-bound processes to Process Scheduling under the VMS remain at their initial priorities (called the base Op erating Sy stem priority) . However, I/O-bound and interactive Some aspects of process scheduling were dis­ processes, which are blocked more frequently, cussed in the previous section. This section usually attain priorities somewhat higher than describes in greater detail how process schedul­ their base ones. A process's priority is lowered one ing is implemented in the VMS system and point when the process is scheduled to execute, which of its aspects are different in a multipro­ unless it is already running at its base priority. cessing environment 6 Multiprocessor Scheduling Single-Processor Scheduling The primary processor schedules all work on The VMS scheduling algorithm implemented on the system, for both itself and the secondary a single processor is round-robin and preemp­ processor. The scheduling algorithm used for tive, with the highest priority process being exe­ the primary processor is basically the same one cuted first. There are 31 levels of process prior- used in a single-processor system (an important

Digital Te chnical journal 117 1987 No. 4 February VMS Multiprocessing on the VA X 8800 .�pstem

goal in this implementation). For the multipro­ in kernel mode. The primary processor han­ cessor scheduling algorithm, however. certain dles this interrupt by performing a reschedul­ modifications were made to extend the effec­ ing operation. a result, the primaqrproces­ tiveness of process scheduling to utilize the sor sends the processAs it was just executing, additional CPU resources that arc available. The which is no longer in kernel mode, ro the execution environment of the secondary proces­ secondaqr processor in a timely fashion . The sor is more constrained than that of the primary. primary is then free to execute another pro­ Most notably, the kernel-mode code is restricted cess itself. to the primary CPU. The multiprocessor When then: is only one computable process, g algorithm attempts to keep that sec­ schedulin one of the CPUs will remain idle. In this case ondary CPU as fully utilized as possible with the pri mary processor executes the process minimal scheduling overhead in the fo llowing itself even it may be perfectly eligible to exe­ ways: cute on the secondary. Thus the overhead

• The primary processor always schedules a processing associated with the post-kernel process to run on the secondary before mode AST and the subsequent rescheduling scheduling a process for itself to execute. of the secondary can be avoided. This case also has the effe ct of preventing fu ture • The primary processor will schedule a pro­ cess to run on the secondary only if that pro­ thrashing if the process needs access to ker­ cess does not require immediate execution in nel-mode resources, at least until enough kernel mode and does not have an AST computable processes become available to (which requires kernel-mode execution) keep both processors busy. ready to be delivered . This scheduling helps • The system services7 that request event-flag prevent situations in which a process can waits ( SW AJTFR, SWFLAND, and SWFLOR) fl ip-flop between processors , sometimes arc among the most commonly executed ker­ called scheduler thrashing. nel-mode services. 1 If a process running on

• Scheduling is preemptive on the primary pro­ the secondary processor requests an event­ cessor, bur nor on the secondary. Thus, if the flag wait, the VMS operating system will secondary processor is executing one job arrempr avoid rescheduling the process when another job with higher priority onto the roprimar y CPU. The system-service becomes computable, the primary processor dispatcher on the secondary CPU first checks wiLl not interrupt the secondary to give it the to see if the requested flags are already set. If higher priority job. Therefore, processes exe­ so, the process is allowed to continue execut­ cuting on the secondary processor are more ing on the secondaq' without rescheduling. likely to run for their entire quantum than are I f the flags are not set, an interprocessor processes executing on the primary. interrupt req uesting that the process be This approach guarantees only that the placed inro an event-flag wait state (either highest priority process will be executing, LEF or CEF) will be sent to the primary CPU. not the two highest priority processes. To When that processor services the interrupt, it guarantee the latter would require signifi­ again checks to see if the wait request has cantly more interprocessor interrupt traffic been satisfied (the flags have been set) . If so, and is likely to increase thrashing on the the process is allowed to continue executing entire system, and wi.ll especially affect the on the secondaq'. If the flags are still not set, primary's ability to devote processing time to the process is taken our of execution and irs own selected process. placed inro the appropriate wait state. The secondaqr processor then becomes available If all computable processes require execu­ • for scheduling. tion in kernel mode, then the primary proces­ sor cannot schedule a process for the sec­ Al though a process may currently be eligible ondary and wi ll execute a process itself. for scheduling onto the secondary, the VMS Shou ld that happen, an AST-delivery interrupt operating system cannot predict whether or nor will be generated automaticaLly after the pri­ that process wi ll re quire kernel-mode services mary processor stops executing the process in the ncar future. If those services are needed,

118 Digital Te chnical jo urnal

No. 4 Februmy J Y87 New Products

the process would have to be rescheduled onto ware, the software design could be modified the primary. For example, utilities that perform to rake advantage of additional capabilities interactive tasks (such as editors or the mail sys­ offered by the advanced hardware design in the tem) require numerous I/0 requests . Other VAX 8800 CPU. types of programs incur many page faults. These processes are therefore poor candidates for exe­ Acknowledgments cution on the secondary. Sometimes a system The authors thank Jill Angel of Digital's Corpo­ manager can predict that certain processes will rate User Publications Group for excellent assis­ have those characteristics, and he or she can tance in organizing this material and for serving take preventive measures to avoid processing on as writing coach, and Lawrence Kenah of the the secondary. VAX/VMS Development Group for technical The following VMS multiprocessing schedul­ assistance in planning this paper. Thanks also to ing features give the system manager manual all others who reviewed this paper and made control over the scheduling of processes onto technical and editOrial suggestions and improve­ the secondary CPU: ments. A SYSGEN parameter exists to limit the maxi­ • References mum priority of processes allowed to execute 1. K. Morse and R. Kinicki, "A Performance on the secondary. s Recall that priority boosts are granted to processes after certain events, Study of Multiprocessor Scheduling such as I/0 completion . These I/O-intensive Algorithms on a VAX - 11/ 782, ' ' Confer­ processes tend to stay at priorities above ence Proceedings of the International those of compute-intensive ones . Therefore, Conference on the Management and setting the SYSGEN parameter a point or two Performance of Computer Sy stems above the default base-process priority may (1985): 280-289. effectively screen out many "unsuitable" 2. VA X 1 Architecture Reference Manual processes from the secondary processor. The (BedfI ord : Digital Press, Order No. system manager can set the SYSGEN parame­ EY-3459E-DP, 1987). ter to 0 (indicating no priority screening is to 3. Fu, ). Keller, and K. Haduch, "Aspects occur) or to any value from 1 tO 31, which J of the VAX 8800 C Box Design," Digital sets the priority limit to the specified value. Te chnical journal (February 987, this I • A process can be made ineligible from exe­ issue) : 41-5 1. cuting on the secondary processor by means 4. VA X 11/782 Us er 's Guide (Maynard : of the SET PROCESSjCPU =NOATTACHED Digital Equipment Corporation, Order command. This command prevents user pro­ No. AA-M543A-TE, 1982). cesses that execute only interactive or I/O­ bound utilities from running on the sec­ 5. Guide to Multiprocessing on VA X/ VMS ondary. This fixed-process attribute re mains (Maynard : Digital Equipment Corpora­ in force until it has been changed with a SET tion, Order No. AA-HP69A-TE, 1986). PROCESS/CPU=ATTACHED command.' 6. L. Kenah and S. Bate, VA X/ VMS Internals and Data Structures (Bedford : Digital Summary Press, 1984). The VAX 8800 system running the asymmetric 7. VA X/ VMS Sy stem Services Reference VJVIS operating system provides the most com­ Manual (Maynard : Digital Equipment puting power currently available in the VA X Corporation, Order No. AA-Z50 B-TE, fa mily execute compute-intensive applica­ 1986) . I tions. Theto 8800 represents a merger of a new hardware implementation of the VAX architec­ ture with preexisting multiprocessing capabili­ ties in the VMS operating system. This software uses fe atures of the VAX architecture and the hardware for which it was originally intended . With the advent of new multiprocessing hard-

Digital Technical jow·nal 119 February No. 4 1'}87 Gabriel P. Bischoff

Steven S. Greenberg I

A Parallel Implementation of the Circuit Simul ator SPICE on the VAX 8800 System

Multiprocessors are efficient only if the added computing power can be used to solve sp ecific applications. To demonstrate the VAX 8800 multi­ processor's advantages, the authors converted the circuit simulator SPICE into the parallelprogram CAYENNE. Their methodology involved using VA X instructions and VMS system services to create and control a series of master and slave processes. Other VMS instructions were used to synchronize these processes and to manage the critical sections. Modifi­ cations fo r parallel processsing were made in SPICE 's load, factoriza­ LV tion, and local truncation error phases. The result was that CA YENNE, with two slave processes, ran 1.7 time Jas ter than SPICE.

The realization that two processors might be rcncy. Indeed, the two approaches should not better than one is not new. Indeed, parallel be mutually exclusive: the compiler can detect computing can be traced back to the nineteenth parallelism at the instruction level whereas the century.1 The advent of very large scale integra­ programmer can defi ne paral lelism at the tion opened a variety of new opportunities in algorithmic level. Parallelism on the VAX 8800 the field of parallel processing fo r specific system is achieved through the second applications such as image processing and signal approach. processing. Designing and efficiently using a We will describe in this paper the features of multiprocessor for general-purpose, high-speed the VAX architecture and the VMS operating sys­ computing, however, is more complex. tem that we used to implement our methodol­ The majority of today's application programs ogy for parallel processing. We will present a are written for single-processor machines. To set of FORTRAN routines we wrote to re lieve convert these programs to run on multiproces­ the application programmer from having to sor machines and achieve close to the ideal know the inner workings of the VAX architec­ speed up, linear with the number of processors. ture and the VMS operating system. We will then is not an easy task. Two approaches can be describe the modifications made to the circuit adopted to accomplish this conversion task. The simulator SPICE2 to develop a parallel process­ first is to design specific compilers that auto­ ing implementation, called CAYENNE. Finally, matically convert progra ms written for single we will give comparative timing results on two processors into programs that run efficiently on simulation examples.

multiprocessors. The second is to leave to the application programmer the task of writing code VAX/VMS Primitives fo r Parallel that makes efficient use of the multiple pro­ cessors. Processing The first approach is the best from a user's The VA'"'( 8800 system is a shared-memory multi­ point of view; however, good multiprocessor processor; aU communications between proces­ compilers have yet to be designed. The second sors arc performed through sections of shared approach leaves more flexibility to the pro­ memory rather than through message passing. grammer, who can modify some of the When writing parallel code on a shared-memory multi processor, a programmer must be aware of algorithms in the program to have more concur-

120 Digital Technical jo urnal No. 1 Fe bruarv 1987 New Products

two concepts: critical section and processor syn­ be achieved by setting or clearing an event flag. chronization. A critical section is a section of For example, the system service WAITFR places shared memory that could be accessed by sev­ a process in a wait state pending the setting of eral processors at the same time if no precau­ an event flag. tions were taken to prevent that. Allowing Additional VMS system routines allow the cre­ simultaneous access to shared memory could ation of processes, the creation and mapping of resu lt in incorrect data. Processor synchroniza­ sections of shared memory, and the initializa­ tion is the means by which processors proceed tion of event flags. These system routines are: in an orderly fashion. It consists of mechanisms • CREPRC - Create process allowing processors to broadcast the beginning or the completion of a task or to wait until a sig­ • CRMPSC - Create and map section of shared nal is received. memory Some VAX instructions and some VMS system • MGLBSC - Map global section of shared ro utines support the management of critical sec­ memory tions and processor synchronization 5 We usc 4 ASCEFC - Associate common event flag cluster three V�'( instructions to control access to criti­ • cal sections: More info rmation on these routines can be found in the VA X/VM S Sy stem Services Man­ • BBSSI - Branch on bit set and set interlocked ual. 5 We used the VAX instructions and the

• BBCCI - Branch on bit clear and clear inter­ VMS system routines listed above to write a set locked of routines that embeds our methodology for parallel processing. • ADAWI - Add aligned word interlocked The instructions BBSSI and BBCCI are the VAX Parallel Processing Methodology implementation of the atomic-test and set In the next section we outline the methodology instructions that allow the control of access to we use to achieve parallelism and in the process critical sections one process at a time. The define some important terminology. A program instruction ADAWto I performs an interlocked we wish to convert for parallel processing is integer addition and returns a condition status divided into serial phases. Each phase is divided depending on whether the result is zero or into tasks that are executed either serially or nonzero . concurrently. A phase whose tasks are executed We use three system routines of the VMS oper­ serially is called a single-stream phase, whereas ating system to support processor synchroniza­ a phase whose tasks are executed concurrently tion: is called a multi ple-stream phase. The single­ stream phases are executed by a master process, SETEF - Set event flag • whereas the multiple-stream phases are exe­

• CLREF - Clear event flag cuted by slave processes. The slave processes are idle when the master process is active and WAITFR - Wait for event flag • vice versa. Figure 1 shows this relationship. These routines are services provided by the Master and slave processes run the same exe­ VMS operating system to synchronize processes. cutable fi le, thus leading to easier program Indeed, the significant entity in the VMS multi­ maintenance. mentioned earlier, processes processor environment is not the processor but are dynamicallyAs assigned to processors by the the process. A processor is a physical processing VMS operating system. unit, whereas a process is a software entity cre­ We designed a general set of FORTRAN ated by the VMS operating system. Multiprocess­ routines for this environment. This set now has ing is achieved by creating severa l processes seven entries and implements the critical­ that VMS will assign to available processors. section and process-synchronization concepts Only the operating system, not the user, can defined earlier. It also performs the necessary assign a given process to a given processor. initialization and provides facilities for debug­ Event flags are bits maintained by VMS. Several ging a multiprocess execution. The remainder diffe rent processes can have access to the same of this section describes the functions available event flag, and signaling between processes can in this set.

Digilaf Technical jo urnal 1 21 No. 4 Februmy /')87 Parallel Implementatio n of the Circuit Simulator SPICE on the VA X 8800 -�vstern A

SLAVE 1 Sy nchronization Synchron ization is performed by fo ur of our FORK seven subroutines: FORK, JOIN. JOIN_EX !T. and .J OIN_FO RK. These subroutincs use the VMS MASTER system routines SETEF, CLREF, and WA ITFR ro perform the necessa ry interprocess signaling. FORK JOIN Each su broutine accomplishes rhc fo l lowing SLAVE fu nctions: 2 • FORK -- This subroutine is called by the REAL TIME------� master process to signal rhc sl ave processes ro proceed . The master process then waits in KEY: this subroutine for the slaves to signal ba ck. -ACTIVE • JOIN - This subroutine is called by the slave IDLE processes ro sign al rhc master process to pro­ � SIGNAL TO PROCEED ceed . The slave processes then wait in this subroutine for the master to signal bac k.

Fig ure I Sy nchronization of Processes Only the last calling slave process signals the master process . The VAX instruction AOAW I In itialization is used to identify this last calling slave pro­ cess. Initialization is pnformed by a logical fu nction ca lled MAS TER_PROCESS, which is set to TRUE • JOIN_EXlT- This subroutine is ca l led by if a master process runs the exccmable file and the slave processes ro signal the master pro­ FALSE if a slave process runs it. The slave pro­ cess to proceed. However. rhe slave proccsst:s cesses have special names that diffe remiate then exit instead of waiting for a signal. That them fro m the mastt:r process. An argu mem Jist is the way the slave processes arc sto pped perm its the specification of the number of slave when they are no longer needed. proct:sses ro creatt: and the input and output • JOTN_FORK - This subroutine is called by files to use for those slave processes . Through the slave processes to synchron izc two multi­ this argu ment list a unique process number is ple stream phases with no int erve ning single­ rerurned to each ca lling process . stream phase. The usc of this subroutine A user can also specify the number of slave allows sl ave processes ro be synchron ized processes to create by using a command-line without having to signal the master process. option when the program is ru n. For example, the program CAYE NNE wou ld be ru n w ith These synchronization routines put a process N slave processes if invoked with the command that needs to wait for a signal inro a wait stare. CAYENNEjS LAV ES =N at the prompt. If the Processes in a wa it state clo not use any CPU S calling process is a master, MASTER_PROCESS time. Each call ro one of these synchronization will create the sections of shared memory, ini­ rou tines, however, requires many machine tialize the event flags used for synchronization, instructions ro be executed . If the application and create the required number of slave pro­ programmer anticipates a very short waiting cesses. If the ca lling process is a slave, the fu nc­ time, an alternative to the previous method of tion wi ll map the shared virtua 1-address space to synchronization is synchronization through busy the existing sections of shared memory. The sec­ wait. In this scheme a process will loop, execut­ tions of shared memory arc FORTRAN common ing an instruction of the fo rm DO WHILE blocks defined as shared when the progra m is (FLAG_IS_NOT_SET) ENDDO. The process wi ll linked with an appropriate linker command . execute the previous instruction until the logi­ During this initialization phase, CREPRC creates cal FLAG_JS_NOT_SET is set to FALS E. slave processes, CRMPSC and MGLBSC create The busy-wait form of synchronization needs and map sections of shared memory respec­ ro be used with care . Jr can lead to loss of over­ tively, and AS CEFC initial izes the event flags. all system performance. Indeed, the process

Digital Tech nical journal 122 . . j 1\o Febmrll)' I ')87 New Products

executing a busy-wait instruction will use CPU debug our para!Jel code using the following time that might be more prod uctively used methodology. The fu nctionality of our parallel by another process. In addition, the logical code does not depend on the number of slave FLAG_IS_N OT _SET, which is constant! y processes or on which specific process performs checked fo r, is shared by all processes. There­ a particular task. Therefore, the whole code can fore, access to this logical must be carefully con­ be executed by the same process. For example, trolled. If several processes change this logical at CAYENNE runs with only one process if the the same time, its final value will be unknown. If number of slave processes is specified to be no process updates FLAG_IS_NOT_SET, a pro­ zero. This allows most algorithmic modifica­ cess may execute the busy-wait instruction tions made in the code to be debugged with the forever, thus leading to deadlock. Deadlock VMS debugging faci lities provided for sequen­ occurs when processes are waiting to receive a tial code. signal that will never be sent. After the first debugging phase, a code section could still have errors when run with multiple Critical Section processes . Our routines allow two fo rms of Critical sections in a parallel implementation debugging, requested either through a flag in should be minimized . They are the bottlenecks the argu ment list of the logical fu nction of the multiple-stream phases because they can MASTEILPROCESS or through a command-line be accessed by only one process at a time. If a option. The first form of debugging permits the critical section cannot be avoided, the time assignment of a different terminal to each pro­ spent to access this section should be minimized. cess and the setting of a debugging session for Exclusive access to critical sections can be each process on its assigned terminal. The sec­ achieved by using either the VAX interlocked ond fo rm of debugging is intended to be used instructions or the VMS system services.·; The with a workstation. A different workstation win­ former method implements a busy-wait form of dow is assigned to each process, and a debugging access synchronization, the latter uses event session is set up for each process in its assigned flags. window. The number of processes that can be The two subroutines LOCK and UNLOCK are debugged concurrently is limited to either the routines implementing a number of terminals available or the number of busy-wait fo rm of access synchronization. We workstation windows that can be opened. chose this method because it is fa ster in elapsed time, and the time spent by a process waiting is Exa mp le expected to be small when the access to critical The following example, shown in Figure sections has been minimized. These subroutines illustrates some of the fu nctionality of our set 2,of are used in the following manner to access a routines. We want to compute the sum SUM of critical section: all integers from 1 to N'S. We assume that a mas­ ter process with the help of N slave processes CALL LOCK(SECTrON_ENTRY) does the task. Each slave process is assigned a CALL ACCESS_CRITICALSECTION unique number PROCESS_NUMBER between CALL UNLOCK(SECTrON_ENTRY) and N by the logical fu nction MASTEILPROCESS.1 SECTION_ENTRY is an integer associated The section of shared memory consists of an array with a given critical section. This integer is set PARTIAL_SUM of size N. The slave pro­ to 1 when a process is using the critical section cesses work in parallel. Each slave process and to 0 when no process is using the critical adds S consecutive integers and stores its re­ section. The two calls LOCK and UNLOCK sult in the shared memory location ensure that only one process at a time executes PARTlAL_SUM (PROCES5_NUMBER) . the code ACCESS_CRITrCAL_SECTION. We use After the slave processes have completed their these routines only once in CAYENNE fo r task, the master process adds their partial sums, dynamic task allocation. stored in the shared array PARTIALSUM, to pro­ duce the final result SUM. The code correspond­ Parallel Debugging ing to this procedure fo llows . (Remember that Debugging parallel code is somewhat more master and slave process run the exact same complex than debugging sequential code . We executable fi le.)

Digital Technical jo urnal 123 Februmy No. 4 I 'J87 A Parallel Implementation of the Circuit Simulator SPICE on the VA X 8800 Sy stem

PROGRAM paral lel

LOGICAL mas ter-process

INTEGER process_number

INTEGER number_of_s lav es ,defa ult_ number_of_slaves

INTEGER debug_flag

PARAMETER (defaul t_number_of_slaves=S,debug_flag=O)

COMMON /shared/ number_of_slaves

COMMON /loca l/ process_number

IF (mas ter_process(process_number , number_of_slaves,

de fa u 1 t _number _of_ s 1 aves , 1 input 1 , 1 outpu t 1 , debu 9-f 1 ag)) THEN CALL master_code

ELSE

CALL slave_code

END IF

END

SUBROUT INE mas ter_code

INTEGER number_of_s laves ,maximum_number_of_s laves ,i

PARAMETER (maximum_number_of_slaves= 10)

INTEGER par tial_sum(maximum_number_o f_s laves ),sum

COMMON /shared/ number_of _s laves , par t ia l_sum

CALL fork

sum = 0

DO i = 1 ,number_of_s laves

sum= sum+partial_sum( i)

ENDDO

END

SUBROUT INE slave_code

INTEGER process_number , number_of_s lav es ,st art ,s,i

INTEGER par tiaLsum( 1)

PARAMETER (s=200)

COMMON /loca l/ process_number

COMMON /shared/ number_of_slaves ,partial_sum

par tial_sum(process_number) = 0

start =

DO i = s tar t + 1 , star t + s part ial_sum( proces s_number ) par t ia l_sum( process_number ) + i

ENDDO

CALL join_exit

END

Figure 2 PROGRAM Parallel

124 Digital Technical Jo urnal J No. 1 Fe bruar)• Y87 P.-oducts New

In the next section we describe how we cre­ maintained below a specified threshold. This ated parallel processing in several phases of the error is cal led the local truncation error. The circu it simulator SPICE to produce the program resu lting system of nonlinear equations is CAYENNE. reduced to a system of linear equations by per­ forming a first-order Taylor expansion of the Modifications Made in SPICE nonUnear elements of the circuit. This lineariza­ Before addressing each parallel phase of tion introduces another error called the lin­ CAYE NNE, we give a brief overview of the cir­ earization error. The resulting system of linear cuit simulator SPICE. equations is then solved exactly, using an LU factorization of the system matrix. Overview of SPICE After the solution of the system has been SPICE performs several types of circuit analysis: obtained , the linearization error can be esti­ steady-state analysis, transient analysis, and mated. If this error is too big, a new lineariza­ small-signal analysis. The most commonly used tion is performed around the previously com­ ana lysis for digital circuits is the transient analy­ puted solut ion, and the new linear system is sis, which becomes increasingly time consum­ solved again. Successive linearizations are per­ ing as the size of the simulated circuit increases. formed until convergence is obtained, that is, Figure :) gives a global description of the until the linearization error is be low a specified algorithms used by SPICE for a transient analysis. threshold. When convergence is reached the The circuit equations form a system of ordi­ solution of the nonlinear system is obtained , and nary differential equations. This system is solved the local truncation error is then checked. If numerically at successive time points t , i 1, this error is too big, the solution at time point ti N. It is reduced at a given time point iti int=o a is rejected and the system of differential equa­ system of nonlinear equations by using a dis­ tions is solved at a new time point so that f; cretization method. A discretization method ti - 1 < t1 If the error is below a specified approximates the time derivative of a variable at threshold, the< tiso. lution is accepted , and the sys­ a given time point as a fu nction of the va lue of tem is solved at a new time point ti so that the variable at that time point and at previous ti 1. This procedure is repeated+ 1unti l the time points. This method introduces a dis­ entir< eti tra+ nsient analysis is computed. During a cretization error that must be controlled and transient simulation the circuit simulator SPICE spends up ro 90 percent of its CPU time in three phases of the previous algorithm . These phases arc as fo llows: time = 0 time < finish time > • Load Phase -This phase consists of loading DO WH ILE < d i s c ret i z e d i f fer en t i a e g u at ions the matrix and the right-hand side of the sys­ 1 ( no t converged } tem of linear equations obtained as described DO WH ILE linear ize algebraic eguations above. Device-model equations and lineariza­ solve linear egua tions tion errors are also computed in this phase. check convergence • LU Factorization Phase -This phase consists EHDDO of fa croring the matrix of the system of linear ( local truncat ion error too big ) IF THEH equations into the product of a lower triangu­ reduce time lar matrix and an upper triangu lar matrix. ELSE This factorization is used to solve the system save results at thi s time

advance time of linear equations. Local Truncation Error Phase - This phase EHDIF • consists of computing the local truncation EHDDO error committed at each time step.

Figure Transient Analysis Algorithm jar 3 The modifications for parallel processing made in these three phases are described next. SPICE

Digital Technical journal 125 February No. 1 I

Load Phase

In the load phase each circuit element com­ J putes and loads all its contributions to the K matrix and the right-hand side of the linear sys­ tem obta ined fro m the circuit equations. Severa l distinct elements may contribute to the sa me matrix or right-hand side entry. This means that the matrix and right-hand side are cri tical sec­ tions in the load phase, and access tO them needs to be controlled. One approach to syn ­ chronize accesses to the matrix is to use a single lock on the whole matrix 6 In this case only one processor can write into the matrix at a given time, leading to contention for shared resources and decreased effi ciency. J In our approach locking the entire matrix is avoided by creating an additional data structure to store each individual element contribution . This stru cture can be viewed as a three-dimen­ sional matrix whose third dimension is used to to store each individual element contribmion a K given circuit-matrix entry. Figure 4 depicts such a matrix. There is no unused memory in this structure because it has a vari able depth in its third dimension. Nevertheless , using this stru c­ ture will increase the memory requirements of the simulator. In the design of CAYENNE it was G, - CONDUCTANCE OF FIRST RESISTOR necessary on many occasions to trade memory G2 - CON DUCTANCE OF SECOND RESISTOR for speed . Our test examples show that CAYE NNE requires an average of 20 percent more data memory than SPICE version 2G5 Fig ure 4 Th ree Dimensional Matrix requires . The contrib utions for each matrix entry are subsequently summed and loaded in parallel into the circu it matrix. The matrix load number of slave processes. A task consists of a is therefore performed in two successive multi­ list of circuit elements to be loaded. Tasks are ple-stream phases. defined so that each requires approx imately the It is crucial that tasks arc evenly distri buted sa me amount of work. The amount of work among slave processes so that no slave process needed tO load a circu it element is estimated stays idle while others are computing. A roughly by neglecting bypass and evaluating the dynamic task allocation was chosen for the first CPU time needed to load the element. Dynamic multiple-stream phase of the matrix load . That task allocation is expected to minimize any allocation was preferred to a static task alloca­ imbalance that may occur during simulation tion because the time needed to load each ele­ through device model computation bypass . ment cannot be estimated accurately. Indeed, The task allocation for the second multiple­ computation of device models may be bypassed stream phase of the matrix load is done stati­ during simulation. The model equations of a cally since the work needed to perform this device are not computed at a given iteration of phase can be divided into tasks requiring the the analys is if the voltages applied to this device sa me amount of CPU time. The only interlocked did not change significantly compared to their access to shared memory during the matrix load va lues at the previous iteration. This strategy is the one on the array index, which defines the saves CPU time. next task when dynamic task allocation is used. Dynamic task allocation is achieved through This index is successively read and incremented an array of tasks whose number exceeds the by all slave processes .

126 Digital Te chnical jo urnal No. 4 Februmy 1987 New Products

Fa ctorization Phase menrs. Twelve hundred Ne""rton Raphson itera­ L U The time spent by a direct-method circuit simu­ tions are required for the transient si mulation. lator in the load phase is linear in the number of The efficiency of our parallel implementation is clements, whereas the time spent solving the measured in this example. If a multiple-stream linear system of equations is superlinear in the phase runs sequentially in an elapsed time Ts size of the matrix 7 For large circuits the matrix and in para llel with N slave processes in an solution part will therefore become more elapsed time T,, we define the efficiency, of E, important and will dominate over the load the parallel execution hy phase. E = ( T, - T, )/(T_,- Ts /N) In SPICE the matrix-solution phase is done represents the ratio of the actual savings in using sparse matrix LU factorization . Although Eel apsed time to the potential savings in elapsed fu ll matrices can be fa ctorized efficiently in par­ time. Ta ble I gives timings and efficiencies for allel.H the parallel factorization of sparse the AlU example. a comparison, SPICE simu­ matrices is more difficult. The LU factorization lates the same circuitAs in an elapsed time of 834 algorithm has a sequential dependency, and the seconds. amount of concurrent work that can be done at each step in a sparse matrix is small. Table Timing Performances and Efficiencies It is possible to design algorithms that detect 1 the maximum parallelism at each step of the LU CAYENNE CAYENNE 0 Slaves Slaves Efficiency factorization. Such algorithms have been used Phase (Seconds) (Seconds)2 (Percent) for vecrorized circuit simulation.'> In our envi­ ronment synchronization is done through soft­ Load 694 97 86 ware and the fi ne-grain parallel ism used for vec­ 22 14 70 LU torization may not be efficient. Based on these LTE 67 35 96 considerations. we have proposed and imple­ Total Simulation 867 529 mented an algorithm in which pa rticular care has been taken to minimize the overhead incurred with parallel processing. The details of The second example is the simulation of a our algorithm can be fo und in reference 10. MOS control store . The circu it has 160 nodes and 530 clements , and the transient simulation Local Tr uncation Error Phase requires 1404 Newton Raphson iterations. The parallel computation of the time step does SPICE spends 91 percent of the simulation time not present major difficulties since the compu­ in the three phases we modified for parallel pro­ tation of the local truncation error for each cessing. CAYENNE executing with two slave energy storage element is independent. Each processes achieves 90-percent efficiency in slave process is assigned a set of energy storage these phases and simulates the circu it 1.7 times clements and computes the time step required by fa ster than SPICE. For this simulation, CAYE NNE this sc.-t. The master process then computes the on a VAX 8800 runs 9 times fa ster than SPICE on minimum time step among the time steps re tu rned a VAX -I 1/780 CPU. Table 2 shows these com­ by the slave processes. The energy storage cle­ parisons. ments are statically assigned among slave pro­ The efficiencies of a parallel execution of cesses so that the work among them is balanced. CAYENNE depend on the size of the circuit. Indeed, there is a fixed overhead incurred by Results The parallel algorithms described in this paper Table Comparison of SPICE and 2 have been implemented to produce the program CAVENNE Elapsed Run Times CAYENNE. We now present two examples to compan.: the timing performances of SPICE and Elapsed Case Seconds Ratio CAYENNE. The first example is the simulation of a MOS SPICE on VA X-1 1/780 3990 9.1 arithmetic logic unit (AlU) on a VAX sys­ SPICE on VAX 8800 750 1.7 tem. The circuit has 200 nodes and 13RHOO50 elc- CAYENNE VAX 8800 440 1.0 on

Digital Te chnical jo urnal 127 No. Februmy 1 1987 A Parallel Implementatio n of the Circuit Sim ulator SPICE on the VA X Sy stem 8800

ca lling the synchronization routines JOIN, 4. S. Fa rnham, M. Harvey. and K. Morse , FORK or JOIN_FORK. The bigger the task per­ "VMS Multiprocessing on the VAX 8800 formed by the slave processes before a call to a System." Digital Te chnical jo urnal sy nchronization routine, the smaller the relative (February 19H7. this issue): 111-1 19. cost of synchron ization. The simulations of ou r '5. VA XjVMS Sy stem Services Reference examples were also run on a lightly loaded sys­ Manual (Maynard : Digital Equipment tem. Loss of effi ciency occurs when processors Corporation, Order No. AA -Z'50I B-TE, have to be shared with nonrclated processes. 1986) . and busy-wait synchronizations may waste sig­ nificant resources. A wo rkload consisting of sev­ 6. Jacob. A. Newton, and D. Pederson , era i ndcpendent simulations of eq ua im por­ G."Dir ect Method Circuit Simulation Using tanceI is already decomposed. and CAYEI NNE J\'lu lti processors ," Proceedings of the should be run in single-process mode If the Internatio nal Sy mposium on Circuits turnaround of a single, large simulation needs and Sy stems (May 1986) : 170-173 be minimized, however, CAYENNE shou ld bero 7. A. Newton, ''The Simulation of Large run with two slave processes on a dedicated or Sca le Integrated Circuits." IEEE Tra nsac­ lightly loaded 8800. tions on Circuits and Sy stems, vol . CAS- Summary 26 (September 1979) 74 1 -749 We have descri bed a general methodology for 8. R. Thomas, "Using the Butterfly to Solve parallel processing on the VAX 8800 system and Simultaneous Linear Equations," Labora­ a user-friendly set of routines that embed our tory Memorandum, Bolt, Beranek, and methodology. \V e have also presented the suc­ Newman, lnc. (March 198'5). cessfu l conversion of the circuit simulato r 9 F. Yamamoro and S. Takahashi, "Vccror­ SPICE into the parallel program CAYENNE. New izcd Decomposition Algorithms fo r schemes minimize the overiH.:ad of parallel LargeLU Sca le Circuit Simulation." IEEE processingto and to balance the load among pro­ Transactions on Computer Aided cesses contribute to the overall efficiency of our Oesign, vo l. CAD-4, no. 3 (July 1985): implementation . 2.12-239.

Acknowledgments Bischoff and S. Greenberg, "CAY­ 10. G. We wou ld like to acknowledge Bob Kusik for ENNE: A Parallel Implementation of the initiating this project, Craig Yankes for intro­ Circuit Simulator SPICE." Proceedings of ducing us parallel processing within the the IEEE International Conference on VA.,'(jYMS sysrotem and for providing us with an Computer Aided Design (November initial library of routines from which our 1986) : lH2-18'5. methodology evolved, and John fariceJJi, Nadim Khalil, Karem Sakallah, and john Sopka for many fru itfu l discussions.

References 1. R. Hackney and C. Jesshope. "Parallel Computt:rs," (Bristol: Adam Hi lger. Ltd .. 19H 1).

2. L. Nagel, "SPICE2. A to Simulate Se miconductor Circu its." Memo no. ERL-M520. University of Ca li­ fo rn ia. ne rkcley (May 197'5).

3. Guide to Jlllultiprocessing on VA Xj VMS (Maynard : Digital Equipment Corpora­ tion, Order No. AA-HP69A-TE, 19H6).

1 28 Digital Tecbnit.:ul journal No. I 4 TI!!Jruar)' ')8 7 Dennis Bak T. I

The Impact of VAX 8800 Design Methodology on CAD Development

Contributing to the success of the VAX proj ect was an integrated 8800 CAD environment supporting the hardware design effo rt. A CAD group dedicated to this single project was chartered to supp ly a smoothly op er­ ating CA D process from initial design conception to final production. The CAD environment evolved through a blending of existing tools avail­ able in Digital with toolsdeveloped outside the company. Gaps in the environmentwere fillednew through extensive modification of existing tools and new development effo rts. The driving fo rce behind the CADprocess was a design methodology, radical fo r its time but second nature now.

Past CAD Development Effo rts such as the VAX 8600 fa mily. SUDS eased the Prior to the mid-1 970s, logic development burden of entering and coping with design efforts within Digital Equipment Corporation changes; however, the actual contents of its were largely done without the extensive use of schematics differed little from those of the ear­ CAD taols. Hand-drawn schematic diagrams Lier hand-drawn ones. In large pa n the schemat­ were the prim ary means of expressing logic ics entered by designers into SUDS correlated designs. directly with the physical entity being built, A major advance in design automation took showing all components and their pins. place in the mid- 1 970s when the Stanford Uni­ At the inception of the VAX 8800 project in ve rsity Design System, or SUDS, began to be the early 1980s, a vast collection of CAD tools, used within Digital. SUDS allowed the entry of written by many internal groups, had sprung up. schematics into and the extraction of net lists Most of these rools required large ASCII data from a graphics database. Although it was a files and significant manual intervention by CAD major step forward in the automation of design experts. Although many aids were provided to processes, SUDS required significant user train­ develop design processes, they lacked the cohe­ ing and experience to become an effective tool. siveness and simplicity needed to put a process Build ing a SUDS database capable of being directly into the hands of the designers. used by a computer opened a new avenue for At about this time, a number of significant the evolving CAD groups to automate their advances were made in CAD technology. Engi­ design processes. These groups soon developed neering were announced at prices a large body of programs to support net-list that made it practical to put them directly into extraction, design analysis, placement and rout­ the hands of designers . Moreover, new design ing, and eventually manufacturing parts-lists methodologies, such as structured computer­ generation. Simulation tools were developed to aided logic design , or SCALD, were also devel­ help verify the operations of a design before any oped.1 actual hardware was available. The increased These methodologies could significantly complexity of design drove CAD developers improve the quality of design while decreasing provide more powerfu l CAD tools. In turn, logictO the time to develop complex systems. There­ designers soon grew increasingly dependent on fore , Digital made a commitment tO use those CAD tools as their capabilities increased . methodologies on the VAX 8800 project to pro­ The design methodologies and the CAD tool duce not only the product but a more produc­ suite evo lved to support large-CPU designs, tive way of developing it.

Digital Technical journal 129 Februarv No. 4 1987 Th e Impact of 8800 Design Methodology on CA D Del'elopment VAX

Design Methodology workstation. were processed inro a logica l ner The development of CAD tools for t h e list that was used by rhe si mulation and veri fica­

VA,'( RROO project was a considera ble challenge tion tools. Once a l og ca l design reached a cer­ i ro rhe CA D designns. 'f he complex ity of the ain of maru riry. it was mapped into a t InTI design. with irs par ticu iJr ga te arra �· ph�·sical design. At that point a physica l analysis. VA.'( 8800 implementation, demanded rhar rhc des ign ro determine d e lays and signa l integrity, was quality be high before anything was committed performed. Placemcnr and routing tools were ro hardware. In fact. the project managers made then run to fu rther refine the design . The part of a radical (for irs rime) commitment ul ate the plwsi ca design database that represen ted l the entire desi gn and verify its rimroi ng sim before the l ogica l pology was then passed back to the ro any hardw are was bu i r. The rcfore. r he CAD logical side of the design process Tbere , a com­ I process had ro be designed m eet nor only that parison was made tO ensure that the physical ro goal bur also ro fa cilitate the rapid production and logica l designs were congruent. Thc resu lts of hardware once the design had prove n accept­ of simulations based on the physical design able. This section of the paper describes the were also passed ro the logical process for com­ m thodology we fo llowed to make the best use parison with the simulations based on rhe logi­ e of our CAD tools. The next section descri bes cal design . These mechanisms p rovi clec the pri­ l those rools and how they were used. mary ch ec ks ro ensure r h a t the logical design rool suite that evolved , p icmred i n gur marched the physical one. Fi e 1, supportedThe both logical and physical design pro­ We decided that the best way to ass ure suc­ cesses with checks and bala nces ro ensure that cess was ro develop a complete paper speci fica­ the design topologies re mained the same. Sche­ tion of the machine to be built. Once the ovcr­ ma tic diagrams. ca ptured at an engineering all goals for the machine had been esrablishecl.

DESIGNER

MANUFACTURING - MAPPINGLOGICAL TO PHYSICAL - REPDELAYSORTS - PLAROUTINGCEMENT - INTERACTIVEMANUFACTURING CLEAN RULUP ESCHE CK - INTERFWIRE RULACEE CHECKFILES - SIGNAL INTEGRITY

UNIX VAX/VMS

Figure CAD To ol Su ite 1

1:10 Di,C!, ital Te chnical journal ,Yo. I i Fe!Jnuny ')8' New Products

the designers developed the specifications for avoid the creation of paper schematics by trans­ each major logic section. This high-level logical fe rring their concepts directly to the wo rksta­ design was then partitioned into functions tion screens. required within modules and gate arrays. These The decomposition of the design was from the primary interfaces were specified before any top down, but the actual entry of design data detailed logic was developed. it turned out, occurred simultaneously at many levels. that partitioning remained relatAs ively intact A "design tree" evolved in which cells fo rm­ throughout the project. ing gate arrays were merged onto modules The next step was to develop probe designs that plugged into the backplane to form a sys­ and abstract models for the most complex parts tem . The logical design was entered via the of the machine. These designs and models SCALDSystem tools onto schematics. The physi­ tested whether or not particular logic fu nctions cal implementation of that logical design was could be developed and timing constraints met. left to the physical design tOols. In so me cases the probe designs were carried through to the actual fabrications of gate arrays Simulation and Timing Ve rifi cation or modules. This continuity allowed us to test Simulation on the VAX 8800 project was the limitations of the selected ECL technology as approached from two different viewpoints. The we ll as the logic design. first aimed tO determine whether or not the per­ The probe designs proved useful in many fo rmance goals of the proposed microarchitec­ ways to both the designers and the CAD devel­ ture were within the necessary range, as speci­ opers. The designers were able to verify that fied by the project's needs. 2 This simulation their logic implementations wou ld work. The started early in the project before any detailed CAD developers were able to use the designs as logic design had been completed. Once those test cases tO deve lop and debug processes. performance goals had been verified, the second These test cases proved be critical tO the pro­ level of simulation fo cused on the logic design ject's success, especialltoy when the finished as it evolved. design was given to the manufacturing organiza­ The designers could verify that each piece of tion . The process was so smooth, in fact, that the design fu nctioned as specified while that designs flowed through it with few problems. piece was being developed. the design tree evolved , the number of logic Asleve ls given to the Th e Influence of SCALD simulation tools increased until the entire logic At the onset of the VA X 8800 project, we inves­ design had been entered . At this poi nt the tigated the tools available within Digital for designers actually had the equivalent of a soft­ building a process to support the evolving ware breadboard of the entire VAX 8800 proces­ design methodology . This study lead the CAD sor. Microcoded instructions were "running" on team to explore several systems being devel­ this software breadboard long before any hard­ oped by other companies. One system being ware was available. developed by Valid Logic, Inc., the SCALDSys­ The ability to run instruction streams on the tem CAD system, was procured by Digital. This breadboard gave the project several advantages. system put the power of dedicated engineering Logic designers could debug their logic concur­ workstations directly into the hands of logic re nt with the microcode developers verifying designers. Of equal importance was the fact that their microcode . Moreover, the diagnostics the SCALDSystem CAD tools were being devel­ engineers could write as well as debug signifi­ oped by the same people who conceived the cant numbers of microd iagnostics much earlier SCALD approach tO hardware design . than was usual in a design project. The early Logical schematics, requiring almost no infor­ completion of those diagnostics allowed the mation about the physical design , were entered fi rst available hardware to be checked thor­ inro the SCALDSystem database . These schemat­ oughly. ics were entered in a hierarchical manner Making the design logically correct through through an easy-ro-learn graphical system. Such simulation did not ensure that the machine an arrangement encouraged the designers to would work at the desired cycle time. In the

Digital Te chnical journal 131

No. 1 February I Y87 Th e Impact of VA X 8800 Design Methodology on CAD Development

technology used in rhe VAX 8800, signal rim­ This cluster consisted of 14 VAX- 11/780 and ECL ing was cri tical. 'T'hercfore , a timing verifier, parr VAX:- 11/78') systems with over 20 gigabytes of of the SCALDSys tem rools. was used to ascertain mass storage. Even this large amount of storage whether or not the timing goals were being met. was inadequate ar rimes to support the demands It was within the timing verifier that the intl u­ of the databases. Forecasting rhc: com pmational ence of the physical implementation on the log­ requirements of this project proved diffi cult. ical design was first fe lt. The logic designns had The VAXcluster sysrcm provided the computa­ to c:nsure rhar the placement of gates and ro ut­ tional power and flexibi l ity ro grow as the ing of signals was optimal for al l critical cle­ demands increased. ments. Delay info rmation was then extracted The availability of sufficient computationa l from the physical design and fed back the resources was critical to rhe success of our pro­ timing verifier. to jeer . The design methodology of extensive simu­ lation was effective only with reasonable pro­ Physical Design gram run rimes. Once rhe design was veri fied . the logical design evolved , we developed a large numbers of phys ical designs were rc: lcased As CAD process to convert ir rapidly into a physical for fa brication within a short pc:riocl, which con­ design. A set of automatic placement and rou t­ su med si gnificant computational and sto rage ing tools, rogether with delay-esti mation and resources. signal-integrity tools, was used ro give fe edback to the designers. The important question here The To ol Suite was whether or nor they cou ld build physical representations of their logic designs. These Design Data Management tools also passed data to the timing verifier, A design data management (DDM) system was which analyzed the effect of rhe phys ical design developeu to organize the many fi les that con­ on circuit timings. tained the actual design data. At rhe heart of that Since all the logic had to be verified bdore system was the concept of a "design object ." any hardware was fa bricated , all processes had This object was some fu nctional piece: of the to be designed to hand le a large number of dc:sign . usually conforming rhe physical parti­ designs in parallel. The re levant Digital manu­ tioning. For example. each rogare array and mod­ facturing fa ci litic:s and outside vendors were ule in rhe system was dcfi nc:d as a uesign object. acquainted with the physical design through the For each object we developed a hierarchy of test cases ra rber than through an actua.l protO­ subdirectories within the VMS fi le system. This type. Thus the fac ilities and vendors could con­ separation of data fi les into subdirc:crories figure and debug their own manufacturing pro­ allowed various roots within the CAD process ro cesses before any completed physical designs know where ro find input filLs and ro write our­ were sent ro them. pur fi les. To ensure a smooth transition into rhe fa brica­ The design da tabase was continually churning tion phase, manufacturing engineers were with new information . To give a stable picture ass igned work dire ctly with rhe ck signers as rhe overall design evolved, a "snapshot" of a early in thero dc:s ign process . Th us these: engi­ design object cou ld be take n at any rime, rhus neers became: fa miliar with the 8800 tech­ generating a revision of the design object. New nology and the machine as it evolVAXved . This was subdirectory fi le trees were: then created fo r an important step because our manufacturing each revision. Using rhis sc hc:mc a designer organization was bu ild all rhe hard wa re , could create a "frozen" revision of a design . He including the protottoypes . This early acquain­ could then usc that revision for simulations or tance with the: design allowed them ro develop other activities while changes were being made manufacturing processes ro support rhc rapid ro another rc:v ision of rlw design. change to fu ll vo lume shipments soon after rhe The relationships benvec:n design objects

VAX 8800 system was announced 1 were defined within a revision-matrix fi le kept with each fi le tree . This fi le dcfinc:d the system­ Computatio nal Reso urces level hierarchy of the machine: which ck sign One of the largc:sr VAXclusrer systems ever built objects were subo rdinate ro a given object was assembled suprorr rhc VAX 8800 project. Us ing this fi le a dcsignc:r working on a module w

132 Digital. No. Teci1-'lJniutl'hrllliiT jour nal -I I 'J87 New Products

design could select frozen revisions of the gate chips and R.AJ.\1 devices, behavioral models were array designs on that module and be assured of developed. These more efficient models not having them changed as he worked on it. increased the overall performance of the simula­

Another facility provided by the DDM system tions. In the case of RAM devices, abstracting to a was a user interface to the design environment. behavioral model also allowed the microcoded This int erface consisted of a simple command instructions to be loaded efficiently. language for transversing the design trees and Complementing the fu nctional simulation for running specific tools. Since these tools facilities of DECSIM system was the timing veri­ required a large number of input variables, we fier (TV) in the SCALDSystem tools. TV analyzed established a system of default parameters to circuit timings to ensure that the design would minimize user input. For cases in which those work under worst-case conditions at the desired defaults proved inadequate, users or CAD devel­ . opers could change parameters to meet the Wire delays are a major factor to be taken into design's needs. account by timing verification . The placement of the physical gates was critical tO minimize Schematic Capture the wire lengths and hence the delays. Since the Using the ValidGED editor, logic schematics placement was not available in the initial design were entered directly into the workstations by phases, statistical delays based on loading were the designers. The extracted wire lists were then used. As placement information became plenti­ transferred from the SCAJ.DSystem -based fu l, the latest refined delays were sent to the workstation through a communications port to timing verifi er. When the physical design had the VAXcluster system. The workstations were been completed, delays based on routed lengths also interconnected in a networking environ­ were used. If the required timing was not met at ment, thus providing communication between any point in the process, the offending circuits them. To ease the burden on designers to learn were redesigned or the layout was changed to multiple operating systems, only graphical data correct the problem. entry was permitted on the workstations. All the other CAD tools were run in the more native Wirelisting and State Maintenance VA.i\cluster environment. The logic gates entered on schematics by the Since the majority of a designer's time was designers were, in general, assigned ro physical spent interacting with CAD tools on the components by the CAD process. This mapping VAXcluster system, there was no need for each occurred initially within the SCALDSystem post­ designer to have a dedicated workstation for processor software using a random gate-to-com­ schematic capture . The ratio of designers to ponent assignment. This random packaging was workstations of about two to one proved ade­ then fed into a system called YAWL (for Yet quate. The easily learned GED editor supported a Another WireLister) . YAWL acted as a general­ rapid increase in the number of nondesigners ­ purpose wirelister, generating interfaces to managers, secretaries, and documentation writ­ many tools and accepting feedback from the

ers - in the user community. All were drawn to physical design tools. the system by the ease of graphical data creation . As the physical design process refined the gate Eventually, this documentation activity assignment, YAWL ensured that the logical accounted for the majority of workstation usage. design topology did not change. By accepting feedback data from the placement and routing Simulation and Timing Verification tools and the physical design system, YAWL Another proprietaryto ol, called the DECSIM sys­ caught any illegal changes that would have tem, was the primary simulator used on the pro­ altered the logic functions. ject. This system supported mixed-level simula­ Eventually, the complexity of maintaining the tions, both structural and behavioral. The logical state became so large that YAWL alone could not design was transferred hierarchically to the DEC­ cope with it. Therefore, several other programs SIM system. This system allowed the designers to were placed in the feedback loop from the phys­ deal with complex designs by viewing the simu­ ical design tools to detect changes made in the lation in the same hierarchical form as the sche­ process of manually cleaning up the physical matics. For complex devices, such as multiplier design . These programs were needed since,

Digital Technical journal 133 No. 4 February 1987 Th e Impact of VA X 8800 Design Methodology on CAD Development

even at that late stage, a designer could still add tools. and added the data required to manufac­ logic to the design. The CAD process therefore ture the design. A layout designer, through the had to handle these additions as well as to VLS interactive graphics system, could manually detect illegal transformations ro the logic. The complete the routing that could not be hand.led resolution of these changes took a lot of by the autOmatic rools. Some additional parts resources, both in terms of time and computer that were necessary for fabrication, such as han­ power. dles for modules , were also added at this time. In addition to being the state maintainer, The net result was a complete design , specified YAWL acted as a primary source of the design so that it could be used to manufactu re the data needed for the remainder of the CAD pro­ product. cess. YAWL created many reports to inform The design data was then collected ro form a designers of problems between their logica l and release package. To keep track of the formal physical designs. Most of the interface fi les in release of design data. a system called POST was the CAD process were either read, written, or deve loped by the CAD group. POST provided an both, from YAWL, which played a key role in on-line database, which any member of the pro­ the overall process. ject team could query ro determine the release status of a design. Placement and Routing Two processes were developed for the place­ Problems Imposed by the ment and routing of gate-array and module Design Methodology designs. The gate array process was highly auto­ Up to this point, we have described the of mated, requiring a minimum of interaction by the design methodology used to develop the the designers . The process was organized to VAX 8800 system and some highlights of the make several runs from which a designer could CAD tools supporting that methodology. As select the one that best optimized his logic mentioned earlier, the CAD process was placed design. directly into the hands of the designers. Thus a The bounded problem of placemenr and rout­ tight coupling was established between the pro­ ing within a gate array was easy to solve in com­ cess of clesign and the design process. This cou­ parison to the module designs. Here the con­ pling posed several major problems, as now straints placed by designers, the limitations of described , for the CAD group. tools, and the complexities of design required extensive human intervention. Tra ining Analysis tools were used extensively tO assist With direct control of a process or tool given to in determining the quality of design at the two the designers, they all now needed extensive design levels: gate arrays and modules. These training. On previous projects, one highly tools analyzed such factors as thermal dissi pa­ knowledgeable individual could run a rool; tion, signal integrity, and crosstalk. The con­ now, there were 30 or so novice users all learn­ straints defined in these tools and in the exten­ ing to use that same tool. Extensive support for sive design-rule checkers were met, thus those users, in terms of both trainers and docu­ ensuring a high-quality design. mentation, had to be provided. Most of the tools used for the physical design In most cases the designers quickly learned were developed within Digital. Those devel­ how to utilize the tools. In a few cases - the oped outside the VAX 8800 CAD group were placement of modules in particular - placement modified, sometimes extensively, to meet the experts were needed owing tO the specialized needs of the project. narure of the task. In summary, the extent of the support required by users was greater than Physical Design and anticipated. Manufacturing Interfa ce A proprietary physical design system, called the State Maintenance VAX layout system (VLS) , was used for the final The task of state maintenance proved to be physical design tasks. VLS rook the physical extremely complex owing to the freedom given design, as given by the placement and routing to designers to make changes at almost any point

134 Digital Technical journal No. 4 Fe bruarl' 1987 New Products in the design process. To ensure that the logical delivered ro us worked as expected. We found and physical designs matched, it was necessary only a small number of hardware problems dur­ to do a complete isomorphic comparison of the ing the prototype debug phase of the project. physical topology against the logical topology of Most of those problems were in areas that had the design. not had extensive simulation or timing verifica­ tion. Logical Prints Some general conclusions reached from the The schematics generated by the designers at VAX 8800 project can help fu ture CAD design­ their workstations represented the logica l ers to improve their tools. design, not the physical one. Certain features close coupling from the start, both physi­ available in the SCAlDSystem tools, such as vcc­ • A cally and organizationally, between all torized signals and gates, aiJowed it to produce groups associated with the project leads to a concise representation of the logic. This came, the development of a smooth process flow. however, at the expense of not putting physical data back onto the print set. For reasons of state • The design methodology has a direct and far­ maintenance. we were also unable to restruc­ reaching impact on the CAD process. The ture a print set once mapped to a physical capabilities of CAD tools directly affect the implementation. Both these factors contributed design methodology. to a print set that appeared quite diffe rent from Extensive simulation and timing verification those generated by previous projects. • before fa brication can help to achieve a high­ Logical print sets, while initially envisioned quality product. as being beneficial, later caused problems in documenting the designs. This was particularly • The impact of radical changes (e.g., in the true for module-level designs for which training data content of schematics) must be appreci­ was needed so that groups outside the project ated and then taken into account by all pro­ team could interpret the new symbology. ject members.

Cross References In future projects we will focus on reducing the process-loop times and enhancing the capa­ Using logical print sets alone, a technician bilities of the simulation and timing verification could not probe a pin of the physical boards. tools. It will be easier to fu nction in future Since an abstract mapping took place in the CAD design environments, and more tools will be process. it was necessary to develop an exten­ placed directly into the hands of the designers. sive set of cross references showing the map­ The design methodology will be modified to ping of the logical to the physical design. These make the resolution of the design state easier cross references proved to be cumbersome and, and therefore faster. when printed, consumed vast amounts of paper. References Libraries CAD tools run on libraries, and each major tool 1. Structured Computer Aided Logic Design has irs own fo rmat for library data. These was developed at Lawrence Livermore libraries must be consistent across the entire Laboratories and applied there to the process. Despite all the safeguards built into the design of the Sl computer. process, we fo und that inconsistencies sriJJ C. Wiecek, "The Simulation of Processor crept back into the database. Discovering and 2. Performance for the VAX 8800 Family," eliminating those inconsistencies, many of Digital Technical jo urnal (February which were fou nd late in the project, consumed 1987, this issue): 100-1 10. a lor of time. 3. A. Matthews. "On-line Manufacturing Summmy Data Access on the VAX 8800 Project," Both the design methodology and the CAD pro­ Digital Te chnical jo urnal (February cess supporting the VAX 8800 project were 1987, this issue): 136-141. quite successfu l. The first protOtype hardware

Digital Technical journal 135 No. 1 Februar)' 1<)87 Andrew]. Matthews I

On-line Manufacturing Data Access on the VAX 8800 Project

Previously, the transition fr om design to manufa cture involved transfe r­ ring significant amounts of data on paper. To minimize product start-up time, the VA X 8800 project used an on-line system that eliminated much of the paper. Thekey task was transforming the data fr om existing CA D tools with differentfo rmats into manufacturing data. Two generic typ es of VMS .files, DA TA and DRA WING, contained datafor each Part Number and Revision Number. VMS's subdirectory and access-control capabilities provided total revision control. Manufacturing engineers pulled files at will using DA TA .files to drive their processes and viewing DRA WING .files fr om tation workstations. II

A key objective for the VAX 8800 project was ro rather th

Therefore, to achieve our goal, we had ro elimi­ Manufacturing needed it, that is, by Part Num­ nate or minimize those delays. ber and Revision, among others. Therefore, \Ve knew of a number of ways to speed up some translation process had ro take place this transition phase. Since there is normally a between the data sources in Engineering and the tremendous flow of data on paper between Engi­ data repositories used by Manufacturing. neering and Manufa cturing, one way was to The data sources in Design Engineering are eliminate the paper itself. A second way was to many and varied. Digital uses a large set of CAD accelerate the controlled revision process when tools in its design processes . 1 These tools use a changes were required. And a third way was to variety of methods to gather, srore, and manipu­ accelerate the query-and-response process that late data. The databases associated with these was necessary ro solve specification problems . tools arc the sources for alJ the specific

1 isheel tO determine how best to implement the The primary CAD and CAM process tools did three ways ro minimize delays . not communicate since they were all based on The project team determined that although different data fo rmats and revision procedures. the data flowing between Engineering and Man­ The primary goal of the project was to take the ufacturing was vital, the paper itself was nor. design data created by the CAD tools and, with

Thus the team's goal was to find out how to as little paper as possible, turn it into manufac­ establish a paperless. but not drawinglcss. ruring data that could be used by the various scheme to pass that information between the manufacturing groups . The direct way that goal two organizations. The team also set some con­ could be accomplished was to create an inte­ straints on this scheme. First, existing data tech­ grated source of data as VMS fi les that would be niques should be used whenever possible rather avai l

136 Dip,ital Te<"h nical journal 1<)87 No. ·1 Tebnwry New Products

As typically happens in a rapidly evolving procedure meant that these capabilities would technological environment, the standard data­ be immediately accessible to all of Digital's VAX transfer processes already in place had rapidly users, could be readily linked to existing read become outdated . The resu It was that the stan­ and write processes for CAD/CAM files, and dard process was handling only part of the data, would require no unique training, software, or and informal systems evolved to deliver the hardware. remainder. MDA had to identifyall these data The remainder of this paper describes the processes, regard less of their sources. Then, it approach that MDA takes to achieve an inte­

had to provide all the data needed to build and grated source of manufacturing data. As a first­ test the product through a consistent on-line generation paperless process, MDA was used on process. That task was accomplished by the VAX 8800 project with great success. We "reverse engineering" the existing processes. anticipate that MDA could evolve at a later date All the process managers responsible for the into a second-generation paperless process. In product in Manufacturing were interviewed to this process, users in Manufacturing would be find out what data they were receiving by both able to selectively compose and generate any formal and informal means. They were asked, in desired drawing from the databases. For the first particular, what additional data they needed. design of MDA, however, that was too sophisti­ The result was a lengthy list of data files, most cated a solution to be applied to a broad manu­ of which existed or could be easily generated. facturing community still in transition from One key limitation to this type of data-genera­ paper processes. tion process was the availability of an appropri­ ate engineering database. For example, a visual­ MDA Capabilities inspection process might need the color of a We designated the files containing the data that component, but this data may not be in any drives the computer-aided processes in Manu­ engineering database. Therefore, some manufac­ facturing as DATA files. Every drawing sheet in turing data processes would have to continue the full drawing package is electronically

using other sources, typically I ibraries of addi­ released as a plot file. These on-line files, called tional information, as well as the engineering DRAWING files, are effectively the master draw­ database. ings, and any locally generated paper prints are The objective of MDA was to provide on line temporary working copies. DRAWING files are all the data needed fo r new product start-up. intended only for human interpretation (view­ The problem, as noted earlier, was that this data ing or plotting); they do not have to be inter­ was derived from many diffe rent files used by preted as structured data by other functional­ the CAD tools. These separate software tools, process software. DATA files are used for that having come from many sources at different purpose. times, generally operate on independent VMS Both DATA and DRAWING files are made files and do not yet utilize complex, integrated available through a single unified process avail­ database capabilities. Therefore, another pri­ able anywhere on Digital's world-wide internal mary goal of the MDA project was to bring DECnet network. Data security is provided in appropriate data management to these existing the software by an access control list of specifi­ processes, but at the same time not to require cally authorized users in Manufacturing. A list significant changes within them. method rather than password control was cho­ Given this VMS fi le environment, the team sen since the VMS system has all the capabilities made an early decision that the VMS system to implement list control (identifying re­ could provide the framework for comprehen­ mote users) . Control over access to the on­ sive data management and organization capabili­ line product database remains with the data ties if fu ll advantage were taken of the possibili­ managers. ties inherent in the system. That is, files and The files are organized around the Part Num­ directories, subdirectory schemes, and access ber and Revision Number of the physical object. control lists had to be used effectively. The A complete DATA and DRAWING file set is pro­ advantages of using VMS features for these exist­ vided for each revision, thus leading to a degree ing files rather than implementing a specialized of redundancy between files. We originally con­ data-management scheme were numerous. This sidered solving this redundant-data problem in

Digital Technical journal 137 I No. 4 Fe bruary 987 On-line Manufa cturing Data Access on tbe VA X RROO Project

the traditional CAD/CAM way by defining sepa­ was to integrate the data-management process rate universal interface fi les and designing inte­ with the CAD tools that generated the source grated databases from which anv needed fi lc files. could be extracted . To achieve the primary goa l The principal JVI DA implementation concept of minimizing all delays in product data trans­ was to use the extensive VNIS subdirectories that fns . however, we concluded that providing the ''belonged '' ro each object and revision and process srecific. bur redundant, fi les needed then collect all the aprropriate fi les into the directly in Manufacturing was worth the price. appropriate directories. This technique makes This technique eliminated all hand-off dclavs possible a user data-access process based and allowed the already proven processes to directly on the VMS system in which a user can operate efficiently. Of course. the risk was that answer severa l questions about rhc object or data in the redundant fi lcs could in some way revision for which data is needed . MDA then diverge . Therefore . Engineering assu med the provides him with a directory containing the responsibility of verifying that the data was con­ fi les relevant ro the requested object or revision. sistent between them. Engineering uses special This directory represents the bounded set of software to verify that all fi les in a set. some of data. Within that set each DATA and DRAWING which come from diffe rent CAD tools. represent fi le is "named" so that it is completely identi­ the identical design object and revision state . fi ed even if moved later to other manufacturing The DATA files utilized arc those the starr-up locations. The fi le-naming scheme is also not team identified as being directly needed for cryptic so that manufacturing users can specify each manufacturing process. Our ideal target for and recognize the particular fi lcs they need.

DATA files was the specific data set needed by a An underlying objective of the MDA program "work cell" of the manufacturing plant: this was to provide an environment in which a rypicaUy includes both a computer resource and released data fi lc was perceived as being as sta­ specific people that together receive and adapt ble as an approved and re leased paper drawing. the generic data to the immediate needs of their Whenever a set of DATA and DRAWING fi les for particular plant and process . To minimize rhc a given revision of an object arc re leased. that process starr-up rime. eliminate queues . and set of data becomes "read-only" and is placed assign responsibilities clearly, MDA avoided under strict control. The engineering group will using intermediate data fo rmats. These fo rmats not modify any fi le within the set belonging to hisrorica IIy required prcproccssi ng by some that revision. and subsequent revisions of rhat third parry before rhcy could be used in the object do not overwrite prior revisions. plant. We expected the plants to adapt the DATA MDA allows users to pull data selectively as it files to the srccific needs of their own pro­ is needed rather than pushing ir automatically to cesses. For sophisticated data consumers with predetermined receivers. 'fhe strategy here is to complex manufacturing needs, the source-data deliver not data, but automatically generated design fi les are also included with the on-line notification messages on Digital's electronic

data. VAX mai 1 system. The generation of mail is tied The practica l realities of the many CAD/CAM to the design-management functions of the hard­ processes in usc first required a smoothly oper­ ware designers and the coordinators for cngi­ ating file-management process. A large number nccri ng change orders (EC:Os). The mail mes­ of fi les are required to support the build-and­ sages are sent to designated representatives in test processes for one designed object. A typical any of the manufacturing plants around rhc Digital parr (e.g., a complex CPU logic module) world ro inform them to pull whatever data they is today completely specified by ')() to DATA require from the on-line system. Data users in 70 files and to ')() DRAWING fi les. With that Manufacturing arc notified by automatic mes­ 30 many fi les involved , a key to success for this sages whenever new data is issued or when the type of fi le management is total data acquisi­ status of existing data changes. This method tion . Thus the process was made mandatory (not takes advantage of the existing VMS Mail fa cili­ voluntary); rhat is. it could not depend on some­ ties for identifying remote users. A user access­ one's remembering to do something The only control list has been implemented , and all user way to accomplish complete data acquisition transactions arc logged. These techniques con-

1:)8 Digital Technical journal 1<)87 .Vo. /j Fe hrt/{1/:J' New Products firm that new data has been received by users done for DATA files since there are no standard and provide an audit trail of who accessed par­ procedures that are equivalently recognized for ticular data in case an error is discovered later. naming them or for controlling revisions. If the Much of the data provided for the product is DATA files really define the physical product, intended for the specific assembly and test pro­ then an erroneous data file defines the wrong cesses implemented by the start-up team. Provi­ physical product . In that case, it can be argued, sion of this data is made possible by the close the right way to signifythe change is to update coupling of the Engineering Design and start-up the revision of the object itself. At the present team efforts and the sophistication of the data­ time, if an incorrect DATA file is included in the driven fabrication and test processes. In other released data set, the only unequivocal way to words, the designs of high-technology products correct that problem is to advance the physical are now aimed at specific manufacturing pro­ revision and generate a new set of data. cesses for assembly and test. Except for simple Within the MDA process, the status of any file dimensional data, much of this product data can is specifically marked. (The mere existence of no longer be "post processed" (by software the file within the process does not imply any means only) onto a different manufacturing pro­ particular status.) Typical categories of status cess. A major process alteration might require are verified, issued , released, and obsolete. A reconvening the start-up team and adapting the status is implemented by using the file-owner­ design and data for the new process. ship capabilities within the VMS system. As its name implies, MDA provides on-line access to Revision Management all needed data and drawings for any and all Each revision of a part means that that physical revisions. However, the formal status (prelimi­ design object has changed in some way. In the nary, released, etc) of each part and revision MDA process a complete set of DATA and DRAW­ available on line is controlled and specified by ING files is provided for every revision; there is other existing standard procedures. That status no implied or referenced data. All active revi­ is confirmed by MDA but cannot be determined sions still being built remain on line, and subse­ solely from the status information that MDA pro­ quent revisions do not overwrite earlier revi­ vides on line with the data. sions. If the same DRAWING file applies to The MDA process is not directly coupled to different revisions, it will be provided with each the control procedures in Manufacturing, but is of those revisions. We were concerned initially linked directly with status-setting activities in that this simplified approach would generate a Engineering. For example, the issued status is large number of redundant files, particularly set by a procedure run by the product's ECO DRAWING files. However, an analysis of the coordinator when he issues an ECO package to completed sets showed that, with the CAD his counterpart in the manufacturing plant. design processes in use, only 10 to 20 percent Therefore, the data users in Manufacturing are of the fi les were unchanged from one physical advised to use the displayed status only as con­ revision to the next. Our conclusion now is that fi rmation of a change; they will continue to be having some redundant files is a cheap price fo r notified first through the existing ECO control the benefit and simplicity of having full data procedures. sets. Thus no data set has to reference data from Thus, MDA has on-line data available for a another set, and old revisions can be readily manufacturing activity when Manufacturing is archived. notified, by means external to the MDA process, The MDA process currently has one significant that they should be building a particular revi­ limitation. Unlike the existing procedures for sion. Also, MDA provides no on-line information paper drawings, there is no standard control about such things as the interactions and rela­ process for putting a formal revision on a DATA tionships between revisions, which revisions of file. On the other hand, it is not clear that a con­ the modules go together, and which revisions trol process is sufficiently valuable in a product go with which backplane revisions. Therefore, environment that is totally data driven. Tradi­ although MDA is a comprehensive data-manage­ tionally, when necessary , a paper drawing can ment and access process, it is not also a true be changed separate from the physical revision configuration-control and revision-management of the object itself. That cannot currently be process.

Digital Technical Jo urnal 139 No. 4 Februm:v 1987 On-line Ma nufacturing Data Access on the VA X 8800 Project

Directories and File Na mes Within the MDA process, the DATA and DRAW­ �_ _ � ING files are managed by grouping them in VMS A F 2 o o9_ o o - - .1\. r� ""o'"" subdirectories for the object that these files - '""' "" U o specify. The subd irectories are tied to a com­ 2 mon-root directory to fa cilitate the management DRAWING CODE � SHEET of the overall physical data on the host (e.g., DRAWING NUMBER SHEET REVISION moving various directOry structures between Figure 2 Ty pical DRA WING File Name disk drives) . The directory files themselves are owned by the data-management process. They the file's specific content and fo rmat. File may not be read directly over the network; the names must also cominue ro completely iden­ access process provided must be used . In picto­ tify the fi es after they have been extracted from rial fo rm, the directOry structure is described in the MDA I management process and moved tO Figure I. Manufacturing. Therefore, part of the file name is actually redundant with the MDA directOry name. These fi le names can become extremely COMMON ROOT long, and although reading them is not a prob­ I lem, typing them is. Thus the file names are autOmatically generated, and users can select them from menus. The name of a rypical DATA PART PART PART ..., PART NUMBER NUMBER NUMBER NUMBER fi le is structured as in Figure 3. Since there were many DATA and DRAWING I fi les. the file-naming scheme also permits the .I creation of a typical VMS "wild card" directory VARIATION VARIATION VARIATION VARIATION listing for specific types of DATA or DRAWING fi les. For DATA fi les, the specific type of process activity supported by that file included as a unique field in the file name. isFor DRAWING REVISION REVISION REVISION . ,REVIS ION fi les, the drawing code is included in the file name, which also implies the likely uses. These fi elds within file names arc then used in Manu­ facturing tO obtain file listings specific to an DATA FILES~ DRAWING activity; wild-card directory listing is by far the (50 - 70) (30 FILES most common style of usc. - 50) Fig ure VMS Directory Structure I F2009_ QQ_� -lfJ-MCAMODEL_QXYZD 11.N ETX The name of each DRAWING file is tied directly tO the Digital drawing number plotted by that file. For multisheet drawings, a plot fi le �� is made for every sheet in the complete drawing PARTNUM package, so there is a one-ro-one correspon­ VARIATION :j 1 dence between DRAWING fi les and drawing REVISION sheets. The files are named ro match exactly the CATEGORY OF DATA title block of the drawing sheet. A typical (IN_CIRCUIT TEST) DRAWING fi le name is depicted in Figure 2. DETAILED TYPE OF DATA _____j For DATA files, a different strategy for file (MCA MODEL) names was necessary since, unlike the DRAW­ (FOR OXYZ MCA, LOGICAL REVISION 01 1) ING fi les, a one-ro-one linkage does not exist . A DATA file relates to the physical object it DATA FORMAT------' defines; therefore, the file name defines the exact part to which that fi le applies as well as Figure Ty pical DATA File Na me 3

1 4 0 Digital Technical journal

No. 4 February 1987 New Products

On-line Data Access The workstation used is the VAXstation II sys­ Since all DATA and DRAWING files for each revi­ tem. The software provides th<: following capa­ sion of a Part Number are accessible on line. it bilities: is a si mplc process fo r authorized users to • Access drawings dir<:ctly from the on-line access them. A user first logs on to a captive data process (limited function ) account on a specific host CPU from any system on the Digital's DECoct • Create windows for rhe drawing, and zoom around it network . Since this process is controlled by a

Jist of authorized users, no password is neces­ • Annotate a copy of the drawing for use with sary. The user never sees the VMS prompt level specific processes but is immediately presented with a menu of MDA fu nctions. He is then asked a short series of • Return a copy with questions for the respon­ questions about either the Parr Number or Revi­ sible engineer sion Number and is provided with a directory of • Submit plot requests autOmatically for the applicable fi les. whole drawing or any selected window All user transactions with the data-access pro­ ro either a large electrostatic or an cess arc automatically logged. This logging pro­ LN03 Plus print<:r, both accessible on a local vides several important capabilities: Ethern<:t link

• An accurate summary of the acrual on-line The process of making snap-shot window data usage (which has showed that our initial plots of specific areas of interest on the LN03 assumptions were quite incorrect as to who Plus pri otcr has proven ro b<: a very effective would usc what data, and how much access ca pabil i ry, and shows some of the possibi I ities traffic there would be) of replacing large sheer paper plots within the Manufacturing functions. • A degree of additional security by w1cking all data accesses Summary

• A means to notify all users who have utilized The MDA process has be<:n operational since the any file in which an error has been found first protOtypes of the VAX 8800 system were built. MDA presently maintains approximately Electronic Drawing Access, Plotting, three gigabytes of VAX 8800 product data on and Management line, including both prototype and produc­ At the present time, most DRAWING files arc in tion revisions. More than one hundred users the VMS data format of FILE_NAME. PLO since from teo different locations in both Manufactur­ . PLO is the data format that can be released elec­ ing and field Service have logged an average of tronically to Digital's on-line drawing-microfilm rwo hundred transactions per week. Although service:.A varicty of software packages using this MDA con rains significant amounts of control and data format arc available in each manufacturing verification software. rh<:re has been little for­ plant. We expect to make a transition to a new mal user training. The simplicity of the MDA industry srandarcl when it comes into general process allows the on-line Help information to usc. be an effective source of primary documenta­ Providing each separate drawing sheet as a tion. separate fi le was the first step tOward a paperless process. ·rhc second step was to give Manufactur­ References ing the ability to view a drawing on a V�'

Dif!,iWI Te chnical journal 141 I ')87 No. 1 f'e brucny ISBN 1-55558-00 1-7

Prinrcd in USA EY-67 1 1 E-DP Copyrighr© February 1987 Digi"'1 Equipmem Corporation