<<

VAX 9000 Series

Digital TechnicalDigital Equipment Journal Corporation

Volume 2 Number 4

Fall 1990 Editorial jane . Blake, Editor Barbara Lindmark, Associate EditOr Circulation Catherine M. Phillips, AdministratOr Suzanne J. Babineau, Secretary Production Helen L. Patterson, Production Editor Nancy jones, Typographer Peter Woodbury, IllustratOr and Designer Advisory Board Samuel H. Fuller, Chairman Richard W. Beane Robert M. Glorioso john W. McCredie Mahendra R. Patel F. Grant Saviers Robert K. Spitz Victor A. Vyssotsky

The Digital Technicaljoumal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO I-31B68, Maynard, Massachusetts 01754-2571. Subscriptions tO the journalare S40.00 for four issues and must be prepaid in u.s. funds. University and college professors and Ph. D. students in the electrical engineering and science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent 10 The Digital Tecbnicaljournal at the published-by address. Inquiries can also be sent electronically 10 D'I:[email protected] Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 12 Crosby Drive, Bedford, MA 01730-1493.

Digital employees may send subscription orders on the ENET to RDVAX::JOURNALor by interoffice mail to mailstop MLO I-3/B68. Orders should include badge number, cost center, site location code and address. U.S. engineers in Engineering and Manufacturing receive complimentary subscriptions; engineers in these organiza­ tions in countries outside the u.s. should contact the journal office to receive their complimentary subscriptions. All employees must advise of changes of address.

Comments on the content of any paper are welcomed and may be sent to the editOr at the published-by or network address.

Copyright ll:J 1990 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation ·s authorship is permitted. AU rights reserved.

The information in this Journal is subject 10 change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this journal.

ISSN 0898-901 X

Documentation Number EY-E762 E-DP Cover Design The following are trademarks of Digital Equipment Corporation: Digital s VAX 9000 mainframe system is the theme of this issue. Cl, DECsystem-10, DECSYSTEM-20, Digital, the Digital logo, HDSC, Our cover depicts several simple instructions flowing through MC!J, Micro VAX, Nl, PDP-I, Ul;fRIX, VAX, VAX-11/780, VAX 6000, the VAX 9000 instruction execution . High performance VAX 8000, VAX 8600, VAX 8650, VAX 9000, VAXBI, VMS, XMI. was achieved by breaking the VAX instructions into small simple IBM is a registered trademark of International Business Machines tasksthat could be pipelined efficiently. Concurrentoperation Corporation. on up to six instructions simultaneously resulted in a execution Kapton is a trademark of E. I. duPont de Nemours & Company. rate of one simple VAX instntction per clock period. MOSAIC 111 is a trademark of Corporation.

Gloria Monroy of the High Performance Systems Group designed Micromaster Plus is a registered trademark of t.:rx Company. the cover graphic, which was implemented in cooperation Book production was done by Digital's Educational Services with David Comberg of the Corporate Design Group. Media Communications Group in Bedford, MA. I Contents

11 Foreword Carl S. Gibson

VAX9000 Series

13 Design Strategy for the VAX9000 System David B. Fite Jr. , Tryggve Fossum, and Dwight Manley

25 VAXInstructions ThatIllu strate the Architectural Features of the VAX9000 CPU John E. Murray, Ricky C. Hetherington, and Ronald M. Salett

43 Semiconductor Technology in a High-peiformance VAX System Matthew J Adiletta, Richard L. Doucette, John H. Hackenberg, Dale H. Leuthold, and Dennis M. Litwinetz

61 Vector Processing on the VAX 9000 System Richard A. Brunner, Oileep P. Bhandarkar, Francis X. McKeen, Bimal Patel, William). Rogers Jr., and Gregory L. Yoder

80 HDSC and Multichip Unit Design and Manufacture Peter B. Dunbeck, Richard). Dischler, James B. McElroy, and Frank J. Swiatowiec

90 The VAX 9000 Service Unit Matthew S. Goldman, Paul H. Dormitzer, and Paul A. Leveille

102 The Unique Features of the VAX 9000 Power System Design Derrick). Chin, Barry G. Brown, Charles F. Butala, Luke L. Chang, Steven). Chenetz, Gerald E. Cotter, BrianT. Lynch, Thiagarajan Natarajan, and Leonard J. Salafia

118 Synthesis in the CAD System Used to Design the VAX 9000 System Donald F. Hooper and John C. Eck

130 Hierarchical Fault Detection and Isolation Strategy for the VAX9000 System Karen E. Barnard and Robert P. Harokopus Editor'sIntroduction I implement the 77 different chips, the five custom chips, and the self-timed RAM architecture. An additional performance improvement for numeric computations is the VAX vector architec­ ture and is treated in the paper by Rich Brunner, Dileep Bhandarkar, Frank McKeen, Bimal Patel, Rill Rogers, and Greg Yoder. They discuss the architec­ tural model and particulars of the VAX 9000 imple­ mentation, which affords numerically intensive applications performance four to five times greater than can be achieved by the . To ensure that the system performance gains at the semiconductor level were not diminished jane C. Blake but were instead enhanced by packaging and inter­ Editor connects, engineers developed several technologies The VAX 9000, Digital's first , unique in the industry. The technology behind the is the topic of papers in this issue of the D(f.{ital high-density signal carrier and the multichip unit Te chnical jo urnal. As engineers writing for this are explained in the paper by Pete Dunbeck, Rich issue relate, the primary goal of the project from the Dischler, Jim i'vlcEiroy, and Frank Swiatowiec. initial product strategy through manufacture was to Equally important to performance in the new design and build a very high-performance, highly 9000 is system reliability as e\'idenced by the intro­ reliable VAX system. duction of the service processor unit. In their paper Design engineers applied both crsc and R!SC about the service processor, Matt Goldman, Paul techniques to achieve high levels of performance Dormitzer, and Paul Leveille relate how the for this rightly coupled multiprocessor system. MicroVAX-based system embedded within the 9000 In the opening paper, Dave Fire, Tryggve Fossum, detects, isolates, and corrects problems without and Dwight Manley explain the strategy behind the interrupting the system . design. They begin with an overview of the system, High system availability \Vas also one impetus in the technology, and CAD tools, and then describe the design of the power system . Some of the unique the redesign of VAX instructions into small tasks features of the power system, such as redundant which can be efficiently pipe lined. The authors regulators, improved load sharing and simula­ also touch upon three additional aspects of the tion, are discussed by Derrick Chin, Barry Brown, VAX 9000 system: the integration of vector ­ Charles Butala, Luke Chang, Steve Chenetz, Jerry ing into the VAX architecture, new error handling Cotter, Brian Lynch, Raj Natarajan, and Len Salafia. techniques, and performance modeling. The two papers that close this issue address the One measure of performance is the number of topics of CAD methodology and system diagnosis. instructions processed per cycle. The average num­ Don Hooper and John Eck describe a CA D method­ ber of is less than five, which ology that combines advanced rule-based A! tech­ is nearly half the instruction execution rate of pre­ niques with an object-oriented . The new vious VAX systems. To illustrate the architectural methodology saves logic designers significant time features that enable this level of performance, John and reduces errors. A complex system such as the Murray, Rick Hetherington, and Ron Salett have VAX 9000 requires improved system diagnosis capa­ selected a small sample of VAX instructions. They bilities to achieve the desired high system availabil­ describe the instruction flow through the pipeline, ity. Karen Barnard and Rob Harokopus demonstrate how instruction features combine to work on a sin­ how a new scan system, in combination with scan gle macro, and how stages of the pipeline interact. pattern testing, and symptom-directed diagnosis ln addition to the architectural improvements, achieve this necessary diagnosis capability. machine performance is enhanced at the semi­ The editors thank Rick Hetherington of the High conductor level by a new generation of semicustom Performance Systems Group for not only writing a and custom integrated circuits that support a low paper but for his help in coordinating this issue. cycle time. Matt Acliletta, Dick Doucette, John Hackenberg, Dale Leuthold, and Dennis Litwinetz give an overview of the bipolar technology used in the system. They then describe the methods used to

2 Biographies I

Matthew J. Adiletta Matthew Adiletta is currently contributing to the implementation of a new processor architecture and performing a technology evaluation to determine the technology for the implementation. He joined Digital in 1985 to work on a high-performance RISC architecture. Matt was not only the architect for the VAX 9000 system, but he also implemented the integer and floating point multiply and divide units and developed an ECL custom chip process. He holds one patent and has several patents pending. Man received a B.S. E. E. (honors, 1985) from the University of Connecticut.

Karen E. Barnard A senior software engineer with the High Power Business Unit CPU Development Group, Karen Barnard wrote the read-only memory­ based diagnostic for the VAX 9000 service processor unit's scan control module and developed the scan pattern diagnostic for the VAX 9000 CPU and SCU. Karen also worked on the debugging structural test process for the VAX 9000 kernel environment. Prior to joining Digital in 1986, Karen was with Corporation. She received a B.S. ( 1983) in computer science from the Worcester Polytechnicallnstitute.

Dileep P. Bhandarkar As technical director for RlSC systems, Dileep 13handarkar is responsible for leading the architectural direction of RlSC prod­ ucts. He joined Digital in 1978 and was responsible for managing the evolution of the VAX architecture. Dileep was the chief architect for VAX vector processing and coarchitect of Digital's RISC architecture. He holds one patent for his work at Digital and has several patents pending. His degrees in electrical engineering include a 13achelor of Technology from the Indian Institute of Technology and an M.S. and a Ph. D. from Carnegie-Mellon University.

Barry G. Brown The concept of designing DC-to-DC converters as system elements rather than individual "power supplies" was introduced into the high­ power systems products by Barry Brown. He created and developed a highly tlexible, high-reliability DC-to-DC conversion system for the VAX 9000 series. Barry designed, implemented, and verified the power system for the VAX 9000 Model 200 systems. He was a principal engineer for the Codex Corporation before coming to Digital in 1984. Barry is a graduate of Woolwich Polytechnic and Harlow Technical College.

3 Biographies

Richard A. Brunner As 3 principal engineer, Richard Brunner is the architect currently responsible for the engineering refinement and control of both the VAX and VAX vector architectures. He is the editor of the VAX Architecture Reference Ma nual and coauthor of the VAX Vector Ha ndbook and several papers on the VAX vector 3rchitecture. He received a B.S. (high honors, 19R4) in elec­ trical engineering from Case Western Reserve University and an M.S. (1987) in computer engineering from Rensselaer Polytechnic Institute. He is a member of JEEF. and Ta u Beta Pi.

Charles F. Butala Presently responsible for the power system design and architecrure of rhe VAX 9000 Model 400 systems, Charles Butala is a consulting engineer in the Information Systems Business Unit Power Systems Group. Since he joined Digital in 1976, he has been responsible for several power system design projects, including the VAX H600 system. He is a member of IEEE and Ta u Beta Pi, and holds honorary society membership in Eta Kappa Nu. Charles received a R.S.E.E. (1968) from Illinois Institute of Technology and an M.S.E.E. from Norrhe3stern University.

Luke L. Chang After receiving his M.S. in electrical engineering from Virginia

Polytechnic lnstirute and Stare U niversity in 1988, Luke Chang joined the Power Sysrems Te chnology and Regulations Group. He is currently a hardware engineer and is responsible for developing simulation tools to perform high-quality software design verification tests for the next generation DC-to-DC power con­ verters. Luke's previous responsibilities include transient analysis :md testing of the VAX 9000 memory power distribution sysrem, 3nd power system cost reduc­ tion studies.

Steven ). Chenetz As a principal engineer in the Information Systems Busi­ ness Unit Power Systems Group, Steven Chenetz is currently working on the H7390 for a high-power VAX system. He previously was a member of the design and development te3ms for the H7380 of the VAX 9000 system, the H71HH envi­ ronmental monitoring module fo r the VAX 8600 power system, the VAX 8600 clock distribution system, and signal integrity for the VAX 8600 system. Steve joined Digital upon gr3cluation from Rensselaer Polytechnic Institute in 19RI. He has 3n M.S.E. E. from Northeastern University (19H7).

Derrick ). Chin Derrick Chin is the engineering manager for sever3l Infor­ mation Systems Business Unit power groups and is design engineer of the VAX 9000 processor's DC power distribution system. His 3ssociation with Digital began in 1961, and he has participated in many projecrs, from the POP-I ami the DECsystem-10 to the VAX HMO systems. His responsibilities have ranged from development of precision displays, circuit design, and core and semiconductor memories to environmental monitoring modules and power systems. He holds a B.S. E. E. (1959) from MIT.

4 I

Gerald E. Cotter Principal engineer Gerald Corter is a member of the Infor­ mation Systems Business Unit Power Systems Group. He was the project engineer and coarchitect of the VA X 9000 power control system (PCS). Jerry was the PCS interface to Customer Service and Support Engineering, Manufacturing, and Service Processor Unit Groups. He participated in development of the PCS and power system test strategies and the initial design of the T01060 power and envi­ ronmental monitor module. His previous work includes the VAX 8600 system's power and control subsystem.

Richardj. Dischler In his position of systems engineer for the High Perfor­ nunce Systems Group, Richard Dischler worked on the VAX 9000 signal integrity project. He also was a member of the project team for the electrical design of

HDSC and micropackaging for multichip units, planar boards, and connectors for the VA X 9000 system. Rich held similar responsibilities in the development of the VA X 8600 system. He joined Digital in 1982, and his previous experience was at Applied Research Laboratories. He holds a B.S.E.E . (1982) from Pennsylvania State University.

Paul H. Dormitzer As an undergraduate at Harvard University, Paul Dormitzer gained experience with the by working as a programmer and operator. Upon receiving his B.A. in computer science in 1987, he joined Digital's High Performance Systems Group. He is currently an engineer in the High Performance Business Unit CPU Engineering Group. Paul's primary responsibilities are in the development of error recovery processes for high­ power systems, such as the VAX 9000 system.

Richard L. Doucette Since joining Digital in 1979, Richard Doucette has been a member of several high-performance systems project teams. As a senior engi· neer on the VAX 8600 team, he helped introduce the Motorola I (MCA I) technology into Digital and was responsible for its design analysis and characterization in the system. As engineering manager on the VAX 9000 team, he was responsible for the incorporation ofMCA 3 technology, custom chips, and self-timed RAM components in the system. He holds a B.S.E.E. (1973) from the University of Maine.

Peter B. Dunbeck Peter Dunbeck is an engineering manager in the High Performance Business Unit Te chnology Research and Engineering Group. He held various positions on the VAX 9000 program between 1985 and 1990, includ­ ing technology program manager and design engineering manager for the multi­ chip unit. Before joining Digital in 1984 as a manufacturing engineer, Peter developed energy conservation programs for Thermo Electron. He holds a B.S.

(1977) in mechanical engineering from Virginia Te ch and an s. M. ( 1979) in aero­ nautics and astronautics from MIT.

5 Biographies

John C. Eck The dcvdopment of rhe majority of the physical design CAD tools used in rhe VAX 9000 system was managed by John Eck. He is a software engi­ neer manager in the High Performance Systems CAD and Diagnostics Group. John was employed as the manager of the Automated Design Department of Badger Company before coming ro Digital in 1984. He holds a BS (1964) in physics and an JYI.S. ( 1966) in aeronautics and astronautics from MIT, and an M.B.A. (highest honors, 198--i)from Babson CoJiege.

David B. Fite Jr. Consultant engineer David Fire was a member of rhe initial architecture team for the VAX 9000 system. He developed the architecture for the branch prediction, instruction fetch, and instruction decode for the VA X 9000. His previous work includes responsibility for prototype debugging on the VAX 8600 system. D:IVe joined Digital in 1982. He has one patent and several patent applications pending. He is a graduate of Worcester Polytechnic Institute with a B.S. (honors) in electrical engineering.

Tr yggve Fossum Tryggve Fossum is rhe system architect of rhe VAX 9000 sys­ tem. He received a B.S. ( 1968) from the University of Oslo and earned his Ph. D. ( 1972) from the University of Illinois. Tr yggve joined Digital in 1973 and worked on the design of high-end , notably the VA X-11/780 system. As a pro­ ject leader on the VAX 8600 team, he guided the design of the t1oating point accel­ erator. He has also worked on several research projects, including an early raster scan graphics , and a workstation with an integrated disk system.

Matthew S. Goldman As a senior engineer on the VA X 9000 project team, Matthew Goldman designed the scan control chip, which contains the control logic for the VAX 9000 scan system. He was also the responsible engineer for all VAX 9000 service processor h:trdware. Prior to joining Digital's High Perfor­ mance Systems CPU Group in 1986, Matt was a design engineer for Raytheon Company. He is a member of Ta u Beta Pi and Eta Kappa Nu. M:ut holds a B.S. (highest honors, 1983) and an M.S. ( 1988) in electrical engineering from Worcester Polytechnic Institute.

John H. Hackenberg In 1968, John Hackenberg came to Digital as a tech­ nician on the Kl- 10 project, leaving after two years to serve in the armed forces. He returned to Digital in 1971 and worked on the designs for various high-end systems, including the KL- 10. As a consulting engineer on the VAX 8600 project, he worked in the area of signal integrity. John was the project leader for the MCA3 gate array used in the VA X 9000 system and is currently developing a bipolar gate array. He holds a B.S.E.T. {1979) from the University of Lowell. I

Robert P. Harokopus A cum laude graduate of the University of Michigan, Robert Harokopus received a B.S. (1986) in computer engineering and is now studying for an M.S. in computer engineering from Boston University. Bob is a senior software engineer and joined Digital in 1986. He developed the symptom­ directed diagnosis software used in the VA X 9000 service processor unit. Bob also developed software for the HIDE CAD tool and SCEPTER automatic test pattern generator, both of which were used in the VA X 9000 design project. He is a member of Ta u Beta Pi and Eta Kappa Nu.

rucky C. Hetherington As a principal engineer with the High Performance Systems Group, Ricky Hetherington is currently the project leader of the transla­ tion buffer and design of the VAX 9000 system. He holds one patent and has several patents pending on the various design features of the VA X 9000 M-box. Rick joined Digital in 1982 as a senior engineer in Digital's Large Computer Group. He has a B.S. from Pennsylvania State University.

Donald F. Hooper Don Hooper is a consulting engineer in both logic design and CAD disciplines. He initi:ued and led the development of the Synthesis of Integral Design program, Digital's first synthesis tool. Before coming to Digital in 1979, he was architect for the Itel 7031 mainframe and cache designer for the !tel Advanced System 4. He is a graduate of Don Bosco Te chnical Institute. Don holds patents in speech recognition circuits, the tag and queuing system for Digital's first pipelined CPU, and the control storage pipe for the VAX 8600 system. In addition, he has several patents pending in .

Dale H. Leuthold A member of the technical staff of the Integral Circuit Design Group, Dale Leuthold led the design team for the VAX 9000 vector regis­ ter chip. He is currently working on random-access memory development for high-speed mainframes. Dale was responsible for bipolar design at Signetics Corporation and Trilogy Systems Corporation before coming to Digital in l9H6. He holds one patent and has one patent pending. Dale received a B.S. from Oregon State University.

Paul A. Leveille In his nearly ten-ye:.Jr relationship with Digital, Paul Leveille has specialized in the development of high-power systems, particularly the VA X 8600 and VA X 9000 systems. As a principal engineer in the High Perfor­ mance Business Unit, he helped define the VA X 9000 service processor sub­ system and was responsible for developing the scan control fi rmware and portions of the service processor application software. Paul's previous responsi­ bilities include console diagnostics, fi rmware. and ::�pplicationsoftwa re.

7 Bio�raphies

Derutis M. Litwinetz The projecr leader for the design of four standard and custom chips for the VAX 9000, Dennis Lirwinerz is a consuhing engineer in the High Performance Business Unir. He has previously participated in the design of rwo standard eel.! chip designs for the VA X 8600 system. He joined Digital in 1967 as a technician for the DECsysrem- 10 Engineering (;roup. Denni:-; has a patent pending for the VAX 9000 self-rimed register fi le design. He received a R.S.E.E.T. from Lowell Technological Institute and an ,'.S.C.E. from the University of Lowell.

Brian T. Lynch Brian Lynch is a principal hardware engineer in the Informa­ tion Systems Business Unit Power Systems Group. In this position. he designed and developed the H7382 bias power supply used in rhe VAX 9000 system. He is presently working on power solutions for future high-performance systems. Prior ro joining Digital in 1972, Brian was responsible for power convener and analog module design ar lntronics. He has a B.S. E.E. (1978) from Wo rcester Polytechnic lnstirure.

Dwight Manley As a principal engineer on the VAX 9000 project, Dwight Manley was responsible for all of the perform:mce modeling of the VAX 9000 CPU design. His present responsibilities include writing code for a Digital Extended i'vlarh Library product. Dwight joined Digital in 1979 as a member of the Systems Performance Analysis Group. Prior to that time, he worked as a systems programmer for the Bell Telephone System. Dwight has a H.S. (1971) in mathematics from the University of Massachuseus and an M.S. ( 1976) from Northeastern University.

James B. McElroy Jim McElroy is the multichip unit operations manager. His work on the VAX 9000 system began with interconnect and packaging, fol lowed by the management of the physical technology efforts. He then became the manufacturing systems program manager for the introduction of the VAX 9000 system into manubcturing. Before joining Digital in 1976, Jim worked at RCA on packaging and interconnect design for military computer systems. He received a B.S.M.E. and an M.S.M.E. from Northeastern University.

Francis X. McKeen The project leader for the -box unit of till' VA X 9000 system was Francis McKeen. Prior to working on the VAX 9000 system, he wrote for the VAX 8600 and VAX 8650 systems. Frank is a principal engineer and has been with Digital for seven years. He holds one patent and has several rarenr applications pending. Frank received a B. S. E.E. from Northeastern University and is a member of IEEE. I

john E. Murray The coauthor of of the VAX <)000, john Murray is a consulting engineer in the High Performance Business Unit. He served as project leader of the design team for the 1-box unit of the VAX 9000. He joined Digital in 1982. John's previous employer was ICL in the , where he was a design engineer. He received a B. Sc. ( 1969) from Warwick University. He holds one patent and has several patents pending.

Thiagarajan Natarajan Thiagarajan Natarajan is manager of a DC-to-DC converter group in the Information Systems Business Unit. His group develops

a high-density and highly reliable DC-to-DC converter, associated hybrids, semi­ conductor components, and the distribution system for the next generation, high-performance VAX systems. Raj's prior experience includes positions at , Bell Laboratories, and Perkin Elmer Corporation. He has a Ph.D. in dectrical engineering, has been awarded one patent, and has authored approximately seventeen technical papers.

Bimal Patel Principal engineer Bimal Patel joined Digital in 1986 as a senior engineer. His primary responsibility since that time was the design of the V-box unit of the VAX 9000 system. Bimal was previously employed as a senior engineer in the CPU Design Group of , Inc. He has an M. S. in computer engineering from Boston University.

William J. Rogers Jr. William Rogers is an engineer in the VA X 9000 CPU Group, where he developed the design of the control logic of the V-box unit for the VAX 9000. Prior to working on this high-performance system, Bill was a member of the SASE Support Engineering Group. He joined Digital in 1986 and is a member of IEEE and Tau Beta Pi. He received a B. S. ( 1986) in electrical engineer­ ing from Michigan Technological University.

Leonard j. Salafia The development of the AC front end for the VAX 9000 system was the responsibility of Leonard Salafia. who is the manager of the AC Power Interface Developmem Group. His previous work at Digital includes supervising the development of storage system power products for the Central Power Supply Engineering Group and for the Storage Systems Power Group. Len

worked for General Electric prior to coming to Digital in 1980. He holds a B.S.E. E. (magna cum laude, 1969) from the University of Hartford and an M. S.E. E. (1976) from Renssel::ler Polytechnic Institute.

9 Biographies

Ronald M. Salett As a consulting engineer in the High Performance Systems Group, Ron Saletr is currently leading the development of a new high-perfor­ mance CPU. As a project leader for the VAX 9000 system, he was responsible for the architecture, design, and microcode of the . Since joining Digital in 1977, Ron has also worked as an architect and project leader on low-end integrated PDP- 11 systems. He holds two patents. Ron holds a B.S.E.E. (1975) from Carnegie-Mellon University and an M.S. E. E. (1979) from Worcester Polytechnic Institute.

Frank J. Swiatowiec In 1988, Frank Swiatowiec became HDSC operations manager, with the primary responsibility to transition Digital's new HDSC tech­ nology to volume production. He was one of the engineering managers responsi­ ble fo r the definition and development of the HDSC . Frank had over 15 years of experience in the semiconductor industry when he joined Digital in 1986. While with Motorola Corporation, he was awarded four patents on ECL circuit designs. Frank holds a B.S.E.E. from the University of Illinois and an M.S. E. E. from Arizona State University.

Gregory L. Yo der Gregory Yo der is a senior hardware engineer with the High Performance Systems CPU Engineering Group. His primary responsibilities on the VAX 9000 system included the design and testing of the V-box unit, and pro­ totype system debug, for which he received an excellence award . He also assisted Manufacturing in producing and installing external fi eld test VAX 9000 machines. Greg joined Digital in 1988, after participating in a one-year co-op session at IBM. He holds a B.S. E. E. from Pennsylvania State University.

10 I Fo reword a significant impact in that this structure enabled impingement air cooling and elimination of the bulky liquid system that was part of the initial design. The final design of the VAX 9000 system reflects, in myriad forms, this continual process of successive refinement toward shared goals. Design changes notwithstanding, our primary strategy remained constant. The reader will note that, while we innovated aggressively in CPU struc­ ture, implementation technologies, and design methodologies, we preserved fu ll compatibility with the VA X, Digital storage, and Digital network­ ing and cluster architectures. We wanted Digital Carl S. Gibson and our customers to be able to enjoy very high per­ VAX 9000 Program Ma nager formance levels in a product that was compatible with prior investments. Therefore, we drew as This issue of the Digital Te chnical jo urnal is a much as possible from existing products and collection of papers describing the technologies, designs from many Digital development groups. designs, and design methods employed in Digital's As a result, the VA X 9000 system incorporates VAX 9000 mainframe/, which was Digital's standard XMI and popular Bl, Cl, and introduced in the fa ll of 1989. Nl system-level interconnects. The system runs VMS The VA X 9000 system embodies hundreds of and operating systems, VA X layered prod­ innovations in most areas of design, manufacture, ucts, and all of our customers' and independent and service. In selecting papers for this journal, we software vendors' tools and applications. This have attempted to reflect the immense scope and capability proved especially rewarding when in the variety of this program, which ranks among the final months of the project, our own VAX 9000 largest and most complex in the history of our prototypes, running our unmodified CAD tools, industry. accelerated the processing of the inevitable last­ In the summer of 1983, a small group of us set minute changes. about to determine what it would take for Digital to High-performance computation fundamentally develop a true mainframe. We fe lt that a mainframe requires two key ingredients: short machine cycle VAX would be a powerful addition to Digital's times and maximum computational work per­ product family. The products that we have created formed in each cycle. The semiconductor and took form, changed, and evolved over the months multichip unit papers describe how we minimized and years as technical challenges yielded to inno­ the VAX 9000 cycle time by use of fa st circuits, high­ vations, rigor, and discipline. An undertakjng on density packaging, and high-speed interconnects. this scale necessarily undergoes numerous transi· These papers are complemented by architecture tions as new data emerges, assumptions are tested, descriptions through which the authors present the and alternatives are eliminated. Te chnical break­ innovative features that minimize the number of throughs built upon one another incrementally cycles required to execute the VAX instruction set. as we pressed the design closer to our goals. The These papers present the sophisticated pipelining primary objectives of very high system-level perfor­ techniques and vector processing capabilities incor­ mance and world-class reliability drove the design porated in the VAX 9000 system. process and the changes that emerged. Equal in importance to the computational capa­ The planar logic packaging is illustrative of how bilities of the product are the service and control changes and improvements built upon one another. features of the system. Papers covering the The reliability benefits of minimal connections VA X 9000 service processor and the system's fa ult precipitated a .logic packaging design change from management capabilities provide the reader with stacked modules in dual backplanes to the planar insights into these important aspects of the array. This change- an optimization for reliabil­ product. ity - in the end actually helped performance and The development strategy for the VA X 9000 maintainability. Utimately, though not envisioned system was explicitly formulated to deal with enor­ at the time, the adoption of the planar array had mous technical and project complexity. Complex-

II I

ity itself was the single most formidable challenge adapted to evolving user needs. Concurrent design facing the team. Apparent from the outset, was the permeated every aspect of the project and domi­ fact that such an ambitious product required the nated the way people worked together, with many integration of a very large number of discrete aspects of the technology and product design design objects; each had to be conceived, created, converging and adapting as we learned from our documented, tested, and ultimately integrated and own processes. When the manufacturing process verified as part of the whole. The reader will see needed some help, designs could be reprocessed the diversity of these efforts and recognize the with the new rules and rereleased to keep things challenge of unifying a design from this breadth of moving ahead. technical advancement. And, move ahead they did' To day, the VA X 9000 Central to our strategy was the creation of a system is installed at many customer sites where the unified design tool suite operating in a seamless, systems are exceeding our original goals in both homogeneous VMS computing environment. The performance and dependability. It has been first few years of the project were devoted to con­ accepted by experienced, high-end computer users struction of this environment in parallel with top­ as a bona fide mainframe - a mainframe with the level design formulation. The recognition that unique advantage of full integration with Digital's rigorous design methods were crucial to our success rich distributed processing architecture. was possibly one of the team's most powerful fun­ The VA X 9000 system was created by engineers damental notions. Papers included in this journal working in many disciplines and collaborating illustrate some of the legacy of powerful CAD tools worldwide to invent hardware, software, and pro­ and structured design approaches created by the cesses that have significantly advanced the state VAX 9000 team. of the art of computer design, manufacture, and As we have seen for the product, the methodol­ service. The papers in this journal describe but a ogies were not immune to change as the project few representative examples of the creativity and progressed. Working with rapidly evolving determination of this large and dedicated team of technologies, design process experts continually professionals.

12 David B. Fite]r. Tryg gve Fo ssum Dwight Ma nley

Design Strategyfo r the VAX 9000 System

Th e VAX 9000 system is Digital 's newest high-end processor in the VAX fa mi�y. Th is paper describes the design strategy used to achieve high performanceand shows how RISC concepts were applied to a CISC architecture. Neu.• opportunities fo rparallelism

in VAX program execution were fo und by breaking the VAX instructions into simple tasks which could be pipelined efficiently. By using independent, dedicatedpi peline stages, execution rates approach one instruction per cycle.

The task confronting the VAX 9000 design team connected through the system (SCU). was to develop a VA X system that outperformed Through a cross-bar , the SCU provides high­ any previous VA X system and that was competi­ speed, simultaneous transfers among the central tive with similarly sized processors from other processors, I/O devices, and memory banks. System vendors. Although the VAX system is based on one cache consistency is maintained with duplicate tag of the world's most popular computer architec­ directories located in the SCU. As references are tures, the VA X architecture's instruction complexi­ made to memory, the addresses are checked against ties preclude efficient macroinstruction pipelining, the tag directories. If a cache hit occurs, the cache in such as that found in reduced instruction set com­ question is requested to invalidate or write back to puters (RISC). RISC processors can be built with low main memory. The scu supplies a bandwidth that gate counts to handle simple, fi..xed-Jength instruc­ allows near linear performance improvement as tions sets, load/store architectures, and delayed new processors are added to the system. The mem­ branching. ory is interleaved on cache block boundaries to To compete with machines basedon such archi­ provide bandwidth for multiple CPUs and vector tectures and still remain compatible with the VA X processors. architecture, the design team chose to implement Four XMI backplane buses provide high band­ the VA X architecture on the VA X 9000 system by width paths to I/O devices. Although the XMI is used applying techniques that were similar to those used as the in VA X 6000 systems, the XMI is in RISC processors. We redesigned the VAX instruc­ used exclusively for I/O in the VA X 9000 system . tions into small, simple tasks, and designed dedi­ Several new adapters were designed to increase cated hardware that was optimized for each task. throughput and reduce latency for I/0 transactions. The result is a network of specialized processors, These adapters include connections to the CI, the each of which has its own data paths and state NI, the BI, and local disk comrollers. Although high­ machines, that operate in parallel and execute performance IIO features, such as disk striping, VA X instructions quickly. The most common, sim­ solid-state disk, and load balancing have been added ple instructions are executed at the rate of one to all VA X systems, the VA X 9000 system benefits the per cycle. most from these features because it has the I/O back­ plane bandwidth ro rake advantage of them. A block Sy stem Overview diagram of a single VA X 9000 CPU connected to the SCU and the major data paths between the two units The VAX 9000 system is a tightly coupled multipro­ 2.1 cessor, which runs the symmetric is shown in Figure (SMP) version of the VMS operating system and can have up to four processors sharing a central main Te chnology Contributions to memory. Figure l shows a simplified block diagram Improved Performance of the system. The major system components The central processor cycle rime has been reduced include fo ur CPUs, two memory controllers, two to 16 nanoseconds (ns) mainly by the use of fas t I/o controllers, and a service processor, which is emitter-coupled logic (ECL) semiconductors and

No. 4 Fall /<)')() Digital Te chnicaljournal Vo l. .! 13 VAX 9000 Series

XMI XMI

scu DODD DO DOD DODD DO DODD DODD DO DOD 256 MB DOD DODD CPU VAX VAX 9000 CPUNECTOR 9000 �m� mm

Figure I VAX 9000 Sys tem Diagram fast self-timed random-access memories (RAMs) for (HDSC), tape automated bonding, and a single registers and caches, and by decreasing the inter­ planar module all reduce the interconnect delay connect wire length between components. between active components in the VA X 9000 Motorola's Macrocell Array III (MCA)) technology system. Strict impedance control is maintained provided both macrocell array and throughout the system. Clock skew is minimized by capabilities. The emire system is composed of 77 employing fi xed-length, diffe rential transmission unique MCA3 options and 5 custom chip types. A and dedicated routing layers. single MCA3 contains 838 cells (4 14 major, 224 input, and 200 output), which 10,000 equiva­ lent gates, and 256 I/O pins. Maximum power CA D Contributions to Improved dissip:nion is 30.0 watts, with unloaded gate prop­ Performance agation delays of 120 picoseconds (ps). Perfor­ Hundreds of computer-aided design (CAD) tools mance-critical operations, such as multiplication. were used during the design and construction of division, integer and vector register accesses, and the VAX 9000 system. However, none of these tools system clocking, were h!rther aided by employing was more important in improving performance custom chips 2 than the physical layout and timing analysis tools. Caches for instruction stream and memory Once the design team had placed large functional data, scratch pad registers, ami control stores all sections, placement tools refined individual macro­ require high-speed local storage. Two versions of cell selection and pin placements. Over 33,000 pins a proprietary self-timed RAM were designed for were selected to minimize overall wire length and these specific applications. A 4 kilobit (Kb) self­ maximize critical interconnections. timed RAM, at 5. 5 ns, and a l6Kb self-timed RAM, Routing presented several challenges. All levels of at II. 5 ns, provide internal input and output interconnect included critical signals, differential latches and write pulse generation circuitry. Multi­ pairs, and fixed-length requirements. The HDSC ple access modes allow highly pipelined operations contains large cutouts that enable die attachment to take advantage of shorter access times. and allow cooling through the back panel. These Each new semiconductor generation reduces large routing restrictions and special routing cycle time. which increases the re!Jtive importance characteristics could not be handled by existing of interconnect delay. High density signal carriers CAD tools. Therefore. we developed Chameleon,

14 Vo l. .2 No. -i Faii i'J')(I Digital Te ch nica/journal Design Strategyfo r the VAX 9000 Sys tem

a general-purpose router. With Chameleon, cross­ used. This data fo rmed the basis for design deci­ talk is minimized, and crossing counts are main­ sions and trade-offs we made in the development tained and used to increase signal integrity, which of the VAX 9000 system. improves performance. To model the timing relationships within the Simple Instructions system, we used sophisticated CAD tools to gener­ In many VA X programs, only a few opcodes are ate an accurate representation of the VA X 9000 responsible for a large percentage of the instruc­ system. Detailed timing models of each macrocell tions issued. Most of these opcodes are simple and device were created using the SPICE simulator limited tO a single arithmetic or logical operation. 5 program Chameleon and signal integrity rools Often, one of the operands is in memory. A typical provided delay values for each signal within the example is MCA3, H DSC , and planar modules. CPLJDLY, using ADDL3 the AUTODLY timing tool, tied the various pieces

E-BOX · V-BOX ····· ------· - - .: INSTRUCTION, • INSTRUCTION ------INTEGER------VECTOR r-" CACHE � BUFFER UNIT ADD UNIT (BKB VIC) I • I : � .-----'-'- I/O AND BRANCH INSTRUCTION . VECTOR : 11----, DECODE H INSTRUCTION FLOATING : VECTOR � MEMORY PREDICTION V<-- ISSUE � POINT UNIT REGISTERS '---Y MULTIPLy INTERFACE (1K ENTRY) jv- (XBAR) I:""Y 1 • ¢= UNIT ....___, .. .---'-'-____,: . _._· --.- - ..----,..,.--J · . ...· · · · · · · · .. · OPERAND . REGISTER .. . . . DATA PROCESSING� FILE � MULTIPLYn RETIRE_.__._._1 ·.-.-.--- SWITCH • UNIT UNIT :I-BOX (OPU/SUFPL) h (SLIST/GPRs) 1 : �;I:N ..:: :�:::::: �;N:;:ATI;::=I O= N=I : '�'==::::::=��::::: �UNIT ::::::;::=.£:..!� DIVIDE J ;1K TB) UNIT : �����) ] : . ------scu . . lcr ====���l ;::: ·�- ----�-----�-- ��������-�. ..Jj.. . -->WRITE .--QUEUE'l�.c..__---, ...... (WRTQ) M-BOX

Figure 2 VAX 9000 CPUNector Block Diagram

2 Fa/1 /1)')0 Digital Te cbnicaljournal Vo l. No. 4 15 VAX 9000 Series

diffe rence between the way a VA X processor and a Context , translation buffer changes, and !USC processor process simple instructions is how instruction stream modifications all require that the the variable length instructions and memory speci­ virtual instruction cache be invalidated. Two com­

fiers are handled. VAX operands may reside in plete sets of block valid bits reduce cache sweeps to general-purpose registers (similar to RISC a single cycle, if consecutive sweeps do nor occur operands), in memory, or may be embedded in the within 256 cycles of each other. Block size and fre­ instruction stream. The VAX architecture provides quent sweeping reduce the virtual instruction a rich selection of memory operand specifiers, cache's hit rate to approximately 96 percent, but by which often require computations to create the fi lling through the main cache, the miss penalty is address. In a RISC processor, only load and store minimized. instructions access main memory. The instruction preprocessing stage (1-box) InstructionDecode decodes instructions and fe tches operands in the Because the majority of instructions executed VA X 9000 system. In the execution stage (E-box), require only a single cycle to execute, the instruc­ simple VA X instructions n.:s<:mble RISC instructions. tion decode's task of keeping ahead of the E-box is A simple opcode describes the operation, a single not simple. Most instructions must be decoded in a provides source operands, and a desti­ single cycle to keep the VAX 9000 system's ticks­ nation queue supplies a result descriptOr. The !-box per-instruction (tpi) low. operates in parallel as with the E-box, which func­ For example, VA X instructions may contain up to tions as a RISC processor by executing one instruc­ si..,x operand specifiers. With 59 diffe rent specifier tion each cycle. Execution occurs without the need addressing modes, instruction lengths can vary to identify the operand's source or addr<:ssingcom­ from a single byte to more than 50 byres. However, plexity. Figure 3 illustrates how simple instructions the overall average VAX instruction length is 3.8 t1ow through the VA X 9000 pipeline. Although all bytes, and 98 percent of instructions require only VA X implementations perform these tasks, the VA X 8 or less bytes.'i Furthermore, 96 percent of VA X 9000 implementation uses separate, independent instructions executed use only 3 or less specifiers. hardware units to overlap the work because con­ In each machine cycle, a 9-byte instruction buffer current operation is a prerequisite for single-cycle is presented to the decode stage (XBAR). The instruction execution. instruction buffer contains instruction stream data prefetched from the virtual instruction cache. Instruction Cache Instruction decoding consists of generating an ini­ We used an instruction cache in the 1-box to tial microaddress, determining the number of decrease instruction stream fetch latency and specifiers for the instruction, including each speci­ reduce the bandwidth requirements on the main fier access mode and data type, and forwarding the cache. Choosing a virtually addressed cache fu rther appropriate specifier data to the operand process­ reduced latency and simplified the design by ing stages. The XBAR can handle up to three specifi­ removing the need for duplicate translation buffers. ers. Instructions that contain more than three The virtual instruction cache is an 8 kilobyte (KB) specifiers require additional decode cycles. Since cache with a quadword line size, 32-byre blocks, general-purpose register specifiers occur approxi­ and a single-cycle access time. Line valid bits are mately 41 percent of the time, three register specifi­ maintained to allow variable size fills from the main ers can be processed concurrently.1' Short literals data cache. Because the average VAX code block size comprise nearly 16 percent of the specifiers. How­ is 16 to 20 bytes, the block size of the virtual instruc­ ever, the XBAR can only decode a single short literal tion cache provides a good balance between the per cycle. The remaining specifiers must all be instruction decode stage and the main cache. processed by the operand processing unit, which

Table 1 Decode Cycles Required

Instruction VA X-11/780 VA X 8650 VAX 9000

MULF3 R3,R5,R7 3 2 II ADDL3 S #48,R4,@(R2) + [R3) 5 4 AOBLEQ SII#63,R10,10$ 3 3

16 Vo l. 2 No. 4 Fall 19')0 Digital Te cbnicaljournal Design Strategyfo r the VAX 9000 Sy stem

OPERATION CYCLE 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

PC GENERATI ON DJ VIC ACCESS 0 INSTRUCTION DECODE

SPECIFIER PROCESSING

TRANSLATE � AOORESS . ;:::::::��:�:::� DATA CACHE I I I I ACCESS i i i : • . •·=·:·:· :·:·:·:.� : : ' . . . . . ' ' ' ' 1• . 1:• ::'::::::::::10 • ' I. 1:::::::::::::,1' ' MULTIPLY UNIT EXECUTION 04:::' . :::�:,�::::1. FLOATING UNIT i ��·:::::;::::::!! EXECUTION LOOP: MULF3 (R0),#2.5,R1 D I � MULF3 4(RO),II3.5,R2 : • INTEGER UNIT D EXECUTION MULF3 8(R0),#4.5,R3 [])

RETIRE ADDF3 R1 ,R2,R4 BJ

' ' ' AOOL2 #11xC,RO m REGISTER II WRITE ADDF3 R3,R4,(R5)+ IIIII . DATA CACHE D SOBGEO R6,LOOP ACCESS • DJ

Figure 3 The VAX 9000 Instruction Pipeline

decodes a single complex specifier per cycle. Unlike Load/S tore Architecture preceding processors, the XBAR handles multiple Load/store architectures separate memory accesses specifiers in any order. Ta ble I shows the number of from computation. Loads can be scheduled to place decode cycles required for several VA X processors. arriving memory data at a functional unit just as an operation begins. To achieve this effect with VAX Op erand Prefetching instructions, memory specifiers are treated as load/ Because most simple instrucrions are decoded and store instructions. VAX memory specifiers describe executed in a single cycle by v::triouspipeline stages, the effective addresses of memory operands. VAX instruction operands also must be handled in a memory specifiers do not contain the source and single cycle. Multiple, specialized operand units destination registers that are specified in RISC load/ increase operand processing throughput. From one store instructions. Rather, the VA X 9000 system to three register operands may be fo rwarded to rhe assigns temporary register file locations to buffer E-box by one register unit per cycle. A dedicated memory data. By processing specifiers early in the short literal unit expands all VAX data fo rmats. The pipeline, data can be scheduled to arrive at the operand processing unit performs complex address appropriate time. calculations and requests memory operand data Memory specifiers act as independent instruc­ from the cache unit (M-box). Both the operand pro­ tions executed in the operand processing unit. This cessing and short literal units can perform multiple unit creates the operand's effective address and for­ cycle operations. wards it to the M-box. For loads, the actual memory

2 IYYO Digital Te ch nicaljo urnal Vo l. No. 4 Fall 17 VAX9000 Series data is returned to the E-box register file. The trans­ ber or a flag which indicates that the result should lated physical address is saved in a queue of write be written to memory. addresses for store/destination specifiers. When The instruction issue unit removes source execution results arrive from the E-box, the previ­ pointers from the source queue. These pointers are ously saved address is used to write the data into used to address either the general-purpose registers the cache. or source list for the actual source data. Destination pointers from the destination queue determine Conflict Detection and Resolution where resulls should be wrirren. Register conflicts Macropipelining in the VA X 9000 system relies on can be detected by comparing the source pointers autonomous units operating in parallel . Each inde­ needed to issue an instruction with all issued desti­ pendem unit is optimized for an individual task. nation pointers in the destination queue. For exam­ However, macropipelining does require that mech­ ple, in Figure 4, the MULL3's RO source queue entry anisms be added to resolve data dependencies would match the ADDL3 's RO destination queue among instruction processing units. Data cont1icts entry. A write to the general-purpose registers by occur when an instruction's results are required by the E-box removes the destination queue entry, and an earlier pipeline stage. An addressing data conilict the instruction issue can resume. appears in the following example: SRCQ SLIST DSTQ MOVL RO , R1

MOVB TABLE(R1 ) ,R2 Rl __. DATA RO Any dedicated aclclress calculating hardware must R2 MEM wait for the MOVL instruction results before per­ #0 I- forming the l'viOVB instruction's effective address computation. A memory conflict is another form RO of data dependency. ADDL3 R 1 ,R2,RO In the following example, MULL3 (R3),RO,(R4)

MOVB R 0 , ( R 1 ) Figure 4 Register Conflict Detection MOVB R2 ) ( , R3 a prefetch unit could read the second instruction's Address ing Conflicts To resolve addressing data source operand while the E-box writes the first contlicts, the I-box maintains a read/write register instruction's results, if the values of registers R I and scoreboard. Two register masks are created for R2 are different. However, when the registers con­ each instruction decoded. The first register mask tain identical values, the read must be delayed until denotes the general-purpose registers that the E-box the write occurs. The VA X 9000 system uses several will read for the instruction, and the second register diffe rem mechanisms detect and resolve data w mask specifies the general-purpose register writes. dependencies. Passing pointers, scoreboard masks Each bit in these register masks refers to a single within the 1-box, the write queue in the M-box, and VA X general-purpose register. Specifiers that are architectural restrictions are all used to handle vari­ being processed in the operand processing unit are ous confl icts. checked against up to six previous Conflicts The simplest hardware mecha­ masks. From the fi rst example above, the specifier [TABLE(R I)] requires that the operand processing nism employed in the VAX 9000 system is the use of pointers to reference data. The operand processing unit read R 1. If the Rl bit is asserted in any preced­ unit oversees a 16-entry source queue, an H-entry ing instruction's scoreboard write masks, this effec­ destination queue, and a 16-entry source list. A sin­ tive address calculation must be deferred. gle pointer is inserted into the source queue for The VAX architecture presents a unique address­ each source specifier. The pointer represents either ing conflict problem because some specifiers, a register number, in the case of general-purpose such as -(Rn) and (Rn)+, modify general-purpose register operands, or a tag that indicates an entry in registers. the source list where the operand data is located . A In the fo llowing example, pointer is added to the destination queue for e:.�ch SUBL2 RO ,R1 RO ). , R destination. This pointer represents a register num- ADDL2 C 2

ht/1 1'.)!)0 18 Vo l .! No . -1 Digital Te chnicaljo urnal Design Strategy fo r the VAX 9000 Sys tem

the (RO)+ specifier modifies the contents of RO. Unlike its predecessors, the VAX 9000 system com­ Therefore, the operand processing unit cannot mits all its resources to a single branch path. The update the general-purpose register without affect­ prediction hardware selects the path of execution ing the prior instruction. The read masks are used to resolve memory conflicts for those branch ro detect this type of conflict. All specifiers that instructions that are decoded before results are modify general-purpose registers must check the available. This path selection is based on prior his­ scoreboard read masks before proceeding with tory, if the branch hits in the branch cache. If the the instruction. Thus, when a confl ict occurs, the branch does not hit in the branch cache, the path general-purpose register modification stalls. is predicted staticly, based on the instruction's When an instruction completes execution, the opcode. When the branch executes, the prediction instruction's read/write mask is removed from the is compared to the actual results. The pipeline is scoreboard . In all addressing conflicts, specifier flushed back to the correct code path if the branch processing continues once the blocking mask is prediction was incorrect. removed. The entries in the branch cache store the branch results of the previous execution of the branch and Memory Conflicts The write queue is used to the target address, if the branch was taken. Because resolve memory conflicts. Physical addresses, the branch cache is a one-way associative cache that received from the translation buffe r, are inserted can store only 1024 entries, the results have an aver­ into an eight-entry FIFO. These addresses are later age hit rate of approximately 80 percent. However, paired with the proper write data from the E-box correct predictions occur 85 percent of the time and written into the M-box. To avoid prefetching from the cache, as opposed to an average hit rate of stale dat:J.,:�.II memory addresses for source memory 56 percent, when the predictions are based solely oper:�.nds are translated and compared with the on opcode. Loop branches are always predicted addresses in the write queue. When no address con­ as taken, which increases the overall correct pre­ flict occurs, the data from memory is fo rwarded diction rate to close to 89 percent. By caching to the source .list. Operand requests that conflict branch targets, the calculation may be avoided and with a pending write address are stalled until the a latency factor of one-cycle branch taken is contlict is resolved . The conflict is resolved when achieved. The branch cache can store a sufficient the appropriate write data is received. The conflict­ amount of branch context to eliminate the need ing address is then removed from the write queue. to sweep the cache. The 1-box can process instructions with up to Miscellaneous Conflicts The VA X architecture two conditional branches outstanding. Uncondi­ includes instructions with operands that either are tional branches (e.g., BSBW, BRB) are processed as not known when the instruction is decoded (e.g., ordinary instructions by simply changing the INSQlJE, MTPR), or modify large portions of mem­ instruction flow To reduce the penalty for a bad ory (e.g., MOVC 5). To avoid conflicts from these prediction, which results in a four-cycle penalty, instructions, the 1-box suspends processing mem­ operand specifiers that mod ify general-purpose ory specifiers until the instruction execution is registers are not processed under a branch predic­ completed. Self-modifying code presents another tion and cause the operand processing unit to stalL fo rm of conflict, which is solved by an REI instruc­ Also, branch instruction execution is overlapped tion that not ifies the hardware of this condition. with the previous instruction to provide the actual branch results earlier. Branch Instructions Branch instructions have a substantial influence on Compute-intensive Instructions the overall performance of a VAX processor. On Compute-intensive instructions require multiple average, a VA X processor executes 3.9 instructions, execution stage cycles. Common examples of these including the branch. before a branch starts a new instructions are multiplication, division, and float­ instruction sequence. Instructions that modify the ing point operations. All VA X implementations program represent nearly 40 percent of the employ dedicated logic for compute-intensive total instructions execmed. The VA X 9000 system instructions that occur frequently. Less frequently uses a 1024-entry branch cache and a two-tiered used instructions depend on microcode-controlled prediction schemc to increase the average code arithmetic and logical data paths. The VAX 9000 block size and reduce thc branch-takcn Latcncy. system contains four independent execution pro-

Dif!.ilai Tecbnicaljournal hill I'J'JI! Vo l. .! Nu .j 19 VAX 9000 Series

cessors. The integer, floating point. multiply, and implementations. Because memory bandwidth is divide units cxecute the VAX instruction set. The critical, the VA X 9000 system provides fe atures ro 1-box preprocesses instructions, which allows benefit thcsc instructions. instruction execution to overlap in thcst: units. In For example, the virtual instruction cache ser­ each cycle, a new instruction can be initiated in vices most instruction stream references, which the appropriate unit prior to the completion of frees the main cache to service prefetched operand previous instructions. The t1oating point and multi­ rcf<:rences. Both the virtual instruction cache and ply units are pipelined and can accept one instruc­ the main cache have 64-bir data paths, important tion each cycle. The integer unit is pipdined for for charactcr string operations and extended pre­ simple instructions. However, complex instructions cision arithmetic. The caches are fu lly pipelined must use microcode control to perform multicycle and allow one read per cycle. The main cache block operations. size is 64 bytes. exploiting spatial locality. When Pipelined instructions are issued in order and cache references do miss. data is wrapped and the proceed through the data path withou t fu rther most critical data is rewrned first. A write back, microcode control. upon completion, instruction write allocation fu rther reduces main results are retired in thc same instruction order. The memory and cache bandwidth requirements and instructions must be proccsscd in order because the reduces latency. result of one operation is often needed in a sub­ The VA X system is a architecture. sequent operation. Therefore, the pipelines must be Virtual addresses need to be translated to physical short and contain data bypasscs to make results addresses through page tables in memory. A trans­ available quickly. The multiply, float, and divide lation buffe r caches the most recently used page units' internal data paths are 64 -bits wide. To under­ tables entries. VA X systems, such as the VA X- 11/780 stand how the pipelined and overlapped operations system, process translation buffer misses in micro­ apply to the following opcration. code, which can be rime-consuming. However, the VAX 9000 system uses a memory management pro­ y(i) = y(i) + C(i) cessor to process translation buffer misses as part consider the program: of instruction preprocessing. This operation is per­ formed early in the pipeline and is faster than LOOP . MULG3 R6 ,

1 • The CALL and RETl 'RN instructions push and pop ADDG2 R� , ( R ) ADDG2 R2 ,. registers on the stack, and these instructions can be memory-bound. The VAX 9000 system contains The two MULG3/ADDG2 instruction pairs prevent both the conrrol logic and the bandwidth to process a that could occur because of data these registers at a rate of one per cycle. dependencies. The instructions fu rther reduce the loop overhead, which is already fairly small because the loop control instruction was predicted Un conventional Instntctions correctly. Instructions and source operands are Special, dedicated hardware was added to the prefetched . The multiply and add units accept the VAX 9000 system to process those VAX inst ructions instructions as they become available. The memory that did not fit into the categories listed above. The references are made as the operand processing unit additional hardware operates within the pipeline processes memory specifiers. The majority of speci­ architecture and cycle time, and the cost of adding fier processing is performed independently of the the hardware was minimal. instruction execution. In the fo llowing example,

- <------> P HL Me mory-intensiveInstructions MOVL RO , US RO Some VAX instruction classes are primarily memory the MOVL and PUSHL instructions perform identical operations that require only minor computation. operations, but the PliSHL instruction does not Typical examples of these instructions an.: char­ explicitly specify a destination address. On pn:­ acter string, decimal, and privileged operating sys­ vious VAX systems, the instruction prefetching tem. Pipelined execution offers link advantage to would stall until the current instruction execution memory-incensive instructions because the number was completed. However, the VAX 9000 modi­ of memory references is not reduced as the number fies such instructions during the decode stage by of cycles required for execution is reduced by new adding the implied specifiers. The benefits of this

20 Vo l. 2 No. 4 Fa ll I'J')O Digital Te cbnicaljournal Design Strategyfo r the VAX 9000 Sy stem

enhancement are more evident in the fo llowing Logical Integra tion instructions. The VAX 9000 connects to the scalar CPU as an additional fu nctional execution BSBW 10$ <------> MOVAL Relurn_PC ,- JMP @(SPl+ unit. Vector instructions are processed, and operands are stored, in queues, the same as are Similarly, instructions such as LOCC and CMPC3 scalar instructions. As instructions are issued, a con­ implicitly reference the ge neral-purpose registers. trol word is sent with instruction operands to the The instruction decode stage creates a read/w rite vector processor. The processor contains vector mask with these references, which allows instruc­ registers and arithmetic units. Addresses for load , tion prefetching to cont inue. store, gather, and scatter operations are also gener­ To aid handling instructions like PUSHR and ated by the vector processor. Ve ctor data is stored in CALL, the inreger execution unit conrains special the main cache, and both the scalar and vector pro­ bit mask manipulation h:trdware, which opti­ cessors have fa st, shared access to that dat::t. mizes general-purpose register saves and restores. The VA X instruction set contains variable-length, Phys ical In tegration bit-field instructions that handle non-byte data. The VAX 9000 scalar and vector processors reside These instructions can reference memory within a on a single planar board. Three mu l tichip unit slots (MB) '512 megabyte range. The field referenced is are reserved for the optional vector processor, 8 within the first hytes of the base add ress more which is field-installable. The integration of the vec­ than 9'5 percent of the time. Therefore, to allow tor processor directly with the scalar processor instruction prefctching to continue, the operand keeps critical interconnects short and reduces vec­ processing unit assumes that the field is within the tor instruction overhead. initial quadword and requests that data. If, during execution, the field destination actually resides out­ Error Ha ndling side the prefetched quadword, the correct data is Reliabi lity, availability, and integrity are critical fac­ fe tched and the pipeline is fl ushed to avoid poten­ tors in a high-performance computer system. These tial memory con tlicrs. factors are affected by the quality of the physical design (i.e. , worst-case design), effective cooling, In tegrating Ve ctor Processing redundant power supplies, and quality controls The VA X 9000 project team was instrumental in during manufacture. Still, fa ilures are possible, and inregrating vector operations and data types inro the VA X 9000 design had to dea l effectively with the VA X architecture. For many scientific applica­ errors. tions, the use of vectors improves performance in Error handling in the VA X 9000 system has two three ways: main goals:

• Ve ctor instructions specify many operations in • Minimize system service disruption from indi­ a single opcode, which eliminates instruction vidual fa ilures stream decode as a processing hottleneck. • Maximize the failure information col lected fo r • Ve cwr registers increase available local storage. use in preventive and corrective maintenance • Ve ctor registers support high peak perfor­ A large percentage of hardware fa ilures are inter­ mance through high bandwidth and short access mittent, and many solid hardware failures start as latency. intermittent. The VAX 9000 system was designed to The VA X vector architecture implements a load/ recover from these fa ilures and to use the fa ilure store architecture, which permits the hardware to data to predict (and prevent) fu ture problems. deal with large pieces of memory in a uniform To gather information effectively, VA X 9000 stor­ manner and increases the use of parallelism. age elements (i.e., latches, tlip tl ops. and RAM cells) We added the vector instructions and data types are visible to the service rrocessor unit through a to the VA X architecture in an integrated fashion . serial diagnostic bus. Most state information that Scalar and vector instructions are mixed throughout is relevant to isolate the fa iling component is avail­ the pipdines. Systems that do not incl ude vector able fo r error analysis programs that can be run at processors emulate vector instructions with soft­ a convenient time. The result of this processing is ware. a technique especially useful for program then used to isolate the failing components for - >< development. . quick repair.

. Fa /1 /'J'.IO Di�ital Te cbnicaljournal llul. .! Nu .j 21 VAX9000 Series

To access the storage elements through the visi­ By getting to an instruction boundary, the clocks bility chain, the system clocks must be disabled, can be stopped in an orderly fashion, and the state which disrupts the system operation for a period can be read out, including temporary data to be of time. The error may also have affected the exe­ used for fa ilure analysis. The machine can be reset cution of the instructions in the pipeline. Error to start processing at the instruction boundary once handling minimizes these disruptions by making the clocks are started again. them invisible ro the users almost all the time. While the clock is stopped , the CPUcannot inter­ The macroinstruction is the unit of execution in act with other subsystems or I/0 processors. To a program that is visible to the user. Between keep these fu nctions from being blocked and possi­ instructions, the program state is clearly defined bly timing out, we only stop the clock to the CPU in in terms of memory contents and register values. error, not all the clocks in the system. We also and exceptions are handled between sweep the cache of written data before the clock is instructions to save this state in an orderly fa shion. stopped, and IIO interrupts are directed to other It is important to handle errors the same way. CPUs in a symmetric multiprocessing system. Two problems arose in trying to provide the same method of error handling. First, instructions Performance Modeling go through many stages in a pipelined computer, When multiple features are added to a CPU design and several instructions will be in progress at the to individually enhance performance, some of same time. It is difficult to identify a beginning those features can interact negatively with each and end for each inMruction. Second, even when other to decrease performance. Therefore, we boundaries are established, errors can occur at any designed a performance model to help us evaluate time and the errors do nor automatically line up the performance of the design and make trade-offs with instruction boundaries. where necessary. Although instructions were not To solve this, we made the E-box the point of syn­ executed on the model , it is an accurate cycle-by­ chronization between error handling and instruc­ cycle model of the system for most instruction oper­ tion execution. In the instruction execution model, ations. Equally important, the model was written at the E-box accepts operands, then computes and a high level, which made it easy to modify and use delivers results for storage. If an error occurs that to experiment with different feawres before they directly affects one of these steps, the error is were added to the design. synchronous to the execution of that instruction. Asynchronous errors do not directly affect any of Cy cle Time these steps and are treated as interrupts, i.e., pro­ A perennial CPU design issue is the trade-off cessed after the E-box completes an instruction but between cycle time and cycles per instructions. In before it starts another instruction. a VAX system, the cycle time is often limited by the A synchronous error causes a trap to occur in RAM speed in the and cache. We mod­ the E-box when the E-box requests data from the eled a machine at 8 ns and one at 16 ns for the VAX subsystem with the error. Since such data can he 9000 system. At 8 ns, the pipelines became longer. unavailable as a result of virtual access problems, the E-box is ready to deal with exceptions at Although the peak throughput almost doubled, that time, and errors can use the same pipelined the model showed that the net performance g:1in mechanism. did not offset the risks associated with the shorter We do not differentiate between those syn­ cycle time. chronous errors that affect computation in the E-box and those that do not. Instead, if the program /-stream Sy nchronization visible state of the machine has not been modi­ The VAX architecture requires that changes to the fied, the instruction is backed up to the beginning instruction stream be synchronized with an REI and restarted . Performing this task is not a prob­ instruction. This synchronization makes it easier to lem, since the state is normally not changed until implement an instruction cache that is separate the result is stored at the end of the instruction. from the main cache. To synchronize, either all Errors occurring in early pipeline stages are easily memory writes can be watched or the J-cache can recoverab.le. In a few cases, memory and registers he cleared on every REI. The first alternative entails could have been modified early and, as a result, high hardware costs, and the second c:1n affect be affected by the error. Status flags indicate if this performance. However, the model showed us that has happened. the performance impact would be minimal if the

22 Vo l. J No. .; Fu iii'J'JO Digital Te cbnicaljournal Design Strategyfo r the VAX9000 5ystem

!-cache was refilled from the main cache rather than We also used the performance model as a verifi­ from main memory because the critical parameters cation tool. The model provided us with early were the main cache bandwidth and the !-cache warnings when a feature did not function in the invalidation time, rather than the refill latency. model, or when the cycle count differed from the count in the gate-level simulation. For example, BranchPred iction from the model, we became aware of problems in The branch prediction scheme used in the the design of how conflicts between instructions VAX 9000 system was analyzed in great detail. in specifier processing were handled. Periodically, We investigated the use of multiple history bits to we compared the performance model to the logical improve the effectiveness of branch prediction. model. Both models were subjected to the same In all cases, the use of extra bits provided less than instruction sequences. Deviations of more than a I percent improvement in system performance. ±5.0 percent were investigated. Some design bugs Furthermore, no multiple bit scheme could be were found that did not affect the results of the pro­ implemented without increasing cycle time gram but which did keep performance features because multiple history bit branch prediction from working properly. The average deviation was schemes update status each time a branch is on the order of ± 1.0 percent. encountered. Therefore, we chose to use a single­ Performance tests are among the first programs bit technique in the VAX 9000 design. Unlike multi­ run on a functional prototype. The VAX 9000 sys­ ple bit schemes that read and write history bits tem performed almost as expected. Table 2 com­ for each branch instruction encountered, the single­ pares the actual performance of a VAX 9000 system bit technique updates the history bit only when the to its predicted performance for a small sample of prediction is wrong. The single-bit scheme is both modeled programs. The accuracy of the predictions faster and simpler. highlights the increasing importance of models in the modern engineering process. Cache Pa rameters

The main data cache was accurately modeled. The Table 2 Performance Measurements VAX 9000 system uses a first-in first-out (FIFO) block of a VA X 9000 System replacement scheme. The performance model pre­ dicted that a true least recently used replacement Predicted Measured Program Name (VUPs*) (VUPs* ) policy would provide an insignificant improvement in performance over the FIFO method. Also, a true HANOI 28. 54 25.53 recently used policy requires that status be least FFT45 36.87 37.85 read and written for each cache access. In con­ GAUSS 32.72 32.57 trast, the FIFO replacement policy updates status only when a cache miss has occurred. Further, the WHETS 27.78 27. 17 update can be done in parallel with the writing of WHETD 34.48 34.89 data into the cache block. Although the 128-byte cache block provided a better cache hit, we chose • Performance measured in VAX units of performance (VUP). where the performance of the VAX·11/780 system = 1.0 VUP. the 64-byte block because it produced better system level performance. We chose two-set associativity because the model clearly indicated that performance would degrade Ve ctor Performance

with a direct-mapped scheme. The model also pre­ Ve ctor processing was modeled using graphical dicted that a four-way set associative cache would descriptions of the pipeline. The graphical descrip­ not improve performance enough to justify the tions were essentially critical path method schedul­ extra hardware, design complexity, and cycle time ing charts. This approach is reasonable because penalty. vector processing makes regular demands on sys­ The data bypass mechanism, the write queue, tem resources. In fact, the regularity of resource and the parallel translation buffer fix-up mecha­ demand patterns was a major reason that vector nisms were implemented after the performance processing techniques were developed. By using model indicated significant performance gains the pipeline schedules, we realized that data should

would he achieved from these features. he prefetched to ensure good vector performance.

DiJ:itafTe cbnicaljournaf Vo l. .!.. No , htff f')')IJ 23 VAX 9000 Series

Performance Me asurement Acknowledgments Table 5 compares the VAX 9000 scalar and vector Many people contributed to reaching the VAX 9000 processors performance to other members of the p<.:rformance goals. The authors would especially VAX family of processors. like to thank David Orbits, whose advanced devel­ opment work on high-performance VAX designs became the basis for the performance model; and Ta ble 3 Performance of the VAX 9000 Scalar and Vector Processors Bill Grundmann, Rick Hetherington, John Murray, Bill Smith, and David Webb, who comprised, VA X 9000 VA X 9000 with the authors, the original VAX 9000 architec­ VA X 8550 Scalar Vector ture team. Program System Processor Processor Name (VUPs*) (VUPs*) (VUPs*) References A3D 6.55 65.54 77.45 I. ]. Murray et al., "VAX Instructions That Illustrate DYFESM 5. 12 31 .88 40.49 the Architectural Features of the VAX 9000 C: Pt!," EMIT 5.86 41 .65 79 .86 Digital Te chnical journal, vol . 2, no. 4 (Fall CFFT2D 5.52 25.76 64. 18 1990, this issue): 25-42. BMK8A1 5.45 30.65 83.84 2. M. Adiletta et al. , "Semicomluctor Technology MXM 5.93 40.81 269.32 in a High-performance VAX System," Digital Te chnical jo urnal, vol. 2, no. 4 (Fall 1990, this • Performance measured in VAX units of performance (VUP), where the performance of the VAX-11/780 system = 1.0 VUP. issue): 43-60. 3. SPICE is a general-purpose circuit simulator program developed by Lawrence Nagel and Ellis Cohen of the Department of Electrical The vanattons in these performance numbers Engineering and Computer Sciences, University indicate that significant performance improve­ of California, Berkeley. ments can be achieved by using applications that 4. D. Clark, "Pipelining and Performance in the rake advantage of machine rcsourccs. The numbers VAX 8800 Processor," Architectural Support also highlight opportunities. By modifying appli­ fo r Programming Languages and Op erating cations ro capitalize on machine features, large per­ S:vstems (ACM, October 1987). formance gains may he realized. Performance gains of 100 to 200 percent are often realized and may 5. C. Wiecek, "A Case Study of VA X-II Instruction substantially extend the lives of older programs. Set Usage for Execution," Proceedings Vector applications tend to fall into three cate­ of the Sy mposium on Architectural Support gories. The first category generally does not contain fo r Programming Languages and Op erating 1982): 177- 184. much parallel content. This category is represented Sys tems (ACM, March by A5D and DYFESM in Table :� . Vectorizing such 6 . .J. Emer and D. Clark, "A Characterization of programs improves performance by a modest Processor Performance in the VAX-Il/780, " 0 to 50 percent. Programs E.\IIT and CFFT2D in Proceedings of the 11th Annual Sy mposium on Table 5 represent the second category, which are (Ann Arbor: June 1984): applications of moderate parallel content. Applica­ 301 -310. 50 150 tions in this category realize a ro percent 7. VA X Ve ctor Processing Handbook (Maynard: performance gain when vectorized. Applications Digital Equipment Corporation, Order No. in the third category, highest parallel content, EC-H04 19-46, 1989). demonstrate performance improvements of more 8. R. Brunner and D. Bhandarkar, "Vector Exten­ than 150 percent when vectorized. Programs sions to the VA X Architecture," Proceedings BMK8AI and in Table 3 arc examples of this MXM of CO!'vtPCON 'YO (San Francisco: Spring 1990). class of application. Often, modest code changes can realize dramatic performance improvements. By simply redefining array dimensions or loop specifications, an applica­ tion can move from the first category to the third category.

24 v'li l. .2 No . .f Fall 11)')0 Digital Te chnicaljo urnal john E. Murray Ricky C. Hetherington RonaldM. Salett

VAX Instructions Th at Illustrate the Architectural Features of the VAX 9000 CPU

The VA X 9000 system is Digital's largest and most powerfu l VAX system. As such, it offe rs many unique features that required the use of advanced technology and innovative architecture in the design of the system. Overall, the VAX 9000 micro­ architecture produces a high level of sy stem performance and the lou'est cycle time of any VA X processor, i.e., less than fi ve cycles per instruction. Th ree sections of the l'ltX 9000 CPU-the instruction fe tch and decode unit (!-box), the execution unit (£-box), and the data cache and main memory inte1jace unit (M-box)-are illustrated in this paper through descriptions of a small sample of VAX instructions. Th ese instructions are discussedin relation to their flow through the pipeline, how their architecturalfe atures combine to work on a single macro instruction, and how various stages of the pipeline interact.

In October 1989. Digital inrroduced its VAX 9000 preferch, hardware translation buffer fix-up unit, family of high-performance scalar, vector, and par­ write address buffer and conflict checker, multi­ :tlld processors. The VAX 9000 system is designed ported write-back cache, independent arithmetic ro be expandable from one ro four processors, with units, and separate issue and retire queues. These an optional integrated vector facility available on features are pipelined and do not interact in a each processor. The design team obtained high straightforward way. Many stages are not directly levels of performance with advanced technology linked to the subsequent stage bur feed a queue u and innovative architectural fearures. The tech­ or first-in first-out (FIFO) buffer. The subsequent nology provided a platform that has the shortest stage works on the output of the FIFO buffer. The cycle rime for any VAX processor. Most VAX proces­ pipeline is not a fixed-length and many operations sors average ten or more cycles per instruction, are done in parallel. whereas the architectural features of the VA X 9000 The architectural features do not function totally system reduce that average below five. independent of one another. In fact, the highest The VAX architecture is a complex instruction set level of performance is achieved when all the units arch itecture. VAX instructions vary in length and function in harmony. This paper highlights the number of operand specifiers. The opcode may be implementation of the macropipeline found in the one or two byres long. The number of specifiers three major subsystems of the VAX 9000. These is implied by the opcode. Each specifier 's length is subsystems are the instruction fetch and decode determined by the specifier type, and the length can unit (1-box), the execution unit (E-box), and the d::tt:1 1 vary by up to 17 bytes. Although the VAX 9000 cache and main memory inrerface (M-box). implements a large number of instructions in a The design team for the VA X 9000 system's single cycle, some instructions need to be imple­ !-box evolved a cost-effective subsystem that our­ menred in tens of cycles. In these cases, microcode performs all previous VA X systems. As shown in J.Ssiswnce is required. To increase performance, Figure 1, the !-box processes the majority of instruc­ many features were included in the VA X 9000 tions in just one cycle. lt combines a single cycle system that have not been implemented in prev i­ access virtual instruction cache with a 25-byre ous VAX systems. The system contains a virtual instruction buffer and an instruction clecocle cross instruction cache. a branch pn:diction cache, bar that can decode three specifiers per cycle. To multiple specifier evaluation units. deep instruction minimize cycle-wasting stalls. a branch prediction

DiRilal 1'ecbnicaljournal H>l. .! .Yo. ·I Fa ii i'J')Ii 25 VAX 9000 Series

unit handles transitions from one code block to first cycle, the M-box receives and prioritizes vir­ another. In addition, the operand processing unit tually (or physically) addressed memory requests. receives and processes specifiers from the decode The M-box then indexes the translation buffer to unit. The specifiers are passed either to the E-box as produce a 33-bit physical address and to perform pointers, literal data or addresses, or to the M-box protection and validity checks. The second pipe­ as virtual addresses. lined cycle involves data cache access, data align­ Figure 2 illustrates how the front end of the ment, if required , and port response. There are M-box translates addresses by using either a trans­ numerous architectural features within both seg­ lation buffer or an autonomous virtual-to-physical ments that are targeted at high bandwidth for address translation unit. Physical addresses for prefetching and storing scalar and vector operands. reads are used to access a two-way associative To illustrate the various features of the VAX 9000 write-hack cache and to fetch data from memory microarchitecture, we have selected the code i through the system control unit (SCU), if the data sequence shown in Figure 5. In the fo llowing sec­ is missing from the cache. Read data is returned to tions, we discuss each instruction as it progresses the E-box . Write addresses from the operand pro­ through the pipeline as if it were the only instruc­ cessing unit are translated and queued by the M-hox tion in the pipeline. We then sununarize by consid­ until the E-box provides the data for the write. ering the same instructions as a block of code. The E-box of the VA X 9000 CPU performs aU scalar operations. As shown in Figure ), the E-box VAX InstructionAD DL2 is a pipelioed design that incorporates a micro­ The ADDL2 instruction uses general-purpose regis­ sequencer to control functional unit operation. ter R8 as an address ro memory. The contents of Other dedicated control logic directs the flow that location are added to general-purpose register through the pipe stages. R7, and the result is written back to the same loca­ A multiported register file provides general­ tion in memory. The instruction is encoded in three purpose registers and temporarily holds memory bytes: opcode, register, and base register. data. The data is processed by one of the four arithmetic functional units. Results pass through a Cy cles One through Th ree retirement to the register file or the If we assume that the ADDL2 instruction is the first M-box data cache, as shown in Figure 4. Multiple instruction either in an routine or follow­ VA X instructions arc executed concurrently in the ing a context switch, the is gener­ E-box pipeline. The primary goal of the E-box is ated by the E-box and passed to the I-box on a 32-bit to produce a 32-bit result each cycle, which allows bus. The program counter is latched and used to the majority of the simple, but most frequent, VA X access the virtual instruction cache during cycle instructions ro be executed in one cycle. This goal one. The virtual instruction cache contains up to is achieved when four requirements are met. First, 8 kilobytes (KB) in 32-byte blocks and 8-byte lines the !-box must have conunands available for the of instruction stream data. £-box. Second, operand data, often from the M-box Bits < 12:3> of the program counter's prefetch data cache, must be available. Third, pipelined or buffer are used to access an 8-byte line from the single-cycle latency functional units are required virtual instruction cache. Bits < 12:5> are used to for single-cycle throughput. Finally, results must access a tag, a valid block. and four quadword valid be transferred from the functional units. E-box bits. The tag is compared with bits <31: 13> of the features, such as queues, data bypass paths, and program counter's prefetch buffer. If the tag and the powerful arithmetic units, help the system attain bits match, the block and the quadword within the a high-performance level. Stalls arc avoided and block are valid, and the instruction is in the virtual each instruction is executed in a minimal amount instruction cache (i.e., a hit). Bits <2:0> of the pre­ of time. fetch buffer are used to rotate the quadword fo r the The M-box of the VAX 9000 CPU is the primary opcode byte to he loaded into byte 0 of the !-buffer source of memory data. Therefore, it contains the at the encl of cycle one. Similar to the VAX 8650 virtual address translation buffer and the data system, the first hyte of the !-buffer is the operation cache. The M-box is multiported ami pipelincd with code (opcode) of the instruction." two autonomous pipeline segments. Each segment The ADDL2 is three bytes long and normally occupies one machine cycle, and the cache access fits in one line of the virtual instruction cache. If · latency is, therefore, two cycles long. During the the ADDL2 instruction crosses a line boundary, a

. 26 Vol. .! No. q Fa/1 1')')0 Digital Te chnicaljo urnal E-BOX RESULT I-BOX DATA

M-BOX IB DATA

S2 POINTER

DEST POINTER

�------�------�------� FETCH STAGE DECODE STAGE SPECIFIER STAGE KEY VIR - VIRTUAL INSTRUCTION CACHE U PC - UNWIND PROGRAM COUNTER SL - SHORT LITERAL SL D - SHORT LITERAL DECODE Sl - SOURCE 1 D PC - DECODE PROGRAM COUNTER GPR - GENERAL PURPOSE REGISTER R1 -REGISTER 1 S2 - SOURCE 2 S PC - SPECIFIER PROGRAM COUNTER GPRS - GENERAL PURPOSE REGISTERS R2 - REGISTER 2 DEST - DESTINATION BP - BRANCH PREDICTION XGPR - X GENERAL PURPOSE REGISTER R3 - REGISTER 3 IB - I-BUFFER PC - PROGRAM COUNTER YGPR - Y GENERAL PURPOSE REGISTER DISP - DISPENSER P PC - PREFETCH PROGRAM COUNTER OPU - OPERAND PROCESSING UNIT OP D - OP DECODE

Figure 1 Block Diagramof the VAX 9000 Sys tem /-box VAX9000 Series

CONTROL LOGIC

MICRO­ SEQUENCER I-BOX QUEUES V-BOX

M-BOX

REGISTER FILE

Figure 2 Front End of the VAX 9000 Sys tem M-box

MISS ::> I-BUFFER m � �

TRANSLATION 9 OPU > ��BUFFER b

::: E-BOX II r=::> TRANSLATION v BUFFER f.- � FIX-UP SEQUENCER � r:: � I--

FIX-UP f �v

Figure 3 Block Diagram of the VAX 9000 .�ystem E-box

28 Vo l. l No. 4 Fa ll /')')IJ Digital Te cbnicaljournal VA X Instructions Th at Illustrate the ArchitecturalFe atures of the VA X9000 CPU

E-BOX 64

I-BUFFER OPERAND 32 CACHE PROCESSING UNIT

OPERAND PROCESSING UNIT I-BUFFER 64

E-BOX E- BOX WRITE BUFFER

M-BOX 32

E-BOX WRITE BUFFER WRITE QUEUE

WRITE BACK MAIN MEMORY -f:64 FILL BUFFER

Figure 4 Cache Unit of the VA X 9000 System M-box

subsequent cycle is required to access the second should be routed w the register/pointer unit and line. The average VA X instruction is 3.8 bytes long. that the memory specifier should be routed to the Therefore, a virtual instruction cache hit delivers operand processing unit. about two instructions to the l-buffer.6 In parallel with the XBAR decode process dur­ Other VA X processors generally require a cycle ing cycle two, the program coumer is passed to the to decode the opcode and one or more cycles to E-box from the 1-box. The opcode is used to address decode each subsequent specifier.7.H However, the the fork random-access memories (RAMs) in the VA X 9000 CPU's instruction decode cross bar can E-box that provide a fork address to the microse­ decode the vast majority of common instructions in quencer. At the end of cycle two, the decoded bytes a single cycle. are shifted out of the !-buffer, and the subsequent If the three bytes of the ADDL2 instruction were instruction is presented to the XBAR in cycle three. loaded into the !-buffer at the end of cycle one, the The fork address from the 1-box is then used to bytes would be decoded during cycle two. The address a fork RAM in the E-box. For each opcode, decode unit (XBAR) passes data from the !-buffer to the fork RAM provides an entry address into the a short literal unit, a register/pointer unit or an control store, indicates which functional unit operand processing unit. As the opcode and speci­ should begin the execution, and specifies how fier bytes are decoded in parallel, the XBAR deter­ many source operands are needed in the first cycle. mines in less than a cycle that both specifier bytes The fork address is modified when an instruction

68 57 co 0080 22 1$: ADDL2 R7, (R8) 53 6044 00 41 0083 23 SUBF3 #0,5, (RO)[R4]. R3 59 85 9999A999 535940C2 8F 45FD 0088 24 MULG3 #2345.5, (R5)+. R9 E3 00000121' EF OD E4 0095 25 BBSC #13. BDATA. 1$

Fig ure 5 VA X Instructions That Illustrate the Major Features of the VA X 9000 System

Digital Te cbnicaljournal Vo l. 2 No. 4 Fa ll f

is restarted after it was interrupted before comple­ required. The M-box includes a feature that adds tion. Memory management faults on the instruction four addresses to E-box or operand processing unit stream also modify the fork. At the end of cycle two, addresses, if the size and alignment of the request the fork RAM data is latched in a fork queue, and the crosses a quadword boundary. Other VA X systems instruction program counter is latched in the pro­ trap on unaligned accesses using E-box cycles gram counter queue. and require using microcode to generate the incre­ The register/pointer unit accepts the register and mented address and subsequem fetch. specifier byte at the end of cycle two. During cycle In parallel to the arbitration process, virrual three, the register/pointer unit passes two source bits 31, < 17:09> index the 1024-entry translation pointers (general-purpose register R7 and memory buffer. The translation buffer is a direct-mapped, data) and the destination pointer (memory destina­ associative memory that contains the results of tion) to the E-box. The source one pointer points to the most recent 1024 translations. Bits < 30: 18> general-purpose register R7. The memory data will are compared, validated, and protection-checked be returned eventually to a 16-bit deep circular against the tag field. The physical frame number is queue, called the source list, in the E-box. The regis­ a 24-bit field that is appended to the virtual address ter/pointer unit tracks the source list pointers and bits <9:0> to create the 33-bit physical address. The allocates a source list entry to the memory data. The self-timed RAM used for the translation buffer is a source list address is passed to the E-box and the 1024 by 4 self-timed RAM with a 4.5 nanosecond operand processing unit. The destination pointer (ns) access time. simply indicates that the result of the instruction Protection checking occurs during the latter por­ goes to memory. tion of cycle four. The example we are discussing is Further, during cycle three, the operand process­ a request for a read and write check. Therefore, ing unit generates the memory address and passes both read and write access are checked. Fault indi­ it to the M-box. For the register deferred specifier cation is forwarded with the request to the data (R8), the operand processing unit accesses its local cache and subsequently, with the data, ro the E-box. copy ofR8 and passes it to the M-box, together with If the request has a valid entry in the translation the source list tag received from the register/pointer buffer and no protection violations exist (i.e., trans­ unit and a control function that indicates the mem­ lation buffer hit), a data cache access is required in ory location is to be read and then written. cycle five. The fork queue is a cyclical, eight-entry FIFO The two source pointers and the destination buffer that is flushed for interrupts, exceptions, pointer from the 1-box are latched in the source and or incorrect branch predictions. For the ADDL2 destination queues, respectively, at the start of cycle RAM instruction, the queue passes part of the fork four. The source queue holds 16 entries and can data to the , which is idle and receive 2 entries per cycle. The destination queue awaiting a valid fork, early in the third cycle. Fork holds eight entries. Both queues are circular FIFO RAM data is used to generate the appropriate con­ queues that can be fl ushed with the fork queue. The trol store address for all control store RAMs. The two source pointers are also latched in the source RAM remaining fork data is passed to the issue operand logic at the start of cycle four. The source control by the end of the third cycle. operand logic determines which two source pointers to use each cycle. The pointers can come Cy cle Fo ur from the source queue, the 1-box, the microword , At the start of cycle four, the M-box receives a com­ the register log, and several special functions. In this mand from the operand processing unit to perform example, the two pointers are selected directly a read with a write-check. The M-box must read a from the latched I-box pointers because using the longword from memory, send the longword to the source queue would have required an extra cycle. E-box, and check for write access. The command is The selected pointers address the register file accompanied by a 32-bit virtual address, a tag field, and are passed to the issue logic early in the fourth context (size of operand), and the request signal. cycle. The register file contains the 15 general­ Arbitration for access to the translation buffer purpose registers, RO through Rl4. These registers occurs every cycle. If the operand processing unit can be written by either the £-box or the !-box for wins arbitration, the command is decoded and the autoincremem or autodecremem specifiers. The comext is checked against the starting address to first pointer accesses general-purpose register R7. determine if additional virtual addresses are The contents of general-purpose register R7 are

30 Vo l. 2 No. 4 Fa ll /

passed to the data distribution logic by the end of A write check inserts the physical address and cycle fou r. The second pointer accesses one of the a number of status bits into the write queue. The 16 locations in the source lis1. The source list is a status bits are termed valid, fa ult, twice (asserted for queue fo r source operand data that is written by the the final entry when writing unaligned operands !-box, with immediate or short literal data, or by the that require more than one address), last, PCD M-hox, with memory data. The pointer is used to (page crossing danger alert), and blocked. The write access the appropriate source list register, and the check illustrated in this ex;tmple has a valid and data is passed to the data distribution logic. last flagasserted. The issue control uses the fork RAM data and the The cache tag is accessed in the middle of cycle source pointers to determine if the instruction five. The cache tag srore is two-way set associa­ can he executed. The issue control checks that the tive, with 1024 entries per set . Each entry repre­ target fu nctional unit is ready and that all the sents a 64-hyre block of memory data. The tag store required source operands are available. In this also contains 16 valid bits (i .e., one per longword) example, the integer unit is ready, the first operand and a written bit because memory update requires a (i.e., the general-purpose register R7) is available, write back. The cache rag is used to determine if the bur the second operand (i.e. , memory data) is not. requested data resides in the data cache and, if not , Because normal issue cannot occur without the whether the cache data held there needs to be writ­ second operand, we created a special issue control ten hack to memory. to handle this case. When issue is prevented only If a cache hit occurs, the data, with an asserted by the lack of a single memory operand , the instruc­ response line and tag, is sent to the £-box. The tag tion is "issued with bypass." To save an operational field tells the E-box where to place the field in the cycle, when the M-box delivers the missing source list . In cycle five, the issue logic asserts a operand, that operand bypasses the source list and microword hold signal throughout the E-box. passes immediately to the waiting functional unit. Because execution did not occur in cycle five The issue control signals the fork and source queues and the latched microword was not used, the that entries were used and can now be removed. microword latches must be held until execution can occur. Cy cle Five Cycle five begins with the cache clara and rag self­ timed RAMs latching the physical address. The Cy cles Six through Te n priority request was selected by the cache control At the start of cycle six, the M-box data is latched in the latter portion of cycle fo ur. The default prior­ to the data distribution logic. The data is immedi­ ity request selection is the write queue. However, ately passed to the integer unit, where the add oper­ if the default is used and a translation buffer hit ation is performed . The results of that operation are occurs, the current address from the translation sent to the retire logic by the end of cycle six. The buffe r is used. The first stage of the M-box, or the issue logic deasserts the microword hold signal to translation buffer stage, is referred to as the front allow subsequent microwords to become latched. end . II provides the cache with a 4-bit cycle-identifi­ The issue logic also makes an entry in the eight­ cation field that identifies the command type and location result queue. The result queue is used to port . In addition, a context field provides the cache maintain write ordering when multiple fu nctional with the data size. The request for the second speci­ units are operating in parallel and also acts as a fier of the ADDL2 is a cache read and a write check. scoreboard for register conflicts. A general-purpose The write queue is a key fe ature of the M-hox and register is not a valid source operand if the register is the VA X 9000 system . The write queue is an H-entry in the result queue waiting to be updated by a func­ FIFO buffe r that holds pretranslated operand mem­ tional unit. When the functional unit specified by ory destination addresses and allows the operand the top entry of the result queue completes the prou:ssing unit to continue prefetching operands operation, the results are retired and the queue after memory destination operands. The write entry is discarded. The integer unit always com­ queue is designed to be a content-addressable pletes in a single cycle. Therefore, the ADDL2 memory that checks fo r memory conflicts as sub­ instruction is discarded fro m the result queue by sequent memory source operands are accessed. the end of cycle six. Entries in the write queue arc discarded when the In cycle seven. the retire multiplexer selects the E-hox completes execution and a successful cache integer unit result data ami sends it to the :vi-box to write occurs. be written. A request signal for an op write also is

Digital Te chnicaljo urnal 11•1. .! :vo. -i Fa ll /

sent fo r the M-hox to iniri:Hc the write. The :VI -box assL"rted in this cycle as a result of the tag store indicates that the address translation was successful lookup cycle in cycle nine. The tag store is upd:J.tcd to ensure that a memory management fa ult does not only if a valid bit had to he asserted . However, a occur. A signal that the instruction is done removes partially valid and written cache block that may the instruction program counter from the program require SL"tting the appropriate valid hit can be counter queue and indicates successful address written. Each line of the cache is protected with translation completion to the 1-box. byte parity. However, because it is a write-b:J.ck At the start of cycle eight, the E-box sends the cache, the reliability of the machine is significantly results of the ADDL2 instruction to the M-box to he enhanced using an error-correcting code. The written into the data cache. 'fhe write data interface check-bit pattern for the error-correcting co<.k from the E-hox to the M-hox is :)2 bits wide. The storage is generated and stored separately. M-box has two 32-bit data buffe rs that receive the write data and hold the data until the tag status is VAX Instruction SUBF3 appropriate fo r the write to occur. The E-box sig­ The Sl!BF3 instruction subtracts one half of the nals the M-box to perform the write. This signal F _tloat fo rmat number addressed by general-pur­ docs not affect the front end of the M-box because pose register RH and indexed by general-purpose address translation has already occurred. The signal register R9. The resultant number is placed into is a request to the cache control and arbitration for general-purpose register R3. The instruction is an op write. The top entry of the write queue holds encoded in five bytes: opcode, short liter:J.l, index the status and complete physical address for the register, base register, and destination register. write destination of the ADDL2 instruction. Because this is a modify operand, write access is checked Cy cle One and reported to the E-box when the operand is read As with all VA X instructions, the first cycle of the from memory. SliBF3 may be either a virtual instruction cache A two-cycle cache write starts in cycle nine. The access or a simple shift to the low-order bytes of the first cycle is the lookup cycle. The cache control !-buffer. In a few cases, the instruction not in the selects the write queue address to address the cache. Consider an example where the SUBF3 cache. Before the write can occur, the cache block instruction is preceded by a long instruction that must be dedicated to this particular physical is gradu:J.llydecoded for sn-cral cycles. As a result, address. the SliBF3 instructions cross a virtual instruction 'fhe cache rag store and data cache are read in cache block boundary. The virtual instruction cache parallel. If the operand is unaligned or is less than is accessed for instruction stream data every cycle, a longword , the cache data line read during the but the address is incremented only when the data lookup cycle is captured and merged in the next is latched in one of three instruction stream buffers, cycle with the data from the E-box. To ensure data !-buffer (9 bytes), I-bex (H byres), or l-hex2 (8 bytes). consistency, data is allowed to exist in the cache in If only one byte or if no valid bytes exist in the one of two states, read only or written. The SCL' !-buffer, the virtual instruction cache data is loaded controls the data in up to four individual C:Pl 1 directly into the !-buffe r. If the I-bex is empty and caches. Read-only data may be valid in multiple the !-buffe r contains between two and eight valid caches, but written data may only exist in one byres, the virtual instruction cache dat:J. is merged cache. Thus, before an M-box can change data, it with the the !-buffer d:J.ta and the remaining bytes must askpermission from the SCll. are loaded into the I-bex. In this example, the operand is longword­ If there is a virtual instruction cache miss (i.e. , the aligncd , and the write occurs in cycle ten if the data is not in the virtual instruction cache) and there block in the cache tag store has a valid and written are no va lid bytes in the I-bex, the 1-hox passes a status. The write also occurs for aligned longwords request to the M-hox for data. Generally, the request and quadwords in cycle ten if the cache block is will be fo r a virtual instruction cache block, i.e. , completely invalid. If this cache block has a read 32 byres, because the 1-box decodes instructions status (i.e., valid and not written), :1 command is sequentially across a block boundary. However, a sent to the SCLI to request permission to write. The branch or interrupt nuy direct the decode into the write is del:J.yed until the SClJ responds with middle of :1 block. In such a case, the 1-box requests approval to write. the remainder of the t:J.rget instruction stream in Cycle ten is the cache data write cycle. Write that block. The first cycle of the SliBF :) accesses the enables to the self-timed I{ AMs in the data cache :1rc virtual instruction cache only ro find what data is

Vo l. .! No. '-1 Fa ll f

not in the cache. The address and the request then The operand processing unit passes the tag, are passed to the M-box. with the address for the indexed memory specifier In the next cycle, the result of the virtual instruc­ request, from the register/pointer unit to the tion cache tag match is signaled to the M-box M-box. The address is generated by the in through an !-box abort signal. If a virtual instruc­ the operand processing unit. In parallel with the tion cache miss has occurred, the block in the operand processing unit and register/pointer unit, virtual instruction cache is cleared and the !-box the short literal expansion unit takes the 6-bit field awaits the data from the M-box. The M-box passes and expands it to a 32-bit F _floating number. eight bytes of data to the 1-box. The third cycle During cycle four, the short literal is written after a virtual instruction cache miss is found in the through the 1-box data bus to the relevant entry 1-box when hits occur in both the translation buffer in source list. Issue control can issue with bypass and the cache. The data sent to the !-box is written because only the memory data for operand two is and read through the virtual instruction cache self­ missing. timed RAMs to the !-buffe r or the I-bex. The virtual The E-box stalls until the memory data arrives. instruction cache's tag, valid block, and the relevant Because the 1-box and the M-box generally are func­ valid quadword are written next. If the !-buffer is tioning ahead of the E-box, memory stalls are short empty, the eight bytes of SUBF3 instruction are or nonexistent. In this example, the memory data written into the !-buffer. In subsequent cycles, the arrives at the end of cycle fi ve, as was the case with fo llow ing instruction stream quadwords are writ­ the ADDL2 instruction. ten into the virtual instruction cache, with a new In cycle fo ur, the M-box operates for the SUBF3 quadword valid bit, until the end of the block is instruction in a similar manner to its cycle four reached. The data is also available to the 1-buffe r. activity for the ADDL2 instruction. At the start of Cycle one of the SUHF3 instruction cou ld be the cycle, a command, address, context, and tag fetching the instruction stream from the virtual field are sent from the operand processing unit to instruction cache, as described for the ADDL2 the M-box. The command is a simple operand read. instruction, or it could be already in the !-buffer Arbitration occurs early in the cycle. The trans­ (e .g., bytes <8:3>) fo llowing the ADDL2 instruction lation buffer is then accessed, and the physical (i.e., bytes <2:0>). In the latter case, the SUBF3 address is sent to the cache. instruction would be shifted into the lower bytes Cycle five begins when the data cache receives as the ADDL2 instruction is shifted out. the physical address for the operand processing unit to read . The tag store lookup and address Cy cles Tw o through Eigh t matching are performed simultaneously with the In cycle two, the SUHF3 instruction is completely data read , and the data is available to the E-box at decoded and shifted out of the !-buffer. As a result, the end of the cycle. If the operand read results in a the fo llowing actions occur: cache miss, the M-box must assemble a command and an address, which are sent to the SCU to enable • The fork address is passed to the E-box. the SCU to access a 64-byte block of memory data. • The short literal is passed to the short literal In addition, the data cache tells the scu which set expansion unit. the cache will replace with the new cache block. Jf

• The base and index registers arc passed to the the current cache block contains valid and written operand processing unit. data, the block must bewri tten back to main mem­ ory before the new cacheblock arrives. • The destination general-purpose register R3 The sends a command and an address back and the t\vo sources are passed to the register/ scu to the M-box when the memory data is ready. The pointer unit. send takes approximately 26 cycles and is fo llowed , During cycle three, the register/pointer unit allo­ within a short period of time, by eight cycles of data cates the next available entry in the source list ro the transfer. Each cycle is 8 bytes long. The requested short literal and the subsequent entry in the indexed quadword is returned first to respond to the memory reference. The E-hox is informed of these requesting port during the first cycle of the cache allocations as pointers to the relevant entries are refill. On the eighth cycle of cache refill, the tag passed to the pointer queues in the source one and store is updated. source two pointers. The register/pointer unit also The floating point fu nctional unit is started in passes the destination register to the destination cycle six, as specified by the fo rk RAM data. Both queue in the E-hox. source operands are delivered, and the microword

h71/ /'J'}IJ Digital Te cbuicaljo m·nal Vt>l. 1 No. 4 VAX9000 Series

indicates a SUBF operation. The floating point unit in bytes <8:5>. The fo ur remaining bytes of the requires two cycles to perform the SUBF operation . immediate specifier could be valid in the I-bex and Unpacking and alignment occur in the first cycle. the rest of the instruction could be contained in the The floating point unit signals the issue control that I-bex2. At the end of cycle one, the fi rst four bytes the result wiJJ be available at the end of the follow­ are shifted to the low four byres of the 1-buffer. The ing cycle. The issue control enters the general­ next fo ur bytes are merged from the I-bex to the purpose register R3 destination but must wait high fo ur bytes of the !-buffer. The I-bex is now another cycle before beginning retirement. If the empty, and the bytes in the I-bex2 can be loaded next instruction requires that the fl oating point unit into the I-bex. and the operands be available, the instruction Because the MULG 3 instruction has a 2-byte-long would be issued in this cycle because the floating opcode, the only decoding necessary in cycle two is point unit is fu lly pipelined. to note the 2-byte length and shift our the ft rst byte The second execution cycle occurs in cycle so as tO align the specifiers to be the same as a single seven. The floating point unit adds, normalizes, byte opcode instruction. The specifiers are then in rounds, and packs. The result is latched in the fl oat­ bytes of the !-buffer. As the first opcode byte ing point unit at the end of the cycle, and the issue (in this case, #FD) is shifted out, the next valid byte control discards the top entry from the result queue in the I-bex is merged into byte 9 of the 1-buffer, to retire the data. which leaves seven valid bytes in the I-bex. In cycle eight, the retire multiplexer selects the Decoding really begins in cycle three. The fork's floating point unit result data and sends that data to address is sent to the E-box, and bit <8> is set to the data distribution logic. The data distribution indicate a 2-byte-long opcode. The ft rst five bytes logic holds the result, which will be written into of the immediate specifier are passed to the general-purpose register R3 in the register file dur­ operand processing unit. The fi rst byte also is ing the next cycle. The write is purposely delayed passed to the register/pointer unit fo r source list to permit it to be aborted if an arithmetic fault allocation. The five bytes shifted out of the !-buffer occurs. By holding the result in the data distribution are replenished from the I-bex, which leaves two logic, result bypassing into the data path can act as va lid byres in the I-bex . a source operand. The result is written into the reg­ In cycle four, the register/pointer unit allocates ister fi le at the beginning of cycle nine. the two entries in the source list fo r the immediate VAX InstructionMU LG3 G_floating number by passing a source one pointer to rhe E-box and the tag to the operand processing The MULG3 instruction takes the G_format floating unit. The operand processing unit passes the first number, addresscd by general-purpose register R5, longword of the immediate G_floating number to from the instruction stream, multiplies it by the the unit's output buffer. immediate constant 2345.675, which is also a The next four bytes of the immediate are passed G_format number, and puts the result in general­ from the !-buffer to the operand processing unir. purpose registers R9 and R 10. General-purpose The remaining two valid bytes from the I-bex are register R5 also is incremented by eight as a side merged into the !-buffe r. The I-bex is then loaded effect of the specifier evaluation. The opcode is with eight bytes from the virtual instruction cache. 2 bytes long, the constant is a nine-byte immediate specifier, and rhe autoincrement and register speci­ In cycle five, the autoincrement and register fiers are each a single byte. Thus, the instruction is specifiers are decoded and the remaining bytes of encoded in 13 bytes. the instruction are shifted ou t. Five bytes from the I-bex are merged with the four valid byres in the Cy cles Onethrough Five 1-buffer. The autoincrement general-purpose regis­ As in cycle one of the SUBF3 instruction, the MULG 3 ter R5 is passed to the operand processing unit instruction can either be a virtual instruction cache and the register/pointer unit, which also receives access cycle or part of the instruction already can be general-purpose register R9. The first longword of in the !-buffer and shifted to the least significant the immediate specifier is passed from the operand byte as the previous instruction is shifted out. For processing unit output buffer, through the 1-box, to example, if the previous instruction is the SUBF3 the source list entry allocated by the register/ #0. 5 (RO)[R4] R3 in bytes <4:0> of the !-buffer, the pointer unit. The second longword is passed to the fi rst four bytes of the MULG3 instruction could be operand processing unit output buffe r.

34 Vo l. 2 No. 4 Fa ll /')')() Digital Te cbn icaljounzal VAX Instructions That Illustrate the Architectural Features of the VAX 9000 CPU

The first microword is accessed and distributed (the fix-up unit processes double misses), a memory throughout the E-box. The microsequencer uses management fa ult, or a cache response. The cache the fast fields of the microword to generate the final response, which is the event most likely to occur, control store address for this instruction. The signals the state machine to return to idle and pre­ microinstruction is not issued because it requires pare for the next miss. Hardware control external two source operands and the second source pointer to the ftx -up unit writes the entry into the trans­ is not yet available. lation buffer, and the original request is retried. This time there is a translation buffer hit, and the Cy cle Six physical address is sent to the cache. Single misses In cycle six, the register/pointer unit allocates two in the translation buffer require seven cycles to pro­ source list entries for the autoincrement specifier, cess. A double miss requires 13 cycles, assuming passes this information to the E-box in the source data cache hits occur. one pointer, and passes a tag to the operand pro­ The issue control asserts the microword hold cessing unit. The general-purpose register R9 is signal to force the microword latches to hold the passed to the E-box as the destination pointer. first microword until it can be executed. The micro­ The operand processing unit accesses general­ sequencer regenerates the control store address of purpose register R5 and passes it, with a tag and a the second microword each cycle until the execu­ quadword read request, as an address to the M-box. tion stall ends. In parallel, the operand processing unit writes general-purpose register R5, incremented by 8-byte Cy cles Seven through Th irteen lengths in the unit's output buffer. The second long­ Cycle seven is the data cache read cycle for the word of the immediate specifier is written to the quadword operand processing unit request that source list at the relevant entry. was translated in the previous cycle. The VAX 9000 The operand processing unit sends the M-box a system has a 128KB data cache, with a block size of read request quadword for the double-precision 64 bytes and access width of 8 bytes. The 64-bit floating point operand . If the address is on a quad­ access width matches the 64-bit data path to the word boundary, the front end of the M-box will not E-box, which was constructed to provide high produce any additional virtual addresses because bandwidth for double-precision operand transfers. the operand will not cross a page boundary or a When a cache hit results for the read of an aligned cache line boundary. If there is a miss in the trans­ quadword, both the normal response line and the lation buffer for this reference, all other arbitration quadword response signal are asserted to alert the stops and control are given to the state machine of E-box that the M-box is sending a quadword of data. the translation buffer fL"X-upun it. In cycle seven, general-purpose register R5 of Bits < 31 :09> of the request are captured by the both the E-box and !-box is written with the incre­ translation buffer's fix-up unit in parallel with the mented value. In addition, both source pointers translation buffer RAM's access to achieve an early and the first source operand are available to the start on miss processing. The fork to the state issue control. Because only the second operand is machine is sensitive to bits < 31 :30> of the virtual missing, the microinstruction can be issued with address. Therefore, when a translation buffer miss bypass awaiting memory data. occurs, a constrained control word flow begins The quadword operand is available to the M-box based on the values of bits <31 :30>. Because this is at the end of cycle eight. The low longword is a user mode, the value is zero. Therefore, on the latched in the data distribution logic of the E-box, first cycle following the translation buffer miss, the and the high longword is held in the M-box. virtual page number is compared against the PO In cycle nine, the quadword operand is written length register, POLR. On the next machine cycle, into the register file at the two source list locations the POBR (i.e., base register) is added to the virtual allocated by the operand processing unit. However, page number ro create the system virtual address of the low longword is available as a source immedi­ the process page table entry. The fix-up unit acts the ately. The low longword of the short literal operand same as any other port into the translation buffer, and the low longword of the memory operand are and makes a virtual read request with an aligned passed to the multiply functional unit at the start of longword context. The state machine is controlled cycle nine. The multiply unit performs the first by a microword that branches to itself until one of cycle of execution, which includes· unpacking and three events occurs: a miss in the translation buffer multiplying the most significant bits of the two

Digital Te cbnlcal]ournal Vo l. 2 No. 4 Fa ll 1990 35 VAX9000 Series operands. Issue comrol drops the microword hold opcode, one short literal position, five for the base signal to allow the second microword to be latched. address (a 4-byte displacement off the program An entry, which specifies general-purpose n:gister counter), and one displacement. R9 as the destination for the low longword of the result, is made to the result queue. The second Cy cles One and Tw o microword is issued because the multiplier requires Cycle one for the BBSC can be fetching the instruc­ the next half of each source operand and both are tion stream from the virtual instruction cache, as available from the register file. described fo r cycle one of the ADDL2 instruction, or The microsequencer then attempts to generate a it already can be in the 1-buffe r (e.g., bytes <8:3>) new control store address from the next entry in and the I-bex (i.e., bytes <7 6>) following the the fork queue. If no new fo rks are available, the MULG 3 (i .e. , bytes < 2:0> ). In the latter case, the microsequencer remains idle. BBSC instruction is shifted into the lower bytes as In the tenth cycle. the multiply unit receives the the MULG 3 instruction is shifted out. high longword of both source operands. The sec­ The decode of the BBSC begins with passing the ond execution cycle is performed, which includes short literal, number 13, to the short literal expan­ unpacking and three simultaneous multiplications sion unit and the program counter/relative base of the appropriate combinations of the most and address to the operand processing unit. Informa­ least significant bits of the two operands. The multi­ tion on both specifiers is passed to the register/ plier signals the issue control that the result will be pointer unit. In this cycle, the fo rk address is also available in the following cycle. The issue control passed to the E-box. The fork address is modified makes an entry, which specifies general-purpose for field instructions if the base is a register. There­ register R 10 as the destination for the high long­ fore, passing the fork address is delayed until the word of the result, in the result queue. The multiply base specifier is decoded . In this example, the base functional unit is fully pipelined and could be issued is decoded in the cycle after the opcode is received. in this cycle to start subsequent operations. If the base is a register, the field instruction takes a Cycle eleven is the third and final execution cycle. different microcode flow. The multiplier accumulates the four products it During cycle two, the decoder passes the pro­ produced in the two previous cycles, rounds, and gram counter decoder for the program count of packs the final double-precision result. The issue the instruction to be decoded to the operand pro­ control discards the top entry from the resuh queue cessing unit. The program counter is passed to the to retire the low longword of the result. operand processing unit and the E-box in the first In cycle twelve, the retire multiplexer selects the decode cycle. Whenever a specifier is passed to the multiply unit result data and sends it to the data dis­ operand processing unit, the XI3AR also sends a tribution logic. The issue control discards another specifier offset delta. When the delta is added to the entry from the result queue to retire the high long­ program counter's decoder, the address of the last word of the result. The low longword of the result is byte of the specifier plus one is produced. written into the register file's general-purpose regis­ As the short literal and program counter/relative ter R9 in cycle thirteen . The high longword of the specifiers are decoded, they are discarded from the result is written into general-purpose register R 10 in !-buffer. The BBSC displacement is shifted to the the next cycle as the instruction is completed . first byte of the !-buffe r. The data arriving from the cache is merged into bytes <8:2>, and the other VAX InstructionBBSC byte is placed in the I-bex. The BBSC instruction tests a bit in memory, The branch prediction unit begins operating branches if the bit is set, and clears the bit. The during the first decode cycle. A prediction for the BOATA is the base address in memory with the branch must accompany the fork address sent to number 13 position-bit offset. The majority of VAX the E-box. The prediction is made by using the field instructions have a position offset of less than program counter to access a branch prediction 64 bits. Therefore, the VAX 9000 system's J-box cache and determine how the branch behaved the prefetches the quadword addressed by the base. last time it was decoded (i.e., one history bit). If As with all conditional branches, the result of the the branch is in the cache, the prediction is that test is predicted and the VAX 9000 system's J-box the branch will behave the same as the last time. If continues to fetch instructions along the rredicted the branch is not in the cache, a prediction is made path. The BBSC is encoded in eight bytes: one based on the normal behavior of this conditional

36 Vol. .2 No. 4 1-Ctff 1')')0 Digital Te chnicafjournaf VAX Instructions TbatIllustrate the ArchitecturalFe atures of the VAX 9000 CPU

branch. For example, a BEQL (58 percent) and a indicates the correctness of the prediction as soon BBSC (73 percent) normally do not branch, whereas as possible. For simple branches, the E-box could a BNEQ (62 percent) normally branches. If the BBSC indicate that the prediction is incorrect before the instruction is in the cache and branched last time, branch is fully decoded. this information is indicated to the E-box, with the If a hit occurs but the history bit shows that the I-box prediction given as true. branch was taken last time, the branch prediction unit latches this state information and stops the Cy cle Th ree decoding of the subsequent instruction stream by In this cycle, the register/pointer unit allocates one clearing the !-buffer and the I-bex. The program entry in the source list for the position specifier and counter of the subsequent instruction is stored in three entries for the base specifier. The unit then the program counter's unwind buffer. The program passes the source one, source two, and destination counter's target address, which is received from the pointers to the E-box. branch prediction unit cache, is passed to the pro­ In the operand processing unit, the address of the gram counter's prefetch buffer. The target address last byte of the specitler plus one is ftr st calculated that is later provided by the operand processing using the program counter of the instruction and unit may be discarded. The branch displacement the delta provided by the XBAR. The displacement and instruction length from the branch prediction from the instruction is then added to this calcula­ cache are latched. For the following discussion on tion. The result is latched in the operand processing the remaining cycles in the BBSC instruction, we unit's outpur buffer and passed to the M-box. The have assumed that the BBSC instruction is a branch operand processing unit also passes a quadword, prediction hit and that the branch was taken the last field modify function, and the source list tag. time decoding occurred. The short literal expansion unit extends the size of the position specitler to a longword and latches it Cy cle Fo ur in the unit's output buffer. In this example, the In cycle four, both the operand processing and extension is done with zeros. The XBAR passes the short literal expansion units contain data to be branch displacement byte and an updated value of passed to the source list. The operand processing the program counter's delta to the operand process­ unit normally has the higher priority of the two. ing unit. The delta of the program counter and the Therefore, the short literal expansion unit will stall. branch displacement are also sent to the branch The operand processing unit passes the base prediction unit as instruction lengths. The BBSC address to the source list through the 1-box. In the instruction is completely decoded, and the opcode operand processing unit, the new delta of the pro­ and displacement are discarded from the !-buffer. gram counter is added to the program counter, the The branch prediction unit does most of its work sign of the branch's displacement is extended from during the last decode cycle of a branch. For the a byte to 32 bits, and the two are added to produce majority of conditional branches, the last decode the new target address. The result is latched in the cycle is also the first. operand processing unit output buffer. The branch prediction cache contains 1024 The virtual instruction cache is accessed for the entries. Each entry has a history bit, a 32-bit target target instruction. If the instruction is in the vir­ program counter, a 6-bit instruction length, and a tual instruction cacbe, it is passed to the !-buffe r. 16-bit branch displacement and its tag. The entries However, there is a gap in the pipeline because no are addressed by bits 9 through 0 of the program instruction can be decoded this cycle. counter's decoder. If the tag matches bits <31 : 10> The displacement and instruction length from of the program counter's decoder, the entry is the branch cache are compared with the actual dis­ assumed to be the entry, or a hit, for this branch. placement and instruction length. Normally, these If a hit occurs and the history bit shows that the lengths match. However, if they are different, the branch was not taken last time, the branch predic­ target address from the branch prediction unit tion unit latches this state information and allows cache is probably incorrect. The fetching and the subsequent instruction stream to be decoded. decoding of instructions must wait until the The operand processing unit produces the target operand processing unit provides the correct address as soon as it is not busy. The target address address. must be stored in the program counter's unwind At the start of cycle four, the M-box receives buffer in case the prediction is incorrect. The E-box a request from the operand processing unit. This

Digital Te cbt�icaljounUII VtJ/. .! Nu. 4 Full 1'}'}0 37 VAX9000 Series

request differs from all requests previously tion or for subsequenr branches to be decoded . The described in that it contains a command that gets unit predicts a maximum of three branches before it special treatment in the M-box. The command is stalls decoding to resolve the first branch. an "opu read with write check no block." As the address xxxx..xxx5 is accessing the trans­ The command is used because the VAX 9000 CPU lation buffer, the fi nal address is produced by contains an optimization that enhances the perfor­ adding 4, which makes a translation buffe r request

mance of bit field instructions. With this command, (i.e., addr = xxxxxxx9) through the sequencer port the op<:rand processing unit prefetches a quadword in cycle six. The three translation buffe r accesses of data, starting from the address pointed to by the are contiguous and interruptible. Data alignment is base, without looking at the value of the position performed by the M-box, but the alignment is con­ operand . Hopefully, the majority of bit fi elds are strained to longwords. When an unaligned quad­ within 64 bits of the base. The special command word is detected, the front end of the M-box alters tells the M-box that if a fa ult should occur, it should the context field that it passes to the data cache pass the fa ult, with an operand, to the E-box and unit. The quadword request is effectively broken not close down the operand processing unit port or into two unaligned longwords, which are properly put a lock on the fa ult parameters. The command is rotated into the low longword of the quadword an unaligned quadword operand and, as such, interface and sent to the E-box independently. requires that the M-box produce additional virtual Cycle five is the data cache read cycle for the fi rst addresses to correctly access the cache. A quad­ unaligned longword. Because the starting address is word is unaligned when bits <2:0> are nonzero. x:xxxx:xxl, the entire longword is contained in the For this example, we have assumed that the starting cache line. Therefore, one additional rotation cycle address is x:x .."Lx xxxl. is all that is required before the data is sent to the Specialized hardware in the front end of the E-box. The M-box pipe is effectively lengthened by M-box detects if the starting address requires a cycle when it is performing unaligned operations. sequencing (i.e., the addition of a constant of 4 to Because cycle five is a data cache read cycle, no the current address) and how many sequenced response is issued to the E-box. In addition to the addresses are necessary. In this case, three addresses data cache read, the physical address is placed in are required. The first is the starting address (i.e., the write queue. A memory write is required after addr = xxxxxxx l ), which is received from the the bit is tested. A status bit for a new quadword is operand processing unit. As the starting address is set in the write queue. The new quadword indicates accessing the translation buffer, a constant of 4 is that this is the starting address of an operand and added and the sequence port requests a virtual writes should not rake place until an entry appears address (i.e., addr = xxxxxxx5) from the translation in the write queue with a last bit assertion. buffer at the start of cycle five. Because the first operand is written into the The issue control uses the fo rk RAM data to deter­ source Jist, the operand is available ro the integer mine that the integer unit and two source operands unit at the start of cycle six. The microword hold are required. Because only the first operand is miss­ signal is asserted to hold the fi rst microword during ing from the source list, the instruction is issued the stall. The microsequencer regenerates the con­ with bypass. The microsequencer generates the sec­ trol store address of the second microword. ond control store address based on the fast access fields of the first microword. Cy cles Six through Nine In cycle six, the data cache is read again with

Cy cle Five address xxxxxx:x 5, which is the same cache line Decoding the target instruction stream begins in read in cycle five. However, because the context is cycle five. The operand processing unit sends the a longword, one additional byte of data must be target address to the branch prediction unit through read from the cache to satisfy the request. Also, in the program counter's target address. However, as cycle six, rotation of the data read in cycle five is noted earlier, the target address sent is discarded. completed, and the M-box responds to the E-box. Because the operand processing unit does not use Finally, address xxxxxxx5 is placed in the write the 1-box data register, the short literal expansion queue. unit can pass the short literal to the source Jist. By using source pointers from the source queue, The branch prediction unit now waits either for the position and base address operands are selected the E-box to indicate the correctness of the predic- by the fork RAM and passed to the integer unit. If

38 Vo l. 2 No. -4 Fa ll /'J

the base address operand page fa ults, the base speci­ must ignore the prefetched quadword and initiate fier was an indirect specifier. The M-box returns a byte-read directly to the M-box for the appro­ the data to the E-box as fa ulty, rather than returning priate byte. The third branch checks whether the the indirect address to the !-box. The return ro the position is greater than 31 bits. This check is used to E-box results in a memory management page fa ult. determine which prefetched longword to use when Both operands are saved in registers within the the position is in the 64-bit range. In this example, integer unit. Also, the position is divided by eight the bit position of 13 means that the bit is in the low by shifting, and is saved . The source pointer used to longword and no page fa ult is assumed. get the base address as source two is incremented Issue control determines that only the source two and used to select the next source list entry, which operand, which is the high longword, is missing. is the low longword of the prefetched quadword The fourth microworcl is issued with bypass. and field. Issue control determines that only the source the microsequencer generates the control store two operand is missing and issues the second address of the fifth microworcl. microword with bypass. The microword hold sig­ In cycle nine, the M-box delivers the high long­ nal is deasserted and the microsequencer generates word and execution begins inunediately. The the control store address of the third microword . encoder in the integer unit clears the correct bit in Cycle seven begins with a data cache read of the low longword of the field. The microsequencer address xxxxxxx9. The rotator is spinning the generates the sixth control store address, and the three bytes <7:5> of interest from the cache read next cycle is issued normally. in cycle seven to the correct position. No response is issued to the E-box because this unaligned refer­ Cy cles Te n through Fifteen ence requires two data cache reads to fulfill. The In cycle ten, the E-box initiates a byte write to the address xxxxxxx9 and the last bit are inserted into M-box. Data is passed to the M-box, and the appro­ the write queue. The M-box delivers the required priate byte is shifted to the low byte location. The longword, and execution begins immed iately. The sixth and final microinstruction is issued normally. second execution cycle calculates the target byte In cycle eleven, the M-box receives an explicit address. The position, divided by eight, is added to E-box write request to retire the BBSC instruction the base address. The microsequencer generates with a memory write. Explicit writes differ from the fourth control store address by using the next writes initiated by the 1-box in that the E-box sup­ address field of the microword. No operands are plies a virtual address with the data, whereas the selected for the next cycle, and the next instruction I-box provides a virtual address and the E-box sub­ is issued normally. sequently provides the clara for 1-box v.·rites. How­ Cycle eight is a rotation-only cycle. The one byte ever, three entries exist in the write queue for the <8> of interest, read from the cache in the previous prefetched quad word. These entries were placed in cycle, is rotated into the correct position (i.e., byte the queue for memory conflict-checking purposes <0:3> ), and the M-box sends the data to the E-box and cannot be used for writing purposes because by issuing a response. only a byte of clara is being written and not a quad­ The third execution cycle uses the bit position to word. The write field command from the E-box set up the special encoder in the integer unit and forces the write queue control to discard the three clear the appropriate bit. The source two register entries. The front end of the E-box accesses the fi le pointer is incremented again to select the high translation buffer and checks fo r write success longword from the source list. This microword during this cycle. If the write is successful, the phys­ branches on three comlitions determined by hard­ ical address and the context of the byte are sent to ware functions. The first condition indicates if the the data cache. low longword of the prefetched field has a page The fi nal execution cycle determines if the fault. If a fa ult does exist, the microword flow branch prediction was correct. The bit specified checks whether the longword is needed or not. As by the correct position is shifted to the least signi­ noted earlier, the longword was prefetched in fi cant position in the shifter, where it can be used the hope that the bit position was within the first for a macrobranch comparison. The macrobranch 64 bits of the base. If the bit is not within the first result is compared to the I-hox branch prediction longword , the page fault can be disregarded . The in cycle twelve. The microword also ind.icates that second branch checks whether the position is the microsequenc<.:r should start forking for new gr<.:ater than (l_) hits. If it is greater, the microcode macroinstructions.

Vo l. 1 t\iJ. ·I Digital Te cbnica/jourual P( /1/ /'J')IJ .19 VAX9000 Series

Cycle twelve is the data cache lookup cycle fo r E-box. This process c:vens the tl ow through the the byte-write operation. The data size is less than a pipeline and keeps the E-box busy. Figure 6 illus­ longword. Therefore, the byte that is to be written tratc:s the code block as it moves down the pipe. must be merged with the seven unaffected bytes of The first stage is the virtual instruction cache the cache line. :tccess, or fetch. stage as the instruction is read from Two signals are sent to inform the 1-box of the the virtual instruction c:.tche. Some instructions branch prediction status. The branch valid signal do not need an actual virtual instru ction cache indicates that a branch prediction validation has access but are in the !-buffer from :.t previous virtu:.tl occurred, and the branch signal indicates ifthe va li­ instruction c:.tche fe tch. The instruction decode dation was correct. takes place in the decode, or XBAR, stage. The The branch prediction logic receives the branch !-buffer is shifted and the fork RAI'

.:; Fa ll /')')0 40 Vo l. .2 No. Digital Te chnicaljo urnal VA X Instructions That Illustratetbe Architectural features of tbe VA X 9000 CPU

2 3 4 5 6 7

VIC

XBR/IBUF

OPU/FPL/S_L __ -'-_

TBfi_ OATA

CACHE/ISSUE/READ_SLIST

EXECUTE- INT/MUL/FLOAT

RETIRE

WRITE-GPRjQU EUE

WRITE-DATA

KEY: � ADDL2 � SUBF3 � MULG3 � BBSC

Figure o VA X 9000 Instruction Pipeline

Overall some part of this set of instructions is address translations. However, if the required trans­ being worked on for 22 cycles. After cycle nine the lation is not contained in the translation buffer, the !-box can be prefetching and decoding the instruc­ fix-up unit autonomously creates an entry, which tions after the branch, either the instructions eliminates the usual latency involved when an directly fo llowing the branch or instructions that E-box is used to translate addresses. The M-box also are branched to. The number of retire cycles used or pretranslates write addresses and stores them in the wasted by a sequence of instructions is a good mea­ write queue for subsequent access and confl ict sure of the time taken to execute those instructions. checking. The 128KB, two-way associative, write­ If the prediction is correct, these four instructions back cache provides a very low miss rate, high execute in I '5 cycles; but if the prediction is incor­ bandwidth, and low latency. The queues of rect, these instructions take 21 cycles. decoded instructions allow the 1-box pipeline to be less tightly coupled to the E-box. The multiple Conclusion functional units in the E-box allow multiple VAX instructions to be executed in parallel. The archi­ The various advanced architectural features con­ tectural features of the VAX 9000 interact to pro­ tributed to the low number of cycles required for duce a CPU that executes VAX instructions in the an average VAX instruction. The virtual instruction least number of cycles or ticks. The low number of cache provides a high bandwidth of instruction ticks per instruction, combined with the short cycle stream to the !-buffer (8 bytes per cycle) and time, produce the highest performance VA X system requires a much lower bandwidth from the M-box now available. (8 bytes every 12 cycles). The large !-buffer presents 9 bytes of instruction stream for decoding. The Acknowledgments instruction decoder (XBAR) delivers up to three The design and implementation of the VA X 9000 specifiers per cycle. The operand processing unit was a team effort with contributions from many calculates operand addresses and branch target people, including: Dave Fite, Mark Firstenberg, and addresses. The branch prediction unit accurately Mike McKeon, who were responsible for the !-box; predicts the majority of branches, and instruction Dave Webb, Maurice Steinman, joe Macri, Brad decoding continues down the predicted path so Hollister, and Basheer Ahmed, who designed the that no time is lost waiting for results from com­ M-box; Bill Crundmann, Larry Herman, Ginny pares. The !-box prefetches and decodes up to five Lamere, Elaine fi,te, Oan Stiding, hken Samberg, instructions ahead of the E-box. The translation Mark Haq, ancl .\1att Adilctta, who were members of buffer contains up to 1024 virtual-to-physical the E-box design team.

Di?,ital Te clmicaljournal l!rJ/. l Vo. 1 Fu /1 /')')0 ..Jt VAX9000 Series

References 5. M. Troiani et al., "The VA X 8600 J-box, A Pipelined Implementation of the VA X Archi­ I. D. Marshall and ). McElroy, "VAX 9000 tecture," Digital Te chnicaljoumal, vol. 1, no. 1 Packaging-The Multichip Unit," Proceedings (August 1985 ): 24-42. of COJ'v!PCON '90 (San Francisco: Spring 1990): 54-57. 6. ). Emer and D. Clark, "A Characterization of Processor Performance in the VA X-11/780," 2. T. Fossum and D. Fite, "Designing a VA X for High Proceedings of the 11th Annual SJ mlposium on Performance," Proceedings of COMPCON '90 4 (San Francisco: Spring 1990): 36-43. Computer Architecture (Ann Arbor: June 198 ).

3. T. Leonard, VA X Architecture Reference Manual 7. T. Fossum, ). McElroy, and W. English, "An Over­ (Bedford: Digital Press, Digital Equipment view of the VAX 8600 System," Digital Te chnical Corporation, 19H7). journal, vol. 1, no. I (August 1985): 8-23. 4 .). Murray et al., "Microarchitecture of the 8. S. Mishra, "The VAX 8800 Microarchitecture," VA X 9000," Proceedings <>[ COM PCON '90 (San Digital Te cbnicaljoumal, vol . I, no. 4 (February Francisco: Spring 1990): 44-53. 1987): 20-33.

42 Vo l 2 No. 4 Fa ll I'J')O Digital Te cbnicaljoun.al Matthew]. Adiletta Richard L. Doucette john H. Hackenberg Dale H. Leuthold Dennis M. Litwinetz

Semiconductor Te chnology in a High -performance VAX Sy stem

The VAX 9000 system is the newest member of Digital's VAX fa mily of computer systems. The9000 is a high-performance ECLproce ssor, with a very fa st, 16-nano­ second cycle time. To achieve thishi gh level of performance, a new generation of semicustomand custom integratedcircuits was required fo r the scalar CPU and the vector processing op tion. Goalsfo r circuit density, performance, and skew mainte­ nance werefu lfilled with the development of a high-speed gate array, sp ecial custom chipsused in keyap plications, and a high-speed RAMemploying a newarch itecture.

The semiconductor requirements for the VAX 9000 The STGx chip was implemented using a silicon system posed a number of challenges for Digital's compiler technique. The MULx and DJVx chips Integrated Circuits Development Group. Those mwere implemented using a standard cell design requirements included a tremendous number of approach. Statistics on 9000 system chip design are equivalent logic gates (1,037,400 gates) and a large given in Table 1. amount of RAM in the processor (3,280,000 bits). This paper describes the VAX 9000 MCA Ill gate Moreover, the project's performance goal of over array, the development of each of the five custom 30 VA X-11/780 units of performance (VUPs) chips, and the STRAM architecture. Before our dis­ required the development of state-of-the-art semi­ cussion of the gate array, we present a brief conductors and the use of innovative techniques to overview of the semiconductor technology used design them. to fabricate the array and the custom chips. Given the project's goals, the IC technologists evaluated several competing semiconductor tech­ Semiconductor Te chnology nologies and decided to implement most of the logic within the 9000 system in a high-speed, high­ In 1985, the VAX 8800 series was Digital's largest density, 10,000-gate array. The gate array provides and most powerful system, offering single-CPU per­ a broad range of speed and power-dissipation formance of eight VUPs. The 8800 CPU logic was options. Working with Motorola, the IC Group first Motorola's Macrocell Array I (MCA I) gate array, engineered the base 10,000-gate macrocell array which was fa bricated in MOSAIC I bipolar technol­ (MCA), which is implemented in Motorola's MOSAIC ogy. In comparison, the VAX 9000 goal of 30 VlJPs III process. Logic engineers then designed the 77 was aggressive, and the IC Group realized a new different gate array chips (options) on the base semiconductor technology was required . array, using a rich library of logic functions and a set At the start of the project, the technologists evalu­ of automated place and route tools. Additionally, ated semiconductor vendors to determine what they designed five custom chips, invented a fa st was the "best" technology available to implement cycle time, self-timed random access memory the new system. CMOS, BiCMOS , bipolar, and GaAs (STRAM) architecture, and designed a multichip unit IC technologies were evaluated. Among the factors to imerconnect all these high-performance !Cs.' considered were logic density, gate delays, on- and Four different design methods were used to off-chip interconnect delays. mam.1facturing risks, implement the chips. The MCAx chips employ a gate and product delivery. array design technique. The cnxx, the VRGx, and Although very high gate densities were available the Sl"RAM chips required a fu ll custom approach. with CMOS technology, the delays proved

Digital Te chnicaljournal Vo l. .! No . .:j Fa ll /')90 43 VAX 9000 Series

Table 1 VA X 9000 Chip Statistics

Die Size Signal RAM Power Chip Description (Millimeters) Pins Count Bits (Watts)

MCAx MCA Ill gate array chip 9.8 X 9.8 256 40. 1 K 30 CDxx Clock distribution chip 6.2 X 6.2 170 7.2K 13.9 STGx Self-timed reg ister file chip 9.8 X 9.8 152 29.3K 17.8 MULx Multiplication chip 9.8 X 9.8 182 48.4K 30.9 DIVx Division chip 9.8 X 9.8 112 29.2K 23.9 VRGx Vector register file chip 9.8 X 9.8 198 76.0K 9216 24.9 1KSR 1 K x 4 self-timed RAM 4.9 X 3.6 33 28.0K 4096 2.4 4KSR 4K x 4 self-timed RAM 6.4 X 4.2 35 103.0K 16384 2.4

to be too slow ro meet the cycle time requirement. of the manufacturing steps were new, most of them Also, the CMOS output circuits could not drive sig­ were based on previously proven techniques. The

nals off-chip into a 50-ohm transmission line as group therefore concluded that MOSA IC 111 was quickly as a bipolar transistor, which limited the best suited tO meet the challenges of the VAX 9000 speed of signal between IC:s. system. BiCi\·JOS offers the advantage of highly dense The MOSAIC Ill process is an advanced silicon CMOS coupled with bipolar drive capability. How­ bipolar process which yields a transistor structure ever, the technologies available at the time were with a polysilicon base. emitter and collector elec­ optimized for the best CMOS with a com­ t�·odes, polysilicon resistors, and three layers of promised bipolar device. This approach limited the metalization. Compared to the MOSAIC l device overall performance of the circuit to a level roughly used in the 8800, the critical collector-base junction equivalent to that of previous generation bipolar of this transistor structure takes up approximately devices, which would not be aggressive enough ro 50 percent less area, as shown in Figure I. Com­ meet the CPU performance needs. bined with shallower junctions and reduced base Gallium arsenide (GaAs) ICs offer a theoretical resistance, the intrinsic device performance was performance advantage of between two and three improved by a factor of three. Further, the poly­ to one over s)licon implementations. The group silicon resistor produced with this process has far found IC densities were lower than those of bipolar lower parasitic capacitance than the MOSAIC l devices, however; and the on-chip speed advantage monosilicon resistor. Some key performance mod­ was countered by the need for more off-chip sig­ eling parameters and density metrics are provided

nals in the critical paths of the CPU. Also, because with the figure. the manufacturing technology of GaAs ICs was The VA X 9000 packaging imposed other require­ immature, very few companies had attempted to ments on the semiconductor technology. Power sell GaAs into the commercial marketplace. So dissipation increased from 5 watts for the MCA I to while this technology was considered for a rime in �0 watts for the MCA Ill because of the increase in some applications where alternatives also existed, gate density from 1,200 to 10,000 gates. Therefore it GaAs were eventually dropped from consideration was determined that all chips should be mounted because of the uncenainty of availability. directly to the multichip unit cold plate for opti­ The IC Group also studied Motorola's third mum cooling. For manufacturing economy, it was generation of their oxide-isolated self-aligned desirable to bond the multiple leads of the chip implanted circuits (MOSAIC Ill) bipolar technology.2 directly to the pads on the high-density signal car­ (HDSC). CPU Ir offered a factor of six in speed advantage over rier Consequently, all chips must be the previously used MOSAIC I technology and had provided to the multichip unit assembly site in a the potential of prov iding eight to ten times the tape automated bond (TA B) package. As shown in logic density. Although not as dense as CMOS or Figure 2, chips are mounted in a plastic carrier suit­ BiCMOS, MOSAIC Ill was much faster than either of able for automated handling, and the surface of the those technologies and much denser than any avail­ die is protected from mechanical damage with an able GaAs technology In addition, although many epoxy encapsulent.

44 Vn/. .2 filii. ..; Fa ll 1')')11 Digital Te chnicaljo urnal Semiconductor Te chnology in a High-performance VAXSy stem

MCA JOK Gate Array number of logic cells for a given signal pin count are A high-performance emitter coupled logic (ECL) available for the logic designers. Te chnologists eval­ gate array with 10,000 equivalent gates and 256 uated several key factors to determine the gate array inputs/outputs has been developed for the VA X physical layout and to ensure its success: 9000 system. The gate array design approach used • Area of the silicon chip versus yield in the VA X 9000 system ensures the shortest possi­ ble turnaround time from option ma-;kto hardware, • 110 pad pitch thereby reducing the system design time. In this • Maximum power dissipation approach, cell boundaries are defined with all tran­ sistors and resistors fu, within the cells. When a • Speed of the gates cell function is selected from a predefined cell • Maximum number of logic cells library, the cell customization occurs at the metal between the transistors and resistors. Then, to Successful trial layouts of the IOK ECL gate array define the function of the gate array option, the floor plan were completed before any VA X 9000 metalization between cells is customized. This options were started. approach allows the semiconductor foundry to The gate array floor plan, shown in Figure 3, build many wafers up ro the customizarion level; comprises a central core area of 414 major (M) cells, when a gate array is to be built, only the custom divisible imo quarter cell functions, arranged in an metal is required. As noted above, 77 different lOK array of 20 rows and 21 columns, less 6 sires for the ECL gate array options are used in the VA X 9000 sys­ master bias generators and special clock generator tem. This gate array has a rich selection of logic cells circuits. The number of transistors used in a quarter with different power settings for the logicians to cell is based on the logic cell most frequemly used use to meet performance and power requirements. in the lOK ECL gate array, the scan larch. A ring of Using Rent's Rule, technologists maintained a bal­ 200 output (0) cells is interspersed with 224 inter­ ance between the number of gates and the package face (I) cells. The ring surrounds the imernal cells J/0 count. This balance ensures that a maximum and imerfaces the pad drivers with the internal

MOSAIC Ill P+ POLYSILIC ON N+ POLYSILICON ����� _..) NPN TRANSISTOR 1 POLYSI LICON RESISTOR / I �_..... ,... I �-----� _...... - I / / C-B JUNCTION AREA _..... �--MOSA� I -��--�)·::::::::::::�

MONOSILICON RESISTOR NPN TRANSISTOR

MOSAIC I MOSAIC Ill

NPN Fr: 5 GHz 16 GHz R0: 1475 ohms 400 ohms CJc: 50 II 20 ff CJE: 45 II 24 ff CJS: 185 ff 54 If DRAWN EMITTER SIZE: 31'm X 41'm 1.751'm x 4!Jm METAL 1 PITCH: Bl'm 4.5!Jm METAL 2 PITCH: 151'm 71'm METAL 3 PITCH: - 121'm

Figure I Comparison of MOSAIC Ill and MOSA IC I Deuices

Di�ilal Te cbnicaljournal Vo l. .! No. 4 Fall /')'JIJ 45 VAX9000 Series

cells. The 256 t/0 pad cells along with the J04 Ta ble 2 Comparison of Number of Cells power pads are located around the perimeter of the and Delays in the VAX 8800 and VAX 9000 Gate Arrays IOK gate array. The mctalization system uses three interconnect layers. The customized routing chan­ VAX 8800 VA X 9000 nels reside on the first and second metal layers with Gate Array Gate Array interconnecting vias between the two layers of Internal major 48 metal. The top metal layer and parts of metal I and 2 414 cells provide power and ground distribution. Output cells 26 200 The lOK ECL gate array used in the VAX 9000 is Input cells approximately ten times more dense than the ECL 25 224 gate array used in the VA X 8800 system. The gate Input cells 1.05 nano- 175 pico- gate delay seconds seconds delays in the 9000 are improved six times over gate (high power) delays in the VAX 8800. Ta ble 2 compares the IOK Metal delay 2.6 pico- 1 .3 pico- Ec.Lgate array used in the \AX 9000 to the ECL gate (fall delays) seconds seconds array used in the VAX 8800. per mil per mil Previous gate array designs. in general, have provided only two le,·els of series gating, thereby limiting the complexity of functions that can be All current switches within the array are pow­ designed with one current switch. Within this gate ered from the main supply voltage VEE I. Three­ array, three levels of series gating Jt borh internal level-series gated functions are implemented in the and output macrocells provide addition:�! "AND" VAX 9000 gate array option, which requires VEE I (product) gate fu nctions at very high sreed with to be set to -5.2 V. Input cells are powered from a one switch delay and at a lower power level . Fig­ second, lower supply voltage VEE2 ( - 3.4 V) to save ure 4 compares three-level series gating and two­ power. The output emitter followers of M, I, and cells as well as series-terminated level series gating for a "2-3-4-4 AND/OR " logic 0 ECL (STECL) function (internal gate). Ta ble 3 lists the differences output followers employ constant current source in typical gate performance for a low power gate. pul ldowns to VEE2 to save power. The constant cur­ The table also compares low power gate and high rent source pulldowns minimize the sensitivity of power gate. Notice the power difference between AC performance to variations in power supply. This the two-level and three-level high power gate. same termination scheme was used in VAX 9000 custom chips. One of the technologists' main goals was ro mini­ mize power consumption of each macrocell while obtaining the highest possible performance from the IOK ECL gate array. The overall !OK ECL Gate Array power is limited to 30 watts because of the cooling requirements, the internal power distribu­ tion, and the current density limits on power pins. A unique feature included in the !OK ECL gate array that rrevious gate arrays do not have is series­ terminated ECL (STECL) omputs. STECL outputs

Table 3 Comparison of Two-level and Three-level Series Gating

Two Levels Three Levels of Gating of Gating

Gate delay from 300 picoseconds 250 picoseconds input pin A to output pin YA (low power) Low power gate 9.88 milliwatts 8.84 milliwatts Figure 2 Chip in TA B Pa ckage Mo unted on High power gate 18.20 milliwatts 13.00 milliwatts Plastic Carrier and Encapsulated

46 Vol 2 No. 4 Fa ll /'J'Jfl Digital Te cbn icaljom-nal Semiconductor Te chnology in a High-performance VAX System

Figure 3 Photomicrograph of the Gate Array

include a constant current source pulldown and a reference clocks. The chip also supplies clocks to series terminating resistor. This feature allows the all STRAMs on the unit. Each of the STRAM's four elimination of off-chip termination resistors used groups of SL'< clocks can be programmed to one of in conventional 50-ohm ECL outputs. STECL out­ eight possible clock phases. This flexibility in pro­ puts allow shorter interconnections between chips gramming allows the system designer to select the on the multichip unit because the chips can be appropriate clocks for STRAMs in order to meet placed closer to each other, thm improving perfor­ system timing requirements. mance. Another advantage of using STECL outputs In addition to prov iding the functions above, over 50-ohm outputs is that less than half of the the design goals for the CDxx project included the simultaneous switching output noise is coupled to fo llowing: unswitched outputs. All custom chips used in the • Minimize the space occupied by the chip on the VA X 9000 employ STF.Cl.term ination. multichip unit

Clock Distribution Chip - CDxx • Provide scan control and scan distribution

The major function of the clock distribution chip • Include a wideb:md amplifier (CDxx), shown in Figure 5, is to distribute master • Ensure low clock skew and reference clocks to each MCA on a multichip unit. There are eight pairs of differentialmaster and • Provide a temperature-detecting circuit

Di�ital Te cbuicaljournal 11Jl .! No . q Fa /1 1990 47 VA X 9000 Series

,------�

vee

'------��YA� ------+- vs@ VBB1 VBB1

ONE LEVEL OF GATING .----- vee

vee

VBB3

VEE1 �------�------'

THREE LEVELS OF GATING

Figure 4 Tw o-leuel Fu nctions uersus Three-leuel Fu nctions

48 Vo l. .2 No. 4 Fall I')'JI! Digital Te cb nicaljournal Se miconductor Technology in a High-pe�tormance VAX Sy ste�n

HOT CIRCUIT

Figure 5 Photomicrograph of CDx.-.: Chip

Minimizing the real estate occupied by the chip Each coxx receives its scan control signals from the was complicated by additional fu nctions located on previous CDxx in the chain or from the service pro­ the CDxx, such as scan and the temperature detect­ cessor. As shown in Figure 5, there are three scan ing circuits. The minimization was accomplished rings located on the CDxx. Ring 12 is a 16-bit ring by employing a custom chip design approach in reserved fo r the CD)C'< STRAM clock generation con­ which each element (cell) is optimized and then trol ring. This ring controls the STRAl'•lclock phase manually placed and routed to achieve a compact selection and enable fo r each of the four STRAM des ign. As it turned out, the size of the chip was not clock groups. Ring 13 is a 14-bit ring reserved for the determined by the amount of real estate needed to CD)C'< scan control. Data is shifted into this ring and implement the circuits, but rather by the number of then loaded into CDxx control registers. Ring 14 is a pins required to communicate to the rest of the 47-bit ring reserved for the CDxx information scan multichip unit. ring. Data is loaded into this ring from CDxx data Since a CDxx is mounted on every multichip unit registers and shifted out ro the service processor. in the CPU, the scan distribution and control logic The design of the widebaml amplifier was are located on this chip. The CDxx chips in the sys­ prompted by the need for the clock distribution tem are chained together on the system scan bus. chip to receive two differential sinusoidal master

Digital Te cbnicaljournal Vo l. .! No. ·I Fa ll I'J'JO 49 VAX 9000 Series

and rcfc.:n:ncecl ock signJis as inpurs. These.: signals time is typically less rhan 4 ns, and rhe write time arc.: transformer coupled from the clock source. is typically less rhan ':i ns. Figure () is a photomicro­ The master clock runs at one L"ighrh the systL"m graph of the STG x chip. cycle rimL". and the reference clock runs at the sys­ The STGx is a 64 -word by 18-bit LCL register file tem ncle rime. The wideband amplifier receives containing three wrire ports and rwo read ports. diffe rential sinusoidal signalls of relatively small The 64 words are separated into fou r 16-word by amplitude - less than 125 milli\·olts peak to peak ­ 18-bit storage array sections. Each of the four stor­ and transforms them ro lOOK ECL levels on output. age banks has dual read capabi lity. Storage bank one Th<.: design of the input circuits meets these crite­ has dual write capability; storage banks rwo and ria and rypic::�lly fu nctions with inputs less rhan three have triple write capability; and storage bank 65 millivolts. four hJs single write capability. Simultaneous write All rhe clocks are distributed by the COxx as pairs access to the array is possible through all pons wirh of diffcrc:ntial signals. The distribution of these correct results occurring; the only exception is in clocks is, of course, ro be done with minimal clock the case of writes to the same location from multi­ skew. Clock skew is the difference in del::�y time ple pons, which is an undefined operation. A write berw<.:c:n different clock outputs measured from a followed by a read access to the array - even to rhe common point. The common point in this case is same address - is possible w irh correct results the numbc:r of master dock inputs to the chip. To occurring. The chip has two clock inputs for con­ maintain low clock skew, technologists designed trolling reads and writes. fast gates and minimized the number of cascaded One requirement for rhe design was to include a gates in the clock path. Also, all the metal that inter­ self-rimed write capability so that the system need connects the cells in the clock path is controlled for nor provide properly timed write pulses ro rhe chip. equal delay. As a result, the measured clock skew In rhe system, rhe chip is clocked with STRAM is less than 100 picoseconds on a chip for master, clocks for reading and writing. The design uses reference, and STRAM clocks. The delay of master these clocks to latch read address information, to clock input ro output is less than I nanosecond (ns). latch write add ress information, and to latch input The: temperature-detecting circuit on the CDxx data. In addition, the design rakes the leading edge warnsrhe system when a device junction tempera­ of the write clock ro generate a delayed write pulse. ture approaches rhe maximum allowed tempera­ The delayed write pulse is used to write the appro­ tun: on a multichip unir. As implemented, the priate word in the 64-word by 18-bir array, raking circuit is controlled from the system console. The into account rhe rime needed ro decode the wrire console loads rhe CDxx with a number that repre­ address. sents rhe temperature rhe circuit musr use as a point The design sryle used to implement rhe self-rimc.:d of comparison. If rhe junction temperature of rhe register file chip is simiiJr ro a silicon compiler tech­ Cl)xx is higher than the programmed value, the cir­ nique. The chip's storage area is made up of four cuit trips and notifies the console of a temperature arrays. The input address register for borh read and problem. The console rhen rakes corrc.:criveacrion . wrire ports, the inpur dara larches. and rhe dat::l out­ pur drivers are arrangements of c<:lls in strips. The Self-timed Register File Chip - STGx placement and routing of these arrays and strips was The self-rimed register file chip (sn; x) is employed procedurally performed using custom layom tools. Once rhe blocks were:assembled and placed, inter­ in the VAX 9000 to provide four register banks accessible through muhirle read and write pons. connecrions among blocks, strips, and pins were ·rhe four banks incluJe a microcode scratch-pad then routed manually. register hank, rhe VA X generJl-purpose register set, a memory Jara register storage bank, and an Multiplication Chip -MULx instruction data register bank. The performance The architecture of the scalar processor defined an requirements for rhe STC x were quite rigid and integrated floating point processor. Unlike most guided several key design tkcisions, including den­ RISC processors, which off-load all floating poinr sity and layout. The read access time was ro be less operations ro a separate tl oating point processor, VA X than ':i ns. The write access time was to be less than rhe 9000 sysrem handles fl oating point opera­ 6 ns. Ln orher words. rhe chip must read or write tions within the E-box.1 The multiplication unit any one of irs 6.:j locations in ':i or 6 ns. respectively. therefore supports horh inr<:ger and tl oaring point Borh goals ha\'e been met. In fact. rhe read access fo rmats. To achieve this support, a custom chip was

l'n/. .! .\"o. .; Fall I')<)O Digital Te cbnicaljournal Se miconductor Te chnology in a High-JH!r(ornwnce VA X -�),stem

Figure G Photomicrograph of STGx Ch ip required that provided superior performance. spe­ Three �l l'Lx chips werc r<:quired in the scalar cial logic gates. and improved density. Custom chip processm to achieve doubk-prccision r<:rformancc technology provided enough dcnsity to accommo­ in which every 64 ns a ')6-bit multiplication could date a .12 -bit by :)2-bit. cight-logic-l<:vcl multiplica­ complete. Each Ml'l.x chip has two .12-bit input data (Mlll.x). Ml !Lx tion array in a singlc chip To minimize the buses. The chip is also employed to perform cost and time of custom design. designers employed all integer multiply operations in a single 16-ns standard cell design techniques in which the cell cycle. height was fixcd anu thc width could vary to take The scal::irprocessor, which has .1 2-bit-wide data advantage of packing dcnsity. By constraining paths, delivers double-precision input data in two the design in this fa shion. the High Performance cycles. In the first cycle, each MllLx consumes the

Systems Group's <.AD suitc could be employed to most significant high bits of c:Kh operand. A II three place and route the chip. Special logic gates MULx chips latch this <.bta while also unpacking eliminated thrcc logic lcvds. and high-powered fast it, multiplying it, and then latching the product. Ml'Lx gates provided the pnfmmancc to permit a .1 2-bit One of the chips' results is then s:1ved. In the by :)2-bit multiph· opcration in less than 9 ns. Fig­ second cycle. the n.:maining dou hk:-prccision dat:I, un: I shows a photomicrograph of tile \l ll.x chip. the least significant low bits. is consumed, and each

Digital Te cbnicaljournal \i,f. .! .\iJ. 1 hiii i'J'Jii ')[ VAX 9000 Series

- - - . -

�-

;--..,..:.--,...� ��.;.,...._:;..., I MULTIPLIER ARRAY """"""'.,M, ,:... .-��

...... ,...

Figure 7 Photomicrograph of M UL:xCh ip

MULx chip unpacks the data and performs a unique are delivered, each MLJLx has an additional person­ multiply: operand A high bits and operand B low ality bit for indicating whether the MULx is in the bits; operand A low bits and operand B high bits; V-box or E-box. and operand A low bits and operand B low bits. The MULx chip, as used in both the scalar and III An 1\KA gate array accumulates all these vector processors, is a 32-bit by 32-bit ECL parallel results, and another rounds and packs the bits into a multiplier which is fully pipelined for a 16-ns cycle

VA X fl oating point product. Since each ivl ULx needs time. It performs both two's complement and sign/ ro know which partial product it must compute in magnitude multiplication. In a single cycle, the chip

the second cycle, two personality bits are included unpacks VA X floating point formats F, D, and G, or that are loaded by means of the system scan chain. integer formats long, word, and byte; performs MULx chips are also used in the vector processor. exponent calculations and sign handling; and com­ The vector processor (V-box) has 64 -bit-wide data pletes up to a 32-bit by 32-bit multiplication. paths. Four MULx chips are employed ro complete a If the operation is double precision, the 64 -bit double-precision multiply every 16 ns. Since the result is a partial result. It must be accumulated with operand unpacking di ffe rs between the scalar and three other partial results to form the double-preci­ vector processors as a result of how fast operands sion, correctly rounded, and normalized product.

52 Vo l. 2 No 4 Fa ll 1')')0 Digital Te chnicaljounwl Semiconductor Te chnology in a High-performance VAX System

If the operation is an integer type, then the 64-bit tion, by algorithmically placing the array, signi­ two's complement result is the VA X integer product. ficant density improvements were realized . This Along with producing this integer product, MULx program provides a Wallace-Dadda implementa­ also produces the correct condition codes. Integer tion that logically reduces 32 rows in 8 logic levels, operations require one machine cycle to complete. and consumes as many initial summand bits. It Operands are not latched at input. Instead they are also uses the least number of full adders as theoreti­ immediately unpacked and sent to the multiplica­ cally possible, while delivering the least significant tion array. This multipurpose array then produces a 32 bits of sum and carries at least one full logic level set of sum and carry product vectors. These vectors sooner than the most significant bits. are then added in a full carry lookahead adder (CLA). This adder comprises a 31-bit adder and a Division Chip- D/Vx 32-bit adder, cascaded . The produced sum is the The iterative divide function performed by the divi­ 64-bit product, which is then latched. The output sion chip, DIVx, requires a significant amount of of the latch is used to compute integer-type con­ hardware, the density of which a standard cell chip dition codes. affords. Two gate arrays would be required to per­ The integer instructions supported include VAX form the same function, in which case a timing­ MULB, MULW, and MULL. EMUL is also directly sup­ critical path crossing would occur between the two ported, along with the Z and N bit condition codes. chips. Therefore, the IC designers implemented the Finally, to assist in H format-type multiplications, DIVx chip as a standard cell design by building a true 32-bit by 32-bit magnitude multiplication is on the techniques developed for the MULx chip also supported, called EXTMUL (extended multiply). described above. Also, like the MULx design, the There is a 64-bit data path back into the E-box for goals for the D!Vx design project were to optimize EMUL- and EXTMUL-type operations. performance and minimize real estate use by fitting Six features of the MULx design that improve per­ the iterative divide function in a single chip. formance and minimize logic should be noted. The IC designers employed a standard cell tech­ First, unlike traditional designs, the MULx design nique in which four horizontal sections are defined, does not include Booth recoding of the multiplier each section having a different number of columns. operand. Booth recoding offers no logic savings Reference cells are located in the center row of each either in timing or real estate when the multiplica­ section and provide ECL reference voltages to the tion array reduction scheme is optimal. Second, a cells above and below in that section's columns. Baugh-Wooley two's complement algorithm was Placement was driven for performance, with quo­ used to implement integer multiplication.' Third, tient selection logic being distributed to where it engineers designed special full adder logic gates to was required. This method made for an irregular integrate multiplication summand generation into structure, as can been seen in Figure 8. the full adder cell and to eliminate the need for an The VA X 9000 system optimizes both multiplica­ additional logic level. Fourth, a unique multipli­ tion and division by providing separate functional cation reduction algorithm was developed which units. Each functional unit performs both integer provides the initial routing advantages of a Wallace and floating point operations. This approach differs " 6 tree, with the minimal logic of a Dadda tree. · Fifth, from the one taken by most processor architects, a ripple is formed in the reduction array. The ripple who conceptually link multiplication and division. facilitates the start of the least significant 31 -bit Usually, are chosen that can share hard­ CLA addition at least one logic level sooner than ware at the expense of the performance of either the most significant 32 bits and does not require a operation. The separate division unit in the 9000 carry-in input to the upper 32-bit adder. Finally, by provides superior performance for both integer and developing a very fast 4-3-2-1 AND/OR gate, engi­ floating point operations. The DIVx chip is also neers were able to remove two additional logic used by the V-box to perform very fa st vector divi­ levels in both CLA adder networks. sion operations, as shown in Ta ble 4 . To avoid bugs in the array design, since bugs in an Division is an iterative process. Unlike the case of array consisting of 1000 full adders could have sig­ multiplication, one cannot predict the summands nificantly affected the product shipment schedule, and then reduce the summand matrix. The two engineers developed a program to logi­ approaches to division most commonly used are cally interconnect and physically place the array. the Ta ylor Series convergence algorithm and a sub­ � Any bugs would be algorithmic and not random, tract and shift algorithm. The algorithm employed and algorithmic bugs should be obvious. In addi- in the 9000 is a variation on the subtract and shift

Digital Te cbnicafjournuf Vo l. 2 No. 4 Fa ll 19')1! 53 VAX9000 Series

Table 4 Division Performance and a truncated version of the exact divisor. This imprecise quotient digit is corrected when the next Time guess quoticnt digit is selected . The selected digits Data Type Cycles (Nanoseconds) may be positive or ncgative. The positive digits are Integer: byte 3-4 48-64 accumulated in a positive-value shift register. The word 3-5 48-80 negative digits are accumulated in a negative-value long 3-8 48-128 shift rcgistcr. The final corrected binary quotient is Floating then formed by subtracting the negative register point: F-format 7 112 D-format 13 208 from the positive register. G-format 12 192 The algorithm is based on a signed digit notation scheme. To determine two quotient bits, the bits may be chosen from a digit set that includes

method, which allows for savings in hardware as { -2, -I, -0, + 0, + 1, + 2 }. The digit set is simply an well as increased performance. expanded fo rm of the common nonrestoring digit Jn this method, an imprecise quotient is selected set that typically uses { -1, 0, + 1}. In nonrestoring based on a truncated estimated partial remainder algorithms, the quorient is normally corrected as

Figure 8 Ph otom icrograph of D!Vx Chip

.. Fa ll Digital Te chnicaljo urnal 54 Vol. 2 No J /')')0 Semiconductor Te chnology in a High-performance VAX System

needed; whereas here, it is not corrected until the register file had to perform read and write at dif­ entire iterative process is completed. The next sig­ ferent addresses within a single 16-ns clock cycle. nificant difference between this division technique These requirements could not be met with available and the nonrestoring method is that the quotient memory and logic chips, thus necessitating the bits selected are based on an estimate of the partial development of a fully custom vector register chip. remainder and divisor rather than the exact values. The vector register file is 64 bits wide and con­ The first advantage of this method is that an esti­ sists of 16 vector registers with 64 elements each. mate can be obtained faster than the exact value. The vector register chip, VRGx, was developed as an Second, a truncated estimate is acceptable, rather 8-bit slice of the 64-bit vector register file. The chip than a full-width estimate. Consequently, this contains 9216 bits of RAM for data storage and the method saves a significant amount of hardware and cross-bar logic (6000 equivalent gates) that allows increases the speed of the operation. If one were to access from the five read ports and three write complete each partial remainder, up to three addi­ ports. Integrating the register memory and the tional chips would be required and the delay would cross-bar logic on the same chip allowed timing to more than double. be optimized so that the system timing require­ The trick to the method lies in the quotient selec­ ments were met. tion. The selection is based on partial remainder range transformations which guarantee that a VR Gx Ch ip Physical Features and quotient digit selected in one iteration may be cor­ Organization rected to the exact quotient digit on the next The VRGx chip is fabricated using the MOSAIC III ECL iteration. Therefore, although six quotient digits process, which was not designed as a memory pro­ are determined per major iteration, an additional cess. Coordination with the vendor resulted in the minor iteration is required to guarantee the least addition of an implant step for the memory-cell­ significant digit of the major iteration. The major bit line emitters. Key features of the process are and minor iteration terms refer to the architecture three metal interconnect layers, oxide isolation, of the divide iterative hardware. The OIVx produces and polysilicon emitters with a drawn width of six quotient bits per machine cycle. This is a radix 1.75 microns. 64 division technique. However, the high radix Figure 9 shows the locations of the major circuit division is accomplished by overlapping lesser blocks in the VRGx chip. The major blocks of the radLx divisions. In particular, there are three sets of VRGx chip are five read ports, three write ports, radix 4 division groups. The first two sets are over­ and 16 vector registers in the RAM bank array. The lapped, so that the critical path through the radix block diagram, Figure 10, shows the main data 64 division is actually the critical path through two paths. The 16 vector registers are implemented as radix 4 divisions. A minor iteration is the path 64-word by 9-bit single port RAMs. Eight bits are a through one radix 4 division group. A major itera­ slice of the 64-bit vector register ftle and the ninth tion is the path through the overlapped set of two bit is for byte parity. radix 4 division groups, followed by the final radix 4 group. It is important to note that extra iterations Timing do not adversely affect the corrected quotient. A register RAM can be read from one address and Finally, to produce the corrected quotient, the set written from a different address in one 16-ns clock of negative quotient digits is subtracted from the cycle. This dual operation is made possible by a 2 set of positive quotient digits, where each digit is to 1 multiplexer on the RAM address inputs. The properly radix 2 weighted, based on the order of read address is applied during the first portion of selection. (That is, the first quotient digit selected is the cycle, and the write address is applied during the most significant bit of the correct quotient.) the second portion of the cycle. Splitting the clock cycle into read and write portions eliminates Ve ctor Register File Ch ip - VRGx conflict between read and write ports in the event The VAX 9000 architecture adds vector instructions that a single register RAJVl is selected for both read to the standard VAX environment, thus a vector and write. Read data is held in a latch during the sec­ register file was required. There were two primary ond portion of the cycle and is unaffected by the design requirements for the vector register file. write operation. First, the register file and associated cross-bar logic A single clock cycle consists of nonoverlapping had to fit in a single multichip unit; and second, the clock phases A ami B. Latches on the read and write

Digital Te chnicaljo urnal Vol. 2 No. 4 Fa llf'J'JO 55 VAX 9000 Series

Figure 9 Photomicrograph of VRGx Chip pon inputs are clocked by phase A, and read port A read port consists of an enable, a 4-bir register output latches are clocked by phase B. For a read select, a o-bit vector element address, and a 9-bit operation initiated on phase A, the output read data output. An enabled read port applies a register becomes valid during phase B. select code that points to a particular RAM bank. At that RAM bank, a')to I multiplexer selects the vec­ Cross-bar Logic tor element address from the active read port and

Cross-bar logic in the RAM bank array makes each of applies it ro the read add ress of the RAM. Then the the 16 vector register RAMs independently accessi­ RAM output passes through a 16 to l multiplexer ble from the read and write ports. Enable inputs on controlled by the register select code, so that the the ports prevent invalid addresses from contlicring selected RAM output reaches the output of the active with intended addresses. Read and write ports may read port. point to the same register RAM, bur different write A write port consists of an enable, a 4-bit register pons may nor point to the same RAM. Also, differ­ select, a 6-bir vector element address, and a 9-bir ent read ports may on ly point to the same RMvl if the write data input. An enabled write port applies a vector element address is the same. All conflicts register select code that points to a particular RA.M must be resolved external to the chip. bank. At that RAM bank, a 3 to I multiplexer selects

Fa ll 56 Vo l .! 1\'o. 4 /

5x 5x

SEL SEL ADDR<5:0> -

- ;------r - ---, READ SEL<3:0> - READ PORT I I 16:1 PO RT 91 MUX ou T ADDR 6 I 5:1 ENABLE AR DO D0 -8 0· - I MUX I f-A- ______.. I I 3x

ADDR I 6 I 3:1 DIN<8:0> - I / AW MUX I I I I ADDR<5 0> - RAM WRITE I 64 X 9 I PORT DATA SEL<3:0> - 3:1 I Dl MUX I SEL -+ ENABLE - I I I I

RAMt BA NK I L----I ------__j

RAM BANK ARRAY. 16x

Figure 10 VRGx ChipBlock Diagram the vector element address from the active write tial, and low sensitivity to alpha-particle-induced port and applies it to the write address of the RAM. soft errors. The cell has one limitation: excess Also, a 3 to I multiplexer selects the write data charge storage due to write current can delay sub­ from the active write port and applies it to the RAM sequent writing to the opposite state. This problem data input. is eliminated with a special bit line current steering circuit that makes write current state dependent RAMTe chnology (Figure 11). The normal transistors in an ECL process are of the The SCR memory cell in Figure 11 is written by NPN type, where the collector is a buried N-doped applying a high current (four times read current) to region. For memory cells, a lateral PNP transistor is the "off' bit line emitter. The current steering tran­ placed in the same collector region, and the com­ sistors prevent this current from reaching a bit line bined structure has the latching characteristics of a emitter that is already "on." Thus, attempting to silicon controlled rectifier (SCR). The memory cell write a cell that is already in the desired state does array in the 64 by 9 register RAMs is implemented not result in any additional cell current beyond the with ECL SCR memory cells. normal read current, and no additional charge stor­ The SCR memory cell shown in Figure II consists age occurs. of two cross-coupled SCR structures. Extra NPN emitters connect to the bit lines and provide a Other Ch ip Features means of writing and sensing the celL The "on" side Other noteworthy chip features include scan logic, of the cell saturates, allowing the bit line emitter to parity error detect logic, and a data pipeline for conduct in the inverse mode. Inverse gain of the bit write port 0 data. Scan operation gives access to the line emitters must be limited to avoid excessive register RAMs. In a single scan-in and scan-out oper­ leakage into the unselected cells. An added process ation, it is possible to read five registers and to write step applies a special base implant to the bit line three registers. emitters only to control their inverse gain. Parity checking logic is used to detect input Advantages of the SCR cell include good density, errors and set error flags.There is a parity check on low standby power, large sense voltage differen- the 9-bit write port data inputs. Another parity

Digital Te cbnicaljournal Vo l 2 No . 4 Fa ll 1990 57 VAX9000 Series

1.51 � .---..----. � 0.51 � 0.51

VA VA KEY:

WC - WRITE CONTROL UWL - UPPER WORD LINE BL - BIT Ll N E (LEFT) BR - BIT LINE (RIGHT) LWL - LOWER WORD LINE VA -VOLTAGE REFERENCE

Figure 11 SCR Memory Cell with Bit Line Current Steering Circuit checker is applied to address and control inputs. for the VAX 9000 knew that unless some architec­ These are assigned to three parity groups, with a tural improvements were made to the traditional parity bit input for each group. static RAM, much of the RAM performance improve­ The write port 0 data pipeline allows a delay of ments would be lost in the wiring interconnect. one. two, and three clock cycles to be selected, They also realized that Digital's memory suppliers RAM delaying the write port data as necessary to resolve would have to be convinced that a new archi­ register access conflicts. tecture would be marketable to their other cus­ tomers. After several design iterations, the tech­ Self-timedRAM nologists submitted a set of specifications for a In the VA X 9000 system - as in any high-perfor­ synchronous, self-timed RAM (STRAM) to several mance CPU - fast memory is used for cache and suppliers for their review. After extensive market control store applications. Engineers traditionally surveys, our memory suppliers agreed that this new use very fast static RAMs within the CPU for mem­ architecture could eventually become a new stan­ ory. Logic designers, however, have long recognized dard for high-speed static RAMs. that CPU performance is often limited as a result of The VAX 9000 system requires two configura­ the time needed to access data in these RAMs. This tions of the STRAM device: I K words by 4 bits, limitation is not only the result of the access time and 4K words by 4 bits. A block diagram of the and write cycle performance of the devices them­ STRAM is shown in Figure 12. The STRAM is similar selves, but also of the off-chip circuitry and inter­ to the traditional RAM in that it has chip select, input connect used for write pulse generation and address and data, and output data. However, the distribution. The logic designers and technologists STRAM also has several nontraditional inputs such

58 Vo l 2 No. 4 Fa ll /'J'JO Digilal Te ch nicaljo urnal Semiconductor Te chnology in a High-performance VAX System

as write, a diffe rential clock, and a reference voltage Input and output latches are clocked on opposite (Vbb). Latches added to all inputs and ourputs edges of the internal differential clock buffer. Tim­ provide pipelined timing. An internal write pulse ing diagrams are shown in Figure 13. On a fa lling generator controls write operations and eliminates edge of CLK H, data and address inputs flow into the the need to generate and distribute the write pulse RAMar ray. signal externally on the module. Also two optional If write is asserted during the next rising edge output configurations are provided: a 50-ohm drive of CLK H, then a write cycle is initiated, and the open emitter for standard parallel termination on input data is stored in the memory at the address the module, and a resistor and pulldown current presented at the ADR inputs. At the same time, the source which is wired externally to implement data is passed through the multiplexer and the out­ STECL or on-chip source termination. putlatch. The clock buffer design allows inputs to be If write is deasserted on the rising edge of CLK H, driven differentially from off-chip to minimize then the STRAM is in a read cycle and input data is clock skew. The clock buffer is also designed to ignored _ The data stored in the RAM at the address accommodate customers who are not greatly con­ presented at the ADR inputs flows out to the multi­ cerned about skew or who may be more concerned plexer and output latch. about conserving routing area. One input of the If chip select (CS) is deasserted prior to the rising clock buffer may be tied to the output pin of the edge of CLK H, then write and read operations are reference generator which provides the standard disabled and the output latches are reset low. ECL threshold voltage (Vbb), allowing the other For proper operation of the STRAM , certain input of the clock buffer to be driven in a single­ timing requirements must be fulfilled. The write ended mode. operation is terminated by either the falling edge of

DIN<3:0>H RAM ARRAY DOUT RAM <3:0>H 2M X 4 DIN DOUT <3:0><3:0> ADDR WR EN ..------1

ADDRH

DO<3:0>H

WRITE CLOCK H WRITE L PULSE GENERATOR

��------� ENABLE H

CS L DLY CLOCK H CLK H

D CLK H

0 CLK L

Figure 12 STRAMBlock Diagram

Digital Te cbnicaljournal Vo l. 2 No. 4 Fa /1 1990 59 VAX9000 Series

NOTE: CLOCK HIGH STATE MUST LAST LONG ENOUGH TO COMPLETE A WRITE CYCLE I'" "'I CLK

WRITE

ADDR, DIN, CS

DATA OUT 1 WR W& 2RD Wffo;l 3RD I

KEY:

0 RD - READ OPERATION CYCLE 0 1 WR - WRITE OPERATION CYCLE 1

Figure 13 STRAM Timing Diagrams

CLK H or by the internal write pulse generator, References whichever occurs first. Therefore CLK H must be 1. D. Marshall and ]. McElroy, "VAX 9000 asserted long enough to ensure that data is properly Packaging, The Multi-Chip Unit," Pmceedings of written into the memory array. The internal write '90 (Spring 1990). pulse generator provides an output having the COM PC ON "MOSAIC proper duration as determined by a string of gates. 2. P. Zdebel et al., lll- A High Perfor­ Also, the assertion of the internalwrite pulse sig­ mance Bipolar Te chnology with Self-Aligned nal must be delayed by an amount equal to the inter­ Devices," Proceedings of IEEE 1987 Bipolar nal access time of the RAM . In this way. the correct Circuits and Te chnology Meeting VAX data is stored, and not the data previously stored in 3. D. Fire and T. Fossum, "Designing a for High the input registers. The delay is accomplished by Performance," Proceedings of COMPCON '90 the row delay circuit, which is also simply a string (Spring 1990). of gates. These features give the STRAM its "self­ 4. C. Baugh and B. Wo oley, "A Two's Complement timed" nature. Parallel Array Multiplication Algorithm," Sh011

No te at COMPCON 73, 7th Annual IEEE Acknowledgments Computer Society International Co nference (February 1973). The authors would like to acknowledge the fo llow­ ing individuals who participated in and contrib­ 5. C. Wallace, "A Suggestion for a Fast Multiplier," VAX uted to the success of the 9000 project: Jerry 1 EEE Transactions on Electronic Computers, Weisbach, Andy Moroney, Bob Haller, Mar c Vol. EC-13 (February 1964): 14- 17. Lamere, Mark Hamel, To m Senna, Dave McCall, 6. L. Dadda, "Some Schemes for Parallel Patty Kroesen, Rick Jones, jim jensen, Te rry Multipliers," Colloque sur l'Algebre de Boote Skrypek, Eugene Marteney, Paul Guglielmi, Elaine Oanuary 1965). Fire, Larry Herman, Bill Grundmann, Mark 7. K. Hwang, Computer Arithmetic Principles, Pascarelli, Fran Richard, Linda Greska, Jack Mason, Architecture, and Design (New Yo rk: john Wiley Chris Caiazzi, Roger Dame, Mike Normand Steve and Sons, 1979): 213-283. Sullivan, Rob Rcinschmidt, Bob Bechdolt, Mike Warder, Mike Hickman, Brian Sadler, Wayne Nunn, Rita Wes pi, Gene Ye e, Bruce Smith, Alisyn Emerson, Jim Glanville.

60 Vol. 2 No. 4 Fa ll 19')0 Digital Te cbnicaljounwl Richard A. Brunner Dileep P. Bhandarkar Francis X. McKeen Bimal Patel Wi lliam]. Rogersjr. GregoryL. Yo der

Ve ctor Processing on the VAX 9000 Sy stem

Th e VAX 9000 system provides thefirst emitter-coupled logic (ECL) implementation of the VA X vector architecture. The op tional vector processor on the VAX 9000 system addresses the computing needs of numerically intensive applications with a peak performance of 125 MFL OPS for double-precision calculations. Th e innovative design of the vector registerfi le allows the vectorprocessor to overlap the execution of up to three vector instructions. Supported by both the VMS and ULTRIX op erating systems, the vector processor on the VA X 9000 system provides fo ur to fi ve times performance improvementfo r vectorizable applications over its scalarprocessor.

For a long time, vector processing was the domain This paper describes the VAX vector architecture of large, expensive such as the and its implementation by the VAX 9000 V-box. The -1.1 However, with the availability of low cost, first part of the paper discusses the architectural pipelined floating point arithmetic chips, and the model that all VAX vector processors must follow. maturation of vectorizing , vector pro­ The second part shows the actual realization of this cessing has become a mainstream technology for architecture in the VAX 9000 V-box and explains the scientific applications.2 Applications that can bene­ innovative techniques the V-box uses to achieve fit from vector processing include finite element good performance. The paper concludes with analysis, signal processing, and computational fluid preliminary vector performance numbers for the dynamics. The recent addition of integrated vector VAX 9000 system on some standard vector bench­ processing to the VAX architecture and its imple­ marks and a number of vector code examples. mentation on the VAX 9000 system provides these applications with an improvement in execution VA X Ve ctor Architecture time of four to five times over that of a VAX 9000 sys­ The VAX vector architecture defines the instruction tem without vector processing. Ve ctor processing set, registers, and behavior that all VAX vector extends the performance range of VAX systems. The vector processor on the VAX 9000 system, implementations, such as the VAX 9000 V-box, must ' referred to as the V-box, is the first emitter-coupled follow. The vector architecture effort started in logic (ECL) implementation of the VAX vector archi­ December 1985. At that time several CPU develop­ tecture. The definition of the architecture and the ment projects were well underway, including the development of the V-box started in 1986, two years VAX 9000 system. With the expectation of provid­ after the design of the rest of the VAX 9000 CPU. ing four to five times performance improvement Thus, the design of the V-box was synergistic with for vectorizable applications, Digital decided to add the definition of the VAX vector architecture. The vector processing to the VAX 9000 system, even major goal of the V-box design was to provide though the system was in an advanced stage of adequate vector performance (four to five times development. A decision also was made to provide speed-up over scalar) without impacting the design a complementary metal oxide semiconductor of the remainder of the VA X 9000 CPU and the (CMOS) implementation of the architecture on the " memory subsystem, which were too far along in VAX 6000 Model 400 system. development to change. With vector performance Because both systems could not tolerate major comparable to a CRAY -1 and a peak performance of changes without a major slip in schedule, the archi­ 125 MFLOPS for double-precision calculations, the tecture required an approach that made few V-box fulfills this goal. changes to the scalar processor - that part of a VA,'\

Digital Te L·hnicaljournal V!JI. 2 No. 4 Fa ll 1990 61 VA X 9000 Series

processor that executes the regular VA X instruction seven classes. The seven basic instruction groups set. Furthermore, because not all applications and and their opcodes are shown in Ta ble l. markets can benefit from vector processing, Digital Within each class, all instructions have the same decided not to require vector processing on every number and types of operands, which allows the new VAX processor. Therefore, vector processing is scalar processor to use block-decoding techniques. offered as an optional capability. The scalar proces­ The differences in operation between the individ­ sor decodes vector instructions and passed them ual instructions within a class are irrelevant to the to its associated vector processor. All processing scalar processor and need only be known by the of vector instructions is handled by the vector pro­ vector processor. Important features of the instruc­ cessor. Mechanisms are provided for vector-scalar tion set are synchronization and handling of vector exceptions • Support for random-strided vector memory data by the scalar processor. through gather (VGATH) and scatter (VSCAT) Although the architecture had to account for the instructions implementation constraints of both ongoing CMOS and ECL projects, it had to be general and flexible • Generation of compressed IOTA vectors (through enough to allow future, more integrated implemen­ the IOTA instruction) to be used as offsets to the tations at higher performance. The architecture gather and scatrer instructions also had to minimize its impact on the existing VMS • Merging vector registers through the VMERGE and ULTRIX operating systems because major instruction changes could significantly delay software support • The ability for any vector instruction to operate for vector processing. under control of the vector mask register BasicAr chitecture Additional control information for a vector The VAX vector architecture uses a vector-register­ instruction is provided in the vector control word based design first pioneered by Seymour Cray.1 (shown as cntrl in Table 1), which is a scalar There are 16 vector registers, each of which holds operand to most vector instructions. The control 64 elements; an element is 64-bits. Instructions word operand can be specified using any VAX which operate on longword integers or F _floating . However, VAX compilers gener­ point data, only manipulate the low-order 32 bits ally use immediate mode addressing (that is, place of each element-sometimes referred to as long­ the control word within the instruction stream). word elements. The format of the vector control word is shown in A number of vector control registers control Figure 1. which elements of a vector register are processed The Va , Yb,and Vc fields indicate the source and by an instmction. The vector length register (VLR) destination vector registers to be used by the limits the highest-numbered vector register ele­ instruction. These fields also indicate the specific ment that is processed by a vector instruction. The operation to be performed by a vector compare or vector mask register (VMR) consists of a 64-bit mask, convert instruction. The MOE bit indicates whether in which each mask bit corresponds to one of the the particular instruction operates under control of possible element positions in a vector register. the vector mask register. The MTF bit determines When instructions are executed under control of what bit value corresponds to "true" for vector the vector mask register, only those elements for mask register bits. It allows a compiler to vectorize which the corresponding mask bit is true are pro­ if-then-else constructs. The EXC bit is used in vector cessed by the instruction. Ve ctor compare instruc­ arithmetic instructions to enable integer overflow tions set the value of the vector mask register. and floating underflow exception reporting. The The vector count register (VCR) receives the Ml bit is used in vector memory load instructions to number of elements generated by the compressed indicate modify-intent. Figure 2 shows the encod­ IOTA instruction, which is similar to COMPRESSED ing for some typical VAX vector instructions. IOTA on the CRAY-2.1 All VAX vector instructions use two-byte extended opcodes. Any necessary scalar Ve ctor Execution Mo del operands (e.g., base address and stride for vector With the addition of vector processing, a typical memory instructions) are specified by standard VAX VAX processor consists of a scalar processor and an scalar operand specifiers. The instruction formats associated vector processor; the two are referred to allow all VA X vector instructions to be encoded in as a scalar/vector pair. A VA X multiprocessor system

62 Vo l. 2 No. 4 Fa ll 1990 Digital Te cbnicaljournal Ve ctor Processing on the VAX 9000 Sy stem

Table 1 VAX Vector Instruction Classes

Vector Memory, Constant-stride Vector-scalar Double-precision Arithmetic opcode cntrl, base, stride opcode cntrl, scalar

VLDL Load longword vector data VSADDD O_floating add VLDQ Load quadword vector data VSADDG G_floating add VSTL Store longword vector data VSCMPD O_floating compare VSTQ Store quadword vector data VSCMPG G_floating compare VSDIVD O_floating divide

Vector Memory, Random-stride VSDIVG G_floating divide opcode cntrl, base VSMULD O_floating multiply VSMULG G_floating multiply

VGATHL Gather longword vector data VSSUBD O_floating subtract VGATHQ Gather quadword vector data VSSUBG G_floating subtract VSCATL Scatter longword vector data VSMERGE Merge VSCATQ Scatter quadword vector data Vector-vector Arithmetic

Vector-Scalar Single-precision Arithmetic opcode cntrl or regnum opcode cntrl, scalar VVADDL Integer longword add

VSADDL Integer longword add VVADDF F _floating add VSADDF F _floating add VVADDD O_floating add VSBICL Bit clear longword VVADDG G_floating add VSBISL Bit set longword VVBICL Bit clear longword VSCMPL Integer longword compare VVBISL Bit set longword VSCMPF F _floating compare VVCMPL Integer longword compare VSDIVF F _floating divide VVCMPF F_floating compare VSMULL Integer longword multiply VVCMPD O_floating compare VSMULF F _floating multiply VVCMPG G_floating compare VSSLLL Shift left logical longword VVCVT Convert VSSRLL Shift right logical longword VVDIVF F _floating divide VSSUBL Integer longword subtract VVDIVD D_floating divide VSSUBF F _floating subtract VVDIVG G_floating divide VSXORL Exclusive-or longword VVMERGE Merge IOTA Generate compressed IOTA VVMULL Integer longword multiply vector VVMULF F _floating multiply VVMULD O_floating multiply

Vector Control Register Read VVMULG G_floating multiply opcode regnum, destination VVSLLL Shift left logical longword VVSRLL Shift right logical longword

MFVP Move from vector processor VVSUBL Integer longword subtract VVSUBF F _floating subtract Vector Control Register Write VVSUBD O_floating subtract opcode regnum, scalar VVSUBG G_floating subtract VVXORL Exclusive-or longword

MTVP Move to vector processor VSYNC Synchronize vector memory access

Digital Te chllicaljournal Vol. 2 No. 4 Fall /990 63 VAX9000 Series

15 14 13 12 11 8 7 4 3 0 EXC MOE MTF Ml 0 VNCONVERT FCN VB VC/COMPARE FCN

Figure 1 Ve ctor Co ntrol Wo rd

comprises a number of tht:st: scalar/vector pairs. ever, the asynchronous execution does cause the Asymmetric configurations can exist when only reporting of vector exceptions to be imprecise. some of the VA X processors in a multiprocessor Special instructions, which are described in the system contain a vector processor. Synchronization section, are provided to ensure For good perform ance, the scalar processor oper­ synchronous operation when necessary. ates asynchronously from its vector processor Both scalar and vector instructions are initially whenever possible. Asynchronous operation allows fe tched from memory and decoded by the scalar the execution of scalar instructions to be over­ processor. If the opcode indicates a vector instruc­ lapped with the execution of vector instructions. tion, the opcode and necessary scalar operands are Furthermore, the servicing of interrupts and scalar issued to the vector processor and placed in its exceptions by the scalar processor does not disturb instruction queue. The vector processor accesses the execution of the vector processor, which is memory directly for any vector data that it must freed from the compk:xity of resuming the execu­ read or write. For most vector instructions, once the tion of vector instructions after such events. How- scalar processor successfully issues the vector

ASSEMBLER FORMAT:

VVEOLF V6,V7 ;IF V6[i] = V7[i] THEN VMR[i] = 1, ELSE VMR[i] = 0 ; (VVEOLF IS A VVCMPF PSEUDO·OPCODE) VVADDF/1 V1,V2,V3 ; V3 = V1 + V2. DO ADDITION UNDER CONTROL OF VMR : WITH MATCH = 1 VSMULF/U R4,V4,V5 ; V5 = R4'V4 WITH UNDERFLOW EXCEPTION CHECKING ENABLED INSTRUCTION FORMAT: VVCMPF cntrl.rw ; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD VVADDF cntrl.rw ; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD VSMULF cntrl.rw, src.rl ; INSTRUCTION CONSISTS OF OPCODE, CONTROL WORD, AND SCALAR SOURCE ENCODING IN MEMORY:

- BYTE ,- -, FD 0 ::> TWO-BYTE OPCODE FOR VVCMPF C4 :1 8F :2 OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) :3 CONTROL WORD <7:0>: COMPARE FCN IS EOL AND V7 IS A SOURCE :4 CONTROL WORD <15:8>: V6 1S A SOURCE 5 :6 ..J. TWO-BYTE OPCODE FOR VVADDF

:7 •- OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) :8 CONTROL WORD <7:0>: V3 IS DESTINATION AND V2 IS A SOURCE A SOURCE, MASKED OPERATIONS ARE ENABLED, AND MATCH = 1 :9 CONTROL WORD <15:8>: V1 IS TWO-BYTE OPCODE FOR VSMULF ::C: J OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) :D CONTROL WORD <7 0>:V5 IS DESTINATION AND V4 IS A SOURCE

:E _,_ CONTROL WORD <15:8>: VA IS IGNORED. UNDERFLOW EXCEPTION CHECKING IS ENABLED :F OPERAND SPECIFIER FOR REGISTER MODE WITH SCALAR DATA IN R4

Figure 2 Ve ctor Instruction Encoding

64 Vo l. 2 No. 4 Fa l/ /<)<)0 Dtgilal Te chn icaljo urnal Ve ctor Processing on the VAX 9000 Sy stem

instruction, it proceeds to process other instruc­ vided with a prograrruning model that ensures tions and does not wait for the vector instruction to correct results regardless of the extent a particular complete. An execution model is shown in Figure 3. implementation chains or overlaps. This approach When the scalar processor attempts to issue a differs with respect to some other existing vector vector instruction, it checks to see if the vector pro­ architectures, such as the IBM S/370 vector archi­ cessor is disabled - that is, whether it will accept tecture, which give the appearance of sequential further vector instructions. If the vector processor instruction execution.6 is disabled, then the scalar processor takes a "vec­ A VAX vector implementation may have its own tor processor disabled" fault. An operating system memory management hardware, translation buffer, handler is then invoked on the scalar processor to and cache; or it may share those of the scalar pro­ examine the various error-reporting registers on the cessor. In high-end vector implementations, such as vector processor to determine the disabling con­ the VAX 9000 system, the vector and scalar proces­ dition. The vector processor disables itself to report sors are tightly coupled. The problems of limited the occurrence of vector arithmetic exceptions or chip area and translation buffer and cache coher­ hardware errors. The operating system disables the ency can be lessened by allowing high-speed mem­ vector processor, usually to indicate the unavaila­ ory management hardware and cache to be shared bility of the vector processor, by writing to a privi­ by both vector and scalar processors. For other leged vector register. If the disabling condition can implementations, such as the VAX 6000 Model 400 be corrected, the handler enables the vector proces­ system, the vector and scalar processors are not so sor and directs the scalar processor to reissue the tightly coupled, and there is a performance advan­ faulted vector instruction. tage in allowing separate memory management Within the constraint of maintaining the proper hardware and cache.1 Little additional effort is ­ ordering among the operations of data-dependent essary by an operating system to support separate instructions, the architecture explicitly allows the vector memory management hardware and cache. vector processor to execute any number of the A vector processor can treat vector memory instructions in its queue concurrently and retire management exceptions (MME) in a synchronous

them out of order. Thus, a VAX vector implementa­ manner, as the VAX 9000 V-box does. Once the tion can chain and overlap instructions to the scalar processor issues a vector memory instruc­ extent best suited for its technology and cost­ tion, it pauses until the vector processor deter­ performance. In addition, by making this feature an mines whether an MME will be encountered by the explicit part of the architecture, software is pro- instruction. If an MME will occur, then a precise

PHYSICAL MEMORY 16 GB

INSTRUCTION OPCODE, CONTROL WORD STREAM INSTRUCTIONS VAX SCALAR DATA CPU DISABLE/STATUS

DATA STREAM

VECTOR DATA

Figure 3 Ve ctor Execution Un it

Digital Te cbnical]ournal Vo l. 2 No. 4 Fa ll 1990 65 VAX 9000 Series

exception is taken on the scalar processor and the Ve ctor arithmetic exceptions are reported in an appropriate operating system handler is invoked. imprecise manner by vector processor disabled If no MME willoccur, the scalar processor proceeds fa ults. When an exception occurs in the processing to process other instructions and the vector proces­ of a vector element, the vector processor records sor completes the memory instruction. In the case the exception in both a privileged exception regis­ of referencing a unity-strided vector, which occurs ter (the vector arithmetic exception register, VA ER) most frequently, the MME checking takes only and in the corresponding element of the destination a short time at the beginning because the vector vector register specified by the instruction. The vec­ is contained in two or less pages. (MME checking is tor processor then disables itself from receiving done at the page level .) fu rther vector instructions. However, the vector processor continues to execute the instruction that Co ntext Switching encountered the exception to completion by pro­ Because of the asynchronous operation of the vec­ cessing the remaining vector register elements. tor and scalar processors, the vector context state of As stated earlier, memory management excep­ a process is separate from its scalar comext state. tions can be reported precisely by a VAX vector Thus, it is possible for an operating system to swap processor to its scalar processor, as the VAX 9000 in a new process to the scalar processor while V-box does, and the scalar processor takes a normal allowing the ve ctor context of the previous process VAX memory management fa ult. Exception infor­ to remain on the ve ctor processor. When the previ­ mation is placed on the stack in the same format as ous process is swapped out, the vector processor is for scalar memory management exceptions. The disabled by the operating system to prevent other use of the same format minimizes the effort needed processes from accessing this vector context. by an operating system to support these exceptions. If the subsequent processes do not use the vec­ Memory management exceptions were extended tor processor, then the operating system avoids fo r vectors to include two new exception para­ the overhead of saving and subsequently restoring meter bits: vector I/O space reference and vector 8 kilobytes (KB) of vector context state for the orig­ aligrunent fa ult. A vector I/O space reference occurs inal process. If another process does use the vector whenever an attempt is made to load or store vector processor, the operating system must reenable the data to I/O space. Because of the performance vector processor, save the vector state of the origi­ degrada tion of unaligned memory data, a vector nal process, load the vector context of the new alignment fa ult occurs whenever an element being process, and, finally, make the vector processor accessed by a vector memory instmction does not available. This fu ll context switch can take up to begin at an address that is an integer multiple of the 100 microseconds on the VAX 9000 system. length of the element in bytes. For example, a long­ Assuming that only a few processes require the word (4-byte) element in memory should begin at vector processor, it is likely that when the original an address which is an integer multiple of 4 bytes. process is rescheduled to the same scalar/vector pair, the process will find its vector context state Sy nchronization residing on the vector processor. By using this tech­ nique, which is referred to as "cheap vector context In most cases, it is desirable for the vector processor switching," both the VMS and ULTRlX operating sys­ to operate asynchronously with the scalar proces­ tems reduce the time required to swap in a process sor to achieve good performance. However, there that uses the vector processor. are cases in which the operation of the vector and scalar processors must be synchronized to ensure Exceptions correct results. Rather than forcing the vector pro­ Most of the exceptions encountered by VAX vector cessor to detect and automatically provide synchro­ instructions are identical to those that occur for nization in these cases, the architecture provides VAX scalar instructions. The arithmetic exceptions special instructions, which software can use, to are exactly the same. The memory management accomplish the synchronization. Some of these exceptions have been extended to include two new instructions are discussed below. Software must vector exceptions: vector IIO space reference and determine when to use these synchronization vector alignment fa ult. As in the VAX scalar architec­ instructions to ensure correct results or establish ture, the reporting of floatingunde rflow and integer exception checkpoints. Given the necessary sophis­ overflow exceptions can be disabled by setting the tication of vectorizing compilers, this requirement EXC bit in the vector control word. is not onerous.

66 Vo l 2 No . 4 Fall 199 0 Digital Te cbnicaljournal Ve ctor Processing on the VAX 9000 System

Ve ctor and scalar memory references may be of a 64-bit data path, which brings instructions and issued simultaneously. Therefore, these references data to the vector unit, and a 32-bit path, which must be synchronized to prevent a conflictfrom sends data to the scalar unit. AU vector memory occurring when accessing loca­ instructions send data through this data path. tions. This synchronization is provided by the As Figure 4 also shows, the V-box is composed of MSYNC function of the MFVP instruction. Once the the folJowing subunits: vector register unit, vector MSYNC function is invoked, the scalar processor add unit, vector multiply unit, vector mask unit, does not issue further instructions until all pre­ vector address unit, and vector control unit. Each of vious vector and scalar memory references have these subunits can function in parallel, which completed. allows up tO two vector arithmetic instructions Because the vector and scalar processors execute and one vector memory instruction to be executed asynchronously, software cannot determine when a simultaneously. Crucial to this instruction over­ vector exception will be reported. However, soft­ lapping ability is the vector register unit, which ware requires that exceptions be reported at certain supports up to eight simultaneous accesses from checkpoints. For example, exceptions incurred in a the other subunits. procedure must be reported within the context of Physically, the V-box resides on the same planar that procedure before another procedure is calJed. board as the remainder of the VAX 9000 CPU. Three This exception reporting synchronization is pro­ multichip units (MCUs) are reserved for the V-box, vided by the SYNC function of the MFVP instruction. which is a field-installable option. The V-box com­ Once SYNC is invoked, the scalar processor does not prises 25 ECL Motorola Macrocell Array Ills (MCA3)7 issue further instructions until the exceptions of (For brevity, a macrocell array is referred to as a previous vector instructions, if any, are reported. "chip" in this paper.) The operation of these sub­ units and the techniques used to enhancetheir per­ VAX 9000 Y-boxOverview formance are described in the following sections. The VAX 9000 V-box is one of four tightly coupled, parallel function units that compose the VAX 9000 Ve ctor Control Un it CPU. As such, it shares, with the rest of the CPU, The vector control unit receives and coordinates both the large 128KB data cache and the very fast the execution of vector instructions within the address translation hardware. As a result, the V-box V-box. The VAX 9000 scalar execution engine has very fast access to memory data. The V-box is (E-box) transfers both an encoded version of the connected to the CPU through the scalar execution vector instruction and the necessary scalar data to unit as shown in Figure 4 . This connection consists the unit, which loads the instruction and data into a

VECTOR REGISTER MASK! VECTOR UNIT ADDRESS CONTROL 1-----l� 1--lloi UNIT

Fig ure 4 V- box Organization (with VAX 9000 CPU)

Digital Te cbnicaljourna/ Vo l. ,! No. 4 Fa ll/l)'JO 67 VAX 9000 Series

circular queue as shown in Figure 5. The queue can are ro be used by already executing vector instruc­ buffer a few pending instructions while the remain­ tions in the next cycle against the vector register ing Y-box subunits are executing others. Without addresses required by the new instruction. If none the queue, the V-box could not accept pending of the addresses overlap then the instruction is free instructions when all of its subunits are busy, thus, to issue. If an overlap does exist, the instruction is propagating a stall condition to the scalar execution held until the next cycle, when it can then be issued unit and resulting in poor performance. to the appropriate subunit. (The Jack of significant The scalar data that is required by a vector cycle delay in this case is due to the optimal design instruction is placed in the queue one location of the vector register unit.) If there are no register

behind the instruction quadword. Whenever the conflicts, the instruction is issued immediately to queue contains two entries, the vector control unit the appropriate subunit. returns a signal to the scalar execution unit and As the vector control unit issues the instruction to requests that subsequent instruction issue be the subunit, it also sends scalar source operands, delayed until the number of entries in the queue if any, and the addresses of the vector registers has diminished to one or less. The queue is cir­ required by the instruction to the vector register cular in nature and wraps around to the beginning unit. The vector register unit latches the scalar data automatically. for the duration of that instruction. For each cycle When an instruction is loaded into the queue, a of the instruction's execution, the register unit then pointer directs the instruction to the decode logic sends the necessary scalar and register data to the shown in Figure 5. If there is enough instruction appropriate subunit. The vector control unit also data available in the queue and the necessary sub­ contains the vector length register and sends a copy unit is not busy, then the vector control unit sends of it with every instruction that is issued to a sub­ the instruction data from the queue to the register unit. By supplying each subunit with a copy of conflict logic. The register conflictlogic determines the vector length register, writes to the register by if the vector registers required by the instruction are MTVP instructions do not affect instructions cur­ already in use by the other subunits, a condition rently executing under the register's previous value. called register conflict. The determination is made Without this mechanism, writes to the vector by comparing the vector register addresses that length register would be delayed until previously

E-BOX VECTOR BUFFER DATA SCALAR DATA TO VECTOR SOURCE/DESTINATION VECTOR REGISTER FILE REGISTER ADDRESSES ADD VECTOR INSTRUCTION MUL

GEN

NO VECTOR NO CONFLICT REGISTER CONFLICT ISSUE NEW CONFLICT 1---­ INSTRUCTION '-----1 CHECK BUFFER LOGIC BUFFER VALID COUNTER ISSUE INSTRUCTION NEW ISSUE INSTRUCTION DECISION LOGIC

Figure 5 Ve ctor Control Un it

68 Vo l. 2 No. 4 Fa ll /'-)')0 Digital Te cbnicafjounwl Ve ctor Processing on the VAX 9000 Sy stem

executing instructions had finished, which would unit, reads occur during the first half of the cycle, result in poor performance. and writes occur during the last half. A write and Upon reaching the subunit, most vector instruc­ read enabling signal is generated for each register tions execute at one cycle per element, after the bank every cycle. Each cycle, data is selected from initial pipeline latency. However, the vector divide one of the three write ports to be written into any instructions (VSDIV and VVOJV) execute at a varying enabled register banks. Write port 0 has a four-stage number of cycles, depending on the floating point pipe to buffer data coming from the E-box, through format (F, D, or G). (To simplify the vector control the control logic, which cannot be written due to a logic, no other vector instructions are issued once register bank conflict. The vectOr register file also a vector divide starts.) Results are returned to the has three scalar registers (one each for the vector vector register unit or vector mask unit as they are address-mask unit, vector add unit, and vector mul­ generated, depending on the instruction. tiply unit) to hold scalar source operands for vector­ As described earlier, microcode in the scalar exe­ scalar instructions. Write port 0 is used to write cution engine encodes vector instructions into an these registers. Each enabled read port selects an instruction quadword before passing them to the element from one of the 16 register banks or scalar V-box. Ta ble 2 shows the high-order 32 bits of the registers (for vector-scalar instructions) and trans­ format used for every instruction sent to the V-box. fers it to one of the other subunits. This quadword contains fields that indicate the The vector register file uses a technique referred instruction, appropriate V-box subunit to execute to as "barber poling" to improve the use of chaining the instruction, and format of the vector control and overlapped instruction execution. As Figure 7 word. The low-order 32 bits of the instruction quad­ shows, barber poling spreads each architecturally word contain the vector control word for the vector defined vector register across all vector register instruction. The instruction quadwords present the banks. Elements are laid out such that the first V-box with a fixed format instruction that smoothly vector element of each vector register is in location fits into a fi Xed-length instruction queue, requires 0 of the same physical register bank and element b little subsequent decoding, and has fields that can of vector register nisin location b of vector register be directly gated to selection logic. As a result, the bank ({n +b]modulo 16). time needed by the V-box to decode vector instruc­ By using this technique, a vector register conflict tions is reduced and performance is increased. causes the vector control unit to delay the issuing of a new vector instruction for no more than three Ve ctor Register Unit cycles. If the more standard technique of placing all The vector register unit or file, as its name implies, elements of one vectOr register in the same bank contains the logic and fast memory that imple­ were used, a vector register conflict could cause ment the 16 VA X vector registers on the V-box. The the execution of a new instruction to be delayed by block diagram of the vector register file is shown in 64 cycles. The 64-cycle delay would have frustrated Figure 6. The vector register file has three write attempts at overlapping and severely degraded the ports and five read ports. By using the innovative vector performance of the VAX 9000 system. technique described below, these ports provide the multiple accesses needed to feed two operands per Ve ctor Add Un it cycle to the vector add and multiply units, and one The vector add unit executes most vector instruc­ operand to the vector address-mask unit. This unit tions, including both floating point and integer is the single largest contributor to the excellent vec­ addition, subtraction, comparison; vector convert; tor performance of the VAX 9000 system. vector shift logical; vector logical operations; and The file consists of 16 vector registers. Each vector merges. For brevity, these instructions are register contains 64 elements, and each element is referred to as add-class instructions. One of the 72-bits wide (64 data, S parity). The vector register challenges in designing the vector add unit was the file is implemented as a byte-sliced custom chip, need to perform both integer and floating point which has a single parity bit per data port. Three arithmetic. writes and five reads to the file can occur simulta­ The organization of the vector add unit is shown neously in any cycle. All writes must be to different in Figure 8. It is a pipelined structure that comprises register banks. However, multiple reads can occur two identical chips for unpacking and aligning to the same bank if the same element is required by operands (VI:'SAand VI'SB); one chip for performing each read access. Internally to the vector register arithmetic and logical operations (VFAD); and a

Digital Te chnicaljo urnal Vo l. 2 No. 4 Fa lf /')')() 69 VAX9000 Series

Table 2 Encoded Instruction Quadword (bits <63:32>)

Vector OPCODE Control Word Type Dispatch Type Instruction <39:32> <42:40> <46:43>

VVSUBF!VSSUBF OF9 2/6 0 VVSUBG!VSSUBG ODB 2/6 0 VVSUBD!VSSUBD OD2 2/6 0 VVSUBUVSSUBL OF6 2/6 0 VVCMPL!VSCMPL OF5 3/7 0 VVSLL!VSSLL 034 2/6 0 VVSRL!VSSRL 026 2/6 0 VVBISUVSBISL 086 2/6 0 VVBICL!VSBICL 08E 2/6 0 VVXORL!VSXORL 088 2/6 0 VVMERGE!VSMERGE OAE 5/1 0 VVADDD!VSADDD 092 2/6 0 VVADDF!VSADDF 089 2/6 0 VVADDG!VSADDG 098 2/6 0 VVADDL!VSADDL 086 2/6 0 VVCMPD!VSCMPD OD5 3/7 0 VVCMPF!VSCMPF OFD 3/7 0 VVCMPG!VSCMPG ODD 3/7 0 VVCMPD!VSCMPD OD5 3/7 0 VVCVTDF 01 1 4 0 VVCVTDL 016 4 0 VVCVTFD 03A 4 0 VVCVTFG 038 4 0 VVCVTFL 03E 4 0 VVCVTG F 019 4 0 VVCVTG L 01 E 4 0 VVCVTLD 032 4 0 VVCVTLF 031 4 0 VVCVTLG 033 4 0 VVCVTDL 017 4 0 VVCVTFL 03F 4 0 VVCVTG L 01 F 4 0 VVMULL/VSMULL 003 2/6 1 VVM U LF!VSMULF 004 2/6 1 VVM ULD!VSMLILD 005 2/6 1 VVM U LG!VSMLILG 006 2/6 1 VVDIVF!VSDIVF ooc 2/6 4 VVDIVD!VSDIVD OOD 2/6 4 VVDIVG!VSDIVG OOE 2/6 4 VLDL 001 0 2 VLDQ 002 0 2 Block load ooc 0 2 VSTL 003 0 2 VSTQ 004 0 2 VGATHL 005 1 2 VGATHQ 006 1 2 VSCATL 010 1 2 VSCATQ 01 1 1 2 IOTA 012 0 2 Load VLR 007 0 2 Load low VMR 009 0 2 Load high VMR OOA 0 2 Store low VMR OOD 0 3 Store high VMR OOE 0 3 Store unalig ned address 013 0 3 Load VPSR 014 0 2 Load VA ER 015 0 3 Store VA ER 008 0 3 RESET OOF 2 3

Bits <63:47> are reserved.

70 Vo l. 2 No . 4 Fa ll 1990 Digital Te chnicaljournal Ve ctor Processing on the VAX 9000 Sy stem

FROM VML FROM CONTROL FROM VAD VML RESULT VCT WT DAT VAD RESULT WPORT 2 WPORT O WPORT 1

LATCH LATCH SREG4 LD I � SREG2 LD I SCALARI 4 1 I SCALAR 2 S4 � SREGO LD I 1 I SCALAR 0 S2 I I H I so ' WRPORTO CNF SEL -\ / I \ SELECT WRITE DATA FOR EACH REG BANK FROM WRITE PORTS I

REG BANKO WT EN REG BANKO AD EN I WRITE j � READ WPORTS O -2 ENABLE ENABLE A PORTS 0-4 REG BANK15WT EN REG BANK15 RD EN LOGIC MEMORY LOGIC ARRAY REG BANKO WT ADA REG BANKO RD ADA � I WRITE READ - WPORTS 0 -2 - ADDRESS ADDRESS RPORTS 0-4 REG BANK15 WT ADA REG BANK15 RD ADR l LOGIC LOGIC

SELECT DATA FOR EACHl READ PORT! FROM REG BANKS I \ S so --=:J I S2 "-1 I 4 ---r I SCALO SE L -\ I -\ I .\ I SCAL2 SE L SCAL4 SEL I

I I I I �I eRPORTO eRPORT1 eRPORT2 eRPORT3 RPORT4 TO MASK LOGIC TO VML LOGIC TO ADDER LOGIC

Figure 6 Ve ctor Register Unit

remaining chip for normalizing, rounding, and packing the result (YFPK). The data paths between the chips are al1 64 -bits wide. The pipeline latency through this unit for both single-precision (integer and F _floating) and dou­ ble-precision (G_floating and D_floating) formats is only three cycles. Thus, the vector/scalar cross-over number for add-class instructions is quite small (that is, the minimum number of vector elements needed for the V-box to surpass the performance of the remainder of the VA X 9000 CPU for this class of instructions.) As a result, the V-box achieves good performance for add-class instructions with small­ sized vectors and large-sized vectors (large-sized vectors being naturally favored by the technique of pipelining). BANK 0 BANK 1 BANK 2 BANK 13 BANK 14 BANK 15 When the vector add unit begins to execute an instruction, it receives two source elements from Figure Barber Poling the vector register unit each cycle. The elements are 7 latched into the unpacking logic, one clement for

No. 4 Digital Te cbnical}ournal Vo l. 2 Fa ll I'J'JO 71 VAX 9000 Series

each of the two chips. During the next cycle, each The complexity of skipping over masked-out ele­ unpacking chip concurrently unpacks and aligns ments would have added extra cycles of pipeline its source element, if necessary, and forwards the latency and resulted in less performance for small­ result to the addition or logical-operation logic, sized vectors. For masked as well as unmasked depending on the instruction. Within the same instructions, the vector add unit operates from the cycle, the addition chip uses the two sources from first up to the last element (as indicated by the the unpacking logic to generate a result, which is vecror length register) of both source registers. The then latched. actual masking of results is handled by the vector During the final cycle, the result is sent to the control unit, which blocks the vector register unit packing chip, which normalizes, rounds, and packs, from receiving masked-out results as they are if necessary, the result and sends it to the vector being sent by the vector add unit. However, the register unit to be written. Exception checking and packing chip does use vector mask register bits to reporting are also done in the last cycle by the pack­ suppress exception generation for results that are ing chip, which maintains the vector add unit's masked out. copy of the vector arithmetic exception register (YAER). When the instruction completes, the vector Floating Point Op eration When executing vector add unit sends its VA ER copy to the vector mask unit floating point instructions, the unpacking logic to be merged with the VA ER copy from the vector takes the various fields of a floating point element multiply unit. and expands and rearranges it into a more conve­ The vector add unit does not differentiate nient format for the addition logic, i.e., the elemem between masked and unmasked vector instructions. is "unpacked." As a result of this process, the addi-

64 SOURCE A --- -- 1- - VFSA VFSB 1 I I

__L ����------� �------L�MASK BIT I I VFAD II

EXPONENT I

VFPK I VAER TO VMKB

ADDER LOGIC

Figure 8 Ve ctor AddUn it

72 Vo l. 2 No . 4 Fa /1 1990 Digital Te cbnicaljournal Ve ctor Processing on the VAX 9000 Sy stem

tion logic is simplified because all VAX floating point right count to the aligmnent logic, which then formats (F, D, and G) are unpacked into an identical sends the shifted result to the add chip. The add format. The unpackinginvolv es decoding the sign, chip operates on the low-order 32 bits of these inserting the hidden bit, and rearranging the frac­ elements and passes through the high-order 32 bits tion bits. For all VAX floating point formats, the unchanged to the packing chip. For logical shift­ fractional part is expanded to 56 bits. (F_floating left instructions (VVSLLL and VSSLLL), the low-order and G_floating are expanded with zeros on the 32 bits also pass through the add chip unchanged. right.) The fractional part is then surrounded on the On the packing chip, the floating point normalize right with two guard bits and a rounding bit to logic performs to do logical shift-left operations. form a 59-bit fraction. The overflow and guard bits The shift count is passed to the normalize logic ensure the accuracy of rounded results. fr om the unpacking logic during the first cycle. For After the elements are unpacked, the unpacking all other integer and logical instructions, the nor­ chips align the elements by taking the fractional malize count is fo rced to zero to pass the add chip part of the smaller magnitude number and shifting result through. Finally, just before sending the result it to the right until its exponent is equal to that of to the vector register unit, the packing chip checks the larger magnitude number. Each unpacking chip for integer overflow exceptions. also receives the exponent bits of the other chip's element. Therefore, the alignment process can be Merge lnstrnctions For vector merge instructions done in parallel before the elements are sent to the (VVMERGE and VSMERGE), the unpacking chip with addition logic that requires the alignment. If during the masked-out element, based on the appropriate the alignment of an element for a vector floating vector mask register bit, zeros that element out point subtract instruction, a one is shifted out of the before sending it to the addition logic. The addition 59-bit fraction field, then a "sticky bit" is generated. logic adds the zero to the other element, which has This sticky bit is used by the addition logic in the the effect of passing the value of the other element next cycle as a carry into the subtraction. on to the packing chip. The unpacked, aligned elements are then sent to the add chip, which produces a result and then par­ Ve ctor Memory Op eration tially normalizes the result before sending it to the Because vector applications tend to issue many packing chip. Again, if the shifting during normal­ vector memory instructions, the execution time of ization shifts a one out of the fraction field, a sticky these instructions is a critical factor in the perfor­ bit is generated. Finally the partially normalized mance of a vector processor. Therefore, the V-box result and the second sticky bit are sent to the pack­ was designed to minimize the execution time by ing chip which completes the normalization and taking advantage of the VAX 9000 CPU's large 128KB rounding and adjusts the exponent field accord­ data cache, by prefetching vector data, and by ingly. To save an extra cycle, the packing chip com­ fe tching it in blocks instead of element by element. putes two exponents values, one for each value of Memory requests by the V-box are sent through the carry-over in the rounding process. Final selec­ the VAX 9000 CPU to the cache and address trans­ tion of the exponent and its exception is done using lation hardware (M-box) of the VA X 9000 CPU. The the actual carry-over of the rounding logic. The M-box translates the 32-bit virtual addresses for vec­ proper exponent and the normalized fraction are tor data into physical addresses and accesses the then rearranged into the appropriate floatingpoint proper locations in the data cache. The vector format, and the assembled element is sent to the address-mask unit generates the virtual addresses vector register unit. for the vector elements. For vector load and gather instructions, the vector data is returned to the Integer and Logical lnstntctions For vector inte­ V-box through the E-box, and written to the proper ger and logical instructions, the elements bypass the vector registers. The M-box returns 64 bits of data alignment logic and are sent to the add chip (VFAD) each cycle. For vector store and scatter instructions, for all but the logical shift right instruction (VVSLRL the vector elements are sent through the E-box to and VSSLRL). For logical shift right instructions, the the M-box. Although the vector register unit is alignment logic does the shifting because the shift­ capable of sending 64 bits at a time, the E-box need ing circuitry is already needed for the alignment of only forward 32 bits per cycle to the M-box. The fractions in floating point elements. The exponent M-box requires two cycles to write the cache and unpacking logic is used to pass on the logical shift does not actually write the 64-bit data until the

Digita1 1ecbnicaljournal Vo l. 2 No. 4 hill I')'JO 73 VAX9000 Series second cycle. (The first cycle performs the cache tag For vector memory instructions, the vector lookup.) Because the V-box implements synchro­ address-mask unit receives the base (starting mem­ nous memory management exception reporting, ory address of the vector) and stride (distance once a vector memory instruction begins execu­ between vector elements in memory) of the instruc­ tion, no other vector instruction may be issued until tion from the vector control unit in an indirect the memory instruction completes. manner through the vector register unir. Both the The VAX 9000 CPU prefetches vector data. This base and stride are 32 bits long. mechanism is used to move data from the main For most vector load and store instructions, the memory to cache in a manner which optimizes memory addresses for the vector data are generated memory bandwidth. By using this method, a 25 in an iterative fashion. During the first cycle of exe­ percent improvement in the performance of vector cution, the base address bypasses the address adder load instructions is achieved. The preferching starts and is immed iately sent to the M-box to request the when the scalar microcode on the VA X 9000 CPU fi rst element. Concurrently, the base and stride are checks the stride of a VLDQ instruction. If this stride added together by the address adder and latched to is 8 bytes long (quadwords are contiguous in mem­ provide the address of the next elemenr. In the next ory), the microcode converts the instruction into a cycle, the latched address is sent to the M-box and block load instruction and sends it to the V-box. to the address adder, where it is added to the stride The block load instruction directs the V-hox to issue to generate the next address. The process repeats a series of block load requests for vector data. A until all element addresses have been issued. In block load request moves an entire cache block tandem with the address generation, the vector from the memory into the vector registers. These control unit directs the vector register unit to send blocks are loaded into both the cache and the vector or receive the appropriate vector register element. registers when they come from main memory. For vector gather and scatter instructions, the (Bypassing the cached to load the vector registers memory addresses for the vector data are also directly reduces the effect of a cache miss for vector issued in an iterative fashion. During the first cycle data.) Otherwise, the memory requests are done for of execution, the base address is sent to the vector one register element at a rime. address unit. In the second cycle, the vector control In addition to converting the VLDQ to a block unit directs the vector register unit to send the first load instruction, the scalar microcode also issues element of the offset vector to the vector address preferch requests ro the M-box. The M-box deter­ unit, which adds it to the base and latches the result. mines if the data is valid in the cache. If so, no fm­ In the third and subsequent cycles, the resulting rher action is taken on the request. If not, the data address is sent to the M-box while the base and next is requested from main memory. In this manner offset are added together. The process repeats until several prefetch requests are started in successive all element addresses have been issued. In tandem cycles. This method results in multiple memory with the address generation, the vector control unit banks being used in parallel. Vector data comes directs the vector register unit to send or receive the back to the cache at a rate of 500 megabytes appropriate vector register element. {MB) per second. The microcode stops issuing For masked vector load and gather instructions, prefetch requests when all the vector data has been addresses for all elements, masked and unmasked, requested. This ensures that the requests from the are sent to the M-box. However, for masked-our V-box do nor encounter many cache misses. elements, the request is modified from read to read no-op (i.e., do not actually perform the read). Ve ctor Address-Mask Unit This process prevents the M-box from raking cache The vector address-mask unit performs the address misses and address translation exceptions on generation and memory requests needed to exe­ masked-out elements. For masked-our elements, cute the vector memory instructions VLD, VST, the M-box returns a dummy value to the V-box, VSCAT, and VGATH. It also contains the vector mask which blocks the value from being written to the register and support logic for masked instructions. vector register unit. The vector address unit directs Further, it contains the complete vector arithmetic the control unit to block writes, based on the value exception register {VAER), which it updates based of the appropriate vector mask register bit. on the status sent by the vector add and vector mul­ For masked vector store and scatter instructions, tiply units. although both masked and unmasked elements

4 74 Vo l. 2 No. Fa ll1990 Digital Te cbnicaljounral Ve ctor Processing on the VAX 9000 Sys tem are read from the vector register unit, masked-out these chips reside on the VML multichip unit of the elements are stopped from reaching the M-box. The VAX 9000 CPU. The custom multipliers and divider vector address unit, based on the vector mask regis­ are identical to those used in the scalar execution H ter, causes the E-box to discard the masked-out engine (E-box). element instead of forwarding it to the M-box. Multiplication By using fo ur parallel multipli­ As described earlier, a VLDQ instmction with a ers, the pipeline latency through the multiplica­ stride of 8 bytes (unity stride) is converted by the tion logic for both single precision (integer and VAX9000 scalar processor into a block load instruc­ F_floating) and double precision (G_floating and tion when sent to the V-box. The vector address D_floating) is only three cycles. Thus, the vector/ unit, in turn, issues a number of block toad requests, scalar cross-over number fo r multiplication is quite each of which is for 64 bytes of data, to the M-box small. As a result, the V-box achieves good perfor­ with the appropriate address and selection bits. mance for vector multiply instructions with small­ There are eight selection bits, one for each quad­ sized vectors as well as large. As a double-precision word in the block, which tell the M-box whether to vector multiply instruction executes, two 64 -bit return the corresponding quadword to the V-box elements are received from the vector register unit for that block load request. Generation of these each cycle and are latched in the four custom selection bits by the vector address unit is com­ multipliers, each of which does a 32-bit by 32-bit plicated because the starting address of a vector in multiplication. memory is not aligned on a block boundary (i.e., As shown in Figure 9, the element bits are dis­ starts within the middle of a block). The bits also tributed in such a way that one multiplier operates depend on the vector mask register (for masked on the high-order bits of both elements; one multi­ block loads). plier operates on the low-order bits of operand one To handle unaligned, masked block loads, the and the high-order bits of operand two; one multi­ vector address unit must ge nerate selection bits that plier operates on the high-order bits of operand one deselect those quadwords which are not part of the and the low-order bits of operand two; and one vector but lie within the same blocks as the first multiplier operates on the low-order bits of both and last elements of the vector. In addition, it must elements. deselect those quadwords within the vector that During the next clock cycle, each of the four mul­ are masked out by the vector mask register. Both of tipliers unpacks its inputs and sends them through the above requirements are handled by using an a large multiplication array, which produces one extended version of the vector mask register to 64-bit partial product and latches the product. generate the selection bits. This process involves During the third cycle, the pack chips (VMLA and conceptually extending the vector mask register on VMLB) add the four 64 -bit partial products together both ends with enough selection bits so that each to produce one result and prepare the result to be quadword has a corresponding selection bit. For written back to the vector register unit. In this example, a vector starting at the last quadword of cycle, the four partial products are shifted accord­ one block requires that seven selection bits be ing to their weight. We ight is determined in relation added at the beginning of the vector mask register to which bits the multiplier usee! to produce a and one bit be added after the end. result. For example, the multiplier that operated on the high-order 32 bits (most significant bits) of both Ve ctorMultiply Un it elemems produces the most significant partial The vector multiply unit performs all of the vector product bits, and the multiplier that operated on multiply and vector divide operations defined by the low-order 32 bits (least significant bits) of both the VAX vector architecture: VVMUL, VSMUL, elements produces the least significant partial VVDIV, and VSDIV . The unit can perform either one product bits. The partial products must be aligned multiply instruction or one divide instruction at a or shifted properly before they are added together. time, but cannot perform both types of instruc­ Once the partial products have been added, the tions simultaneously. In addition, the unit performs final product is then rounded, normalized, and exception checking and reporting, as required, packed into the appropriate VAX integer or floating including floating overflow, floating underflow, and poim format before being written into the vector divide by zero exceptions. The unit consists of register unit in the next cycle. four custom multipliers: a custom divider, a divide The process and pipeline stages fo r single-preci­ unpack chip, and two packing chips. Physically, sion multiplication (VYMULF and VSMULF) are

Digital Te chnicaljo urnal Vo l. 2 No. 4 ftttl 1990 75 VAX9000 Series

VREG_SOURCE1 [31 OJ VREG SOURCE1 [63:32J VREG SOURCE1 [310J VREG_SOURCE1 [63 32J VREG_SOURCE2 [31 :OJ VREG SOURCE2 [63 32J

CUSTOM MULTIPLIERS

PARTIAL_PRODUCT1 [47:0J PARTIAL _PRODUCT1 [63:0J PARTIAL_PRODUCT1 [63 OJ PARTIAL_PRODUCT1 [63:32J VMLAIVMLS

147 ol RESULTS FROM DIVISION ACCUMULATION + � j63 FINAL PRODUCT 3�1

COMMON (TO BETWEEN DIVU) MULTIPLIERS (FROM EXCEPTION DATA AND AND DIVU) FINAL EXPONENT FROM DIVIDERS EXPONENT LOGIC

VML_RESULT [63:0J TO VREG

Figure 9 Ve ctor Multiply Un it

similar to the process used for double-precision ingly. The normalize count (i.e., shift count) is used multiplication. However, in single-precision multi­ to select the correct final exponent. Overflow and

plication, only one multiplier chip is needed ro pro­ underflow exception checking can only be detected duce the result and the pack chips do not need to and reported after the final exponent is selected. If sum the partial product. Integer multiplication is an exception is detected, then a reserved operand is slightly different from floating point multiplication written to the appropriate vector register element. because it does not need to be accumulated or The first stage of the exponent logic also checks for rounded. Thus, the correct product is produced divide by zero and reserved operand exceptions. by one multiplier. The result bypasses the accumu­ lation and rounding logic and proceeds directly Division Ve ctor division is a variable-cycle func­ into the packing logic to be sent to the vector regis­ tion. The number of cycles depends on the format ter unjt. of the operands. The custom divider is capable of The exponent handling for both multiplication producing six quotient bits per cycle. Therefore, and division is performed by the same logic on the F _floatingpoint division is performed in 7 cycles, packing chips. Depending on the instruction being G_floating point in 12 cycles, and D_floating executed, the exponent is either added (multipli­ point in 13 cycles. Because of the variable number cation) or subtracted (division). The result of this of cycles in a divide instruction, no other instruc­ operation is then piped to the next stage and the tion can execute in the V-box while a divide is in position of the hidden bit is determined. If the frac­ process. Also, because of the iterative nature of divi­ tional portion of the data must be shifted to ensure sion (i.e., one division must be completed before the hidden bit is in the correct position, the expo­ another can be started), the instruction cannot be nent is then incremented or decremented accord- pipelined.

4 76 Vol. 2 No. Fa/1 /'J'J{) Digital Te cbnicaljounwl Ve ctor Processing on the VAX9000 System

As a vector divide instruction executes, two class instruction, vector multiply instruction, and 64-bit elements are received from the vector regis­ vector memory instruction. Unlike the VAX 6000 ter unit each cycle and are latched in the divide Model 400 system, vector register conflicts between unpack chip. The elements are unpacked, and the these instructions have little effect on overlapping.; fractional portion of the elements is sent to the etJS­ With the VA X 9000 system, a conflict only delays tom divider in 32-bit slices. The exponent portion the execution of the subsequent vector instruction is sent to the shared exponent logic on the packing by one or two cycles at most. chips, as described in the Multiplication section. However, the overlapping behavior of the V-box During this cycle, time-critical values, such as com­ is sensitive to the issue order of vector instructions. plemented element values and first-cycle quotient If two vector instructions executed by the same bits, are calculated and forwarded to the custom V-box unit are issued one after the other, the second divider. instruction is delayed until the V-box unit has fin­ When the divider receives the data, it uses an ished executing the first. In addition, vector instruc­ iterative algorithm to produce six quotient bits per tions issued after a vector memory instruction or cycle. The quotient bits produced are then sent to divide instruction, do not begin execution unti.l the the packing chips, which may have to increment previous instruction completes. A general ru le in the quotient, depending on the value of subsequent scheduling code for the VAX 9000 V-box, is to gen­ quotient bits. The divider instructs the quotient erate, whenever possible, instruction triples, where accumulation logic whether or not incrementing is the first two instructions are a vector add-class and necessary. The partial quotient, once decided, is vector multiply instruction and the last instruction held in a bank of latches until all the quotient bits is a vectOr memory or vector divide instruction . are received. When the entire quotient is available, Failing that, at least one vector add-class or vector the result is rounded, normalized, and packed by multiply instruction should be issued before a vec­ using the same logic path as multiplication. A mul­ tor memory or vector divide instruction. tiplexer switches this packing logic between the The fo llowing code examples demonstrate the multiplication and division logic. usage of the VAX vector instruction set and the over­ lapping behavior of the VA X 9000 V-box. (Note: It Performance Characteristics should be assumed in the examples that all arrays As of this writing, testing of the vccror performance are 8-byte double precision.) of the VAX 9000 system has only just begun. How­ In the fo llowing DAXPY inner loop example, the ever, some preliminmy results are presented in first two VLDQ instructions do nor overlap. How­ Ta ble 3. We expect that these results will improve ever, the VSMULD, VVADDD, and VSTQ instructions as testing continues and more code is optimized do overlap. to take advantage of the chaining and overlapping Do i = 1 , 64 provided by the V-box. DY ( i) = DY ( i) • DA x OX ( i) enddo Chaining and Overlapp ing vecrorizes as: Because of the design of the vector register unit, the V-box can concurrently execute a vector add- VLDQ ox , K8 , vo ;Load vector OX VLDQ/M DY , K8, V2 ;Load ve ctor DY ;wi t h mod ify inten t VSMULD DA , VO , V1 ;V1 = DA*DX Table 3 VAX 9000 Model 21 0 Preliminary VVADDD V1 , V2 , V3 Performance Double-precision ;V3. V1 .DY VSTQ V3 , MFLOPS, Uniprocessor DY ' K8 ;Store vec tor DY The first two VLDQ instructions do not overlap in Size Vector the fo llow ing MERGE example, Peak rate NA 125 = LFK (Geometric mean) 441 13.2 Do i 1, 64 LFK (Arithmetic average) 441 20.6 a( i) = b(l) - c( i) if (a(i) . t. 0) then LINPA C K 10002 80 g b(i) = a( i) FFT 4096 26 else Convolution 150 X 1500 99.15 b( i) = di) Matrix multiply 642 111.3 6 endif enddo

Digital Te cbllicaljournal Vol. 1 No. -1 Fa ll 1990 77 VAX 9000 Series

vectorizes as: vecwrizes as:

VLDQ b, #8 , vo ;Load vector b VLDQ a, #8 , VO ;Load vector a J\ VLDQ c, #8 , V1 ; Load vector c VSEQLD # XO ,VO ; Test a( •) for zero and VVSUBD VO , V1 , V2 ;b-e ; set mask. CVSCMP pseudo­ VSTQ V2 , a, #8 ;Store vec tor a ;op doing Equal test> 1\ VSLSSD # XO , V2 ; Test a ( • ) and set rnas k IOTA #8 , V1 ;Make compressed ;in VMR .

VLDQ a, #8, VO ; Load vee tor a 1\ VSLSSD # XO , VO ; Test a( • ) and set mas k Summary ; in VMR .

78 Vo l. 2 No. 4 Fa l/1990 Digital Te cbntcaljournal Ve ctor Processing on the VA X 9000 System

5. References CRA Y- 2 Compute-r Sy stem Fu nctional Descrip­ tion (Cray Research, Inc, 1985 ). 1. Russell, "The CRAY-1 Computer System," 6. 1978): W. Buchholz, "The IBM System/370 Vector Archi­ ACM Pro ceedings, vol. 21, no. 1 (January 25, 1 63-72. tecture," IBM Sy ste-ms jo urnal, vol. no. (1986): 51 -62 . 2. VA X Vector Processing Ha ndbook (Maynard: "VAX 9000 Digital Equipment Corporation, Order No. 7. D. Marshall and ]. McElroy, Pack­ EC-H0419-46/89, 1989). aging-The Multichip Unit," Proceedings of '90 (!EEE, Spring 1990). 3. COMPCON R. Brunner, VA X Architecture Reference Ma nual 8. (Bedford: Digital Press, Order No. EY -F576 E-DP, M. Adiletta et al., "Semiconductor Te chnology VAX 1990). in a High-performance System," Digital Te chnical jo urnal, vol. 2, no. 4 (Fall 1990, this 4. D. Fenwick et al., "A VlSI Implementation of issue): 43-60. the VAX Vector Architecture," Proceedings of '90 (IEEE, 1990). COMPCON Spring

Digital Te cbntcaljournal Vo l. No. 4 Fa ll 1990 2 79 Peter B. Dunbeck Richard]. Dischler ja mes B. McElroy Frank]. Swiatowiec

HD SC and Multichip Un it Design and Ma nufacture

The VAX 9000 system effectively integrates state-ofthe-art packaging and inter­ connects with advanced integratedcircuits to achieve a short machine cycle time (16nano seconds) and a high rate of instruction execution. To meet highjrequency electrical signal and pin count requirements fo r the system, engineers chose tape automated bonding technologyand consequently conceived and develop ed the high­ density signal carrier (HDSC). Tbe HDSC offe rs densities three to fi ve times greater than conventional printed circuit boards. Th is unique technology ismanufactured using semiconductor and advanced tecbniques. The HDSC is at tbe heart of the multich zp unit, a bigh-performance logic module, with wbicb tbe VAX 9000 CPUs and system control unit are constructed.

Over the past decade, advances in the performance 16 nanoseconds (ns), allows the system to operate at of integrated circuits (ICs) have outpaced advances 30 VA X units of performance (VUPs). in packaging and interconnect technologies. Thus a To transmit electrical signals quickly between high-performance mainframe with conventionally chips, wiring paths must have controlled ratios of packaged bipolar integrated circuits would experi­ wire size to distance from voltage planes. These ence interconnect delays that account for more impedance-controlled paths allow radio-frequency than 50 percent of the system cycle time. Key to computer signals to propagate with minimal dis­ optimizing high-end mainframe performance, then, tortion. Prevention of noise on the signals is is the effective integration of state-of-the-art pack­ paramount and many details of the physical imple­ aging and interconnects with advanced integrated mentation, including spacings between wires, are circuits. The high-density signal carrier (HDSC) and critical to ensuring signal integrity. the multichip unit (MCU) are proprietary tech­ To meet the cycle time goal, high-frequency elec­ nologies that shrink interconnect paths and thus trical signal concerns needed to be considered in reduce the distance and electrical loading of signals the design, concerns that would have been negligi­ between chips. These technologies use conven­ ble for slower speed signals. Due to the physics of tional semiconductor and printed circuit board electrical fields, as electrical signals switch at high (PCB) equipment in many areas of manufacturing frequencies, they succeed in holding their shape to improve reliability at a competitive cost. The (data) only if they are fed power extremely quickly, result is shorter machine cycle time and higher and if they are given short paths of uniform proper­ instruction execution rate. The VAX 9000 CPUs and ties on which to travel. Due to the amount of power system control unit (SCU) are constructed entirely and the short amoum of time a signal is given tO of multichip units on large planar modules. The SCU arrive on chip, conventional chip carrier packages is composed of arrays of 6 multichip units, and the were disallowed for the VAX 9000 system. The sig­ CPUs are composed of arrays of 16. nal paths had to be very short to be virtually noise­ less. To achieve this objective, engineers decided to Multicbip Un it Design Goals enhance tape automated bonding (TAB) technology Beginning at the concept level and throughout the with a ground plane for electrical control of the development and test phase, signal integrity con­ wire impedances (paths). This reduction in chip siderations guided the development of the HDSC package size also allowed all of the chips fo r the sys­ and the multichip unit. Designers had to ensure tem to be packaged into a tight area. Consequently, that the fast signals would not be disturbed by to fit wires between chips, extremely dense HDSC noise. The cycle time goal for the VAX 9000 system, technology was conceived and developed.

80 Vo l. 2 No. 4 Fa ll 1990 Digital Te cbnicaljou,-nal HDSC and Multich ip Un it Design and Ma nufacture

The multichip unit also required careful thermal All integrated circuits in the multichip unit are design attention because each chip consumes up to attached to the HDSC by a tape automated bonding 30 watts. Moreover, most multichip units contain (TAB) process. The VAX 9000 system uses four types four to eight of these chips plus self-timed RAMs of ch ips, all of which have emitter coupled logic (STRAMs). The key to success for the VAX 9000 (ECL): gate arrays, custom chips, and two types of program was balancing the trade-offs between per­ STRAMs. At each chip site, a cutout in the HDSC formance requirements and technology develop­ allows the chip to directly attach to the baseplate. ment risks. The signals on and off the multichip unit are carried To meet the electrical and density requirements by four signal flex connectors which attach to the for the machine, engineers specified the following perimeter of the HOSC. The signal flex connector for the multichip unit: provides a separable interface to the planar board 1. Series-terminated output drivers were required and extends the controlled-impedance electrical on chip. Therefore, external resistOrs are not environment of the HOSC. Power is brought needed on the multichip units or programmed through two power connectors attached to oppo­ into the design elsewhere. These external resis­ site sides of the HDSC. The signal flexes, the power tors take up space and lower reliability. connectors, and the baseplate are attached to the multichip unit housing. The housing provides the 2. TA B was specified for manufacturing reasons. structure for the multichip unit and holds the com­ Short TAB tape was required to reduce switching ponents needed to position and wipe the signal noise on chips. Noise would have been generated flex. The chips and HDSC surface are covered by a if the TAB wires were longer. In the case of the plastic lid. noisiest chips, a ground plane was added to the The high-powered chips are efficiently cooled by tape to reduce noise. a short conductive path through the back of each

3 HDSC etch had to be two routing layers of IS­ chip. The thermal power is conducted from the micron by 9-micron wires on 75-micron centers chip to the baseplate and into a pin fin heat sink to meet the density, resistivity, crosstalk, imped­ over which air is impinged to remove the heat. ance and other goals. The follow ing sections describe the implementa­ tion of the technology. 4. Four power planes, each one powered from two sides, were required to distribute three voltage 1be HDSC Design and rails with acceptably high conductivity. Manufacturing Process

S. Thin dielectric separates the power planes and The goal for the HOSC project was to produce a produces high capacitance which fil ters noise high-density, high-performance, manufacturable and improves performance. This capacitance printed circuit board. This goal was achieved. The eliminates the need fo r discrete parts which con­ density of the HDSC is three to five times greater sume valuable space and lower reliability. than that of conventional printed circuit boards. 6. Impedance control of the connectors on the Even at this density, the HDSC maintains the signal multichip unit was needed to prevent signal dis­ integrity of bipolar integrated circuits with edge turbance. Rules were generated fo r the number speeds of 200 picoseconds. This section describes of ground pins. how the manufacture of the HOSC pushes the limits of printed circuit board and semiconductor equip­ The heart of the multichip unit is the H DSC. The ment into new types of applications. We also HDSC is an imerconnect technology consisting of address the integration of computer-aided (CAD) nine metal layers separated by polyimide dielectric tools, process controls, and test feedback, which and mounted on a copper baseplate. The top metal helped us to achieve the results we sought. layer is a pad layer used to solder-attach all of the integrated circuits and connectors. The four metal HDSC Te chnology layers below make up the signal core. The signal As noted earlier, the HOSC has nine copper layers core is a controlled-impedance, dual buried strip­ for power and signal distribution. The insulating line interconnect system used to wire all integrated material, polyimide, has a low dielectric constant circuits to each other and to the connectors. The of 3.5 as compared with oxide or nitrides used in power is brought from the perimeter of the HOSC to integrated circuits or as compared with ceramic, the integrated circuits through the bottom four which is used for hybrid circuits. The interconnect metal layers. is laminated to a copper baseplate to provide

1\i; Fall Digital 1ecbnicaljournal Vol.2 4 /'J'JI) 81 VAX 9000 Series

mechanical structure as well as attachment of the CORE PROCESS FLOW multichip unit heat sink. r------I The conducting layers consist of the following: 1 SIGNAL CORE POWER CORE

• Two layers for signal distribution I: : � • � • METAL LAYERS METAL LAYERS • Two layers that serve as signal reference planes I 4 4 • POL YIMIDE LAYERS • POL YIMIDE LAYERS I I 5 5 • COPPER LINES ETCHED • WHOLE PLANES I • Four layers for power distribution I • VIAS I I • One layer with bonding pads to attach the TA B I I and connectors TEST* TEST*

The signal distribution is a single x-y pair that ______: I I 1 ______I ___I j : uses the reference planes to create a dual strip­ l line interconnect. This interconnect provides a ASSEMBLY PROCESS FLOW controlled-impedance signal path with minimal ------, crosstalk. Ta ble 1 lists the electrical and physical I • SUBSTRATE REMOVAL I • LAMINATION design parameters of the HDSC . I • DRILL I I • LASER CUTTING I I • PAD LAYER Process Overview • BASE PLATE I---�--J I The HDSC is manufactured by two types of pro­ cesses: core processing and assembly processing. Figure 1 is a diagram of the HDSC process flow. The core process, described funher below, uses semiconductor manufacturing equipment and is similar to the manufacturing process for the back TO MCU end of an integrated circuit. Two cores are manu­ factured: a signal core for strip-line signal inter­ connect, and a power core for the four planes Figure 1 HDSC Core and Assembly Process Flo w (or layers) that distribute power throughouc the finished HDSC . Core Processing The process for the manufacture The second process, assembly, uses advanced of the signal and power cores, or the core process, printed circuit board techniques to laminate and consists of alternating between copper deposition interconnect the signal core and power core. The and polyimide coating until the completed inter­ completed HDSC has solder pads to accept the outer connect layers are built on the metal wafer. The pro­ lead bond of TA B integrated circuits, signal flex, and cess is performed on a metal substrate shaped like a power t1ex. The HDSC is tested with a custom flying 6-inch semiconductor wafer. Copper layers are probe tester. Tests are made to ensure the HDSC is deposited by a combination of sputtering and plat­ functional and meets electrical parameters. ing techniques. Patterns in the copper that become signal traces are generated by a semiconductor phorolithographic technique. First, a photoresist is Table 1 HDSC Physical and Electrical Design Parameters applied to the metal wafer. The resist is then exposed to the pattern in the mask that is held by Line pitch 75 microns the semiconductor wafer aligner. This pattern is Line width 18 microns then developed in the resist and etched into the Line thickness 10 microns copper. The remaining copper thickness is then

Dielectric thickness 25 microns added by plating. Another resist pattern is devel­ oped over the plated signal traces to define where Dielectric constant 3.5 Line impedance 60 ohms a copper connection between interconnect layers Line resistance 1/0 ohm/centimeter will occur. This connection is called a via post, and it is also formed by a plating process. Crossover capacitance femtofarads 3.6 Polyimide is spun on to the wafers by integrated Crosstalk percent maximum 5.1 circuit photoresist spin tracks. The relatively thick Propagation delay 66 picoseconds/ polyimide (25 microns at signal layers) helps to centimeter planarize the surface of the wafers and also to cover

Fall 82 Vo l. 2 No. 4 I

the patterned copper lines and copper posts. Semi­ engineers estimated the expected manufacturing conductor photolithography equipment is also distribution of critical signal core dimensions. Once used to generate patterns in resist through which HDSCs were manufactured, actual processing dis­ holes (extensions of the via post) are plasma-etched tributions were fed into the model which predicted in the polyimide. These vias are filled at the next yield against the specification. Based on this, the copper deposition to create a connection between processes were adjusted to maximize yield. Models patterned layers. and electrical results were verified by time domain Both signal cores and power cores are electrically reflectance (TDR) and resistance-inductance-capaci­ tested to ensure electrical functionality. tance (RLC) measurements. The high-frequency measurements necessary to Assembly Processing To complete an HDSC, a characterize the interconnect were extremely sensi­ signal core is matched to a power core. The metal tive to probe card inductance and capacitance and wafer that acted as a substrate is then removed, and probe contact resistance. Custom test fixtures were the signal and power cores are laminated together. designed to perform the measurements. Connections from the power core to the signal core High-frequency test measurements in the pro­ are made by drilling and then plating up through duction environment are not practical, but we the drilled holes. The plating and etching processes determined that resistance and capacitance testing used to form the plated-through holes also produce could be used instead to verify the HDSC signal copper pads on the bonding layer. Solder is integrity. A production flying probe tester was screened and reflowed onto the bonding pads. Die developed to test HDSCs. Once again, probe para­ site holes which provide openings for bonding the sitics were large compared to the type of measure­ chips to the baseplate are cut through the laminated ments necessary. Custom probe design, calibration signal and power cores (HDSC decal). A laser cuts methods, and software to drive the tester again out the die sites and trims the HDSC decal to its were necessary. From HDSC graphic design files, final size. The assembly is complete when the inter­ test capacitance Limits for every signal net are gener­ connect is laminated to a baseplate which provides ated, which ensures electrical fu nctionaLity as we ll mechanical structure. as signal integrity. In addition, the resistance of every plated-through hole is measured, the integrity HDSC Te st Process of power planes verified, and resistance and leakage The goal of the HDSC test process was to ensure measurements performed on test structures within that the physical technology met the VAX 9000 sig­ the HDSC. nal integrity requirements, discussed earlier in this paper. Equally important was to ensure that the CA D The VAX 9000 system design includes net technology was manufacturable and verifiable lists and graphical files of the HDSC masking layers. (measurable). Engineers had to accurately convert To this data is added process control monitors, design information to masks and to verify HDSC alignment marks, and pattern modifications electrical results by testing. We therefore developed required to meet the process design rules. After modeling and measurement techniques to establish modifications, all data is verified by software that physical and electrical design rules; implemented checks design rulesand electrical rules. CAD tools to verify these design rules; developed The data is used in a variety of applications. First software to generate test vectors from the CAD data­ it is converted to pattern-generation format so that base; and also developed production HDSC testers. from this data masks can be written and an inspec­ tion file can be generated to verify the mask-making Modeling and Measurement Te chniques To deter­ process. The information is also used to drive mine what trade-offs wouLd be required between numerically controlled equipment, such as the the VA X 9000 signal integrity and the HDSC physical drills and lasers that perform die site cut outs. manufacturing capabilities, engineers needed both Finally, it is used to create a net-capacitance test file modeling and emp irical measurement techniques. which drives the HDSC production testers. A software tool based on Monte Carlo Analysis was developed that could drive a three-dimensional MCU Design and Manufacturing capacitance model. This tool predicts electrical The multichip unit (MCU) takes full advantage of parameter sensitivity to different physical pro­ the integrated circuit and HDSC technologies to pro­ cessing variations. Early in the project, processing duce a high-performance logic module. The major

DiRital Te chnicaljo urnal Vn l. 2 Nn. 4 Fa ll /

components of the multichip unit are shown in arrays (up to 8), the STRAMs (9 replace I gate array, Figure 2. The components. their functions. and the 24 replace 3 gate arrays), ancl the HDSC . assembly and test process are discussed in the next two sections. Table 2 summarizes the multichip unit TAB Sem iconductors specifications. All semiconductors in the multichip units are inte­ All units have certain features that are fixed, grated circuits. Discrete devices and passives which regardless of logic design. These include the clock consume more space and display lower reliability distribution chip, serialization pattern, signal con­ are not used. TAB is a chip-co-substrate interconnect nector, power connector, housing and heat sink. made of layers of copper and polyimide film. The The VA X 9000 system uses 20 unique logic design copper signal lines are patterned ro mate with gold implementations, or options. The multichip unit bumps on the IC: perimeter and with solder-plated features that make an option unique are the gate pads on the HDSC.

MCU LID

SIGNAL FLEX CONNECTOR

ELASTOMER TRAY SUBASSEMBLY

POWER CONNECTOR

RETAINER SPRINGS � IL

HOUSING

ELASTOMER SUPPORT BEAM

PIN FIN HEAT SINK

Figure 2 Exp loded View of an MCU

84 Vo l. 2 No. ·i Fa /1 19')0 Digital Te chnicaljo urnal HDSC and Multicbip Un it Design and Ma nufacture

Table 2 Summaryof MCU Specifications The clock distribution custom chip (CDxx) and the STRAM use single-metal-layer tape with poly­ Maximum power 270 watts (air cooled) imide die.l.ectric. The CDxx has 252 total leads dissipation with 84 power and ground leads. The COx:<. has the Maximum IC junction degrees Celsius 85 same pin pitch as the gate array. The VAX 9000 temperature @ 25 degrees Celsius room temperature system uses two sizes of STRAMs (IK by 4 bits and 4K by 4 bits) that have TA B tapes of diffe rent sizes. Maximum number of 72 VLSI chips The STRAJ'vls also use a single-metal-layer tape with 48 leads. The minimum ILB pitch is 250 microns Minimum chip lead 200 microns pitch and the minimum OLB pitch is 450 microns. Single­ Size metal-layer tape was selected for these devices Plane 14.12 x 13.21 centimeters because it was less expensive than two-metal-layer, Height 5.44 centimeters and two-metal-layer tape was not needed because of Minimum pitch on 1 4.38 x 1 3.46 centimeters the shortened lead lengths on the STRAMs. Single­ planar module metal-layer tape was acceptable fo r the CDxx chip Weight 1 .59 kilograms because all the outputs are diffe rential and syn­ (with heat sink) chronous. Noise cancellation was guaranteed. Clock input frequency 320 to 580 megahertz All devices that use TAB are shipped in a 35-milli­ Signal 1/0 per MCU 800 meter slide carrier. The devices are encapsulated in Signal rise time 600 picoseconds epoxy to minimize infiltration of moisture or corro­ Voltage levels 2 plus ground sive ions and to reduce damage due to handling. Maximum current 40 amperes per voltage The back sides of the chips are bare silicon because level

INTEGRAT ED The TA B for the gate array and most of the custom CIRCUIT chips is a two-metal-layer tape with 360 leads. A cross section of the gate array TAB is shown in 200 MICRON Figure 3. The dielectric layers are polyimide fi lm. PITCH One metal layer contains the etch lines used fo r

both power and signal 1/0 and the leads to bond to the chip and The other layer is a reference INNER LEAD HDSC:. BOND plane to establish controlled impedance and to minimize the inductance of the power and ground TAB paths. The reference plane is connectec.J to the ground leads by vias etched through the Kapton HIGH DENSITY and plated up. The gate array power is brought in SIGNAL CARRIER through 104 power (two voltage levels) and ground leads. All of the signal leads are 60 ohms control led impedance. The leads are 35 microns thick and cop­ per coated with about 0. 5 micron of electroless tin. ENCAPSULATION As shown in Figure 3, the TA B has an inner lead bond (I LB) pitch of 100 microns and an outer lead bond (OLB) pitch of 200 microns. The total span of OUTER LEAD BOND the TA B when mounted is 2.4 centimeters. The leads are fo rmed near the OLB to provide strain relief for the !LB and to protect the OLB from thermal stress. ·ro minimize propagation time and noise, the length of the signal lines h:.�s been minimized by keeping INTEGRATED CIRCUIT the OLB pitch to the minimum compatible with manufacturing processes. The bond at the IC (at the

ILB) is a gold-tin eutectic fo rmed by gang thermal Figure 3 Isometric of a Gate A n-ay compression bonding. Showing Fea tures of tbe TAB

Fall Digital Te cbnicaljournal Vo l. .! ,\(>. 4 1<)<)0 85 \J\X9000 Series

no plating is required for epoxy die attach. The into a plastic housing. The connector plugs into flat epoxy die auach is filled with microscopic particles blades on the bus bar of the planar module assem­ to enhance the thermal conductivity while main­ bly. The decoupling capacitors on the power flex taining electrical isolation bet ween chips. circuit fi lter the medium-frequency switching noise

on the MCU and the MCU power bus. Signal Flex Co nnector The signal flex connector is a high-density, con­ ThermalDesi gn trolled-impedance connector used to transmit sig­ The multichip unit was designed from conception nals between the HOSes :md the planar module. to provide an efficient cooling path fo r the inte­ Each multichip unit has four flex connectors with grated circuits. Figure 5 shows a cross section of the a combined signal I/O of 800 in an area less than 40 square centimetcrs. Figure 4 shows a cross sec­ PLANAR MODULE tion of one signal flex connector. The body of the connector is a two-metal-layer flex print with 50- and 60-ohm signal lines. The ground plane in the flex circuit is used as an AC return path. No power is carried through the signal flex. The signal plane contains 200 etch lines with a raised gold bump on each at the planar module interface. The connec­ tion to the HDSC is a solder bond similar to the sol­ der bonds for the TAB devin:. A window is opened through the polyimidc to al low the formation of cantilevered, exposed, solder-plated leads. The raised bump on the flex circuit concentrates the contact force into a small area. The bump is solid copper that is plated over with nickel and hard gold. The force on the bump is generated by com­ pressing a molded silicone rubber elastomer. The compression of the connector causes the tk x frame to engage a cam on the housing and wipe the.: contacts across the planar module pads. The con­ SIGNAL FLEX ELASTOMER FLEX CIRCUIT nector is compressed, nominally, 1.27 mm and CIRCUIT BUMP wipes 0.46 mm . The bottom of the elastomer mates with a tray which has a contoured surface to vary the compression along the length of the elastomer. This contoured surface improves the uniformity of ELASTOMER the force that the humps exert on their pads. The ELASTOMER TRAY connector has been designed to gem:rate 100 grams minimum load on all bumps. The wipe action and the bump force of the connector minimize the effect of dust and environmental fi lms on the.:mat­ ing surfaces.

Power Co nnector

The power consumed by the multichip unit IS brought in through two power connectors mounted on opposite sides of the I!DSC . The connector is composed of a flex circuit, a connector, and decou­ pling capacitors. The flex circuit is solder honded to large pads on the IIDSC surface. The flex has three copper conductive planes separated by polyimide dielectric. The connector has st::�mpcd metal con­ Fig ure ,j Signal Flex Connector with tacts soldered into the llcx circuit and assembled Detail of Bump

Vo l. 2 No. ·1 Fa ll /'J')O Digital Te cbnicaljountal HDSC and MultichtP Unit Design and Ma nufacture

multichip unit. The heat dissipated by the chips is conducted through the silicon and the die attach into the baseplate. As mentioned above, the die attach is an epoxy heavily filled with microscopic BASE �--: PLATE diamond particles to increase thermal conductivity. '" - · · The heat spreads out in the copper alloy baseplate r--'------:'------'-, PIN FIN HEAT SINK and is conducted across a dry interface to an alu­ minum base of the pin fi n heat sink. The heat sink has 600 aluminum pins, each 0.20 centimeters in diameter, pressed into the base. Air plenums in the cabinets direct at least 14 .6 liters per second of air into each multichip unit heat sink. The thermal resistance for a 30-warr gate array is less than 2.0 watts per degree Celsius which gives a junction temperature of 85 degrees Celsius with room air at 25 degrees Celsius. This low junction temperature is a critical part of the high reliability of the multi­ chip unit.

Figure 5 Thermal Pa th Clock Distribution The system clock on the VA X 9000 system is distributed to each of the multichip unit clock TAB andFlex Circuit Bonding distribution chips (CDxx). The CDxx generates 40 The insertion and soldering of leads is the most differential outputs which are routed through critical step in the multichip unit manufacturing equal-length etch to the other chips. The CDxx also process. Single-lead and multiple-lead gang bonding distributes and controls the scan lines that test the approaches were both considered. Gang retlow sol­ unit both in manufacturing and in the field. The dering is an effective way to achieve repeatable, reli­ scan lines also allow the unit serial number and revi­ able connections for both the TA B semiconductors sion status to be read by the system console. and the signal tlex circuits. Early development work on manual machines required operator action for Multicbip Un it Manufacturing lead forming, lead alignment, and gang bonding. Figure 6 shows the manufacturing process flow, Today, critical process parameters- time, pressure, which has three major work centers: temperature - are computer controlled to speci­ fied values, and the process uses tools to assist the • 54-class assembly and inspection operator in material movement and vision systems

• PlOOO assembly and inspection to improve alignment of leads. Before bonding, the leads are covered with a low activation flux which • Te st and diagnose is removed later in the process. In the 54-cla ss process, TAB semiconductor devices arc assembled to the HDSC substrate, result­ Die Attach ing in the subassembly known interm.lly as a 54- Another critical manufacturing step is the die attach class module. In the P 1000 process, connector and process. The excellent thermal performance of the housing components are assembled. At the last multichip unit is achieved by following these steps: major center, the test process, final units are tested and, if necessary, diagnosed. A shop floor control • Careful control of the die attach materials with system tracks the units through the line and pro­ feedback to our suppliers. vides critical component and process trace infor­ • Surface cleanliness specified and also managed mation. In addition, this control system is used to with our suppliers. monitor process parameters to ensure control of the line and consistent product quality. • Dispensing of epoxy. The filled epoxy is dis­ The following section provides insight into pensed by an x-y table that is computer con­ several of the process technologies we used to meet trolled to supply the correct pattern for the the manufacturing goals of the VA X 9000 system. particular multichip unit type.

Digital Te clm icaljournal Vu /. 2 No. 4 Fa ll /'J'JO 87 VAX 9000 Series

END OF 54 CLASS ASSEMBLY START OF P 1000 ASSEMBLY

ALIGN HDSC TO HOUSING SHIP

Figure 6 Manufacturing Process Flo w

• Establishment of bond line thickness and epoxy short removal or single-point bonding. Over time, cure. Bond line thickness is accomplished by we believe that our materials and processes can be mechan ically applying pressure while curing in controlled ro the point at which inspection and a purged belt furnace. repair can be dramatically reduced.

Inspection Final Te st

To ensure that all soldered leads are reliably The goal of our test rrocess was to ensure that bonded, leads must be inspected for shorts, mis­ multichip units wou ld operate successfully in a alignments, opens, and weak joints. Shorts and mis­ system environment. Since no test equipment alignments are discovered by an automated vision m:mufacturer offered a system that met our needs, system that ca lls marginal points to the operator's we developed our own by working with several attention. The operator can then determine if Digital groups as we ll as outside suppl iers. The repair action is warranted. Inspection for opens and system contains three major stations. The first weak joints is done by striking the leads with a pulse provides alignment information and can ;�lso read of laser energy and then measuring the thermal visual serial and part numbers. In the second sta­ decay profile. Repa ir is typically made by localized tion, low voltage shorts are determined between

H8 Vo l. 2 No. 4 Fall 1')')0 Digital Te cbnicaljournal HDSC and Multichip Unit Design and Manufacture

nearest neighbor leads. This step supplements our cess that begins with advanced development and inspection fo r shorts described above. In the final continues through volume manufacture. The HOSC station, we test fo r connectOr opens, thermal mea­ and multichip unit technologies have successfully surement (die attach integrity), scan chain integrity, achieved the volume manufacwring phase. Using and scan pattern data. The scan pattern testing is the products and technologies described done in several bursts of the clock at system speed . in this paper, we have played a key role in the intro­ In addition, diagnose capability is provided by fly­ duction of the VA X 9000 system to the marketplace. ing probes, voltage and clock margining, and a ther­ Extensions of this manufacturing process will mal chuck tovary temperature. ensure that this technology base can be applied across a wide spectrum of products of both higher Conclusion and lower performance. Successful use of advanced interconnect teclmolo­ gies requires a seamless phased development pro-

Fu ll 1'.)')11 Digilal 1i.•cbnicaljo urnal Vo l. .2 No. 4 H9 Matthew S. Goldman Paul H. Dormitzer Paul A. Leveille

Th e VAX 9000 Service Processor Un it

The VAX 9000 servicepro cessor unitprovi des thefr ont-end seruices needed to support a highly available and reliable mainframe system. The unit is close�y linked to the VAX 9000 system to provide realtime detection and recovery of system fa ilures. Ho wever, the unit is independent enough to be isolated fo r maintenance without affecting normal system processor op eration. Th is combination is a firstfo r VAX systems. The service processor also provides various debuggingfe atures that were essential fo r development and ear�)' manufa cture of the VAX 9000 system. Th ese fe atures utilize a system-wide scan architecture to achieve direct access to machine­ state, which provides extensive visibility and control of system logic fu nctions. The inclusion and use of such a scan architecture is a newfe aturefo r a Digital processor.

The VA X 9000 service processor unit (SPU) is • XMI-ro-system control unit adapter interface test designed w provide a dedicated subsystem for ser­ • Symptom-directed diagnosis support vice and maintenance support for the VAX 9000 family. The SPU serves two distinct roles. It func­ In addition to its use as the front-end processor tions as the familiar operator interface (i.e., VA X for the VAX 9000 system, the SPU wJs embedded console) and as a maintenance vehicle used lO diag­ in several manufacturing and engineering rest nose and isolate system processor hardware fa ults. vehicles. In the Debugging Features section of this The SPU performs the following major front-end pJper, we describe how the SPU was used as a services: debugging tool during VAX 9000 product devel­ opment and the various debugging features we • System initi:llization provide to help locate design and fabrication • Power system control and monitoring problems. A mJjor goal of the SPU WJS to perform system­ • Environmental monitoring wide error detection and recovery functions for the

• Clock control and monitoring VAX 9000 processor. In the Error Handling section of this paper, we detail the types of errors that the • VAX 9000 operating system access to SPU mass SPU handles arid how error detection, reporting, storage devices (disk and tape) and recovery occurs.

• Remote diagnosis port support Another of our design goals was to be able to service the SPU without adversely :�ffecting the • System error detection, recovery, and reponing operation of the system processor. This feature was needed to support the high availability require­ The SPU also provides or assists in the fo llowing ments of a mainframe system. To meet this goal, we system diagnosis functions: designed mechJnisms to enable the VA X 9000 oper­ • SPU module self-tests ating system to determine that the SPU is not func­ tionJl (whereupon the operating system takes the • Scan system diagnostics appropriate action to secure its own operation),

• Clock system diagnostics as well as recognize and reintegrate with the SPU when the SPll is functional again. • Scan pattern structural diagnostics If the VAX 9000 operating system Jttempts to • Structure cell (e.g., self-rimed random-access access one of the SI'U-based processor registers and memory [RAM]) diagnostics the SI'U does not respond, the fa ilure is detected by

90 Vo l. J No. -i Fa ll /')')0 Digital Te chnicaljo urnal The VAX 9000 Service Processor Unit

using the usual register time-out mechanism. How­ tests are performed. The SPU's operating system ever, because the SPU is responsible for system error then boots automatically and signals its availability handling, SPU failures must be detected quickly to to the VA X 9000 operating system. enable the SPU to respond to a system error should The SPU is designed to continue operation even one occur. Consequently, we developed a keep­ if the SPU primary storage device, an RD54 alive protocol with which the VAX 9000 operating Winchester disk drive, fails, which further increases system can determine SPU failures without relying the availability of the SPU. For customers who on operating system accesses to SPU-based pro­ require data security and high availability, we cessor registers. The keep-alive mechanism is designed a system configuration option that does described in more detail under the Error Handling not use a disk drive. In this case, the SPU boots from section of this paper. Both the time-out and keep­ TK50 cartridge tape. The SPU fu nctions that require alive mechanisms work regardless of whether the a disk drive for data storage (e.g., SPU-generated SPU has an unexpected failure or undergoes a sched­ error logs) are disabled in this configuration. uled power-down. Should the SPU require service, field upgrades SPU Architecture may be performed easily and quickly because of the A block diagram of the SPU architecture is shown in modularity of the hardware, which is primarily Figure 1. The service processor module, scan con­ VAXBI bus interface-based adapters. The VAXBI trol module, and power and environmental monitor backplane minimizes downtime because modules were designed uniquely for the VA X 9000 system. can be removed or inserted without requiring reca­ The disk controller, tape controller, as well as the bling. When power to the SPU is restored, SPU self- memory daughter board were available from other

DISK TAPE/NETWORK" 14------' CONTROLLER CONTROLLER * (11031 KFBTA) (11 034 DEBNK) Nl

VAX 81

SCAN SERVICE SPU MEMORY CONTROL POWER AND PROCESSOR 16 MBYTES TO/FROM MODULE ENVIRONMENTAL MODULE ECC REST OF PCS MONITOR (12051 SPM) (12050 SCM) (11 060 PEM) SPU OS FIRMWARE I FIRMWARE

POWER CONTROL SYSTEM SJI

PlY !-----'

"NI CONNECTION USED DURING SYSTEM PROCESSOR DEVELOPMENT ONLY

Figure 1 VAX 9000 SPU Block Diagram and interconnects

DiRilttl Te cbnicaljournal VtJ/. 2 No 4 Fa/1 /'J'J() 91 VAX 9000 Series

Digital products. Every SPlJ VA XBI adapter provides interface, the system processor may also interrupt 1 its own built-in self-test diagnostics. the SPU when the processor needs service. This SPU hardware is based on either industry-proven type of interrupt request is known as an attention. (e.g., 74 00-series TTL components, complementary The SPU is integrated into the system cabinet to metal oxide semiconductor [CMOS] gate arrays) better meet the performance requirements neces­ or Digital-proven technology (e.g. , VAXBI, Digital sary for system error recovery and VAX 9000 oper­ custOm CMOS devices) to ensure that the unit is an ating system boot. Cabinet integration substantially effective debugging platform for a system processor decreases interconnect distances to processor logic based on leading edge technology. As a result, the and ensures that all cables are kept internal to the inherent risk and learning curve associated with cabinet. Another reason for choosing the VA XBI new technology were avoided and the SPU was backplane card cage is that its form factor is small, ready and available during the VA X 9000 system which reduces the cabinet area needed (cabinet area protOtype debugging process. is always in high demand), yet the user-definable The SI'U also was made available to manufactur­ zones provide the high pin density required for ing process and tester groups (e.g., multichip unit interconnects (i.e. , 180 110 pins per VAXBI slot). tester) fo r use with their designs. The advantages to this approach were that technicians became fam il­ Communication Pa th iar with the same subsystem that would be used in The SPU communicates with the system processor the VAX 9000 family, and the test programs could using the SJ I. This interface is used to load the pri­ be transferred for use in other test envi ronments mary bootstrap into the VAX 9000 main memory, that also used the SPU, including the VA X 9000 sys­ transfer error and machine-check information to tem itself. the VA X 9000 operating system, provide file trans­ The service processor module is the primary fer access between the VAX 9000 operating system processing element of the SPU and is the VAXBI host and the SPU's RD54 disk drive, access system main adapter. Based on the MicroVAX 78032 chip and memory, and access system i/O registers. several custom-designed application-specific inte­ The VA X 9000 operating system accesses the SPU grated circuits (e.g., SPll-to-system control unit as if it were a standard J/0 device. The SPU is an adapter, SPU ), the module con­ independent subsystem and does not rely on the tains all the hardware necessary to store and execution unit of the system processor to be a con­ execute the SPU operating system . The on-board sole processing engine, as was done in previous firmware contains a VA X standard console interface VA X systems. There are several benefits to this to load the SPU operating system during initializa­ design approach. Each CPU has equal access to the tion and to assist in subsystem debugging. The SPU­ SPU and may interrupt the SPU to request service. to-system control unit interface (SJI) connects the In addition, the SPU may interrupt any of the CPUs service processor module to the system control unit to request an operating system service. The SPU and is the primary communication path between may be used as a debugging tool during system pro­ the SPU and the VAX 9000 operating system. cessor debugging because it does not require that The scan control module is the control interface any portion of the system processor be operational. to the VA X 9000 scan system, which is the visibility The fact that the SPU could be used as a debugging and maintenance path to the system processor. Like tool was an extremely important benefit for the the service processor module, the scan control VA X 9000 system debugging effort. The debugger module is based on the MicroVAX 780)2 chip ami did not have easy access to the logic elements s<:veral custom-designed application-specific inte­ because of the advanced packaging and circuit inte­ grated circuits (e.g., scan control chip, scan gration of the VAX 9000 system. Therefore, SPU ser­ distribution chip). On-board firmware provides vices were utilized in lieu of logic probes. Further, high-level fu nctions that allow the service processor because the SPU no longer uses the CPU for system module to continue processing while scan-related access, console support microcode (i.e., the collec­ operations, including logical-to-physical signal tion of microcode procedures traditionally used for translations, are performed concurrently by the access to the system processor, memory, and J/0 scan control module. The scan interconnect (SCI) registers) is not required . The benefit of this process connects the scan control module to the system is that valuable VAX 9000 control store space could processor (i.e., one to fo ur C:PUs and the system con­ be used for system microcode or to reduce the con­ trol unit) and the master clock mod ule. Using this trol store size. For example, in the VAX 8650 system ,

92 Vo l. J No. 4 Fa ll /<)<)() Digital Te cbnicaljournal The VAX<)000 Sen lice Processo r Unit

console support microcode occupies approxi­ Table 1 RXFCT/RXPRM and TXFCT/TXPRM mately 180 microword locations. Functions VAX 9000 operating system access to the SPU is RXFCT/RXPRM Functions through the VA X console register set. We extended (SPU to System Processor) the VA X console register set to provide access to the enhanced capabilities of the SPU. Additional regis­ Remove processor ters include transmit function request and param­ Add processor Mark memory page bad eter and receive function request and parameter (i.e., TXFCT, TXPRM, !L.'<.FCT, RX PRM). Ta ble l lists Request pages of memory the functions provided by these registers. Send error log entry SJ I communications are in the form of 14-byte Send OPCOM message packers that contain the command (i.e., function), Get datagram buffer address, and data. Packets are sent and received Send datagram over two 8-bit data paths that provide fu ll duplex Return datagram status operation. Data transfers peak at 3. 5 megabytes Set keep-alive state (MB) per second for quadword transfers. Abort datal ink VAX When the 9000 operating system executes a Error interrupt Move_ro/from_ Processor _Register instruction that specifics an SPU register, the system control unit TXFCT/TXPRM Functions (System Processor to SPU) sends an I/O command p::tcket, through the SJ I, to SPU the SPl! to initiate the system request. Then the Get hardware context (of a halted CPU) typically uses an interrupt command packet, which Virtual block file operation generates an interrupt to the specified CPU. The (access to SPU disk and tape) two other packet types are Keep-alive and error correction code. Send datagram Return datagram status Visibility Pa th Switch primary In the development and manufacture of a com­ Reboot system request plex computer system, extensive testing methods Clear warm start flag must be available to ensure functional operation Clear cold start flag and product quality. Design engineering no longer Boot secondary processor can use manu::tl probing techniques in prototype Halt CPU and remove from available set debugging. Space limitations have resulted from Halt CPU and keep in available set advanced packaging and the close pitch of inte­ Console quiet grated circuit !IO pins, which is due to high integra­ tion lewis. Failure isolation must be performed in Set interrupt mode the manufacturing process, often without an exten­ Abort datal ink sive knowledge of the machine design. Reset 1/0 system A separate visibility and control path in the sys­ Disable vector unit tem processor of the VAX 9000 system provides Set keep-alive state nearly 100 percent visibility to the machine-state. Start processor The visibility path eliminates the need to select a Margin power subset of visibility points to meet all test needs, as Margin clock was done with previous VA X systems. In addition, Fault signal the path allows designers to directly alter the entire Start error window machine-state, which is a major advantage for End error window VAX design and process debugging. A 9000 uni­ Report error in window processor (i.e., one CPU and system control unit) Get error log entry contains over 26,000 access points. Get unmarked error log entry and mark The path is called the VA X 9000 scan system and Enable halt restart is controlled by the service control module. The Get 1/0 physical address memory map configuration scan system is the foundation for direct access Get physical address memory map configuration by prototype debuggers, system error recovery

Vo l. /'J'J() Digital Te cb nicaljournal 1 No. -i Fu ll 9:) VAX 9000 Series

software, and diagnostics to observe and alter the by issuing scan clocks, which serially shift system VAX 9000 machine-state. Some functions provided data to the scan control module. System state is by the scan control module and supporting SPU changed when the scan control module drives new software are data to the system latches while issuing scan clocks. An architectural feature permits each multichip • Load and save processor state unit to generate an attention interrupt directly to

• Scan pattern execution the scan control module over the scan data return line. Attentions notify the SPU of system events, • Continuity testing of the processor's scan such as processor errors, memory self-test comple­ hardware tion, CPU halts, and keep-alive responses.

• Multichip unit type and revision information System diagnostics can diagnose the SCI by using extraction the same control signals as used for scan system operation. Dedicated logic and special routing of Processor attention notification • the scan lines provide failure isolation. Stuck-at A block diagram of the VA X 9000 scan system faults and disconnect conditions can be isolated to is shown in Figure 2. The scan control module the multichip unit. connects to the system planar module over the SCI. Scan and clock distribution logic, contained in a Debugging Features macrocell array on the pl:mar module, distributes In addition to its use as the VAX 9000 front-end data and control signals over the scan bus to each of processor, the SPU provides a variety of features the multichip units. A clock distribution chip at the for debugging and troubleshooting multichip unit hub of each multichip unit further distributes the logic configurations. These features were required scan bus signals to the macrocell arrays, which are because all multichip unit logic visibility and con­ integrated circuits that contain system logic. trol is handled through the SCI, which connects As shown in Figure 3, the state devices within a directly to the SPU. The use of scan larches to access macrocell array are scan latches. The latches are internal logic states is a first for VAX systems and connected serially to form a ring or chain by con­ challenged the designers to define and deliver the necting the Scan_Data_Out line of each latch to the necessary tools and features to assist the multichip Scan_Data_ln line of the next latch. The end links unit debugging effort. Furthermore, the features are connected to the clock distribution chip. When provided by the SPU had to apply tO various tester the system clocks are running, data is loaded into environments, ranging from single multichip units the latch from the system data input. During scan mounted in probe stations to full system config­ operation, system clocks are not active. Generated urations. Additional requirements to support the by the scan control module, the scan clocks load the clock and power system test stations made it clear latch with data from the scan data input. Conse­ that the SPU would have to be adaptable to a variety quently, the scan control module reads system state of environments.

SERVICE PLANAR PROCESSOR MODULE

SCI

SCAN SCAN CONTROL scD· DATA MODULE RETURN

MCUn MCUO

SCAN DATA IN AND CONTROL

'SCD - SCAN AND CLOCK DISTRIBUTION LOGIC

Figure 2 VAX 9000 Scan System

94 Vo l l No. 4 Fa llf

Generic SPU Environment before full system configuration support was To satisfy the SPU's fundamental requirement of needed. In particular, the command language being adaptive to diffe ring environments, test sta­ interpreter had ro be ready to provide the basic tions had to comply with the physical interfaces SPU functions, such as file manipulation, com­ provided by the SPU in the VA X 9000 system. We did mand procedures, symbol and expression evalu­ not have the resources to develop tester-specific ation, and command recall. interfaces, so it was agreed by all development • Te chnicians working at the test stations had an groups that testers would 's comply with the SPU opportunity to develop an understanding of the system environment, as shown in Figure 4. SPU's operation that was then carried forward to This generic environment allowed SPU hardware other SPU-based debugging environments. :md software development to concentrate on sup­ the needs of the VA X 9000 system and to • Economies of scale existed because one front­ receive valuable feedback and debug time from end development effort supported both in-house rhe test station groups prior to system processor test stations and the final product. availability. Several major benefits were achieved • Many of rhe primitive debugging features devel­ with this approach: oped for rester use were found to be just as • Because the SPU software had early exposure valuable during actual system debugging, partic­ in rhe rest srarion environments, the software ularly the fundamental commands that allow was debugged and tested to an acceptable level direct control of the SCl signals.

SYSTEM SCAN DATA IN D LATCH .._ DO SCAN DATA IN SOl SYSTEM_DATA - 01 0

HOLD -c HOLD SYSTEM OATA OUT PUTS SCAN_DATA_IN - SOl SYSTEM DATA IN SYSTEM LATCH SCAN_A_CLK - SCK DATA D c..,-. 0 OUT - P-J-..

SYSTEM A CLK(-{ LD - LD SYSTEM_B_C LK --c OR '- SCAN_B_CLK SCAN DATA_OUT SYSTEM D DATA IN (a) Macrocell SOl Os

SCAN SCAN DATA LATCH OUT

(L) l:xample of ScanRing Routing

0 HOLD 8

(b)Re gister Tra nsfe r Let,el Boc�JI

Figure 3 Scan Latch

Digital Te cbnicaljournal Vo l. .! No. -i Fall I'J'JII 91 VAX9000 Series

Primitive Debug_f?ingFe atures the power, clock, and SPU subsystems as parr of the One of the most basic features offered by the SPU is reliability and design verification test plans. the ability to directly :1ccessthe internal registers of Other low-level commands provide the means the clock and power systt:ms by using the FXAMI:-..JE for debugging and troubleshooting the scan paths. and DEPOSIT commands. When combined with the For example, the SET and SHOW commands permit general command language of the SPU, these com­ individual control of the SCI signals. t ·sing these mands allow clebuggers to create procedures to commands, the SCI can be observed statically and control, rest, or interrogate various components of be stepped through its operations. Precise control the VA X 9000 system. For cx:.�mple, command pro­ of the SCI signals provides easier debugging of the cedures h:.!\'C been created to monitor and exercise scan paths in the early multichip units, primarily

1 TO 5 TEST STATIONS OR VAX 9000 SYSTEM I I I I

SPU .. FRONT END .. I I CTY ..- TTY CONTROL

MCU(s) UNDER RTY .._ Nl TEST

RD54 DISK

SCI

scu - CPU3 � SCI ADAPTER SCAN CPU2 CONTROL - CPUl '-- CPUO CLOCK SCI �YSTEM CLOCK MCM ADAPTER CLOCK SUBSYSTEM - POWER POWER SUPPLI IES POWER CONTROL AND SENSORS SUBSYSTEM PCI

KEY: MCM - MASTER CLOCK MODULE SCU - SYSTEM CONTROL UNIT PCI - POWER CONTROL INTERFACE CTY - CONSOLE TERMINAL RTY - REMOTE TERMINAL MCU - MULTICHIP UNIT SCI - SCAN INTERCONNECT Nl - NETWORK INTERCONNECT

Figure 4 Generic SPU Environment

96 Vo l. 2 No. 4 Fa ll /'J')O Digital Te chnicaljo urnal Th e VAX 9000 Service Processor Unit

because static signal operation gives more isobred • All register transfer-level signal names corre­ feedback than an interconnect that runs at fu ll spond with those present on logic schematics, speed through many state changes. including the logical design hierarchy. This cor­ Once the single-step operation of the SCI was ver­ respondence makes the relationship of displayed ified, the fu ll-speed operation of the scan logic in signal names and schematic signal names easy the multichip unit could be tested . Commands that (e.g., %CPU0. VA P.VAPO.ALU_FUNCT!ON_H ). collectively display and modify the scan latches of The translation from a logical signal to its associ- a single macrocell array were an effective way to ated scan latch uses clara structures supplied in a verify the operation of this logic. Latch data are configuration database file, which is loaded into displayed in their physical fo rm as a string of hexa­ SPU memory during SPU initialization . All CPUs decimal digits, the length of which varies from one with identical mu ltichip unit configurations (i.e., macrocell array to the next, in the range of 100 to same CPU revision) share the same configuration :)00 bits. Provisions also exist ro select scan clock database memory image. The system control unit rates ranging from 3200 nanoseconds (ns) per bit to always req uires its own database. Only two CPU the fu ll 100 ns per bit operation. revisions can be supported at one time because of High-level Debugging Fe atures SPlJ memory constraints for storing the separate The use of the scan architecture as a means for configuration . However, by providing for initializing and debugging the VA X 9000 processor two CPU revisions, the needs of single and dual CPU was a first for a VA X front-end. Because physical configurations were completely satisfied . Further, larch information is cryptic and diffi cult to use, we it was possible to upgrade homogeneous triple and designed the SI'U to provide the necessary trans­ quadruple configurations in a stepwise manner. lation from a logic signal name to its corresponding scan larch in the machine. We modeled the SPU's Macrocode Execution of DECSIM, user interface after the user interface Initial system-le\'el multicbip unit configurations one of Digital's logic simulation utilities. Engineer­ consisted only of a scalar CPU. The system control DECSIM ing used the interface during the VA X 9000 unit was not yet available as a result of the extended design phase and was already familiar with its user simulation of the design. Fortunately, we had antici­ interface. pated the possibility of running partial configu­ The majority of the user interface development rations and could provide modes within the SPU work involved the EXAMINE DEPOSIT and com­ software to redirect commands that normally mands and their associated data structures that access main memory (e.g., EXAMINE, LOAD) to resemble the procedure usesd in the DECSIM sys­ access the CPU's 12H kilobyte (KB) system cache tem. These commands provide access to the more or SKB virtual instruction cache instead. The first than 26,000 signals accessible through the scan sys­ VA X macro-instructions were loaded and executed tem in a uniprocessor system . The SI'U also main­ on the VA X 9000 system using this technique. An tains the design hierarchy of the signals, which additional fea ture, which involved minor hooks in permits signals to be referenced as they appear on the system microcode. provided a means for the the pages of rhe logic schematics. A watch point VA X instruction set diagnostic, EVKAA, to commu­ and trace point capability, modeled after similar nicate with the console terminal through scan features in the DECSitvl system , simplifies the task of attentions rather than by using the system control monitoring state changes in the machine. Because unit. Thus, the diagnostic could run to completion. the processor clocks are single-stepped, signals which change state are displayed automatically. D Advanced DebuggingFe atures Using the ECSiivl system as the model for these Sl'l l features produced two advantages: Although not obvious aids to VA X 9000 debug, the following features were indispensable or, at the • Designers moved from the simu lation environ­ least, reduced debugging time and effort: ment (i.e., using the DECSIM system) ro actual debugging (i.e., using the Sl'll) with virtually no • A character-cell windowing capability that training. Air hough the precise syntax of the SPU's allows system microcode sources ro be automat­ commands is not always identical to the syntax ically located, displayed, and updated on the of DFCSIM commands, the concepts are the same. screen as the system is single-stepped. We mod­ Therefore, fi rst-time users overcome the diffe r­ eled this fe ature after the VAX debugger's win­ ences quickly. dowing capability because most VAX engineers

. . Dip,ital Te cbnicaljoro-nal Vol. .! No j Fa ll /'J'JO 97 VAX9000 Series

are fa miliar with this capability. Windowing module, the power and environmental monitor, eliminated the need for hard-copy microcode the disk controller, and the tape controller. It also listings and the logistical problems associated reports errors in various parts of the VAX 9000 with their use. system, such as the system control unit, the CPI ·s, the memory system , the master clock module, and • By connecting the SPU to the engineering net­ the power and environmental systems. Because fa il­ work during development, timely updates of ures in any of these subsystems can incapacitate the SPU software were made possible. This kept the VAX 9000 system, none of them reports its errors VA X 9000 debugging effort, which was occur­ directly to the VAX 9000 operating system . ring simultaneously on several systems, up to date with the latest SPU software fixes and Errors The disk controller, tape controller, enhancements. Together with the multisession SPU and scan control module use the VAXBI VA X port capability of the SPU operating system, the use protocol to report errors. The power and environ­ of the network made remote debugging a reality mental monitor passes error information to the ser­ throughout the VAX 9000 debug phase. vice processor module through its private bus, the • 13ecause the SPU had to initialize the VA X 9000 SPU-to-power control system interface. system thousands of times during system debug­ ging, the unit was designed to perform system Environmental Exceptions The power and envi­ initialization as efficiently as possible. For exam­ ronmental monitor monitors the regulator intelli­ ple, the loading of structures (e.g., control stores gence cards, airflow sensors, and temperature or cache tags) was optimized by overlapping the sensors throughout the system. When it detects any operation of three MicroVAX-based processors : problems in operating voltages, currents, tempera­ the service processor module, the scan control tures, or airflow, it notifies the service processor module, and the disk controller. operating system , which logs the error condition.

The debugging fe atures located early design and Clock Exceptions When the master clock module fabrication problems in the clock, power, scan, and detects an error in either the clock phase or the processor logic areas. Ultimately, the features were clock frequency lock, it generates an attention to used to initialize and run the first VA X 9000 system. the scan control module, which interrupts the ser­ vice processor module. The SPU operating system Error Handling logs the error condition. To support high system availability, accurate and timely error detection and logging is required. Memory Error Correction Code Events The main Error data collection cannot depend upon host sys­ memory of the VA X 9000 system contains error­ tem availability, and the data must be available when correcting logic to correct single-bit errors and the system is not functional. Therefore, an indepen­ detect double-bit errors. When a memory location dent service subsystem that can collect data from all with a single-bit error is read, the system control system components, render it into a useful format, unit corrects the error and passes the corrected data and store and display the information is needed . to the requesting device. It also writes an SPU regis­ The service subsystem must also be organized in ter with the error type and the failing memory such a way that if it fa ils, it does not directly cause address. The SPU operating system writes this infor­ system processor fa ilures. Repair, reboot, and sys­ mation to the error log. If the system control unit tem reintegration must occur without interfering detects a double-bit error or reads a marked-bad with system processor operation. The SPU meets location, it passes the bad data, marked as bad, to these requirements; it is a fully independent com­ the requesting device and notifies the service pro­ puter that runs its own operating system with dedi­ cessor operating system, which logs the error. The catec.J peripherals. The SPU performs system-wide bad dat::1 is handled locally by the requesting device, error detection and reporting fu nctions and pro­ usually by generating an error of its ow n. vides advanced error recovery features for the system processor. CPU and Sys tem Control Unit Errors When a CPU detects an error in a parity checker, it attempts to Error Detection come to an instruction boundary and halt. Once The SPU reports errors in its own VA XBI adapters, it has halted, the CPU sweeps its cache. When the the service processor module, the scan control cache sweep is completed, the CPU asserts an

98 Vo l. 2 No. 4 Fall I'J'JO Digital Te cbnicaljournal Th e VAX 9000 Service Processor Unit

attention to the scan control module to inform the be used to determine the approximate time of the SPU that recovery is required. When the system crash. Errors are logged regard less of the state of the control unit detects an error, it first asserts a fatal system processor. As a result, information is avail­ error signal to each of the CPUs, and then asserts an able for analysis even in the event of a total proces­ attention. When the CPUs receive the fatal error sig­ sor fa ilure. The error log file may also be transferred nal, they attempt to come to an instruction to TK50 tape for off-site analysis. boundary and halt. Once halted, the crus assert The error formatter passes error information to attention lines to the scan control module. The the VAX 9000 operating system by copying the error caches are not swept since their path to memory, log entry to system memory and then invoking the the system control unit, is not working. RXFCT function to notify the VA,'{ 9000 operating system that the entry is available. Should the operat­ Keep-alive, Timeout To ensure that a CPU is not ing system not respond to this notification, the hung by an undetected error, the SPU periodically error formatter assumes that the operating system sends a keep-alive interrupt to each CPU. CPU has crashed and writes the error log entry to a tem­ microcode services the interrupt at the next macro­ porary data ft.le. When the VAX 9000 operating sys­ instruction boundary by asserting an attention to tem reboots, it notifies the SPU by using a TXFCT the scan control module. If the CPU should be hung function. The error formatter then reads any saved by an undetected error, the SPU times out while it error log entries from the data file and transmits waits for the keep-alive reply attention and, thus, them to the VAX 9000 operating system. This proto­ determines that there has been an error. Similarly, col ensures that all collected error data is eventually the primary CPU monitors the SPU by sending it a reported in the system error log. keep-alive request through the TXFCT register. If the The error formatter also maintains a SPU port to SPU does not respond to this request within a time­ which any process running on the SPU may con­ out period, the VAX 9000 operating system assumes nect. Connected processes receive copies of all that the SPU is hung and reboots it using a VA XBI error log entries as the entries are logged . This port reset. When the SPU reboots, it reintegrates itself is used by EWKCA, the symptom-directed diagnosis with the rest of the VAX 9000 system without inter­ tool, which analyzes errors as they occur and fering with system operation. determines which system components might have caused the failure. The port is also used for system ErrorRe porting debugging by the error insertion program to verify When errors are reported to the SPU operating sys­ that errors are being logged and analyzed correctly. tem, the error formatting facility logs the error information locally and reliably transmits it to all Snapshots In addition to its error logging facili­ intended receivers. The error formatter maintains ties, the SPU operating system provides the ability to the error log fi le ERRLOG.SYS on the SPU RD54 take "snapshots" of the system processor state. The drive, passes error log entries to the VAX 9000 oper­ snapshot file provides a detailed record of system ating system to be logged in the system error log, context, which allows engineers to take a snapshot and also passes the entries to any SPU software that of a hung system and reboot it, and then analyze the requests them. The error formatter writes the error snapshot file while the system proceeds to perform log file using the SPU operating system disk I/O func­ other useful work. The snapshot display utility is tions, passes the error log entries to the VA X 9000 used to examine the data in a snapshot file. In addi­ operating system using an RXFCT function. and tion to formatting the data in the snapshot file, the passes the error log entries to other SPU processes snapshot display utility can be used to examine any using the SPU port protocol. If the RD54 drive is not scan latch in the fi le, by name, in the same fashion as available, which prevents access to the SPU error the console EXAMINE command is used on the log, the error formatter continues to send error log actual hardware. The data available in a snapshot entries to the VA X 9000 operating system and to file is summarized in Ta ble 2. other sru processes. The SPU error log contains all the error log entries Error Recovery collected by the SPU (but not those collected by the The high level of visibility achieved by the scan VAX 9000 operating system) and time stamps, system allows the SPU to provide extensive error which are logged every ten minutes. Should an SPU recovery facilities for the VAX 9000 processor. operating system crash occur, the time stamps may SPU-based recovery offers several advantages over

Digital Te cbnicaljournal Vo l. 2 No. 4 Fu l/1')90 99 VAX 9000 Series

Ta ble 2 Snapshot File Contents traditional microcode-based error handling. The CPU hardware resources that might otherwise be Revision Section used for error handling were available for the logic

All multichip unit revisions designers to improve the system performance. Because the error data is processed external to the All SPU adapter revisions failing component, the recovery process itself is Microcode revisions not suspect. Finally, because the system clocks are All XMI adapter revisions stopped while recovery takes place, erroneous data All VA XBI adapter revisions does not propagate throughour the system. Traditionally, many microwords in the CPU Power Section control store (approximately 500 in the VAX 8600 All power control system registers system) are used for error recovery microcode. "Sense power" results However, because the SPU is responsible for VAX 9000 error recovery, additional control store Clock Section space is available for instruction microcode. If this had not been the case, we might have had to make a All master clock module registers space trade-off between instruction and recovery SPU Section microcode, which coul.d have resulted in more emulated instructions and a performance penalty All SPU-to-system control unit adapter registers for VAX instruction execution speed. Because the scan system allows the SPU to deter­ 1/0 Section mine the state of every scan latch in the CPUs and XMI device error registers system control unit, logic designers were able to VA XBI device error registers place error detectors anywhere in the design

XMI-to-system control unit error registers without organizing the detectors into microcode­ readable error registers. As a result, significantly System Control Unit Section more error detectors were used for precise error analysis than would have been possible if the scan All scan latches system were not available. Each VA.,'\ 9000 CPU con­ Last 50 entries from system control unit micro tains over 450 error detector latches. program counter history buffer Several advantages are derived from performing All cache tags error recovery independently from a failed compo­ All other logical structures (e.g., control stores) nent. The most obvious advantage is that hardware, Configuration database version which may be failing, is not used to control the 1/0 physical address memory map recovery. Once the system processor state has been Memory physical address memory map scanned out into SPU memory, analysis is a function

Nonexistent physical address memory map of software running on a known good processor. The SPU analyzes the data and then scans a cor­ CPU Section (Repeated Once for Each CPU) rect state into the system processor. The entire process is performed while the system clocks have All scan latches been stopped. Therefore, processor errors cannot Last 50 entries from program counter history buffer cause "error loops;" that is, the error recovery All cache tags process itself gets errors from a corrupt processor All general-purpose registers state. SPU-based error recovery can completely All internal processor registers reset a corrupt system, regardless of the degree of All other logical structures (e.g., control stores) corruption. Top 50 longwords of current mode stack The VA.,'\ 9000 error-handling facility takes

Top 50 longwords of interrupt stack advantage of many advanced software features that are available in the SPU operating system. It uses 32 bytes of instruction stream around each configuration database information to access sys­ program counter in history buffer tem processor signals by name rather than by scan Configuration database version ring locations. Thus, one version of the error han­ 50 micro program counters, collected by stepping dling code can handle several different physical the clocks processor variations. The error handler also uses the

100 Vo l. 2 No. 4 Fall 1<)<)0 Digital Te cbnicaljounwl The VAX 9000 Service Processor Unit

SPU operating system structure access routines to processes the data from the data collection phase read and write the processor structures, again, by into an error log entry, posts the entry, and cleans burying the physical implementation in the config­ up the data structures that will be used to recover uration database. As a result, the error handler from the next error. can look at the architectural features of the VAX pro­ Errors that are too severe for the error handler to cessor rather than at the gate-level design of the handle are signaled to the SPU command inter­ VAX 9000 system when performing error analysis. preter, which can run command scripts to com­ The benefit of this approach is that recovery proce­ pletely reinitialize the machine and reboot the VAX dures are based on the system architecture, rather 9000 operating system. Examples of such severe than on the machine implementation. errors are bard errors that prevent VAX 9000 oper­ One of our design goals for the VAX 9000 error­ ating system machine check code from running and handling system was to recover from most errors errors that cause a CPU to fail its macrostep. in under 500 milliseconds. Longer delays increase the probability that I/0 devices will time out while Summary waiting for the operating system to respond to The SPU is a dedicated subsystem for service and requests and cause the operating system to crash, maintenance support for the VAX 9000 fa mily. It is even if the error-handling system successfully closely linked to the VAX 9000 processor to provide recovers from the error. The error handler meets system error recovery. It also presents a high-level this goal by taking maximum advantage of the interface with which debuggers may observe and multiprocessing capabilities of the tightly coupled control system processor activity. Through the use hardware design of the service processor module of a system-wide scan architecture, the SPU pro­ and scan control module. Error recovery is split into vides access to nearly roo percent of processor a multistep process that keeps both SPU processors machine-state. Finally, the use of the SPU in various working on the problem simultaneously. tester environments greatly assisted the multichip The error handler recovers a failed system in five unit debugging effort and provided advanced train­ phases: data collection, data analysis, error recov­ ing for VAX 9000 system debuggers. ery, macrostep, and cleanup. In the data collection phase, the scan control module scans out all scan Acknowledgments CPU rings of the failed or system control unit. In the The authors wish to thank Michael Evans, the SPU analysis phase, the scanned data is used to deter­ project leader, whose drive and ambition provided mine which architectural features of the system the force behind the project's success. We also wish have been corrupted (e.g., caches, general-purpose to acknowledge the other members of the SPU registers, internal processor registers, microcode design team: Karen Barnard , Stephen Conway, stores, and the translation buffer). David D'Antonio, Susan DesMarais, and Brian Rost. In the recovery phase, the error handler attempts to restore the system to a state in which no soft­ Reference ware-visible data is corrupt. Therefore, the soft­ ware running on the VAX 9000 system, including 1. D. Chin et al., "The Unique Features of the VAX the operating system, is unaware that an error has 9000 Power System Design," Digital Te chnical occurred. The error handler determines whether jo urnal, vol. 2, no. 4 (Fall 1990, this issue): the system state can be restored successfully or if 102-117. a machine check must be generated to allow the VAX 9000 operating system to attempt to handle the error on a higher level. It then restores the CPU to a known good operating state, by using latch data from the configuration database, and corrects any corrupted software-visible data. In the macrostep phase, the error handler turns on the system clocks to allow the failed CPLI to attempt to macrostep one instruction. If the macrostep completes successfully, the recovery is considered successful and system operation is allowed to continue. In the clean-up phase, the SPU

Digital Te chnicaljournal V(J/. 2 No. ·4 Fa ll /1.)1.)0 101 Derrick]. Chin Barry G. Brown Charles F. Butala Luke L. Chang Steven]. Chenetz Gerald E. Cotter Brian T. Lynch Thiagarajan Natarajan Th e Un ique Features Leonard]. Salafia of the VAX 9000 Power Sy stem Design

The VA X 9000 series rep resents Digital'sfi rst implementation of a mainframe com­ puter system. To be competitive in this market, the power system fo r the VA X 9000 series had to providehi gh system availability To meet this goal, the sy stem includes fe atures neither considered norfo und in previous large Digital computer systems. Some of these fe atures are the use of redundancy in parts of the design and the addition of more power system diagnosis capabili�y fo r quickerfa ult isolation and fa ulty unit replacement. Otberfe atures provide competitive advantages in specific marketplaces, such as meeting low harmonic distortion for AC input current, which is an emerging European AC power qualiryst andard. Simulation tools, wbich are used more prevalent()' in digital logic, were used to improl!e the power design.

The two key requiremems of the VAX 9000 power lators should be used. A large number of regulators system are high availability and the inclusion of in a power system can cause the mean time between competitive features. High availability for rhe power failures (MTBF) to be lower than desired. Therefore, system means we had to achieve the highest unit we chose to use redundant regulators in the power regulator reliability possible by using the appropri­ system architecture for improved availability. ate technology available. Further, we had to deliver Another means of increasing the MTBF was both more power system and cabinet environmen­ achieved by improving the load sharing among the tal monitoring and diagnostic capability that could parallel regulators that power a low-voltage current reduce the time spent in isolating and replacing a load. With this feature, no one regulator operates malfunctioning unit. Competitive features mean at a percentage of maximum rating much higher designing into the system features that would be than its parallel regulators, which eliminates the either better than expected or advantageous to the higher operating temperatures that can occur and, VAX 9000 system in certain markets. as a result, lowers the MTBF. A full discussion of all the methods used to meet High regulator reliability results from good cir­ these requirements is too long for this paper. There­ cuit design. Three examples of the unique simula­ fo re, the discussion in this paper focuses on some of tion features that were used as checks on circuit the unique applications of the power technology designs are discussed in the Simulation section of and tools used in the design of the VAX 9000 system: this paper. In one case, simulation pointed the way to a circuit problem that was not initially apparent. • Power system architecture In another case, simulation was used to verify on

• Improved load sharing paper that the number of regulators chosen to power a specific load was sufficient. • Simulation High availability can be achieved by reducing the

• Increased control and monitoring time to isolate a system problem and replace the malfunctioning unit. A power and cabinet moni­ • Low harmonic distortion toring module, EMM, fulfilled this purpose in the One of the issues we had to decide in designing VAX 8000 systems. The power control subsystem, the power system architecture was how many regu- PCS, used for this purpose in the VAX 9000 systems,

102 Vo l. 2 No. 4 Fall /'J'JO Digital Te chnicaljo urnal The Unique Features of the VAX 9000 Power System Design

expands on the diagnostic and monitoring features The basic power system architecture for the of the EMM. V�'<9000 Model 200 and Model 400 series is shown Meeting emerging European AC power quality in Figures 1 and 2, respectively. Power processing in standards was viewed by the European sales each model occurs in two distinct stages. First, an force as a distinct competitive advantage for the AC front end processes and converts AC utility input VAX 9000 system. A proposed standard we wanted power to high-voltage DC, which is then bused to meet was to achieve low harmoruc distortion of about the power system. Second, DC-to-DC switch­ the input AC current wave form, which was met ing regulators convert the high-voltage DC to low­ in the utility power conditioner (UPC) front-end voltage outputs, which are then distributed through design of the power system. High availability was high-current-carrying busbars to the various logic designed into the UPC through such features as loads. An intelligent power control subsystem (PCS) redundancy and increased immunity to power line provides control, sequencing, monitoring, and disturbances from a commonly accepted industry diagnostic capabilities. Dedicated bias regulators, practice of one AC cycle to teo AC cycles. which are powered from the high-voltage DC, provide housekeeping control (i.e., low power) and VAX 9000 Power System Architecture start-up power to each bank of output regulators. The discussion of the power system architecture The high-voltage DC bus permits low-voltage out­ will focus on some of the architecture's major put regulators to be added or removed for diffe rent features: power zoning, N + 1 redundancy, and system configurations. The high-voltage DC bus also decoupling. can be backed up with a battery unit that produces • Power zoning enables parts of the system to be high-voltage DC from 48-volt batteries through a powered off for maintenance while the rest of step-up switching regulator. This approach allows the system remains operational. any specific low-voltage output to be produced, as needed, during the battery back'l.lp period without • N + 1 redundancy provides higher perceived using specific battery-to-logic voltage output DC-to­ system availability to counteract the impact of DC regulators. The battery required to backup the low system mean time between failures, which is entire computer system would be larger than the a result of the large number of regulators. computer itself. Therefore, diodes are inserted into

• Decoupling major sections of the power system the high-voltage DC distribution to partition the allows future upgrades to be made without high-voltage DC bus, and only sections, such as the requiring significant changes to the rest of the memory refresh operation and PCS control, are system. backed up.

PCS (POWER CONTROL SUBSYSTEM)

/ / / ENVIRONMENTAL / MONITORS UTILITY POWER

120/208 VAC 3 PHASE

� �

Figure 1 VAX 9000 Model 200 Series Power System

Digital Te cbnicaljournal Vo l. 2 No. 4 Fa ll 1990 103 VAX 9000 Series

PCS (POWER CONTROL SUBSYSTEM)

Figure 2 VAX 9000 Mode/ 400 Series Power Sys tem

Power Zoning system. Therefore, the worst load imbalance does The power-zoning feature meers rhe maintain­ nor justifythe added complexity. ability and high availability goals in the VA X 9000 The diode does nor have a significant impact on Model 400 series of triple and quadruple proces­ overall power load re liabiliry because conservarive sors. In the power system's configuration, a pair of deraring of rhe diode results in a lower diode oper­ dual processors can be powered off for mainte­ aring temperature and hence higher rel iabiliry. nance, while the remaining powered-on processors We were concerned that power zoning could maintain system operation. have an impact on rhe resr of rhe system as a result A quadruple processor configuration is not com­ of powering down part of the system. However, posed of two identical dual processors. Some func­ analysis of the results showed rhar such a concern tions of a quadruple processor are not replicated. was unfou nded. The high-voltage DC bus has rela­ The system control unit, the memory, the service tively long time cons tams (i.e., slow to react to processor unit, and the PCS are common ro both changes). Therefore, turn-on and turn-off transients dual processors. Therefore, these functions are on the bus are smooth and gradual and do not powered up by either front end . The high-voltage generate quick-changing electromagnetic fi elds that DC power bus is diode OR 'd from either AC power could affect the operation of the sections of the source, through the dual diode, CR 1, and then fed to system that are still functioning. the ourput stages that power the common elements listed above. N + 1 Redundancy The diode-OR process in the VA X 9000 system Each processor in the VA X 9000 power system uses does not provide for active loadsharing. Active approximately 400 amperes from each of the two loadsharing between each AC from end increases supply voltages. The ratings of the power semi­ the overall actual power system reliability because conductors used in the outputs of the OC-ro-DC it ensures that each AC front end supplies half the regulators deliver an optimal regulator rating of load. Othenvise, one AC front end could take most approximately 240 amperes. Based on these rat­ of the load (and be stressed higher), which would ings, powering a CPU in the VAX 9000 system would leave the other unit roo lightly loaded. However, require two regulators for each voltage. However, acrive load sharing is complicated by the physical in a large system, such as the VAX 9000 system, the distances between the AC front ends and the com­ number of regularors can quickly add up, which plex handling of faults and parcial faults in each would result in an equally quick drop in overall AC front end. The load of the common elements in system reliability. Powering two CPUs from the the VA X 9000 system is only 20 percent of the total same voltage bus reduces the number of regulators.

104 Vo l. 2 No. 4 Fa /1 19')0 Digital Te cbnicaljournal Th e Un ique Fea tures of the VAX 9000 Po wer Sys tem Design

Redundancy is then used to minimize the impact unit. This reliance has a significant impact on the of the large number of regulators in the bus. design of the regu lator, the regulator response time, By using redundancy, additional regulators on a and how the regulator handles the fa ults that can voltage bus increase the perceived time between cause a fa ilure. Fast regulator response (the time it comrlere fa ilures. takes to respond to a change in input or output) is For example, consider a voltage bus that requires needed to ensure that the output voltage does not two regulators to supply the load current. A fa il­ dip roo much when each regulator picks up its ure in either regulator causes a complete fa ilure. share of the load from the f:J. iled regulator. How­ If another parallel regulator is added to supply ever, the fas ter response time makes it more diffi­ the load current, the probability of a complete cult to keep the control functions of the unit stable. failure significantly decreases. In this case, if one Moreover, the regulator input voltage range is regu lator fa ils, the other two could supply the load. designed to be relatively wide to tolerate wide The statistical probability that another failure swings in the high-voltage DC input. would occur before the fa iled regu lator is replaced When one regulator in a bank of regulators oper­ is very small. ated in parallel fa ils, the output bus voltage dips A system of N regu lators at an individual failure until the other regulators, which are connected in rate of lambda (i\)would have a system failure rate parallel, can react and pick up the load currents. of N rimes i\,or an MTBF of 1 divided by Ntimes i\.1 The magn itude of the dip depends on the time the The actual calculations are input fuses in each regulator take to open and on the values of the input capacitors and the distribu­ i\ N X i\ (total) = tion impedances. Fast-opening fuses allow smaller voltage dips but or are more prone to fa lse nuisance openings. Slow ­ MTBF = lli\(total) = 1/(N Xi\) opening fuses do nor open for normal or nuisance The fa ilure rate calculation for a system that con­ surges, but allow a greater voltage dip. Large values tains one regulator more than required (N + l) is of input capacitance provide the energy to open the fuses quickly, but the voltage recharging of the i\ (N + I) X N Xi\X i\ (total observed) = I A I{(N+ l)xi\} +(Nxi\)+u) capacitors is longer. high distribution impedance decouples the fa ults from other units but has a high power loss. MTT3F(observed) = ( ( (N + 1) Xi\]+ Simulation and resting showed that the wide (N Xi\)+ u) I I (N + I) X N Xi\Xi\ ) inpm range design of the regulators is sufficient to It shoukl be noted for the above equation, that u tolerate the high-voltage input dips caused by other equals I divided hy the time between fault and fa ul ts. The regu lator control and response rime repair (service interval). keep the low-voltage OC outputs within specifica­ Using this calculation, if a bus required 4 regu­ tion when the input voltage is within its range. lators and each regulator had an MTBF of 400,000 Other faults within the regulator can cause it to hours, the observed MTBF would be 100,000 hours. fa il, but the load is picked up by the other regula­ T"he observed MTBF with five regu lators (i.e. , N +I) tors, operating in parallel, on the bus. Clearly, fa ults would be 23,9H9,000 hours, which is 239 times such as a permanent short on the output bus, cannot longer than the four regulator case. The maximum be survived . Because the low-voltage output regula­ time between the fault occurrence and repair would tors operate in parallel and in an N + I redundancy be 2 weeks, or 336 hours. The observed MTBf is mode, the output voltage is not affe cted by most so large, compared to other elements in the system, common single-fault cond itions in the power sys­ the redundant regulators have an extremely small tem hardware. effect on the overall reliability. The number of redundant regulators per output Decoupling voltage bus is limited to one in the VAX 9000 power A key feature of the power system's architecture is system for sp:.tee, weigh t, and cost reasons. N is the that each major subsystem is relatively decoupled number of regulators required to supply the maxi­ from the other subsystems. Decoupling permits mum current of a bus, and the addition of one more e:1ch subsystem to be designed for its own require­ regulator is called N + I redundancy. ments and to be changed or upgraded as the N + I redundancy relies on the good regu lators requirements change (e.g., more cost effective, on the output bus to pick up the load from the fa iled improved technology, or different output voltage).

Fa /1 1')')0 Digital Te cb niculjournal v,,r 2 No. ·I 10) VAX 9000 Series

provided the interface and critical fu nction remain observed system power outages over standard the same. For example, two significantly differ­ Digital systems This feature improves system em cost and performance options, H7392 or H7390, availability to the customer. fo r the AC front end can be used in different config­ Harm onic Distortion The power system's design urations, and the rest of the power system does not had to meet the increasing restrictions on the inrn­ need to be changed . Thus, power platforms can be face with the public power utility and be able to flexibly tailored to meet the needs of diffe rent com­ withstand the occasional availability of only poor puter systems. power. Utility power is generated as a relatively Achieving Low Ha rmonic Distortion pure (i.e., low harmonic distortion) sine wave. AC front ends and power supplies must convert this The AC front end of the VAX power system 9000 sine wave of voltage ro a ripple-free DC voltage fo r processes and converts public utility AC power to ultimate consumption by the logic chips within the high-voltage DC. Our goal was to design the AC computer system. Standard methods used for this front end to be highly reliable, have a high availabil­ conversion create a nonlinear load on the sine wave ity, and meet the emerging European AC power of voltage. This nonlinear load distorts the utility's quality standards. One of those standards is to have sine wave of voltage for other users, because of the low harmonic distOrtion of the input AC current distribution system impedance, and usually appears waveform. These features were essential to support as interference for other users. In Europe, the the VAX system's entry into the mainframe 9000 occurrence of this type of interference is planned computer marker. We also decided tO meet the low to be limited by restricting how much nonlinear harmonic distOrtion standard of the AC front end load current an AC fr ont end can have. Therefore. because the European marketing and sales force we had to design a unique circuitry that could viewed compliance with this standard as a distinct convert AC power to DC power at 20,000 watts competitive advantage. without high levels of current distortion to meet this European requirement. Design Fa ctors A design based on commercially available conrrol The dominating design factor for the AC front technology could not meet the stringenr technical end was the size of the input power level, which requirements of high overall conversion efficiency was approximately 20,000 watts. This size signifi­ and stability of operation because conventional cantly exceeded the power levels of previous AC AC-to-DC circuitry produces up to 30 percent dis­ circuit designs for a single unit. The high power tortion. Our goal was to comply with emerging consumption was a result of the use of 250,000 European requirements of harmonic current distor­ emitter-coupled logic (ECL) gates in the CPU and tion levels in the 5 percent range. However, at the 512 megabytes (MB) of memory. time we were designing the system, no circuitry at this power level existed in the power conversion High Reliability and Availability To achieve high industry. Therefore, we had to develop a unique reliability, we used conservative power derating lev­ pulse-width modulator (PWM) circuit and control els and good thermal management for key devices. equations for the input power conversion stage, Typically, the device voltage ratings used are 80 which is shown in Figure 3. percent of rating. The main switches and rectifiers The pulse-width modulator combines the advan­ used in the power stages used 40 percent of rating. tages of low switching frequency, which reduces Current derating is also conservatively placed at 40 switching losses in the converter, with exception­ percent. Stress is lessened because of lower device ally short response time to all input line voltage fu nction temperatures, which results in a longer disturbances and to rapid changes in the required operational life, which equates to higher reliability. computer power. The final design produces We designed two approaches to attain high less than 5 percent total harmonic distortion of availability. First, redundant circuitry was used for the input line current when the UPC is operated the AC-to-DC circuit fu nction. Second, we increased at 20,000 watts load. The uniqueness of the PWM immunity-to-line outage from the standard practice increased the immunity-to-line voltage outages of one cycle of outage protection to ten cycles. The from one cycle of outage protection to ten cycles. increase from one cycle to ten cycles of outage Furthermore, the increase was achieved with­ immunity provides the VA X 9000 system with a out a corresponding tenfold increase in storage 300 percent improvement in mean rime between capacitors.

106 Vo l. 2 No. 4 Fa ll /')')0 Digital Te chnicaljournal The Unique Fea tures of the VAX 9000 Po wer Sy stem Design

OUTPUT SWITCH AC � AC RECTIFIER FILTER o----- INPUT

FAST DISCHARGE

AUX AC POWER AND POWER LINE MONITOR

TO UPC CIRCUITS

DIGITAL POWER BUS RIC AND TOTAL CFFBUS INTERFACE

Figure 3 UPC BlockDiagram

Flexible Line Cora this method essentially uses equipment that The high power level and the requirements for a was designed to function as standalone regulated flexiblelin e cord and plug required that the Under­ voltage sources. By adding external control loops, writers Laboratory (UL) and Canadian Standards the equipment is fo rced to provide identical out­ Association (CSA) agencies expand the regulations put voltages, as measured at some defined point that governed the size of power cordage allowed in in the system. If precise voltage matching is not a computer room . A flexibleli ne cord connected to achieved, whichever supply had the higher voltage the AC service is a requirement by Digital for all its consumes the load, up to its overcurrent sense products. This feature is deemed valuable because it point. Thus, equal load sharing cannot happen. is used both to facilitate the initial installation of the Individual external controllers are required for compmer and possible relocation at the cuswmer\ each converter, which makes the system more site. Although delays can occur while waiting for a complex. The VAX 9000 system requires up to five national agency to amend one of its national regula­ converters per bus, and we could not achieve better tory codes, the approvals were received in time w than 20 percent power sharing between modules maintain the project's schedule. by using this method. No traditional methods could support the number of converters in the VA X 9000 Improving Load Sharing system. Also, most methods had a master-slave rela­ Detailed stress analyses show that when regulators tionship that precluded maximizing a regularor's are operated in parallel, maximum reliability is reliability potential. achieved when the load current is shared equally among them. Ne w Approach As a result of the limitations of the traditional meth­ Traditional App roach ods, we developed a new, less complex approach A traditional approach torunning regulators in par­ to current sharing between parallel converters. allel may be seen in VA X 8000 series machines. Although developed specifically for the VA X 9000 In these processors, regulators that are designed for program, the features and utility of this approach standalone operation are placed in a parallel con­ have universal application. The essential techno­ figuration. Current sharing is forced by modifying logical shift from prior practice is that in this system each supply's individual reference voltage through the regulators are current sources rather than external monitoring and control. In the case of voltage sources. VA X 8000 machines, a maximum of four units We designed the current sources to have a com­ may be coupled in this way. Figure 4 shows that pliance range that covers a band of voltages thar are

Fa /1 /'J'JO Digital 'fecbnicaljournal Vo l. 2 No. 4 107 VAX 9000 Series

I I CONVERTER CONVERTER CONVERTER

INTELLIGENT INTELLIGENT INTELLIGENT CONTROL UNIT CONTROL UNIT CONTROL UNIT (ONE PER (ONE PER (ONE PER MODULE) MODULE) MODULE)

INTERNAL? INTERNAL� I INTERNAL? REFERENCE REFERENCE REFERENCE AND ERROR AND ERROR AND ERROR AMP AMP AMP

cu RRENT SENSE -·� < �

POWER CONTROL SYSTEM VOLTAGE CONTROL

LOAD

Figure 4 Load Sharing by Vo ltage Co ntrol of Vo ltage Sou rces

norm:tlly fou nd in logic circuits. By making the regulator acts as a current source, the system acts as regulator outputs fully fl oating, the VA X 9000 a controlled and regulated voltage source. Because system requ irements for +')-volt, - ).4-vol t, and the voltage control loop only contains one pole, the - 5. 2-volt buses are met with only one regulator bandwidth of the control loop can be increased by design, rather than a separate design for each up to a factor of at least 15. As a result, the substan­ voltage. The VAX 9000 design is simpler and has a tially high current change requirements imposed lower manufacturing cost. The regu lator is voltage by high-speed memories, such as those used in the and polarity "blind" over its compliance range, and VAX 9000 system, can be accommodated. any number of regulators may operate in parallel to provide any amount of power required at any Principle of Op eration

voltage within the compliance range. Also, this A two-transistor forward regulator is shown in method automatically compensates for the effects Figure 6. In this regulator, S I and S2 are switched of stray resistances and different path lengths from individual regulators on a bus. The basic features of this new approach are CONVERTER CONVERTER CONVERTER shown in Figure '). Individual regulators behave as ( extcrnally programmed current sou rces controlled CD ) CD by a common control signal, such that each regu­ \.) lator dclivers the same current. If the outputs are l2 l3 connected to a common load, the current in that tL- 1,------+------�t

load is the sum of the individual regu lator ou tput V = ( + + x LOAD l 1 l2 l3 ) Z curn:nts. The resulting voltage that appears across the load is the product of that current and the eq uiv­ � alent resistance of the load. Furthermore, if that i' � CUORE'T voltage is compared with a reference voltage in a LOAD �CO,TROC convenrional error amplificr and thc resulting error signal is used to derive the regulators· external pro­ gramming source, then a volrage control loop exists Figure 5 Load Sharing by Current Co ntrol around the regulator system. Thus, although each of Current Sources

IOH Vo l. .! No. ·4 Fa // 1')')1! Digital Te chnicaljournal The Unique Features of the VAX 9000 Power System Design

into conduction simultaneously, which causes the of that constant current, and whatever load is current to flow in the primary winding of trans­ placed across those terminals (i.e., Your) would be former Tl at a level that is directly proportional to determined by the load value. By using an error the output currenr lout plus the slope of the current amplifier and reference, Vcontrol can be made a due to Lour. This current also flows in the primary variable quantity. Therefore, rhe regulator transfer winding of current sense transformer T2. The function can control its output current to any level resulting current that flows in T2 secondary wind­ necessary to produce the desired voltage. In such a ing develops a voltage across the load resistor, RL, system, a control voltage, which is derived from a which is amplified in Al and applied to the input of single error amplifier and reference, can be used as comparator CI. Therefore, at this point, a voltage the control input for several regulators that are pulse appears, the amplitude and shape of which running in parallel. Thus, the current from multiple are directly proportional to the current flowing regulators that feed a common bus can be shared. in the output choke Lout during the Sl-to-S2 con­ duction period. Increased Control andMo nitoring A conventional reference source/error amplifier In the VAX 8000 series, power and environmental combination is placed across the output of the sup­ monitoring and control is provided by the H7188 ply. The resulting error signal, called Vcontrol, is environmental monitoring module (EMM). In the applied to the other input of comparator Cl as a DC VAX 9000 system, these functions are provided by level. The comparator is followed by gating and the power control system (PCS). drive circuits to the power switches. Switching is initiated by a pulse within the gating Basic Design of EMM and PCS circuit that drives the power switches on. The cur­ The EMM monitors the DC-to-DC regulator control, rent flows in the output choke, Lout, and a propor­ air flow sensor, and cabinet temperature. It is also tional voltage appears at the output of the amplifier the interface between the system console and the A I. As this voltage ramps, it crosses the threshold power system. Conceptually, the EMM functions as a set by Vcontrol at the Cl input. The comparator peripheral device to the console similar to the way output then changes state and causes the drive pulse an intelligent disk conrroller is a peripheral ro a to the switches ro cease. CPU. The EMM is a single module that plugs into a If Vcontrol were a fL-xed value, the system would power back panel. be a constant current source. Therefore, the voltage The res is a distributed data acquisition and that would appear at its output would be the result control system. It also interfaces between the power and environmental systems and other parts of the computer system. The PCS takes commands from, and reports status changes ro, the service processor unit. However, in the PCS, the conceptual model of the EMM is extended to provide additional support in hardware and firmware to off-load the service processor unit and to simplify the software inter­ face to the PCS. The PCS includes many features that enhance testability, fault coverage, fa ult isolation, and system availability. The relationship of the res modules to one another and to other system com­ lOUT ponents is illustrated in Figure 7. There are five PCS modules:

• Power and environmental monitor (PEM)

• CPU regulator intelligence card (crURIC) TO C2 THROUGH N • l/0 regulator intelligence card (JORIC)

• Signal interface panel (SIP)

Figure 6 Tw o-transistor Forward Regulator • Operator control panel (ocr)

Digital Te chnicaljo urnal Vo l. 2 No. 4 Fall /'J')Ii 109 VAX 9000 Series � � TO OTHER POWER BACKPLA ES POWER BACKPLANE . � POWER BACKPLANE

(f) a: 0 (.) <( Cll a: a DODO (f) (f) CX)CX)CX)CX) z :::J"'"' � "'"' (f) (f) a: (l_('-1'-1'-1'-1'-<(<( w (f) 0 UI II II aim(l_ 1- > 3: (f) 0 0 :.E N --' CXl LL a: "' w 1'- a: I I ;;:{ 1-

I BULKHEAD f.tt IORIC BACKPLANE IORIC BACKPLANE � BULKHEAD I

(f) (f) l0 l ! a: !llllla: JJJ Ol 0 0 1- 1- � Cll U1'

r-�(f) 1- 3: (.) � a: (f) w a: w w � Cll z (l)<( 1- a: 1- (l_ (f) (f) � Cll � w z (f) (f) (l_ <(<( Cll 0 <(� � � (f) (f) (f) (f) r- 3: 0 mm �<(�� 0 (.) aim 0 � (l_ (f) <( 0 mm aim 0 0 :::;: :::J 0 g '-'- (.) (l_ (.) 0 :::;: (f) 0 '

BBU I �Lf ,(,-1 BBU 1 TEST SW _.J OCP I. SIP 1-J BBU 2 I ,...... H � ----1 I"""T"""'PEMt"""T""'T"t tt-_0 SPM 1 - ��OTHER ,_:___ - H7214 SPU MODULES

SPU Bl BACKPLANE

KEY:

RICBUS •TWO-WAY SERIAL COMMUNICATION • POWER SYSTEM MASTER CLOCK • CONTROL SIGNALS D PCS MODULES D OTHER SYSTEM COMPONENTS

Figure 7 PCS Block Diagram

110 Vo l. 2 No. 4 Fa ll /')')IJ Digital Te cbnica/journal Th e Unique Fe atures of the VAX 9000 Po wer Sy stem Design

Comparison of PCS and EMM rectly. The PEM then returns a status byte, which The differences between a power system that uses describes any error that occurred during command an EMM and one that uses the PCS are illustrated in execution, to the service processor unit. Table 1, which details the functions of each system. The PASSTHRU command allows the service pro­ Five-to-ten EMMs would be required to control and cessor unit's software to send commands directly monitor a system of the size and complexity of the to the specified regulator interface card modules. VAX 9000 system. Additional modules would be This command bypasses the PEM and allows the required to support some of the fu nctions provided operator to use other PCS fa ult isolation fu nctions by the signal interface panel and PEM module. that are not used by the service processor unit in normal operation. The PASSTHRU command is used Command Set Enhancements In comparison to for fault isolation purposes only and is not required the EMM, the PCS offers an enhanced command set. for normal system operation. The I'Ei'<'l commands of READ, WRITE, BIS, BIC, and MEASURE provide the same capabilities that the Measurement Accuracy The EMM.'s best measure­ seven EMM commands READ, WRITE, BIS, BIC, ment of accuracy is ±54 millivolts, which is MEASURE, EXAMINE, and DEPOSIT do. In addition, achieved when it is using an 8-bit analog-to-digital the PEM supports six commands that are not imple­ converter. The CPURIC measurements are sub­ mented by the EMM command set: DOWNLOAD, stantially more accurate and repeatable for several MAHGLN, SENSE, POWERON, POWEROFF, and reasons. The CPURIC uses a 12-bit analog-to-digital PASSTHRU. converter that is calibrated for offset and gain by the The regulator interface card firmware supports automatic calibration routines that run during the DOWNLOAD command, which allows the power-up self-test. To filter out noise, each parame­ service processing unit's software to update, with ter is measured 64 times. These measurements are some restrictions, the PEM 's or regulator interface averaged by the firmware before the parameter is card's on-board EEPROM with new firmware. Thus, used by the monitoring or sense commands. the need for Customer Services to replace EPROMs Through comparison measurements with a volt­ in the field if the firmware needs to be updated is meter and a thermometer, the CP RIC measure­ reduced because the latest PCS firmware is stored ments have proven to be repeatable. Also, the on the service processor unit's load device. measurements are accurate to better than 10 milli­ The MARGIN, SENSE, POWERON, and POWEROFF volts, when measuring voltage, and to within one commands off-load work and complexity from the degree Celsius, when measuring temperature. service processor unit's software. By using these Diagnostics una Testabili�y Support The EMM commands, the service processor unit never needs provided some visibility into the power and envi­ to interact directly with the regulator interface card ronment system of the VAX 8000 series of computer modules during normal system operation. Thus, the systems to aid diagnostic and testing. The hard­ amount of software required by the service proces­ res ware and firmware extend the fu nctionality of the sor unit to perform these functions is reduced. All EMM with fe atures such as hardware loopback regulator interface card interaction is handled by circuitry which, when combined with di;tgnostics the PEM firmware. included in the firmware, provide better fault detec­ The MA RGIN command causes the PEM to margin tion and isolation than the EMM. the specified bus voltages by ± 5 percent for fa ult isolation pu rposes, such as trying to aggravate an Enhanced Supportfo r Increased Sys tem Availa bili�J' intermittent CPU hardware problem by reducing a The features designed into the power system and

logic supply voltage by 5 percent. In response to the the res hardware and firmware support N + 1 SENSE command, the PEM returns a record that con­ power buses and bias power supplies. The PCS also tains the specified pO\:ver or environmental data to supports the partitioning of power. The PCS allows the service processor unit. certain cabinets in a VAX 9000 Model 430 or Model The I'OWERON and POWEROFF commands cause 440 system to be powered-off for maintenance or the rEM firmware to turn the specified power buses repair, while the remainder of the power system on or off in the proper sequence. When executing continues to function to provide system availability all of these commands, the PEM fi rmware must send at reduced performance. The PCS recognizes when messages to one or more regulator interface card power is reapplied to these cabinets ancl notifies the modules and perform extensive error checking to service processor unit. The system then can be verify that the power sequencing is proceeding cor- reconfigured to include these cabinets.

Fall Digital Te cbnicaljounzal vb l .2 No. 4 /'J'JII Ill VAX9000 Series

Table 1 Com parative Functions of the Environmental Monitoring Module and the Power Control System

EMM in the VA X 8600 System PCS in the VA X 9000 Model 440

Digitally controls seven DC-to-DC regulators Provides analog and digital control of up to configured in five buses 29 H7380 DC-to-DC reg ulators, configured in eight power buses Measures and monitors four cabinet air Measures and monitors ten cabinet air temperature temperature thermistors thermistors

Monitors two air flow sensors Monitors 20 air flow sensors Controls and monitors one H7231 battery Controls and monitors two H723 1 battery backup unit backup units Measures and monitors one ground current input Measures and monitors two ground current inputs Provides voltage sequencing in hardware Provides voltage sequencing in hardware and software

Displays up to 16 unique shutdown codes on four Displays over 80 unique shutdown codes on a magnetic indicators diagnostic display Measurement accuracy Measu rement accuracy voltage: ±54 millivolts voltage: ±10 millivolts temperatu re: ±2 degrees Fahrenheit temperature: ±1 degree Fahrenheit Digitally controls ±5 percent voltage margining Provides analog and digital control of ±5 percent for eight DC-to-DC regulators voltage margining for eight power buses

Measures and monitors 12 DC-to-DC regulator Measu res and monitors eight power bus voltage outputs voltage outputs

Monitors ten DC-to-DC regulator "mod ule OK" Monitors 29 H7380 DC-to-DC regulator signals "module OK" signals

Monitors 36 H7382 bias power supply "module OK" signals

Monitors up to ten H7214 and H721 5 "module OK" signals used for 1/0 and service processing unit power

Monitors up to 16 H71 89 "mod ule OK" signals in the optional bus interface expansion cabinets Provides bus overcurrent protection and monitoring for eight power buses

Measures the output current from 29 H7380 DC-to-DC converters

Provides N+1 support for eight power buses Monitors the environmental status from four optional bus interface expansion cabinets

Monitors the status from three H7386 overprotection modules Monitors seven status lines from each of the two utility port conditioners Controls and monitors the operator control panel

The VA X 8600 system consists of two cabinets The VAX 9000 Series 440 is a quadruple CPU of which one was monitored by a single EMM. configuration of up to eleven cabinets. A PCS configuration composed of eight CPURICs, two IORICs, one operator control panel, one signal interface panel, and one PEM is required to support this system.

Fa /1 /')1)(1 112 Vu / 1 Nu. i Digital Te ch nicafjournal The Unique Features of the VA X CJOOO Po wer Sys tem Design

Desif!.n fo r Further !mprouement The Ei"vHvl uses Because simulation models for many of the cir­ the acrual analog-to-digital converter ro represent cuit components were nor yet available, we cou ld temperature, voltage, and current. However, the nor simulate the design. Therefore, we built the PCS represents voltage, temperature, and current in circuit without simulation. The resulting frequency a format rhar is independent of the actual analog­ response was lower than expected, and the circuit ro-digical converter va lues. Future upgrading of tended to oscillate ar the maximum output current measurement circuitry can be done without modi­ limit. Multiple attempts to improve the hardware fying the service processor unit's software. proved ineffective and time-consuming because we did nor know the cause of rhe problem. Then, rhe Power .�)!stem Te st Programs We developed exten­ actual schematic of the linear regu lator controller sive power system test programs by using the internal circuit became available, as did SPICE mod­ programmable console command language. These els of components. scripts provided step-by-step control of power We ran an accurate SPICE model bur did not find sequencing and margining, and proved extremely anything outstanding on the gain/frequency plots. invaluable in processor system debugging, system Next, we tried ro find rhe cause by exaggerating qualification, and manufacturing and fidd resting. some simul:.tred changes. such as removing the The rests were developed through a cooperative current limit amplifier portion of the circuit from effort of design engineering, manufacturing engi­ the controller. With this change, we fo und that the neering, and fi eld engineering. gain was close to being the same ar two different frequencies, '5 kilohertz (KHz) and 40 KHz. This Simulation similarity meant that if the phase margins were The usc of simulation in power converter design correct, instability might exist. To prevent this possibility, we decided ro increase the gain of the is not as advanced as the use of simulation tools in regulator circuit belo'vv 30 KHz by making simu­ digital circuits. The level of complexity and number of parasitic clements in power devices have pushed lated modifications ro the circuit. With these modifications, the gain plot below 30 KHz computer CPU requirements beyond the reach of increased and rhe waveform evened out close to many power circuit design groups. However, as what we wanted it to be. We then modified the more computer power is becoming avai lable ar a lower cost. simulation is being used increasingly ro hardware and achieved the desired performance. improve power circuit design . The simulation tool However, we would have saved a substantial most wide ly used is Simulation Program with Inte­ amount of rime if we could have simulated the cir­ cuit before we built it. grated Circuit Emphasis (SPICE) bec:.t use of its abi lity to be configured ro :.tny circuitconfiguration.2 In this section, we illustrate the benefits of sim­ Improving Simulation Accuracy ul:.trion in the VAX 9000 power system design. We In switching regulator design, parasitic (small, provide examples of the use of simulation for cor­ undesirable but existing) elements of seemingly recting designs, improving circuit designs through negligible values, such as printed circuit board etch inclusion of parasitic elements, and transient inductances and transistor capacitances, can have analysi.'i. a significant impact on the behavior of rhe circuit. For accurate simulation these elements must be Simulation to Correct a Design included in the simulation models. An example is Simulation was used to correct a design in the linear shown in the design of the output stage of rhe post regul:.ttor rhar was dn·doped for the H7:)H2 H73HO regu lator. bias supply used in the VA X 9000 power system. We wanted the regu lator ro rake a high-voltage The design required that rwo regulators operate in DC input and produce a low-voltage (i .e. , 3.4-volt parallel for redundancy purposes. We w:.tnted ro DC to 5. 5-volt DC) regulated ou tput. Figure 8 shows achieve good transient response by keeping the how this process is done by changing the high­ output volt:.tge within operating tolerance should voltage DC input voltage to an AC square wave one of the rwo regulators fa il. fkcause good tran­ through turning the transistors, Q I and Q2, on and sient response depends on good frc..:qucncy loop off. The transformer. T I. steps down the AC square response, we had ro determine the optimum fre­ wa\·e and is followed by an output section for recti­ quency response for the circuit. fication and fi ltl'ring.

Di�ita/ 'f(:chnicaljournal l·bl. .! Nn. · I /·/ill I'J'JO I I.) VAX 9000 Series

300VDC ---,------, 50V 5V

w D2 0 MUR460 > Ci (jj T1 s 0 > .. TO RECTIFIER 0 AND FILTER � (f) II s 0 >

OV

D1 MUR460 TIME (200 NANOSECON DS/DIVIDE)

Figure 10 Vds (Q l) Measured Tu rnoff

Figure 8 H7380 OutputSwitching Stage Figure 11 shows a more accurate model of the output stage because the L 1 through L4 etch induc­ The initial model of the H7380 inverter stage used tances and C 1 and C2 transistor capacitances are simple component models and did not consider any included. The current source, !PULSE, and the printed circuit board inductances or transistor resistor, RT , approximate the transformer. Figure 12 capacitances because they seemed negligible com­ shows the result of the simulation model that pared to other elements. We noted a discrepancy in includes the L and C values shown in Figure 10. the voltage across the transistor Q 1 (Vds) during the When the simulation and the measured data are turn-off process between the simulated waveform, correlated, the advantage of accurate simulation shown in Figure 9, and the measured waveform, becomes apparent. By using worst-case values fo r shown in Figure 10. the circuit parameters, the simulation can deter­ Figure 9 shows that the voltage is initially zero mine the maximum peak voltage. The model while the transistor is conducting but rises to 200 depicted in Figure 12 shows that a device capable volts when the transistor is turned off. Figure 10 of withstanding the expected 240 volts is needed. shows that ringing occurs as the voltage approaches Reliance on a less accurate model without para­ 200 volts, with an overshoot to 240 volts. The sitics could lead to the selection of a device capable ringing and overshoot, not shown in Figure 9, of withstanding only 200 volts. Thus, accurate are caused by the circuit board inductance, trans­ simulation allows the correct components and former leakage inductance, and the capacitance of component ratings to be chosen and ensures a the transistor. robust design.

PLOT 1 TIME V(40,3) Tra nsientA nalysis

2.50 A memory system that includes dynamic random­ access memory (RAM) chips presents a difficult 2.00 transient load problem to its power supply. The 1.50 X problem arises from a combination of very high (f) changes in dynamic RAM supply current and cur­ � 1.00 rent change rise times that are typically more than � 0.50 a thousand times faster than the reaction time of a

o.oo L.....__ ...... __ .....,_...�.____.___ _.___ _. power system. The result is a temporary change in 0 2 4 6 8 10 the load supply voltage. To handle these fa st current 7 SECONDS X 10' edges, high-frequency capacirors are mounted on memory boards near the dynamic RAMs. Also, low­ Figure 9 Vds (Q l) Simulated Tu rnoff frequency, electrolytic capacitors, which provide a without Parasitics source of local charge storage, are mounted on the

114 Vo l. 2 No. 4 Fa l/ 1')')0 Digital Te cbnicaljoumal The Unique Fe atures ofthe VAX 9000 Power System Design

PLOT 1 TIME V(40.3)

2.50

L1 2.00 15NH 0 1.50 X Ul 1.00 � D2 0 MUR460 > 0.50

0.00

+ L3 0 2 4 6 8 10

SECONDS X 10 7 L2 25 NH 25NH Figure 12 Vds (Q I) Simulated Tu rnoff D1 MUR460 with Pa rasitics

frequency capacitors were also proposed. The power supply was expected to be ready for load L4 testing before the memory or the busbars would 15NH be available. Therefore, we had to verify that this design could keep the memory supply voltage within operating tolerance. We verified the design SIMULAT ION MODEL OF OUTPUT CIRCUIT WITH PA RASITICS by simulating the performance of the power system and measuring the performance of the actual power supply with a simulated load . Figure 1 I Final Model ofH7380 Output Switching Stage Power Supply Operating Vo ltage To lerance The memory designers specified the operating tolerance memory boards to handle the magnitude of the of the dynamic RAM supply as + 5 volts, ± 10 per­ change. The capacitors help keep the supply voltage cent. Using 10 percent as the supply tolerance within its operating range until the power supply budget, the supply designer made the allocations can react and sufficiently change the current it sup­ shown in Ta ble 2 to all the factors that would cause plies to the memory to stabilize the supply voltage. the load voltage to deviate from its nominal value of An adequate supply design with specified capaci­ + 5 volts. As can be seen from this table, the sum of tors can keep the supply voltage within its operat­ x andy must be less than 350 millivolts or 7 percent ing tolerance. Simulation is used to determine the of + 5 volts. correct mi..'< of high and low frequency capacitors and the number of regulators required to support Memory Load The dynamic RAM supply current this high transient load. was calculated ro be a steady-state pulsed current Another power supply problem arises from the of 256 amperes that would last for 92 nano­ use of N + l redundancy for parallel regulators. seconds (ns) and with rise and fa ll times of 20 ns, When one of the regulators in a parallel regulator as shown in Figure 13. The initial pulse magnitude configuration fa ils, the remaining regulators must was 1024 amperes. be able to rake on the load from the fa iled regulator and keep the supply voltage within operating toler­ Table 2 Supply Tolerance Budget Allocation ance. Because the remaining regulators cannot react instantaneously, the load voltage drops until a Causes of Percentage Voltage Deviation Millivolts of +5 Volts sufficient increase in current can be provided by the remaining regulators. Regulator tolerance 100 2 For the VAX 9000 series memory system, a pro­ Back panel distribution 50 posed dynamic RAM power supply design consisted Tra nsient load with two X of three H7380 DC-ro-DC regulators, which would reg ulators

operate in parallel (including N + I redundancy) Failure of one reg ulator y and be connected to the memory through power To tal deviation budget 500 10 distribution busbars. The numbers of high- and low-

Digitlll Te chnicaljo urnal Vol. J No . 4 Fa /1 11)')0 liS VAX9000 Series

modeled as a current source, Gout, controlled hy

the regulator fe edback voltage, Yf Cout and Rout represent the regulators combined output capaci­ tors and resistors. Most of the other elements in the OA model are determined from component specifica­ I-288 NS� tions. The relationship between Gout and Vf was f--12.96 MICROSECONDS � determined by laboratory measurements on a regu­ KEY: lator and resulted in the following equations. For A - AMPERES NS - NANOSECONDS two regularors,

Gout = 339 X VJ= 339 X (V8 - 2.5) Figure I3 VAX 9000 Model 400 Series Mem ory For three regularors, Power Sys tem Dynamic RAMLoad Gout = 678 X Vf = 678 X (V8 - 2.5)

Memory Po wer Sys tem SPICE Mo del In the SPICE The load is represented as two current sources, lA model of the supply, busbar, load and capacitors and IR, the characteristics of which were obtained that is shown in Figure 14, the three regulators are from the loads shown in Figure 13.

21

ROUT

GOUT lA IR RESR -.l, -j.,

C2

VR =

DC

- - KEY:

R1 1 2 10K GOUT 21 22 12300U IC=5.0 C1 2 0 0.6N IC=2.5 RESR 22 23 1M R2 2 3 10K LESL 23 0 2.4N R3 3 4 20K RBB 21 24 300U C2 3 4 18P IC=5.0 LBB 24 1 150N R4 4 5 1K CHF 1 26 1.3M VR 6 0 DC 5 RHF 26 27 21U R5 5 7 2K LHF 27 0 1.4P C3 7 8 68N IC=3.0 CLF 1 25 108.8M VG 8 9 DC 2.5 RLF 25 28 400U R7 9 0 10MEG LLF 28 0 0.3N R6 9 10 10K \A 1 0 PULSE 0 512 A ONS C4 10 0 0.757N 20 NS 20 NS 92 NS 288 NS GOUT 0 20 POLY(1) 10 0 0 678 IR 1 0 PULSE 0 512 A 0 NS D1 20 21 DIODE 20 NS 20 NS 92 NS 12.961' S ROUT 21 0 17K

Figure I4 SPICE Model of VAX 9000 MemOtJI Power Sys tem

2 116 Vol. No. 4 Fa/1 1')')0 Digital Te cbnlcaljournal The Unique Fea tures of the VAX 9000 Power Sys tem Design

Simulation and LahoratmyMea surements The Fa ilure of One Regulator When one of the three two previously stated conditions of interest result­ regulators fails, the other two regulators cannot ing in large load voltage changes are the transient meet the increased load instantaneously. As a result, load with two regulators and the fa ilure of one the load voltage drops until the two regulators can regulator. increase their output current sufficiently to reverse For transient loads, a larger voltage change the direction of the drop. The SPICE model for this occurs with two regulators rather than with three condition was run and the load voltage of the drop because two regulators take longer than three to was predicted. Laboratory measurements were adjust the supply current to the new load value. then taken with the simulated load and one regu­ lator was turned off. Both the predicted and mea­ Simulated Load For laboratory measurements, sured waveforms had the same shapes, peak the actual dynamic RM•I load, as shown in Figure 13, magnitudes (100 millivolts), and times of occur­ is difficult to design and build in a reasonable time rence of the peak (200 microseconds) after the because of the magnitude and rise time combina­ regulator was turned off. Therefore, we concluded tion. However, a load with a much slower rise time that the proposed design could meet the load could be easily built. Such a load, (I in Figure 14) is requirements. expected through the busbar as the capacitors and busbar slowed down the fast edges of the dynamic References RAI'vl loacl . This simulated load was built and con­ nected to two regulators. The predicted waveform I. P. O'Connor, Practical Reliabili�J' Engineering and the measured waveform showed that the initial 2d ed. (New Yo rk: John Wiley and Sons, 1985). shapes of the peak change, the peak magnitudes 2. SPICE is a general-purpose circuit simulator (80 millivolts), and the times of occurrence of the program developed by Lawrence Nagel and peak (300 microseconds) were all similar. However, Ellis Cohen of the Department of Electrical Engi­ we could not measure the overshoot and ringing neering and Computer Sciences, University of after the peak because the busbar was not available. California, Berkeley.

Digital Te cbnicaljounzal Vo l. 2 No. 4 Fa ll 1')')0 117 Donald F. Hooper John C. Eck j

Sy nthesis in the CAD Sy stem Us ed to Design the VAX 9000 Sy stem

Th e design of the VAX9000 system represents a sixfo ldinc reasein complexity over the VAX 860018650 sy stem. Th is increased complexi�y posed a significant challenge because of the concurrent need to shorten the duration of the project design cycle and convert all high-performance systems computer-aided design (CAD) softwarefr om the DECS YSTEM-20 system to the VAX system. As part of the task of meeting these challenges, the CAD Group proposed the implementation of a design methodology that used logic �ynthesis fo r the first time in the development of a majorproduct fo r Digital. Th e primary objectives of this methodology were to increase the productivi�J' of the logic designers and to reduce the number of erro rs introduced during conversion of high-level designs into gate-lel!e/ structuraldesi gns.

Methodologies transformations of Boolean logic to reduce gate counrs and improve critical timing paths.1 How­ Previous Me thodology ever, this program has had only limited success and In the previous development methodology, as is not really usable as a released computer-aided shown in Figure I, logic designers specified high­ design (CAD) product. For example, the program level designs on paper, and simulation engineers does not deal with selections of cells for com­ transferred this rendition inro a behavioral model . binational logic nor does it consider the myriad Te chnology engineers developed the gate-level problems involved in assembling a database for a cells. After the cells were defined and characterized buildable gate array chip. for fu nction and timing, the logic designers gener­ During 1984 and 1985, new ated schematic drawings by using graphical bodies (AI) and synthesis ideas were being developed. Uni­ that represented the cells. versities and technical communities were exploring As changes were made to the schematics, the sim­ the potential of object-oriented databases, rule­ ulation engineers attempted to reflect these in the based AI, data flow design entry, and algorithmic behavioral model . Finally, a gate-level simulation minimizations. We began the prototype develop­ model was assembled from the completed schemat­ ment of our system for integral design (SID) at ics to verify that the design represented a valid VAX approximately the same time as the ideas for the system. This process was extremely laborious, VAX 9000 hardware architecture were beginning to error-prone, and rime-consuming. Therefore, we be developed. In 1985, the SID program became an concluded it could nor be used to develop the VAX internal CAD product for use in the development 9000 system, which is a 700,000 gate design and for of the VAX 9000 system. By combining the most which the technology cells would not be defined ad vanced rule-based AI techniques with an object­ and characterized until late in the design stage. oriented database, the core SID was designed to be a repository of logic design knowledge. We hoped Logic Sy nthesis that, over the years, SID wou ld mature to perform Our early research into logic synthesis began in many highly repetitive logic design tasks at an 1982. Over the next two years, we explored new expert level. synthesis ideas and constructed prototypes to From 1985 to 1988, the capabilities of the SID sys­ determine the feasibility of those ideas. For exam­ tem gradually improved until it was producing gate ple, one of our early logic minimization efforts was array chips that met the VAX 9000 machine cycle a program that emulated Brown's Laws of Fo rm for time, power, and electrical rules requirements.

118 Vol. .2 No. 4 Fa ll 1990 Digital Te cbnicaljournal Sy nthesis in the CA D Sys tem Us ed to Design the VAX 9000 System

TECHNOLOGY CELL DEFINITION

BUG REPORT TECHNOLOGY BEHAVIOR MODEL CHARACTERIZATION TEXT EDIT

BUG GATE-LEVEL REPORT SCHEMATIC ENTRY

PLACE ROUTE

BUG REPORT GEN ERATED

Figure 1 Previous Design Methodology

New Methodology technology engineers are defining the technology The VAX 9000 development methodology, shown cells. In parallel with these activities, synthesis in Figure 2, circumvents the need to wait for the knowledge engineers are writing rules to transform technology cells to be completely specified before the RTL design into technology cells. These three beginning logic design. This methodology uses activities should be completed at the same time, schematic entry and simulates the technology­ at which point, synthesis produces each of the independent, register transfer level (RTL) bodies. VAX 9000 system's 77 gate array chips. The goals The RTL library for this type of entry includes for the synthesis program were to MUXes, latches, adders, comparators, incrementers, • Simplify design entry and thereby reduce sche­ decoders, and simple Boolean gates. The entry is matic complexity by a fa ctor of 4 extracted to a common database format, called CADEX , from which a simulation model is built. A • Generate 90 percent of the VAX 9000 system's behavior modd still exists, hut its hierarchy logic through synthesis matches the RTL schematic hierarchy at key physi­ • Reduce the number of simulation errors intro­ cal boundaries. Thus, simulation models can be duced in the design built that consist of a hierarchy of mixed behavior and RTL models. • Reduce the number of electrical ri1les violations While logic designers are creating the RTL design, in the design

Digital Te clm.icaljournal Vo l. 2 No. 4 Fa/1 /'J()O 119 VAX 9000 Series

To generate a database for a buildable gate array • Make it easy to detect whether the tool has per­ chip, the synthesis tool is required to formed well

• Read technology-independent input standard • Simplify the improvement of the tool net list fo rmat, which can be in OECSIJVI behav­ ioral notation or CADEX common database SID Database format The design of the SID database is fundamental to the robustness of the CAD system. Previous CAD data­ • Minimize Boolean gates through state-of-the-art bases have all assumed that the data is stable at the minimization techniques time that the CAO tools are working with it. Simu­ • Improve timing-critical paths through Boolean lation, timing verification, design rule checkers transformations, cell/pin selections, power set­ (ORCs), and many other CAD tools assume that net tings, and net load allocations lists and components are fi xed and unchanging. In synthesis, although the data is maintained in • Choose the best available technology cells based a form that makes it easy to update its parameter on timing, size (area), and power estimates values, the basic structure of gates, pins, and nets • Insert the clock system for the gate array chip remains the same. However, throughout most of the synthesis process, the basic structures are in a state • Insert testability access logic for the service pro­ of change. In fa ct, it is a characteristic of synthesis cessor unit that logic functions are removed and replaced with • Obey all electrical design rules for the gate array new, fu nctionally equivalent logic. Because of this chip difference, we designed basic data structures and

BUG REPORT TECHNOLOGY BEHAVIOR MODEL CELL DEFINITION TEXT EDIT

RTL BUG REPORT TECHNOLOGY SYNTHESIS RULES SCHEMATIC CHARACTERIZATION TEXT EDIT ENTRY

SYNTHESIZE PLACE (LOOP BACK) ROUTE SET POWER

BUG REPORT GENERATED

Figure 2 VAX 9000 Deuelopment Me thodolof!J'

120 Vo l. 2 No. 4 Fa ll f

manipulation functions that would allow eff icient hierarchy) looking for electrical rules violations or removal and replacement of logic. logic fu nction redundancy, and resting for timing­ We did use the primary objects of other CAD critical path relationships. To perform these tasks, systems: gates, pins, and nets. However, we made a we added a series of multidirectional pointers to the distinction between the definition of an object and SID database objects by using LISP capabilities. its use or instance. Also, because we wanted these When an object is declared as a symbol in the LISP objects to be used at very high (i.e. , behavioral and programming language, pointer management is RTL) levels and at the gate level, we renamed them included automatically. The LISP langu:tge is well as models, ports, and signals. The primary database known in the industry for irs use in AI applications, objects for Sll) are bur it has a reputation as being slow. Our special handling of direct database pointers enabled us to • Modeldef. The modeldef is the definition of a produce a application that resulted in excellent logic fu nction element. Analogous to a vendor LISP run-rime performance. data sheet, modeldef contains parameters that Once the data structures and their pointers were describe its function, timing, power, size, and defined, we began to create a rich set of database other general information. All bounded blocks access functions that had to be fa ilproof. Therefore, of logic fu nction, from high levels of hierarchy we wrote functions insert and remove the (e.g., floating point unit) to low levels (e.g., to instance objects to ensure that the database pointer simple Boolean gates), are kept as modeldefs. connectivity was properly maintained. These func­ Typically, modeltlefs are used multiple times and tions allowed us to effectively perform a many­ used more at the lower levels of the database fo r-many replacement of modelinsts with a single hierarchy. For example, in the VAX 9000 system, command. there are two cache data multichip units, eight Other secondary objects were defined to contain multiplier chips, and many thousands of two­ such types of info rmation as synthesis knowledge input NOR gates. (i.e., rules and groups of rules), general technology • Modelinst. The modelinst is a use of a modeldef characteristics (i.e. , the maximum number of cells that contains only those parameters unique to on a chip), and general project-specific character­ itself. For example, two instances of a two-input istics (i.e., the cycle rime of the machine). OR cell may be in different places on a chip The synthesis knowledge in the form of rules and, therefore, have diffe rent placement desig­ occupies the majority of SlD-compiled code and nators and timing characteristics. Each mod­ over 10 megabytes (MB) of run-rimememory. elinst points to its motleldef definition to inherit Rule Language the set ofcommon definition parameters. Based on research and the perceived complexity of • Porrdef. The portdef is the definition of an inter­ the task at hand, we estimated that, to perform syn­ fa ce to or from a modeldef. Portdef contains thesis at an expert level. possibly thousands of rules parameters that describe its fu nction, timing, wou ld have to be written. data width, and other general information. Jn researching current AI literature, we deter­ • Portinst. The porrinst is an interface to and from mined that existing rule languages were either too a modelinsr. Portinst contains parameters unique cryptic or too verbose to allow us to write and to itself, such as timing and power settings. maintain a large rule set in a short time frame. Also, Each poninst pointsto its corresponding portdef we preferred to write more powerful rules than definition to inherit the set of common defini­ those of previous rule-based systems. We wanted tion parameters. each rule ro be used fo r making complex decisions and logic transformations based on timing, size, • Signal. The signal is the means of connectivity power, and logic connectivity. The rule does not among modclinsts and between hierarchical "think." Instead, it mimics a logic designer looking partitions. As shown in Figure 3, this connection at the characteristics of some pre-existing design, is established through the interface portinsr or who then changes the design to improve it or make portdef. For behavioral logic, the signal acts as a it more compatible with the new technology. The data flow arc; for RTL logic, the signal acts as a bus; rule, for example, tests whether A and B are true, and for gate-level logic, the signal acts as a net. and if so, performs transformation C. Synthesis rules must he able to walk the database Based on these needs, we began deve.loping the in any direction (i.e. , backward, fo rward, through language Ruldorm as the means for approxi-

\'()/. Fall 1')')11 IJigilal Te cbn icaljournal J tYo. -1 121 VAX 9000 Series

KEY:

P = PORTDEF p = PORTINST

Figure 3 SID Database Objects

mating the designer decision and logic synthesis For more complex operations, we also allowed task. With this approach, the rule would mimic LISP functions to be called by prefixing them with what a logic designer had done once and that action the keyword LISP, or by insertion of a LISP expres­ could be repeated again automatically in similar sion. Thus, if the rule language cannot implement circumstances. a required function, a LISP algorithmic routine is In Ruleform, rules have a left side for decisions cal led. We used algorithmic transforms in the gener­ and a right side for transformations, e.g., OPS-5, as ation of adder carry-lookahead. do other rule-based languages. However, to make rules easier to read and write, Ruleform uses English Rulefo rm Database Access language sentence structures to describe both tests Because the database could be traversed in any and actions. The following predicate forms are used direction for any arbitrary distance through the for left-side rests: multidirectional pointer system, rules had to have • Dbobjecr verb the same traversal capability. Therefore, the

• Dbobject verb dbobjecr dbobject of the Ruleform language is a shorthand notation of the "database walk." Dbobject can be • Adjective dbobjecr verb used in a sentence to compare two database objects • Adjective dbobject verb dbobject by walking to both of them and using a predicate

Ve rbs are words such as IS, ARE, =, >, for the comparison. IS_BOOLEAN, !S_A_NUMBER; adjectives are words Had the database access been implemented in such as ANY, ALL, NO. Dbobjects are database pure LISP programming notation, the sentence objects or the parameters of these objects. form would be lost in the many levels of expres­ The command forms used for right-side actions sions enclosed in parentheses. One test would are corrunaml dbobject and command dbobject occupy many lines of code and would read more preposition dbobjecr. Commands are words such as like a software program than an English sentence. INSERT, REMOVE, REPLACE, MODIFY; prepositions In this case, the chain of thought of the rule writer, are words such as WITH, TO , FROM. The dbobject the purpose of which is to capture the step-by-step can be any of the primary database objects, sec­ thoughts of a logic designer in words, would proba­ ondary objects, or their parameters. bly be broken.

122 Vo l. 2 No. 4 Fa ll 1990 Digital Te cbnicaljournal Sy nthesis in the CA D Sy stem Us ed to Design the VAX 9000 Sy stem

To improve the comprehension of the notation computer-based tool to develop a design, it must be used for identifying the database object, we devel­ able to measure cell counts, power and timing, and oped an .. compare alternative implementations against bud­ notation for the walk to a database object. We also gets of cell counts, power, and timing. To find the developed functions that would compile this nota­ critical path, a computer-based synthesis tool must tion into a LISP expression, complete with all the perform timing analysis in the same way that the appropriate declarations for the most efficient run­ traditional timing verification tool does. In a sense, time performance. Figure 4 shows the use of this the synthesis tool must preverify the decision notation. before casting the synthesis transformation in con­ Further, we incorporated into Ruleform a param­ crete. Therefore, for a computer to do logic design, eter definition mechanism that allowed us to define we had to analyze the steps that had become auto­ any arbitrary parameter, name what object it matic and intuitive, break those steps down, and would attach to, and then use the name of the formalize them in minute detail. parameter in the Ruleform database access. This greatly expanded the role of synthesis in that it Rule Example could now be used for passing controls and infor­ Consider an example of a simple cell-mapping rule. mation to other CAD cools through parameters. The purpose of this rule is twofold: pick the most Parameters relieved the designers of much tedious appropriate cell fo r a configuration of Boolean work, such as identifying clocks, logically equiva­ gates and attach the most critical path signals to lent signals to the placement program CUT , and the those input portinsts that have the fa stest propaga­ parity generator and checker signals to the diagnos­ tion delays through them. tic program, called HIDE. A designer might determine the critical path from experience or through trial and error. The designer Wr iting the Rule also might actually count loads on signals and add Many of the tasks logic designers perform become estimated signal delay to gate delay of all paths automatic and intuitive over time. However, fo r a that might involve the timing-critical piece of logic.

means

MOD EL g et the name of the modeldef of the current instance

INPUTS g et the inputs of the curren t model inst

SIGMAL -2MD- IMS g et the sig nal of the second input of the current modelinst

IMSTAMCES-DR IVERS-SIGMAL -21'1D-IMS g et the instances whose outputs are the driving pe rt inacities of the sig nal of the second input of the current model inst

SOURCES g et the ins tances whose outputs are the driving pertinacities of the sig nals of the inputs of the curren t model inst

DUSTS g et the instances whose inputs are the load per tina cities of the sig nals of the outputs of the current model inst

MODEL-SOURCES get the name of the mode ldefs of the source model insts of the curren t modelinst

Figure 4 Example of a LISP Expression

Digital Te cbnicaljozwnal Vo l. 2 No. 4 Fa ll 1')')0 123 VAX 9000 Series

These alternatives are all very time-consuming. We The rule also checks whether signal outputs of decided a computer is best suited to do this type of the AND� have more than one load. The rule allows work. In SID, a timing analysis routine is run repeat­ the transformation if the critic1l path has more than edly, as the database changes, to set timing parame­ one load, but disallows the transformation if the ters on every portinst of the design. The product of noncritical paths have more than one load at the these calculations is a timing debt number set on output of the AN Ds. Thus, duplication of the source ewry portinst. If the number is positive, the path is AND logic is prevented except when absolutely nec­ over budget (i.e., is in timing trouble) by the number essary. e.g., to remove a wire delay to increase the in picoseconds given. If the number is negative, the speed of a critical path. path is under budget (i.e. , has slack). The timing debt number allows the rules to access the timing Organization of Rules into Rule Bases debt paramctl'fS tofind critical paths. The quantity of rules required that the rules be orga­ In the example shown in Figure 5. four Boolean nized into groups, c:tlled rule b:�ses. As we defined gates exist as a tree in the middle of a gate array cell. the minute steps of the logic design tasks, it was The dest-side gate is a three-input OR , and the apparent that groups of rules were separated by source-side gates are two-input ANDs. The entire levels of abstraction, as depicted in Figure 6. For cycle time of the machine depends upon the most example, a sequence of logic design can be charac­ timing-critical path, which runs through the first terized as a progression through the levels: input of the second AND gate. Because this rule replaces four gates, it has higher behavior ---7 RTL ---7 Boolean ---7 technology map ---7 wiring, tweak parameter set placement priority over other rules that replace fewer gates. ---7 ---7 ---7 When the rule arbiter is called with the OR as the route ---7 power au just current instance, the arbiter executes the left side of We organized the rules by type of activity into the rule (i.e., the first part up to the arrow). The left rule bases. We also developed a run-frame sequene<: side of the rule checks that the current instance is a process that would apply these rule bases. in order, OR three-input and all source instances are two­ from behavior through detailed adjustments :u the input ORs. It then chooses the most critical path technology level. The rule bases are from among the inputs of the sources and notes the other inputs of the sources that were not critical. • Behavior rule base, which contains rules to Because all of the tests in the left side of the rule expand behavior and RTL instances. These returned true, the rule is said to have "fired." The rules transform high-level instances of adders, right side of the rule may now be applied. incrementers, comparators, decoders, encoders, DECSIM The right side removes the current instance, and behavior expressions into generic i.e., OR , and inserts the cell with the most crit­ Boolean instances. They also perform simple ical path connected to the input that has a fast bit replication for data path instances, such as propagation delay to the output. By removing the 32-bit MliXes. OR , the destination of the ANDs is removed. • Optimize rule base, which contains rules to l�Ei'v!OVLIF_NO_DESTS then removes these r\NDs. transform Boolean logic for minimization or The actual rule that does this transformation is timing improvements. These rules performed the well-known !)'morgan, distributive. and Cdefrule "mapCELLn ") :�ssociative transformations to mold networks (mode l has_profi le '(or3) ) of generic Boolean instances into a configuration (al l models-of-1st-sources-of-inputs have_profile '(and2) ) that is best suited to map into the cells of the (found{critical }-inputs target technology. is_mos t-critical 1) • (any {tagged }-inputs Map rule base, which contains rules to transform are_not_in {critical}) generic Boolean and bit data path instances to --> technology cells. This rule base actually was (rep lace *instance• wi th divided into two rule bases, one that mapped out =

(remove_ if_dests sources ) • Wiring rule base, which contains rules to improve timing by loading and logic adjustments

124 Vo l. 2 No. 4 Fa lf J

-- -- MOST CRITICAL PAT H CELL N

Figure 5 Critical Pa th Map

and rul<::s to detect and correct electrical rules paths. The timing calculations for this rule base (ERC) violations. use actual routed wire delays.

• Tu ning rule base, which contains rules to adjust TheLar ger Physical Design Process power on intersections of multiple timing-criti­ SlO is part of a larger chip physical design process cal paths. that includes synthesis, placement, and routing. • Parameter rule base, which contains rules to set The entire process is linked together in a two-pass parameters fo r the placement and route pro­ process, called loopback. grams. These rules include setting parameters The first pass of loopback accomplishes three to identify logically equivalent signals, setting goals: the initial synthesis of the chip based on esti­ pin groups to force collections of rins to be mated interconnect delays, placement of those syn­ near each other in placement, and weighting thesis results, and routing, which includes accurate parameters to force timing-critical signals to interconnect delay calculations. be shortened. The second pass of loopback accomplishes the final synthesis with much more accurate inter­ • Placement and route, which are not rule bases connect delays and a high probability that the subse­ but CAO tools separate from the synthesis tool. quent physical instantiation will achieve all design Placement and route of Motorola Macrocell goals for timing, space, and power. The final synthe­ Array Ill (MCA3) gate arrays occur here in the sis made changes only where required. Final place­ overall design sequence. ment began with the results of the first pass of • Power rule base, which contains rules to adjust loopback, except where ch:mges were made, and power on gate array cel ls and cell output fol low­ routing rerouted only those nets that had been ers. After initial placement and routing, a more modified. Our objective was to limit the number of accurate assessment of signal delay can be made. passes through loopback to two and avoid endless The power rules track the power distribution cycles through the CAD tools. of the gate array cells and the contribution of The placement process itself consists of three each cell ·s current settings to the power budgets major phases. The global phase, called gravity col­ fo r ten regions of the chip . These ru les adjust lapse, attempts to achieve relative orientation of current upward to improve the sreed of critical the various gates and disregard density. The distri-

Digital Te cbnicaljournal 1-'r>/. 2 No. 4 Fall 1990 125 VAX 9000 Series

BEHAVIOR RULE BASE

ADDER. INCREMENTER, DATA PAT H BIT DIAGNOSTIC COMPARATOR, ETC. REPLICATION LOGIC RULES RULES RULES

OPTIMIZE RULE BASE Q (QQ I I

BOOLEANI TRUTH TABLE MULTIPLEXERI RULES RULES RULES

SPECIAL RULES

Figure 6 Knowledge Representation Hierarchy bution phase, called regridding, attempts to assign was finished. This approach avoided the problem the gates to available positions and maintain the of developing a "back-annotation" process, which desired orientation. The multipass final placement would be required if modification of the existing phase swaps cell locations, gates, nets, and pins to database were attempted and would be a complex reduce weighted net lengths but still adhere to the process, given the number of changes made. Also, technology-supplied design rules. because most of the schematic source design is in During the local placement phase, synthesis-sup­ RTL format, many, if not most, of the placement­ plied net weightings and equivalent net parameters introduced changes are not visible in the schematic. are utilized. The net weightings are part of the com­ Placement results were entered into our plex algorithm used to determine whether a poten­ internally developed routing CAD tool, called tial swap of a cell location, gate, pin, or equivalent Chameleon because of its ability to adjust to its net is beneficial. The equivalent net parameters environment. Chameleon is a highly rules and allow the placement program to detach a net from parameter-driven tool. It was used to route all one pin and reattach it to another pin to supply an 77 gate arrays, all 22 multichip units, and both the equivalent signal. This process was a particularly CPU and system control unit planar boards of the common occurrence because synthesis had to sup­ VAX 9000 system. For the 77 gate arrays in the ply the same or complementary signals to many VA..'( 9000 system, the router achieved berrer than destinations and still adhere to a technology-driven 99 percent completion. Further, the average net limitation of no more than four loads from any was routed to between 101 and 102 percent of one source. its Manhattan net length. (Note: Determination of Because the placement process introduces so Manhattan length is somewhat ill-defined when many changes in pins, gates, and nets, we felt it copper-sharing is allowed, and some net segments was prudent for the placement program to simply are common to multiple source-destination paths.) regenerate the CADEX database format when it The routing was so efficient, with regard to length,

126 Vo l. 2 No. 4 Fa ll/990 Digital Te chnicaljo urnal Sy nthesis in the CA D System Us ed to Design the VAX 9000 System

that the synthesis and placement programs could hoped to be able to use the same process in the new assume that determination of a net's end points process, but were prohibited from doing so by defi­ would effectively determine its eventual routed ciencies in the CAE2000 system. As a result, the length and, by extension, its interconnect delay. overall CAD process had to be repeated from the Upon completion of the actual routing, inter­ beginning many times. connect delay calculations were made for every net As with any new development tool, we experi­ source-load combination by a router-related pro­ enced setbacks. for example, we developed the gen­ gram. We did these calculations at this point in eration of an adder that algorithmically picked each the process for two reasons. First, all the necessary gate of the carry-lookahead fo r optimum configura­ data was readily available in the router's database. tion of the Boolean trees with respect to fan-out and Second, accurate delay calculations were needed path delays. We then wrote Boolean optimization by synthesis during the second pass of loopback to rules that attempted to merge ORs together as if no verify the assumptions made during the first pass fan-out existed between them. However, when we and make any adjustments necessary to achieve tim­ did this, the carry-generate, least-critical OR paths ing and power constraints. For those few connec­ merged, which forced the more critical AND-OR tions that were not routed completely, calculations combinations into simple one-stage cells, with extra were made based on the Manhattan length and a levels of wire delay between them. Within a few small contingency factor. weeks, we were able to correct the faulty rules to also consider the timing-critical paths, which Problems allowed the adder to improve along the carry­ Developing and using the new synthesis design propagate paths. Eventually, by working on the cell methodology was not without problems. We were selection, load allocations, and power setting, we able to fix some of these problems for the first were able to produce a 64-bit adder at 3.2 nano­ VAX 9000 system generation. However, because of seconds (ns) in the MCA3 technology. The best hand time and resource problems, others were deferred design was 3.3 ns for a 59-bit adder. At the time, to the next project. the logic designers estimated that an extra stage Digital's previous CAD system ran on a combina­ delay, or 3.7 ns for the 64-bit manual design, would tion of 36-bit DECSYSTEM-20 computers and 32-bit be required. VAX computers. In switching completely to the VA X Since synthesis was new, there was a great deal of system for all CAD processes, we had to rewrite skepticism as to whether it could perform as well as much of the DECSYSTEM-20 computer's existing a manual design. Some logic designers never gave it code and replace the Stanford University design a chance. Other designers encountered early prob­ system (SUDS) schematic drawing program with the lems with it or experienced schedule pressures and, CAE2000 system. In replacing SUDS, we lost nearly as a result, resumed using hand-design methods. one million lines of code, which was used for such However, most designers stayed with the process tasks as wire listing, drawing, back annotation, and until it produced acceptable results. In the process, electrical rules checks. Some of these were available they supplied feedback and algorithms that were in the CAE2000 system, but others had to be devel­ converted into additional SID rules. This work was oped external to that system. crucial to the evolution of the program. Only by We designed a common database format, called adding new rules provided by logic designers could CADEX, that allowed the CAD tools to communi­ SID be improved for future designs. The successors cate with one another. for example, a design could to the VA X 9000 system will reap the major benefits be extracted from CAE2000 drawings to CADEX, from this work, through improved designer pro­ which would supply the design to SID. In turn, ductivity and time savings. SID would write CADEX output, which would be The effect of timing constraints and the general read by the placement program cut. This program accuracy of timing calculations on synthesis results would then write CADEX output. and the cycle cannot be underestimated. We required that timing would continue. New libraries had to be created for budgets be specified to every chip to indicate the HTL schematic bodies and MCA3 cells. Data formats timing criticality of each I/O pin. The budgets were and parameters had to be defined for passing infor­ specified from the i/O pin to latches of either of the mation between the CAD tools. two clock phases (TA or TB). Default budgets were In SUDS, results could be written back to the applied on paths between latches that were con­ drawing program (i.e. , back annotation). We had tained on the same chip. If a budget was missing,

Digital Te chnicaljo urnal Vo l. 2 No. 4 Fall 1990 127 VA X 9000 Series

SIDconsidered rhar the speed of the path did nor Table 1 Ratio of RTl logic Complexity to matter. However, as we ran the paths without Gate Logic Complexity SID budgets, we frequently fo und that had designed RTL Logic Gate Logic the path to bt: too slow. Wt:then had ro specify a Complexity Complexity budgt:t for that path and repeat the testing process. E-box A better alternative would have been to have a tool 4.73 1-box 4.92 set the initial budgets for al l I/O pins and for the user M-box ro modify budgets as necessary. 4.40 The accuracy of timing calculations is another V-box 3.17 factor in the design results. Because SID used worst­ Av erage 4.30 case gate delays rather than rise and fall delays, its calcula tions of timing debt generally produced incorrect numbers that indicated timing problems The average ratio of 1 to 4.3 is interpreted to that did nor actually exist. Triggered by this inac­ indicate that 23 percent of the logic design work curate timing datJ, the rules generated duplicated (not counting placement, routing, simulation, and logic to reduce signal load fan-outs and needlessly timing verification) was done by logic designers and increased chip power. 77 percent by synthesis. Ultimately, synthesis methodology enabled the Another perspective is gained when we consider CAD system to produce accurate gate array designs. the amount of synthesis rules that were applied. With the Ruldorm language, we improved the rules The number of rules varies tremendously in relation to meet the changing necds. In the mJjority of cases, to the impact. For example, the adder-generation rules that were not in time for one designer bene­ rule ' which takes about 1 CPU minute to complete fi ted many other designers at a later design stage. for a 32-bit adder, performs the equivalent of Using an ECO (engineering change order) process, approximately 4 person-days of work. On the other we adjusted thc placements and power and extreme, a parameter-setting rule that is rested and improved the riming-critical paths by requesting applied in .1 CPU seconds performs the equivalent specific power sctt ings or fan-outs that the regular of 15 to 30 person-seconds of work. execution of SII) does nor normally generate. Ta ble 2 shows the approximate number of SID rules applied during synthesis runs. The rules per­ Results formed different categories of activities for the 77 Approximately 9.) percent of the gate-level database VAX 9000 system MCA3 gate array chips. RTL DECSIM, was synrhesiZL:d from source design, Thirteen bugs caused by synthesis were found and microco<.k truth tables. The other seven per­ in the gate-level simulator. These bugs were either cent was implemented in the source schematics in typographical errors in the rules or incorrect CLEGOs. the fo rm of technology cells, known as interaction within a set of rules. Although each rule Since the RTL often is quite similar to the finished was tested independently for correctness through gates, this percentage is not an accurate reflection of the amount of work involved. The RTL bodies Table 2 Activity Categories for the were quite simple and without technological VAX 9000 Gate Array Chip aspects, such as strange polarity inversions and clock connections. However, they did require that Rules Number the designer specify a I. I data paths and control logic Expand behavior instances to bit level 28,567 in the true and fa lse sense; e.g., A and B but not C Minimize and optimize Boolean logic 11,550 feeds the select to 32-bit data path MUX. Initially select macros and macropins 58,597 RTL The ratio of database size for bodies com­ Detect and correct electrical and design 7,392 pared to synthesized gates is a better measure of rule violations how the design entry was simplified through the Improve timing by loading and logic 85,008 use of RTL. schematics and SID synthesis. The com­ adjustments parison was done for CADEX file sizes of the RTL Set high-power on common-critical paths 3,850 165,000 designs versus the synthesized gate designs just Set parameters needed by other CAD tools RTL prior to placement. The ratio of logic complex­ Set high-power on timing-critical paths 24,024 ity versus gate logic complexity, fo r each CPU box, To tal number of rules applied 383,988 is shown in Table I.

128 Vo l. 2 No. 4 Fa ll f

Ta ble 3 Average Run-times for MCA Chips

MCA Chip Average Run-time

Synthesis 30 minutes to 3 hours Placement 4 to 10 hours Route 4 to 12 hours Delay calc 1 to 2 hours

Digital Te clm icaljournal Fa ll /')')(! Vol. 1 No. ·i 129 Ka ren E. Barnard Ro bert P. Ha rokopus I

Hierarchical Fa ult Detection and Isolation Strategyfo r the VAX 9000 Sy stem

Th e VAX 9000 system was designed to compete in the mainframemarket. Ma inframe customers not on�y require high processorpe1j ormance and throughp ut, but also a system which is reliable and always available. Th is paper demonstrates how the newly implemented scan system, in conjunction with scan pattern testing and symptom-directed diagnosis (SDD) , is essential to satisfy these needs. SDD is the use of on-line error detectors and state information saved at the time of an error to isolate the fa ult that caused the error. Th e scan system of the VAX 9000 system allows individual state elements in the processor to be set and sensed, and is the basisfo r fa ult detection and isolation.

As computer technology becomes more advanced, behave as the VA X architecture mandates. Micro­ designs are becoming more dense. Density implies diagnostics execute from control store random­ volume. For example, the typical chip on Digital's access memory (RAM) and have access to internal most advanced CPU, the VAX 9000 system, contains state elements of the CPU. Although microdiagnos­ 8000 gates. The VAX 9000 logic packaging also rics were intended ro provide better fault isolation compounds the diagnostic problems by making the than macrodiagnostics, isolation is still not optimal gates physically impossible to reach with a logic because the access points comprise only a fraction analyzer. of the total CPU. Since both types of diagnostics As the gate volume increases and the logic provide imprecise fa ult isolation, an engineer must becomes less accessible, the problems increase for be highly skilled in the analysis of both the code manufacture of the design, debug of the prototypes, and the internalworki ngs of the CPU to repair a bro­ and repair of the machine in the fi eld. The debug­ ken machine. ging stages and the hardware repair process require To avoid the rime-consuming process of manual that fa ults be found quickly and accurately to fa ult isolation, the field engineer often extensively ensure that downtime is minimal and valuable replaces modules, which is a costly repair method. resources and inventory are not wasted. This paper Historically, this practice has been a problem not presents the solution used in the VA X 9000 system only for Digital but for the entire computer indus­ for detecting and isolating hard and intermittent try. Another disadvantage is that these diagnostics faults. The diagnostic solution for the VAX 9000 are executed using the suspect logic, which can system comprises the scan system, tools to generate produce incorrect test results. Finally, both diagnos­ test data, utilities to submit the rest data to the scan­ tics often fa il ro provide the desired fa ult resolution nable logic, and an to record symp­ because of fa ult propagation. The fa ult spreads toms and produce a callout over time. across module boundaries because a typical test executes several instructions and each instruction TraditionalFa ult Detection and requires one or more CPU cycles before the results Isolation Methods are analyzed. Excluding the Micro VAX chip, all VA XCPl s designed prior to the VA X 9000 CPU are supported by macro­ Solution fo r a High -volume Design diagnostics and microdiagnostics. The solution used in the VAX 9000 system to Macrodiagnostics execute from the system's main improve fault detection and isolation preserves the memory and verify that the CPU can successfully role of the fu nctional diagnostics and addresses the

Vo l. 2 No. 4 Fa/1 1990 Digital Te chn icaljom-nal HierarchicalFa ult Detection and Isolation Strategyfo r the VAX 9000 System

inherent problems of those diagnostics. An integral processor unit and the scan system's basic compo­ scan system addresses the density and packaging nents do not contain any faults. issues. The CAD processes that automatically gener­ After the tests are executed, the service processor ate test data address the complexity and schedule unit's operating system is booted and the scan hard­ issues. The scan system also reduces the problem core diagnostic is mn. The hard-core diagnostic of fa ult propagation by providing a mechanism to assumes that the system has passed the power-up stimulate the machine and examine the results after diagnostics and, therefore, the scan controller will just one machine cycle. Further, more direct control operate properly. The hard-core diagnostic tests the over most of the internal logic elements improves components of the scan system that reside on the test coverage and fault isolation. CPU and system control unit planar modules. If the scan system is broken, the machine cannot be initialized. When the system passes the hard-core Increased Visibility diagnostic, the scan pattern diagnostic tests the In the VAX 9000 CPU design, multichip unit technol­ integrity of the scannable system logic. ogy created a machine that could not be built or The functional diagnostics and system exercisers diagnosed without a marked increase in visibility are run Jfter the scan pattern diagnostic has verified points. In previous VAX systems, the number of visi­ that no structural problems exist. In the testing bility points fo r microdiagnostics varies from one hierarchy of theVAX 9000 system, fu nctionJl diag­ machine to another. For example, the VAX 8700 nostics are as important as structural testing. Func­ system has just over 150 visibility points, and the tional diagnostics verify that the design represents VAX 8600 and 8650 systems have over 3000 visibil­ a valid VA X system. ity points. Most points are read-only points, and the diagnostic processor has limited direct control over Timely Te sting initializing individual CPU state elements. In con­ In the course of developing a system design, several trast, the VAX 9000 scan system provides access to revisions of the system components are usually over 20,000 internal machine state elements for required. Each revision represents a different both reading and writing and direct access to all machine for testing purposes. To test each revision internal RAM and register structures. The design of of the VA X 9000 design in a timely manner, a tool, the VA X 9000 system significantly improves CPU called Scan Environment Patterns for Test Jnd and system control unit logic visibility. Repair (SCEPTER), wJs developed to generate test The scan system is used for diagnostic purposes data. SCEPTER, an automatic pattern and test gen­ ancl to initialize the state of the CPU and system erator, takes JS input the data used to manufacture control unit. Because the number of windows into the multichip units and structural models that simu­ the system are increased, the design can be parti­ late the design. Both inputs are produced during tioned into smaller regions, which improves fault the logic design process. isolation. The fault detection and isolation strategy depends on components that vary in complexity SCEPTER Process from a simple scan latch to automatic generation of Pattern generation fo r the VAX 9000 system is a the test data. recursive process that initializes the scannable An important feature of the scan system is that latches in a logic model, simulates one or more the scan latches can be influenced by the scan sys­ system clocks, and reads the contems of the scan tem and by the system logic. Therefore, the scan latches. The contents are read as variable-length system functions can be tested and verified inde­ vectors, one bit for each scan latch in the design. pendently of the system logic. However, if the scan The vectors to initiJlize the logic, the timing defini­ system is not functioning properly, the scan pattern tions Jnd the expected result, and mask vectors Jre produce valid results. Therefore, diagnostic cannot written by SCEPTER into an ASCII file. The scope of scan system fa ults must be fixed before running the the testing fo r J particular file is determined by the scan p::merns. scope of the model that was input to SCEPTER. SCEPTER is quite flexibleand cJn generJte data files Te sting Hierarchy thJt target a Motorola Macrocell Array III (MC:A:)), Tes ting for the VAX 9000 system begins with diag­ a multichip unit, or CPU. nostics that Jre run automatically when the system The SCEPTER data files are translated into binary is powered up. These tests ensure that the service formats to reduce the size of the fi les and to allow

DiJ!, ital Te cbnicaljournal Vo l. 2 No. 4 Fall /<)<)() 131 VAX9000 Series

one pattern generator to satisfy the requirements related ro how fa r apart the scan latches arc placcu. and abilities of every test environmenr. The envi­ The spacing of the scan latches also affects the isola­ ronments that usc data generated by SCI:PTER arc tion callout. The number of components involved in a callout can be decreased if scan latches arc situ­ • Trillium's Micromastcr Plus, the MCA tester used ated such that, on input to a chip, a signal feeds a by the MCA3 ,·endor and Digital for the manufac­ scan latch prior to the signal being used in combi­ turing test process. The tester ensures that the national logic. Figure I illustrates the callout that chip internal logic is fault-free prior mounting 10 occurs if a fault is detected on either of the two sig­ it on the multichip uni£. nals shown. • MCU tester, which is a comprehensive multichip The callout for signal A, where the chip is insu­ unit testn Digital developed for the manufactur­ lated with scan latches, is 30 percent smaller than ing test process. This multistation high-volume the callout for signal B, which does not have tester has access ro the multichip unit's J/0 pins boundary scan latches. The difference is even larger and can probe the multichip unit and MCA pins. for signals that converge on a common area of com­ binational logic. • Manual probe station, which started out as an engineering tool but has evolved into auxiliary diagnosis too in the multichip unit test process. Us ing Boundary Sca n to Improve It is used ro diagnose multichip units that have Fa ult Isolation been removed from a VAX 9000 system because The example in Figure I does nor illustrate what of suspected fa ults. happens to the isolation callout in the case of a multibit signal that communicates with several • 9000 kernel environment is used by Engi­ VAX chips. Figures 2 and 3 demonstrate the effect that neering, Manufacturing, and Customer Services fa n!N has on fa ult isolation. Figure 2 has boundary to verify MCU installation. scan, and Figure 5 does not. The fo llowing discus­ Pa ttern Generation Process sion centers on these two figures. The callout in Figure 3 includes an extra compo­ The automatic test ami pattern generation (ATPG) nent, i.e., , fo r each chip. If a process begins by determining what portion of scan latch is placed between the combinational the design is to be tested. A computer-aided design logic and the chip boundary, the callout list can (CAD) tool partitions the models into chunks be reduced. The reduction between the callouts in before SCEPTER is run. A chunk can consist of an Figures 2 and 3 is 55 percent. MCA, a multichip unit, one or any portions of crt ·, The hierarchical test strategy, which was detailed these units. earlier, confirms the following items as good on the Typically, one multichip unit is considered to be MCU tester prior to installing the multichip unit on the targeted device for tt'sting, and any multichip the planar module: units that communicate with the targeted multichip unit also are included in the rest process. • Chips I, 2, 5, and 5 Once the model is established, the recursive pro­ cess of initializing, simulating the system clock, and • MCUx HDSC reading the results continues until the generation • MCUy HDSC algorithms are satisfied that no additional fa ults can be detected. The simulation time can be lengthy in • MCUx flex connectors this process, and, therefore, reduced coverage lim­ • MCUy flex connectors its are set for multichip unit pattern generation. MCA pattern generation usually produces better Although these items are confirmed as good, a than 98 .5 percent coverage. fa ult can still exist on the planar module or in the flex connectors. Because the flex connectors are Basic Th eory of StructuralTe sting moving connectors and subject to abrasion, they As discussed earlier, the scan system narrows the have the highest probability of breaking and are, scope of testing to an area as small as one system therefore, the weakest link in the mulrichip unit clock cycle. If the scan latches are strategically assembly. placed, a fa ult's source can be pinpointed to a rela­ As such, these connectors require the greatest tively small region. The size of the region is directly protection and deserve a high-level of suspicion

132 Vo l. 2 No. 4 Fa ll I'J'JO Digital Te cb nicaljournal Hierarchical Fa ult Detection and Isolation Strategy fo r the VAX 9000 Sy stem

MCU 1 MCU 2

MCAx MCAy

FAULT" SIGNAL_A_H <0>

1 CLOCK CYCLE MCAy MCAx

SIGNAL_B_H <0>

1 CLOCK CYCLE

"SIGNAL A: ··siGNAL B:

MCAx MCAx MCU 1 HDSC MCAx COM BINATIONAL LOGIC CPU PLANAR MCU 1 HDSC MCU 2 HDSC CPU PLANAR MCAy MCU 2 HDSC MCAy COMBINATIONAL LOGIC MCAy

Fz�e,ure 1 Examp le of Two Fa ults with and without Scan Latches on the Module Boundaries when isolating faults. As a result of testing in the patterns are rerun. If the run is fa ilure-free, then the manufacturing process, the chip's internal logic and cause of the fault is on the removed multichip unit. the HDSC systems can now be temporarily removed The defective unit is sent to a repair depot for diag­ from the callout. If a fa ult occurs, boundary scan nosis and repair. latches can isolate the fault to one signal. Without For discussion purposes, if additional faults are boundary scan, all three signals have to be included detected on one of the multichip units during the in the callout because the fault source cannot be testing, then that multichip unit should be the first accurately pinpointed. one to be reseated or swapped. Thus, the highest In this example, provided that no other fa ults are number of faults can be eliminated for one MCU present, each multichip unit's flex connector has replacement. The remainder of the repair verifica­ an equal probability of having caused the fault. For tion procedure would be the same as in the case every inch of flex connector, there are 30 signal where no other fa ults existed. connections. A minor piece of debris on a con­ nector can potentially cause a number of failures. Scan Pattern Diagnostic Therefore, the least costly repair strategy is to reseat The scan pattern diagnostic is a utility that resides the multichip unit after cleaning the flex connector on the service processor unit's system disk and and the pads on the planar module. processes the structural test data generated by

If the fault is still present after reseating the multi­ SCEPTER. The scan pattern diagnostic runs each fi le chip unit, the multichip units are swapped and the based on the user's input.

Digital Te cbnicaljournal Vo l. 2 No. 4 Fa ll 1990 VAX 9000 Series

MCUx MCUy

CHIP l CHIP 4

FAULT

CHIP 2 CHIP S

CHIP3 CHIP 6

CALLOUT FOR SAME FA ULT USING BOUNDARY SCAN

ITEMS KEY:

15 CHIP 1 SCAN LATCH -+ DETECTION POINTS 2. CHIP FOR FAULTS 3. MCUx5 HD SC 4. MCUy HDSC D PLANAR 5. -D OR CL -+ COMBINATIONAL LOGIC

Figure 2 Fa n/N with Boundm�)' Scan

'T'h<.: interface supports flexibility for testing the SCEPTEf{ provides isolation maps. which an: lists scmnahle logic in the system control unit and one of the components that may he responsible for the to fo ur Cl'l·s. The scan p:mcrndata files contain the fault detected by a given scan latch. There is one dat::t required to test the hardware as discussed ear­ isolation map for each scan latch involved in the lier in the Pattern GenerationProcess section. testing. When a scan larch detects a failure, the The diagnostic packages the test data into sun scan patlern diagnostic uses its physical location operations that ::tre submitted to the scan control to access the isolation map provided by SCEPTER. modu 1<: through the service processor unit sys­ The contents of the maps arc used in the isolation tem calls. The scan pattern diagnostic checks the callout. returned status; and if fa ults are detected, saves the physical location of each fa ull in an internal Proper Niche fo r Structural Testing database. We did nor design the structural test process for the

I'J'JO Vo l. 1 No. -i Fa ll Digital Te chnicaljournal Hierarchical Fa ult Detection and lsolat ion Strategy fo r the VA X 9000 System

MCUx MCUy

CHIP 1 CHIP4

-� FAULT

-

CHIP 2 CHIP S  r-- '------'------D-

CHIP 3 CHIP 6 

1.---

CALLOUT WITHOUT BOUNDARY SCAN

11 ITEMS KEY 1. CHIP 1 2. CHIP 2 SCAN LATCH-+DETECTION POINT 3. CHIP 3 FOR FAULTS 4. CHIP 5 5. CHIP 1 CL 6. CHIP 2 CL CHIP D OR CL -+ COMBINATIONAL LOGIC 7. 3 CL 8. CH IP 5 CL -D 9. MCUx HDSC 10. PLANAR 11 MCUy HDSC

F(f!,ure3 Fa n!N ll'ithout Boundary Scan

VA X 9000 system to cover every test problem. cannot be used to find problems with the design. Instead, we designed the process to ensure that the Architectural verification tests perform that fu nc­ hardware can physically operate as described by tion. Finally, structural testing cannot be used to the design data. Structural testing cannot be used to detect intermittent fa ults. Because this type of fa ult determine if the VAX 9000 system is operating as a requires the presence of special conditions which VA X system. that is the job of the fu nctional diag­ the test data may not provide, on-line error detec­ nostics. Further, structural testing cannot be used tors and symptom-directed diagnosis are more to determine if the system is robust enough to sup­ effe ctive alternatives. port multiuser traffic; the User Environment Te st Structural testing must be performed at a low Package (liFT!') exercises the system hardware and level. It should be done \vhen power fa ilures or operating system. The structural test process also power surges occur. when multichip units on the

Digital Te dmicaljournal V!JI. .! Nn. 1 f-'a /1 /')90 VAX 9000 Series

CI'U or system conrrol unit planar are swarped, or error t.lerector states, multiplexer selccr vJlues, when signal-carrying cables arc installed . The scan memory addresses, and register values. patterns also should be run if the system crashes The VA X 9000 symprom-dirccred diagnosis or applications begin to behave erratically for no strategy is composed of fou r major components. apparent reason. Initially, when a system problem First, on-line hardware error detectors are used to occurs, the cause must be isolated ro either the soft­ achieve mJximum coverage and an error-reporting ware or hardware to initiate the correct remedial process logs the necessary symptom information action. If the hardware appears to be the cause. then when errors are detected . Second , hardware error the hardware diagnostic strategy must be followed detecrors and secondary syndromes are used to to obtain optimal fault isolation in a minimal build symptom-directed diagnosis fault isolation amount of time. rules that achieve the minimum possible callout of faulty field replaceable units. Third, symptom­ Structural Te st Process directed diagnosis CAD tools calcu!Jre the coverage The VAX 9000 system's structural test process provided by on-line error detection and evaluate shows that, given the proper pattern data files, the quality of fault isolation provided by these fa ults can be detected and isolated faster than with detectors. Fourth, on-line, symptom-directed diag­ traditional methods that usc the symptoms from nosis software performs fault isolation for both the functional diagnostics. Structural resting not single-error and multiple-error events. only fills a gap in Digital's rest hierarchy, bur also preserves the benefits derived from the functional Fa ult Detection Coverage and diagnostics. As a result, logic designers and mJnu­ Error Logging facruring engineers can concentrate on higher level On-line hardware error detection is essential for problems, and field engineers can repair and bring a detecting intermittent fa ults. On high availability broken machine bJck on-line faster. mainframes such as the VAX 9000 system, it is essen­ The structural test process has also produced an tial ro detect or "cover" a high percentage of inter­ automatic rest data generator. This generator is tl ex­ mittent fJ ults. This section discusses the coverage ible enough to support testing for Digital's future measurement of on-line error detection for the processor designs, which include a scan system. VAX 9000 system. The VAX 9000 system error­ This tool will prove ro be essential in bringing the logging process is also discussed. next innovative complex design to marker on rime. It makes design testing, prototype debugging, and On-line Error Detection repair more thorough and efficient. Structural testing cannot address the unique Hardware components are subject ro temporary problems presented by intermittent fa ults. These fJ ilures because of signal noise, environmental devi­ faults require constant monitoring and a mecha­ ations, marginal devices, and other factors. To com­ nism to log the symptoms and isolate over time. pensate for these inevitabilities, the VAX 9000 kernel includes over 450 error latches, which store Sy mptom-directed Diagnosis the results from hardware error-detection circuits, As computer systems have become more complex. such JS parity and error-correcting code checkers. the occurrence of intermittent fa ults hJs increased The detection of fa ults is critical to an orderly and dramatically. This phenomenon results mJinly from predictable error-handling process. Error-detection the increasing densities of chips and interconnec­ circuits nor only ensure the data integrity of the sys­ tions. Traditional test-directed diagnostics are inef­ tem, but also provide information that can be used fective in isolating intermittent fa ults because they for symptom-directed diagnosis fa ult isolation. rely on the ability to re-create the fa ilure condition, The placement of error-detection hardware is which is seldom possible to do. Inrermirrent fa ults critical to the effectiveness of the process. The are usually as a result of marginal components and goals for error detector placement on the VA X 9000 may on ly occur when certain conditions are mer, system included : such as the specific workload on the systcm. In con­ • Maximizing coverage of higher failure trast to rest-directed diagnostics, symptom-directed components diagnosis uses symptom information saycJ at the • time of the fa ilure ro isolate the bult. Symptom Minimizing the callout of fa ulty field replaceable information includes useful machine stares, such as units

Fa ll 136 Vol. 1 No . -1 J<)

• Minimizing pin use and cell count Computation of the error domains for each on­ line error detector in rhe design results in a signal • Minimizing effects on system performance probability of detection for each covered signal. Uncovered signals are assigned a zero signal proba­ Co verage Calculation bility of detection. One of the purposes of hardware error detection is to ensure that the VAX 9000 system behaves in a Coverage Fo rmula The sign

Digital Te chnicaljournal Vo l. 2 No. ·i Fa /1 /'J')fi 1:)7 VAX9000 Series

Thl're are two basic types of fa ult isolation rules. replaceable unit callout results in more than one single event and multiple event. Single-event rules field replaceable unit having a significant possibility arc used fo r analyzing single error events (i.e. , one of fa ilure, then secondary syndromes must be used error log entry). Multiple-even£ rules arc used for to reduce the callout. Secondary syndromes are key

analyzing multiple error events that occur over a machine states, other than error latches. that are specified inrerval of rime. stored in rhe error log entry. Examples of secondary syndromes include mulriplexer select lines, mem­ Single Event Fa ult Is olation Rules ory :tddress values, and orher p�nh-sensitive control There arc several categories of single evenr fa ult iso­ signals. These signal stares are used to derermine lation ru les. These rules are derived from rhe on-line rhe specific parh rhar was S(;nsitized when an error error derection designed into the VA X 9000 system. occurred . The nonsensitized path(s) can then be removed from the callour. An example of how sec­ Prinuuy5>) mdrome Fa ult Isolation Rules Primary ondary syndromes are used for fa ult isolarion is syndromes are the error larches rhat detect and shown in Figure 4. report error events. Each error latch stores the result of an on-line error derector. Each error detec­ FaultPropaga tion Rules Sometimes a single-error ror covers a secrion of logic in rhe system. By map­ event can trigger multiple error detectors because ping this logic to the physical parrition (i.e. , field of fau lt propagation or domain intersection.

replaceable units), the values of set error latches can Fault propagation occurs when a fa ult in :.1 given be ust"d as a first-pass fault isolation. In many error domain (i.e. , the propagation source) propa­ instances, this analysis alone is sufficient to deter­ gates into other error domains (i.e., the propagation mine rhe fa ulty field replaceable unit. destinations). 'J() identify the real source of the error, the possible fa ult propagation paths must be Secondary .'l)•ndrome Fa ult Isolation Rules In some found and the precedence of the error detectors in instances, the fa ult isolation provided by the pri­ each propagation path must be identified. When mary syndromes may nor localize the fault suffi­ multiple error latches are set, the propagation rules ciently. For example, if the primary syndrome field can then be applied to eliminate all propagation

FRU 1 FRU 3 00 -07 / PA RITY �8 / GENERATOR /

DO - 08 , / / �MUX PARITY ERROR PARITY _ERROR r- CHECKER - LATCH FRU 2 00 -�8 01 // DO D7 l,....--. ; /_08 / GENERPARITYATO R /

MUX_SELECT

PARITY_ER ROR MUX_SELECT CALLOUT UNKNOWN FRU 1 FRU 2 FRU 3 0 FRU 1 FRU 3 FRU 2 FRU 3

Figure 4 Secondary 5) 1ndrome Exa mple: MUX Select Usedfor rc iUltIso lation Reji'nement

1:)8 Vo l. .! No. Fa /1 1')')11 Digital Te cbnicaljournal Hierarchical FaultDetect ion and Isolation Strategy fo r the VAX <)000Sy stem destinations for each propagation source in the call­ ing occurred. Such an analysis would result in a ull­ out. An example of fault propagation is shown in out of the fa ulty field replaceable unit. However, Figure 5. multiple-event rules include checking for certain environmental deviations in close proximity to a Domain Intersection Rules Domain intersection logic fault. In this case, multiple-event analysis results when two or more error detectors cover a would attempt to correlate the logic fa ult with the common piece of logic. This information is used to environmental deviations to determine if the fault refine the callout when multiple error latches are set is transient in nature. If this were the case, a callout in the VAX 9000 system as shown in Figure 6. would not be required. Multiple-event rules can also be used to enforce MultipleEvent Fa ult Isolation Rules the callout refinement provided by secondary Multiple-event rules attempt to correlate separate syndromes, fault propagation, and domain inter­ error events to fine! a common problem. This type section. For example, in a VA X 9000 system that of analysis is beneficial when an intermittent or repeatedly generates identical or similar error log transient problem is not diagnosed sufficiently by entries, multiple event analysis can correlate these single-event symptom-directed diagnosis rules. entries to a single intermittent fa ult. It can provide For example, if a logic fault were analyzed with a scenario of which is the most likely secondary single-event, symptom-directed diagnosis rules, an syndrome path to be sensitized and the most likely intermittent logic fault could be concluded as hav- error domain to detect the error first. In this case,

FRU 1

OG-07/ PA RITY 08L OG-08/ PA RITY ERROR PARITY _ERROR_ 1 GENERATOR CHECKER / / L -- LAT CH / --

/ OG-08 v-/

FRU 2

LOGIC

/ OG-08

PA RITY ERROR PARITY _ERROR_2 CHECKER ,- LATCH

PARITY_ERR OR_1 PARITY_ERROR--2 CALLOUT

NO PROPAGATION FRU 1 INFORMATION FRU 2

WITH PROPAGATION FRU 1 INFORMATION

Figure 5 Fault Propagation Example

Digital Te cbnicaljo urnal Vo l. 2 Nu. 4 Fa l/ 1<)90 139 VA X 9000 Series

FRU 1 FRU 2

00-07 PARITY 0� 00-08/ PAR ITY ERROR PARITY_ERR OR_1 / GENERATOR / CHECKER --- LATCH n FRU 3

PARITY ERROR PARITY_ERR OR_2 CHECKER r---- LATCH

PA RITY _ERROR_1 PARITY_ERR OR_2 CALLOUT

FRU 1 0 FRU 2 0 FRU 1 FRU 3 FRU 1

Figure 6 Domain Intersection EwmljJle

multiple-event analysis can view these events as a error detection in designs. Ea rly fe ed back gave single problem rather than seeing each error log dc::signers rime to make design changes if cm·erage entry in isolation. or isolation goals were not achieved . Further. the

information provided by HIOE helps designers CAD To ols andProcesses select locations for error detectors and gave design­ ers quick feedback on the implications of detector 'J(> ensure that the VAX 9000 symptom-d irected diagnosis fa ult coverage and isolation goals were placement and design changes. achien:d , CAD tools were needed to measure the quality of the on-line error detection in the design. Sy mp tom Diagnosis Information To ols also were needed ro help develop symptom­ Language directed diagnosis fa ult isolation rules and to fac ili­ - The symptom d irected diagnosis fa ult isolation tate the conversion of these rules into a format that rules for the VA X 9000 system were coded into a set could he used by the fa ult isolation software. of system fa ult isolation matrix fi les, called symp­ Some of the significant sy mptom- di rected diag­ tom diagnosis information files. Symptom diagnosis nosis CAD tools that were devdoped and used for information is a language that is designed to express 9000 the VAX system are discussed below. hoth single-event and mulriple-c:: vent , symptom­ direcrc::d diagnosis fault isolation rules in an objec­ Hardware Is olation Domain Evaluator tive and consistent manner. The hardware isolation domain ev:tluaror (HII)E) In c::arlier VAX systems, new fault isolation tools CAD roo! was developed to provide symptom­ were needed fo r each new computer system. In the directed diagnosis fault coverage and isolation VAX 9000 system, the symptom diagnosis informa­ information to the VA X 9000 logic designers. Hll)E tion language provides a general-purpose means to also can generate simp!<: symptom-direc ted diag­ specify symptom-directed diagnosis fa ult isolation nosis fa ult isolation rules for usc in the system fa ult rules. The fi les arc used as the rule base for the isolation matrices. sy mptom-directed diagnosis fa ult isolation tools. One of the:: goals for HIDE was to provide ea rly which means that the tools can be used for furure fec::dback to logic designers on the quality of on-line computer system designs .

I')')() 1-i() Vo l. J No. -i Fa ll Digital Te chnicaljo urnal Hierarchical Fa ult Detection and Isolation St rategy for the VAX f)OOO S)'slern

On-line Fa ult Is olation Software vides a hierarchical mechanism for testing each

The VA X 9000 system contains on-line symrrom­ component before it is assembled into the next direcred diagnosis sofrwan: rhar auromatically diag­ level. In off-line resting, the usc of the scan system noses faults as they occur. The software produces rrovidcs high coverage and accurate fa ult isolation . an isolation callout of the possible faulty fi eld Scan test ing also has proven effe ctive during all replaceable units that is automatically received by phases of the VA X 9000 system product develop­ Digital customer s<.:rvice centers through a symp­ ment: design, manufacturing, prototype debug , rom-directed diagnosis reporting process. This and customer support. rrocess is design<.:d (() minimize the repair time Sy mptom-directed diagnosis is a sophisticated for VA X 9000 systems. It automatically notifies tool that provides detection and isolation of inter­ Digital of problems and provides a rerair plan to mittent faults. Intermittent fa ults have heen a signif­ Customer Services before personnel arc sent to the icant problem in the past because of the difficu lty to customer's sire. re-create the conditions that lead to such fa ults. Symptom-directed diagnosis solves the problem of Service Processor Diag nostic intermittent fa ults by analyzing symptom informa­ The VA X 9000 sen·ice rrocessor unit contains a tion generated by on-line error handlers rather than symptom-directed diagnosis fa ult isolation process by attempting to re-create the fa ult. Thus. the use that rerforms single-event analysis. This pro­ of symptom-dir<.:cted diagnosis p rov i des greater cess runs in the background waiting for error log machine availability for the VA X 9000 system . entries. \V hen an error log entry is generated. the Acknowledgments process analyzes the error log entry and rroduces an encoded ca llout of possible fa ulty field rerlace­ The implementation of the VA X 9000 fa ult detec­ ahle units. tion and isolation strategy wou ld have been impos­ The symptom-d irected diagnosis fa ult isolation sible if not fo r the perseverance and dedication to algorithm is rerformed by a general-purpose diag­ high quality shown by the fo llowing peorle: Jeff nostic engine. This engine uses a binary version Barry, Dom inic Carr, Steve Conway, Ed Crowley, of the symprom diagnosis information fi le, i.e. , Betty Daley, To ny [)ancona, Dave D'Antonio, Chris binary-coded matrix, as a rule base fo r its analysis. Demos, Sue DesMarais, Pa ul Dormirzer, Rick The diagnostic engine can analyze any error log Dusek, Mike Evans. Skip Gaede. Mike Gavronsky, entry that has a va lid corresponding binary-coded Philipre Girard, Matt Goldman, Francis (; ravel, matrix file. Chris joseph, Dale Kec k, Tom Krehel, Charlie In addition to the encoded callout, the single­ Kretz, Burch Leitz, Helen Lenane. Paul Leveille. event fa ult isolation rrou:ss produces status infor­ Keith Mayhue, Chris McCabe, Robert Nobrega. mation from each error event that is used for Mike Newman, Paul Paternoster, Brian Rosr, Dan multiple-event analysis. Schullman, Scott Si tterly, Norm Sozio, Ta mar We xler. To m Winter, Te d Wo jcik, Richard Wo od . VAX simPL US Eugene Xia. and the members of thc MCU rester and VAX The Vr\Xsim l'Ll 'S tool runs on the 9000 CPU and tvi CA:)rest develormenr t<.:ams. performs symptom-direcred diagnosis mulriple­ e,·cnt analysis. The tool analyzes information gen­ General References erated by the single-event, symptom-d irected A. Miczo, Digital Logic Te sting (New Yo rk: Harper diagnosis process using multiple-event. hinary­ and Row Publishers, Inc., 1986). coded matrix files. The VA XsimPLLIS tool uses the same general-purpose diagnostic engine as the N. Tendoikar and R. Swann, "Automated Diag­ single-event. symptom-directed diagnosis process. nostic Met hodology for the IBM 3081 Processor The ou trur of the VA Xsim PLliS tool is a syndrome Complex ,'' IRM jo urnal of Research and Det,el­ entry that collapses several error events into a single op ment. \'OI.26. no. 1 (.January 1982): 78-HH. error analysis theory. H. Ta naka er al.. "System Level Fault Dicrionary Summary Generation ." IEEh. International Te st Conference Proceedinp,s(N ew Yor k, 198H): H84 -HH7. A complete rest and diagnosis strategy for a large computn system, such as the VAX 9000 system, M. Coldman et al., "The VAX 9000 Sen·ice Pro­

1-equires off-line resting and its on-line counterpa rt. cessor Unit," Digital Technical .Journal. vol. 2. s�·mptom-direcred diagnosis. Off-line testing rro- no. -+ (Fall 1990, this issue): 90- 101.

4 /'J')Ii Di�ilal Tt.•cbuicatjounwt \'ul. 1 N". Foil l.:i l I Further Readings

The Digital Technical .Journal CVAX-based Systems publishes papers that explore the li> l. 1, No . 7, August 1988 technologicalfo undations of Digital s CVAX chip set design and multiprocessing architec­ major products. Eachjo urnalfo cuses ture of the midrange VA X 6200 family of systems on at least one product area and and the MicroVAX 3500/3600 systems presents a compilation of papers Software Productivity To ols written by the engineers who devel­ W>l. 1, No . 6, February 1988 op ed the product. The contentfo r Tools that assist programmers in the development thejo urnalis selected by thejo urnal of high-quality, reliable software Adl'isory Board, which includesfo ur VAXcluster Systems vice presidents andfi ue senior engi­ W>l. 1, No . 5, September 1987 neering managers. System communication architecture, design and implementation of a distributed lock manager, and performance measurements

To pics covered in previous issues of the Digital VAX 8800 Family Te ch nicaljo urnal are as follows: Vo l. 1, No . 4, February 1987 The microarchitecture, internal boxes, VA XBI bus, DECwindows Program and VMS support for the VA X 8800 high-end multi­ Vo l. 2, No . 3 Summer I990 processor, simulation, and CAD methodology An overview and descriptions of the enhancements NetworkingProducts Digital's engineers have made to MIT's X Window Vo l. I, No . 3, September 1986 System in such areas as the server, toolkit, interface The Digital Network Architecture (DNA), network language, and graphics, as well as contributions performance, LANbridge 100), DECnet-UL'TRIX and made to related industry standards DECnet-DOS, monitor design VAX 6000 Model 400 System MicroVAX II System W>l. 2, No . 2, Sp ring 1990 \1)/. /, No . 2, March 1986 The highly expandable and configurahle midrange The implementation of the and family of VA X systems that includes a vector proces­ floatingpoint chips, CAD suite, MicroVAX work­ sor, a high-performance scalar processor, and station, disk controllers, and TK 50 tape drive advances in chip design and physical technology VAX 8600 Processor Compound Document Architecture Vo l. I, No . 1, August 1985 W>l. 2, No . I, Win ter 1990 The system design with pipelined architecture, The CDA fa mily of architectures and services that the !-box, F-box, packaging considerations, signal support the creation, interchange, ancl processing integrity, and design for reliability of compound documents in a heterogeneous net­ work environment Suhscriptions to the Digital Te chnicaljo urnalare Distributed Systems available on a yearly, prepaid basis. The subscrip- W>l. 1, No . 9, ]une 1989 tion rate is S40.00 per year (four issues). Requests Products that allow system resource sharing should be sent to Cathy Phillips, Digital Equipment throughout a network, the methods and tools Corporation, MLO I-3/B6R, 146 Main Street, Maynard, to evaluate product and system performance MA 01754, US.A. Subscriptions must be paid in u.s. dollars, and checks should be made payable to Digital Storage Te chnology Equipment Corporation. W>l. 1, No . 8, February 1989 Engineering technologies used in the design, manu­ Single copies and past issues of the Digital Te chnical facture, and maintenance of Digital's storage and jo urnalcan be ordered from Digital Press at a cost information management products of $16.00 per copy.

142 lk>/. l No. 4 Fa ll 1?90 Digital Te chnicaljo urnal Technical Papers and Books by Digital Authors Digital Press Digital Press is the book publishing group of Digital Dileep Bh:mdarkar and Richard Brunner, "VAX Equipment Corporation. Digital Press publishes Vector Architecture," Prnceedings of17th Annual books internationally for computer professionals, InternationalSy mposium on Computer Architec­ specializing in the areas of networking and data ture (IEEE, May 1990): 204-215 communication, artificial intelligence, computer David G. Shurtleff and Colin Strutt, "Extensibil­ integrated manufacturing, windowing systems, and ity of an Enterprise Management Director," in the VMS operating system. Digital Press welcomes Ne twork Ma nagement and Control, edited by proposals and ideas in these and related areas. A. Kershenbaum, M. Malek, and M. Wall (New York: Plenum Press, 1990): 129- 14 1. VAX /VMS: Writing Real Programs in DCL Paul C. Anagnostopoulos, 1989, softbound , Xi-Reo Cao, "System Representations and Per­ 409 pages ($29.95) formance Sensitivity Estimates of Discrete Event Systems," Ma thematics and Computers in X WINDOW SYSTEM TOOLKIT: Simulation, vol. 31 (1989): 113- 122. The Complete Programmer's Guide and Specification Xi-Ren Cao, "On a Sample Performance Function of Paul ). Asente and Ralph R. Swick, 1990, softbound, jackson Queueing Networks," Op erationsResearch, 1,000 pages ($44.95) voL 36, no. I ( 19S8): 128- 136. UNIXFOR VMS USERS Xi-Ren Cao and Y. C. Ho, "Estimating Sojourn Time Philip E. Bourne, 1990, softbound, )68 pages Sensitivity in Queueing Networks Using Perturbation ($28.95) Analysis,"journa/of Op timization Th eoryand Applications, voL 53, no. 3 ( 1987): 353-375 INFORMATION TECHNOLOGY STANDARDIZATION: Theory, Practice, Xi-Reo Cao, "First-Order Perturbation Analysis of a and Organizations Single Multi-Class Finite Source Queue," Performance Carl F. Cargill, 19R9, softbound, 252 pages ($24 .95) Eualuation, vol. 7 ( 1987): .1 1-41. THE DIGITAL GUIDE TO SOFTWARE D. Lomet and B. Salzberg, "Access Method for Multi­ DEVELOPMENT version Data," Proceedings of ACM SIGMOD Confe r­ Corporate User Publication Group of Digital ence (May 1989): 315-324. Equipment Corporation, 1990, softbound, D. Lomet and B. Salzberg, "A Robust Multi-attribute 239 pages ($27.95) Search Structure," Proceedings of International VMS INTERNALS AND DATA STRUCTURES: Conference on Data Engineering (February 1989): Ve rsion 5 Update Xpress, Volumes 1,2,3,4,5 296-304 . Ruth E. Goldenberg and Lawrence J. Kenah, 1989, D. Lomet, "A Simple Bounded Disorder File 1990, 1991, all softbound (S27.95) Organization with Good Performance," ACM VA X/VMS INTERNALS AND DATA Tra nsactions on Database Sys tems 13, vol . 4 STRUCTURES:Ve rsion 4.4 (December 1988): 525-551. Lawrence J Kenah, Ruth E. Goldenberg, ami P. Bernstein and 0. Lomet, "CASE Requirements Simon F. Bate, I9SR, softbound, 979 pages ($75.00) for Extensible Database Systems," Data Engineer­ THE USER'S DIRECTORY OF COMPUTER ing, voL 10, no. 2 (June 1987): 2-9. NETWORKS W. Lin win and D. Lomet, "A New Method for Fast Tracy L. LaQuey, 1990, softbound, 630 pages Data Search with Keys," IEEE Software, vol. 4, no. 2 ($34.95) (March 1987): 16-24. AND D. Lomet, "Partial Expansions for File Organizations ARCHITECTURE: The VA X, Second Edition with an Index," ACM Tra nsactions on Database Henry M. Levy and Richard H. Eckhouse, Jr., 1989, Sy sten1s, voL 12, no. l (March 19R7): 65-84 . hardbound, 444 pages (S38.00)

Fa ll Digital 'f'e cbnicaljo urnnl l·'tJ/. .! ,\b. - 1 JIJ'Jii 143 Further Readings

USING MS-DOS KERMIT: Connecting A BEGINNER'S GUIDE TO VA XNMS

Yo ur PC to the Electronic Wo rld UTILITIES AND APPLICATIONS Christin� M. Gianon�. 1990, softbound, 244 pages, Ronald M. Sawey and Troy T. Stokes, 1989, with Kermit Diskerte (529.95) softbound, 278 pages (S26.95)

SOLVING BUSINESS PROBLEMS WITH MRP II X WINDOW SYSTEM, Second Edition Alan D. Luber, 1991, hardbound, )00 pages (534.95) Robert Scheifler and James Geuys, 1990, softbound , 851 rages (S49.95) VMSFILE SYSTEM INTERNALS Kirby McCoy, 1990, softcovcr, 460 rages (S·i9.95) : The Language, Second Edition TECHNICALASPE CTS OF DATA Guy L. Steele jr., 1990, 1,029 pages COMMUNICATION, Third Edition ($38.95 in softbound , £46.95 in hardbound) John E. McNamara, 1988, hardbound, 38:1 pages (542.00) WORKINGWITH WPS-PLUS Charlotte Te mple and Dolores Cordeiro, 1990, LISP STYLE and DESIGN softbound, 235 rages (£24.95) Molly ivl. Miller and Eric Benson, 1990, softbound, 214 rages ($26.95) ABCs OF MUMPS: An Introduction for Novice and Intermediate Programmers THE VMS USER'S GUIDE Richard F. Walters, 1989, softbound, 303 pages James F. Peters III and Patrick J. Holmay, 1990, (525.95) softbound, .1 04 pages (528.95)

THE MAT RIX: Computer Networks and To receive information on these or other publica­ Conferencing Systems Wo rldwide tions from Digital Press. write: johnS. Quarterman. 1990, softbound, 719 pages ($49.95) Digital Press Department 01:1 X AND MOTIF QUICK REFERENCE GUIDE 12 Crosby Drive Randi J Rost, 1990, softbound, 369 pages (524.95) Bedford , MA 01730 FlFTH GENERATION MANAGEMENT: 617/276-1536 Integrating Enterprises Through Human Networking Charles M. Savage, 1990, hardbound, 267 pages (528.95)

14-l Vo l. 2 No. 4 Fa ll/')')II Digital Te cbnicaljournal ISSN 0898-901X

Printed inUSA EY -E762E-DPI90 0902 26.0 BUO Copyright © 1990Digital Equipment Corporation AU Rights Reserved