VAX 6000 Model 400 System

Digital Technical Journal Digital Equipment Corporation

Volume 2 Number 2

Spring 1990 Editorial Jane C. Blake. Editor Rarhara Lindmark, Associate Ediwr Richard W. Hcane, Managing Editor Circulation Catherine M. Phillips, Administrator Suzanne ,I. Babineau. Secretary Production Hden L. l'allcrson. Production Editor Nancr Jones. Typographer Rebecca A. Barker, Typographer l'ct<:r Woodhury, Illustrator �nd Designer Advisory Board S:unud H. Fuller. Chairman Robert M. Glorioso John W. M cCredie Mahendra R. Patel F (;rant Saviers Robt:rt K. Spitz William D. Strecker Victor A. Vyssotsky

The Ui!iila/Tecbnica/joumal is puhlished quarterly by Digi tal Equipment Corporation, J46 M:tin Street MLOI-5JB68. Maymrd. Massachusetts 01754-2571. Subsc riptions 10 the Journal are S·iO.OO for four issues and must he prepaid in li.S. funds. Univer­ si ty and college professors and Ph. 0. students in the electrical engint:ering and computer science fields receive complimen­ tarr suhscriptions upon request. Orders. inquiries. and address changes should be sent to The Di!!ita/'lecbnical}oumal at the published-by address. Inquiries on also be sent electronically on NI:ARNET to [email protected]. Single copies and back issues are 3vaibhk for S 16.00 each from Digit:! I Press of Digital Equipmc:nt Corporation, 12 Crosby Drive, Hcdlord. MA 01-_�0- 149.'1.

Digital employees may send subscriptio n orders on the ENET 10 RDVAX JOliRNAL or by interoffice mail toma ilstop MLO 1- 5/ll68. Orders should include badge number, cost center, site location code and address, and group name. ll s. engineers in Enginen­ ing and Manufacturing receive complimt:mary subscriptions; enginc:er�in tht:se organizations in countries outside the: lJ.S. should com act the journal office to receive: their complimemary subscriptions. All employees must advise of changes of address.

Commt·nts on the content of any paper are wdcomed and may be sem 10 the editor at the published- b\' or nc:t work address.

Copyright' 1990 Digital Equipment Corpora tion. Copying without fcc is permitJed provided that such copies are made for use in educational institutions by faculty members and are not disJributed for comml'rcial ad, ·antage . Abs tracting "'ith credit of Digital Equipment Corporation's authorship is permitted . All rights reserved.

The information in this journal is suhject 10 change without no tice and should not be construed as a commitment by Digital Eq uipment Corporation. Digital Equipmem Corporation assumc:s no responsibility for any errors that may appear in this journal.

ISSN 0898-90 I X

Documentation Number EY-<:1971:-0P

The following are trademarks of Digital Equipment Corpora­ CoverDesign tion: CJ, OECnet, DECst:uion .'1100, OH:wimlows, Digital, the Our cover depicts some of the common equations and terminology Digital logo, HSC. MicroVAX , llZ'i'), ThinWire, VAX, VAX-11/780. used in vector processing, which is one of the featured topics in VAX 6000, VAX 8700, VAX HHOO, VAX 9000, VAXBJ, VAXcluster, this issue. The v.-IXl'ector processor extends the VAX 6000 family to VAX FORTRAN, VAXveclOr(J000, V.\1S, !ILTRtX, XMI. address the compul ing need� (ifnumerically intensiue applications. !IN IX is a registe red trademark of American Telt:phone & Telegraph Company. The VAX 6000 Model •iOO systern isa brightstar in Digital's family dbx is a registered trademark of dbx, Inc of midrange multiprocessors and this issue's main product thenU!. ,\Ill'S is a trademark of MIPS Computer Systems. I nc.

The couer was des(rt,ned by David Comberg and Randy Zieglerof the Book production was done by Digital 's Educational Services C01porate Design Group. Media Communications Group in Hedfortl, MA. I Contents

9 Foreword Pauline A. Nist

VAX 6000 Model400 System

11 Vector Processing on the VAXvector 6000 Model400 Debra L. Slater, David M. Fenwick, D. John Shakshober, and Douglas D. Williams

27 The VAX6000 Model400 Scalar Processor Module Patrick Sullivan, Michael A. Callander, Sr. ,James R. Lundberg, Rebecca L. Stamm, and William). Bowhill

36 An Overview of the VAX6000 Model400 Chip Set W. Hugh Durdan, William). Bowhiii,John F Brown, William V. Herrick, Richard C. Marcello, Sridhar Samudrala, G. Michael Uhler, and Nicholas Wade

52 VAX6000 Model400 Physical Technology john T. Bartoszek, Robert). Hannemann, Stephen P. Hansen, Robert]. McCarty, and john C. Sweeney

64 VAX6000 Model 400 CPU Chip Set Functional Design Verification Richard E. Calcagni and Will Sherwood

73 Test and Qualification ofthe VAX6000 Model400 System John W Croll, Larry T. Camilli, and Anthony]. Vaccaro

84 Development of the DECstation 3100 Thomas C. Furlong, Michael). K. Nielsen, and Neil C. Wilhelm

89 Compiler Optimization in RISCSyst ems Larry B. Weber I Editor's Introduction

Bill Bowhill discuss the module design and give particulars on how difficult signal integrity prob­ lems were resolved. The five system chips that reside on the module are the topic of our next paper by Hugh Durdan, Bill Bowhill , John Brown, Bill Hl:rrick, Rich Marcello, Sri Samudrala, Mike Uhler, and Nick Wade. From their discussions of the chip designs, we learn how the best features of the VAX 8700 ECL-based, pipelined system and of previous VLSJ designs were incorporatedin the chip set, which achieves a cycle tinle of 28 nanoseconds. jane C. Blake This fast cycle time was one of several require­ Editor ments that drove a significant design effort for the physical technology. John Bartoszek, Rob Hannemann, Steve Hansen, Bob McCarty, and John Sweeney describe the technological advances This Spring 1990 issue marks the second issue to achieved in a number of areas, including tape­ be published on the new quarterly schedule of the automated bonding, semicustomized ceramic Digital Te chnicaljournal. This is also the ft rst year single-chip package design, and testability. that the Journal is available by subscription- a The two papers that close this collection of papers service our readers have asked fo r and which we are on the VAX 6000 Model 400 address chip design glad to be able to offer. verification and system test. Rick Calcagni and The journal will continue to fo cus each issue on a Will Sherwood explain the engineers' multipronged product theme. In fact, two products are featured in approach to design verification, an approach neces­ this issue. The main theme is the latest addition to sitated by the complexity of the chip set. Then, the V�'l: 6000 fam ily, the Model400. With its multi­ John Croll, larry Camilli, and To ny Va ccaro present processing capabilities, this midrange fa mily of a paper on the methods and tools designed to com­ systems provides a highly conf igurable and expand­ pletely test the interaction of VAX 6000 Model 400 able computing environment. Because the same system's hardware and software. cabinet, buses, and power systems are used by aU In the last rwo papers, the topic turns to work­ family members, systems can easily be upgraded to stations. To m Furlong, Mike Nielsen, and Neil achieve higher levels of performance. Papers in this Wi.lhelm provide an overview of the successful issue describe VAX 6000 Model 400 innovations and project undertaken to bui.ld a fast, competitively additions, including a new vector processor and a priced, RJSC-based, , called the higher performance scalar processor module, chip DECstation 3100. The Journal is fo rtunate also to set design and verification, physical technology have a related paper on compiler optimization in advances, and system test. The second theme com­ RISC systems by Larry Weber, vice president, MIPS prises two papers related to Digital's workstation Systems, Inc. MIPS Systems built the RJSC chip set development, specifically the DECstation 3100, and incorporated in the DECstation 3100 workstation. compiler optin1ization in RISC systems. I thank Steve Holmes of the Midrange Systems Opening this issue is a paper on one of Digital's Business Group for his help in selecting topics on first vector processors. Dave Fenwick, john the Model 400, and Gillian Scholes of Digital and Shakshober, Debra Slater, and Doug Williams joanne Hasegawa of MIPS Systems, Inc., fo r their review the design alternatives for the VAXvector help in obtaining the workstation and RISC papers in 6000 Model400 processor and describe its function this issue. units. They then give examples of how the units combine to deliver high performance fo r computa­ tionally intensive applications. The Model 400 also has a new scalar processor, with nearly twice the performance of its prede­ cessor, the Model 300. In their paper, Pat Sullivan, Mike Callander, Jim Lundberg, Rebecca Stamm, and

2 Biographies I

John T. Bartoszek john Bartoszek currently manages the Physical Technology Group within the SDE that is responsible for physical technology applications and product designs. John previously managed the PTG physical technology pro­ gram that spawned the physical technology used on the VAX 6000 Model 400 CPU module. He joined Digital in 1981 and has a B.S. in nuclear engineering from Lowell Technological Institute. He holds two patents for thermal control devices. john has authored several papers on spacecraft thermal control, solar energy, and electronics cooling and interconnect technologies.

William). Bowhill As a principal engineer with the Semiconductor Engineer­ ing Group, William Bowhill is project leader for an engineering team that is designing an execution unit section of a large CMOS-based microprocessor. He has applied for two patents for his design work on the vector interface and backup control chip on the VAX 6000 Model 400 system. Bill also holds a patent for his work in relation to the Model 400's floating poinr accelerator chip. He joined Digital in 1985. Bill was educated in Great Britain and received a B.Eng. (honors) in electronic engineering from Liverpool University.

John F. Brown After receiving an M. S.E.E. from Cornell University in 1980, John Brown joined Digital's engineering staff. At present, he is a principal engi­ neer and design team manager for the instruction decode section of the next generation CMOS-based VAX microprocessor. John's previous responsibilities include technical contributions to both the VAX 6000 Model 400 and the Model 200 chip sets. He was also the hardware engineer for the extended floating point data type enhancement to the VAX-11/780 system. John currently holds one patent, has two patent applications pending, and has authored technical papers for several publications.

Richard E. Calcagni A member of the Semiconductor Engineering Group's VLSI microprocessor verification group, Richard Calcagni has contributed to the design of several VLSI microprocessors in the areas of microcode, verification, and prototype system debugging. He has published several papers on CPU design, modeling, and verification. Prior to his work with microprocessors, he worked on module test process development for Digital's Customer Services Manufacturing organization. Before joining Digital in 1979, Rick worked for Burroughs Corporation. He received a B.S. (1976) in electrical engineering from the University of Rhode Island.

Michael A. Callander, Sr. Michael Callander is a principal engineer in Digital's Semiconductor Engineering Group. At present, he is responsible for the archi­ tecture of a future VAX CPU module. Mike led the VAX 6000 Model 400 system's REXMI chip set design project. His previous experience with Digital includes the design and verification of the CPU module for the VAX 8200 system and the VAX 8300 system. Mike received his B.S.E.E. degree from the University of Massachusetts in 1982. He joined Digital upon graduation.

3 Biographies

Larry T. Camilli Since joining Digital in 1979, Larry Camilli has been a member of several prod uct development projects. He is currently a software cnginL"ering supervisor and project leader fo r the architecture verification software development project, which fo cuses on the development and maintenance of test software for current and future architectures. Larry's previous responsi­ bilities include the development of components for a microcode compiler, and a software trace and analysis package. He holds a B.S.E.E. from Clarkson University and is a student in the M.S.C.S. degree program at Wo rcester Poly­ technical Institute.

john Croll John Croll is a principal software engineer in the Midrange Systems Engineering Group. In this pos ition, he is responsible for developing systems test tools for a future hardware product. John was a team leader in tht: Systems Integration Group for the VAX 6000 Model 400 system. He was also a project leader for the development of the VAX 6000 Model 400 console software. John joined Digital in 1978, and his previous responsibilities include the development of device drivers and other system software. He received a B.S. E. E. (1978) from Drexel University and is a member of IEEE and ACM .

W. Hugh Durdan A grad uate of the Rensselaer Polytechnical Institute, Hugh Durdan is a consulting engineer in the Semiconductor Engineering Group. In this position, his major responsibility is management of chip development. Hugh became involved in the VAX 6000 Model 400's chip design at its earliest begin­ nings in 1984 . He led the behaviora l modeling effort and the cache controller chip design project. He also managed the development of the custom chip set for the VAX vector procL"ssor and aFChitected and specified the vector interface bus. Hugh joined Digital in 19HO and worked on the chip design of the VAX 8200 processor.

David M. Fenwick The architect of the VAX vector 6000 Model 400 processor, David Fenwick joined Digital's United Kingdom office in 1980 and transferred to the Cnited StatL"s in 1985. He is a principal engineer for the Low End Midrange Systems Group. His previous experience with Digital includes field service regional support in the 'nited Kingdom and project leadership fo r the DMB32 communications controller for the European engineering offi ce. He also worked on the VAXBI and XMI programs in the United States. Dave received a B.Sc. (Honors) from Loughborough University of Te chnology in England.

Thomas C. Furlong The development of Digital's RISC-based is the responsibility of Engineering Manager Thomas Furlong. It was To m's Palo Alto-based design group that brought the MIPS RISC technology into Digital and developed the OECstation 3100 workstation . Tom has been with Digital for ten years. In addit·ionto his work with RISC-based workstations, he was a member of the start-up te:tm for the VAXstation group. Tom is originally from Detroit, Michigan, and earned a B.S.E.E. from Michigan State University. He holds two patents related to graphics workstation design and has three patents pending on the next-generation products.

4 I

Robert J. Hannemann Robert Hannemann is group manager of the SCIT Design and Engineering Physical Technology Group. His group is responsible for the delivery of IC packaging, module, physical design and test technology for Digital's microprocessor-based systems. A senior consultant engineer, Rob joined Digital in 1978. His prior experience includes positions at Bell Telephone Laboratories and the University of Maryland, where he was a member of the faculty. Rob holds the Sc.D. degree in mechanical engineering from MIT. He also holds two patents and has published several papers on heat transfer engineering and electronics packaging.

Stephen P. Hansen Senior Manager Stephen Hansen manages the SDE/PTG Technical Office. In this position, he coordinates programs related to cost­ effective packaging for future CMOS-based products, defines technical solutions for the next generation of semiconductor packaging, and provides technical coordination for external packaging and interconnect-related activities. In his twelve years with Digital, Steve has developed several packages and assembly processes for internally manufactured CMOS products, including the tape­ automated bonding process. He holds two patents in the areas of tape-automated bonding and packaging.

William V. Herrick Senior Consultant Engineer William Herrick is currently managing a new generation VA X chip design project in the Semiconductor Engineering Group. He joined Digital in 1977 and has been a member of many ZMOS- and CMOS-based product development projects, including the PDP-11123 system and the VA X 8200 system. Before coming to Digital, Bill worked for Raytheon and GTE Sylvania. He has coauthored several papers on solid-state physics and MOS chip design. Bill received a B.S.E.E. (1969, magna cum laude) from Tufts University, and an S.M.E.E. (1971) and E.E. (1971) from MIT and is a member of Tau Beta Pi, Eta Kappa Nu, and Sigma Xi.

James R. Lundberg James Lundberg joined Digital in 1985. Initially, he was a product engineer with the MOS Product Engineering Group and worked on many projects, including BIIC and CQBIC. Jim is currently a senior engineer in Digital's Semiconductor Engineering Group. He was responsible for the signal integrity on the VA X 6000 Model 400 system and is now working on the signal integrity for an advanced CMOS CPU chip set and module. Before coming to Digital, Jim operated his own business. He is a member of Tau Beta Pi and Phi Kappa Phi. He received a B.S.E.E. (1985, honors) from the University of Illinois.

RichardC. Marcello A contributor to the design of the VAX 6000 Model 400 system, Richard Marcello is currently an engineering manager for a new­ generation VAX chip design project. Rich was involved with reliability analysis of semiconductor devices before moving into design five years ago. He worked for Fairchild Semiconductor before joining Digital in 1981 . Rich coauthored the paper "System, Process and Design Implications of a Reduced Supply Voltage Microprocessor," which he recently presented at the ISSCC. He received B.B.A. and B.S.E.E. (1980) degrees from the University of Notre Dame and a M.S.C.S. (1985) degree from Boston University.

5 Biographies

Robert j. McCarty Since joining Digital in 1982 , Robert McCarty has hn:n involved in several major product development efforts. He managed the VAX 6000 Model 400 system's physical design project and M31 console and instrumentation module development, and led the PDP-11184 system engineer­ ing project. Before coming to Digital, Bob worked for AM International as the project leader for a laser-based document printer that was part of a document communication system. He holds a B.S. in electrical engineering from the University of Michigan and an M.B.A. in marketing from the University of Chicago. He is a member of Tau Beta Pi and Eta Kappa Nu.

Michael j. K. Nielsen A consultant engineer in the Wo rkstation Systems Engineering Group, Michael Nielsen is presently completing his responsibilities as architect and chief designer of the DECstation 5000 Model 200 workstation base platform. Mike joined Digital in 1984. Among the many projects he has worked on since that time are the DECstation 3100 workstation, for which he was the architect and chief designer, and the VAXstation 3520/3540, for which he was a member of the architecture and design team. Mike holds B.S.E.E., M.S.E.E., and Ph.D.E.E. degrees from Stanford University. He is a member of Tau Beta Pi and Phi Beta Kappa.

Sridhar Samudrala Currently acting as project leader for a floating point unit, Sridhar Samudrala is a principal hardware engineer in the Semiconductor Engineering Group. Sri joined Digital in 1977. Since that time, he has worked on testing and diagnostics, as well as the VAX 8200 system microcode, and floating point architecture and design. He holds two patents for his work in floating point design. Sri has an M.Sc. (Technology) from Andhra University in India and an M.S. E. E. from the University of Wisconsin.

D. john Shakshober John Shakshober is a senior hardware engineer in the Low End Midrange Systems Group. Previously involved in the hardware design of M31, a VAX parallel processor, John is now a member of the VAX 6000 Model 400 hardware group, where his particular focus is vector processor design. He joined Digital in 1984 after receiving a B.S. in computer engineering from the Rochester Institute of Technology. John received a M.S.E.E. from Cornell University in 1988. He is a member of IEEE and Tau Beta Pi. John's latest published technical paper, "Parallel Algorithms for Super Performance," was presented at SuperComputing 89.

Will Sherwood As a software consulting engineer, Will Sherwood manages the Semiconductor Engineering Group's VLSI microprocessor verification group. Previously, he managed the DECSlM logic simulation group. Will joined Digital in 1975 after receiving B.S.E.E. and M.S.E.E. degrees from Carnegie-Mellon University. In addition to his contribution to the Digital Technicaljoumal, he has published 15 technical papers and is a contributing author to three books. Will is a member of the IFlPS 10.2 working group on computer hardware descrip­ tion languages and has served as publicity chairman and program committee member for several international conferences.

6 I

Debra L. Slater In her position as principal software engineer, Debra Slater leads a group that provides performance modeling and analysis support to hard­ ware development teams. She and her group were integral members of the VAX vector 6000 Model400 vector processor design team. Prior to joining Digital in 1987, Debra worked for the Montreal Engineering Company, initiallyas a pro­ grammer/analyst and later as an independent consultant. She received a B.Sc. in mathematics and computer science in 1980 from Bishop's University in Quebec. In 1981, Debra received a master's degree in applied mathematics from the University of Waterloo in Ontario.

Rebecca L. Stanun Rebecca Stamm is a principal hardware engineer in the Semiconductor Engineering Group. She is currently leading the design of the backup cache, bus interface, and pin bus for a new VAX CPU chip. Rebecca was the architect of the backup cache controller chip for the VAX 6000 Model 400 system. She has also worked on design and verification of a RISC microprocessor. Rebecca joined Digital in 1983. She is a member of Eta Kappa Nu and IEEE, holds one patent, and has coauthored several technical papers for the ISSCC. She received a B.A. (1979) in history from Swarthmore College and a B.S.E.E. (1983) from MIT.

Patrick Sullivan The project leader for the VAX 6000 Model400 CPU module development, Patrick Sullivan is a hardware consultant engineer in Digital's Semiconductor Engineering Group. Pat is now managing a new CPU module development project. Before his work on the VAX 6000 Model 400, he led the group effort that brought MCA emitter-coupled logic (ECL) into the corporation. He is also responsible for the development of a number of Digital's main memory products and participated in the development of several 36-bit CPUs. Pat holds a patent for a memory controller interface. He received his B.S. and M.S. degrees from Northeastern University.

john C. Sweeney Currently a principal engineer, John Sweeney is working on the testability and test process development for a future Digital product. His previous experience includes being part of the test process development for the VAX 8600 and VAX 8800 systems. .John was an application engineer for GENRAD before joining Digital in 1981. He has authored several papers on boundary scan and fault isolation, and has one patent pending in relation to the VA X 6000 Model 400 system's test structures. john received a B.S. E. E. (1980) from Rensselaer Poly­ technical Institute and has taken graduate courses at Northeastern University.

G. Michael Uhler G. Michael Uhler is a consulting engineer in the Semi­ conductor Engineering Group, where he is currently leading the architectural definition for a new CPU. As the CPU architect for the VAX 6000 Model 400 system, Mike was responsible for the CPU architecture, performance evaluation, and CPU microcode. Since joining Digital in 1978, he has also worked on the development of symmetricmultipro cessing in the TOPS-10 , and on the microcode and hardware development for PDP-10 CPUs. Mike received a B.S.E.E. (1975) and M.S.C.S. (1977) from the University of Arizona and is a member of IEEE, ACM, Tau Beta Pi, and Phi Kappa Phi.

7 Biographies

Anthony J. Vaccaro A senior engineer in the Midrange Systems Evaluation Group, Anthony Vaccaro is at present responsible for several evaluation projects. These projects include the FV64A VAX 6000 Model 400 vector processor and KDM70 mass storage controller. Tony joined Digital in 1976 as a field engineer. Some of his earlier project responsibilities include product evaluations for the CIBCA-B VAXcluster Cl adapter and KA825 VAX processor. He was also a member of the team that developed a certification process for new VAXBI adapters. Tony holds a B.S. (1975, cum laude) from Suffolk University and is studying for an M.S.C.S. at Rivier College.

Nicholas Wade The implementation of the backup cache control for the next­ generation VAX CPU chip is being led by Nicholas Wade. Nick is a senior engineer in the Semiconductor Engineering Group. He joined Digital in 1986 and has worked on several projects, including the VAX 6000 Model 400 chip set. He performed the engineering evaluation and debugging for the system support chip on the VAX 3500 and VAX 6000 Model 200 systems. Nick was also a member of the behavioral design and implementation feasibility project for a CPU-XMI interface. He holds B.S. (1985) and M.S. (1986) degrees from Cornell University and is a member of IEEE.

Larry B. �ber As vice president of software development for MIPS Computer Systems, Inc., Larry Weber is responsible for the development, quality assurance, and integration of all systems software products. Larry is one of MIPS first employees, having joined the company in 1984. Prior to joining MIPS, he worked for Dialogic Systems and for IBM. Larry helped develop a PASCAL compiler for both the IBM mainframe and IBM RISC project. He has authored and coauthored a number of articles on compilers and languages. Larry holds a B.S. in mathematics from the State University of New York and an M.S. in computer science from the University of Colorado.

Neil C. Wilhelm Neil Wilhelm, a senior consultant engineer, is responsible for the development and maintenance of the Workstation Systems Engineering Group's CAD system and the design engineering of a low-cost workstation. Neil also designed Digital's first RISC-based system. Neil brought an extensive tech­ nical background to Digital when he joined the company in 1982. He has worked for Hewlett-Packard and Xerox Corporation, founded Ridge Computers, and taught at the University of Rochester. Neil holds a B.S. (1970) in engineering from Harvey Mudd College, and an M.S. (1971) and a Ph.D. (1973) in electrical engi­ neering from Stanford University.

Douglas D. Williams MIT graduate Douglas Williams is a principal engineer in the Midrange Systems Engineering Group. He worked on the architectural definition of the VAXvector 6000 Model 400 processor. He also supervised performance modeling and vector control chip development efforts for that pro­ cessor. Among the many other projects on which Doug has worked since joining Digital in 1981 are the RISC processor development, memory interconnect design, and VLSI design. He holds an S.B. and S.M. in electrical engineering and is a member of Eta Kappa Nu and Tau Beta Pi. Doug holds a number of patents and has several patent applications pending.

8 I Foreword Model 300 was introduced. The Model 400 utilized this newly architected chip set to provide single­ processor performance of 7 times that of the VAX-11/780 system and up to 36 times the VAX-11/780 system for six-processor systems. The performance of the Model 400 was over twice the single-processor performance and more than three times the multiprocessor performance of the Model 200 series, which had been announced only a short 15 months earlier. To support such aggressive product introduction Pauline A. Nist cycles, advanced development work on the new Group Engineering Manager generation of CMOS-2 chips began in mid-1984, Midrange Sy stems Business within months of the start of the CMOS-I designs. Actual design work began approximately a year Because microprocessor-based computer systems later. During this period, Digital made a major deci­ are complex, the work to design and architect cus­ sion to formally extend the VAX architecture to tom chips must be initiated long before module and incorporate full support for vector processing into systems work begins. the base instruction set for aU future VAX proces­ Looking back at recent history, Digital intro· sors. To provide this support, the chip designs duced the VAX 6000 family of computers in April already under way had to be modified to incorpo­ 1988 with the Model 200 series, which utilized the rate the new instructions. ftrst generation of the CMOS-based VAX micropro­ The scalar chip set developed consists of ftve cus­ cessor. The Model 200 was fabricated in Digital's tom VLSI parts. They are the CPU chip, the floating CMOS-1 (complementary metal oxide semiconduc­ point accelerator chip, the vector/cache controller tor) process. A single-processor Model 210 pro­ chip, the system support chip, and the clock chip. vided 2.8 times the performance of a VAX-11/780 The development of the custom chips required a system. One to four processor configurations pro­ team of over 40 people, including logic, circuit, and vided up to a total of 11 times the performance of a layout designers, and verification engineers. The VAX-11/780 system. (Papers discuss these chips and scalar CPU module design, standard cell interface systems in the August 1988 issue of this journal.) design, and associated verification team comprised In January 1989, Digital introduced the second an additional eight engineers. Additionally, the new generation of the VAX 6000 family, the Model 300 vector coprocessor module required three new cus­ series. The Model 300 increased single-processor tom parts, a new gate array, and a separate module performance from 2.8 to 3.8 times the VAX-11/780 design effort. system and total performance for a six-processor Since the direct shrink of die from the CMOS-1 to system to 22 times the VAX-11/780 system. The CMOS-2 process would account for only a 30 per­ 30 percent increase in single-processor perfor­ cent performance increase, the processor architec­ mance was made possible by a direct shrink of the ture had to be substantially changed to achieve die from Digital's 2.0 micron CMOS-1 process to more aggressive performance. Early in the project Digital's 1.5 micron CMOS-2 process. The new pro­ the chip design team established a clear goal to meet cess supported a 25 percent reduction in lateral and or exceed the performance of the VAX 8700 proces­ key vertical dimensions and a 78 percent improve­ sor, which has a performance of five times that of ment in circuit density. To gether, these changes the VAX-11/780 system. Some of the architectural improved chip performance by approximately changes included the following: 30 percent.

However, a simple shrink of the existing die did • A more pipelined architecture, specifically, a six­ not permit full exploitation of the new circuit den­ level pipelined engine built around three auto­ sity. Newly architected and designed parts had to nomous pipes be tailored to take full advantage of the density and speed available with the CMOS-2 process. • A 64-bit wide data bus with 27 separate address The VAX 6000 Model 400 series was formally lines versus a 32-bit multiplexed data/address introduced in July 1989, a mere SLX months after the bus used for the CMOS-I chip

9 I

• Support for decode of the new VAX vector volume manufacturing. It is often difficult to under­ instructions and transfer of instruction operand stand the time lag between the availability of the information to the vector interface bus and onto first prototype unit running the operating system the vector coprocessor module and a product that can be shipped to the customer. However, a substantial amount of work must be • A 2 kilobyte (K B) primary on-chip cache with done between these two events. As the formal qual­ single-cycle access supported by a 128KB off­ ification process for new semiconductor devices chip secondary cache begins, a parallel effort is undertaken to build a large • A 16-byte instruction prefetch queue number of early systems. These systems are used to identify any problems that may occur when the • Two quadword write buffers in the bus interface pieces of the system are assembled into configura­ unit tions typical of those used by actual customers. Experience with the CMOS-I chip showed that Testing is divided across several aspects, including the fabrication line was capable of producing a dis­ actual beta test of prototype un.its at customer sites, tribution of die across a performance range of 80 to formal testing by any required government agencies 100 nanoseconds (ns). As a result, whereas all (e.g., FCC, Ul, VDE), systems design verification CMOS-2 new designs supported the target of a 40 ns tests, and architectural testing that ensures that the CPU cycle time, it was an explicit goal to support new system complies with the formal VAX architec­ devices as fast as 28 ns, should chip yields produce ture standards. Once the majority of testing is in sufficient quantities of faster parts. The yields at process and the required interim milestones have 28 ns actually exceeded predictions and permitted been met, manufacturing begins turning the assem­ faster parts to be used in all products produced. bled inventory into fin.ished products to support First passes of all scalar CPU chips were available volume availability of the system. in April 1988. These chips could successfully boot When a series of systems such as the VAX 6000 both the VMS and ULTRIX operating systems. This family has established a history in the market, it success was due in large part to the aggressive use of becomes increasingly important to ensure that the computer-aided design (CAD) techniques. Func­ announcement of the latest family member coin­ tional design verification efforts alone represented cides with manufacturing's ability to quickly deliver 25 person-years of work on the scalar chip set. An a high volume of product on a worldwide basis. If additional 39 person-years were necessary to com­ manufacturing cannot do so, a demand will have plete the scalar CPU module and the vector copro­ been created that cannot be filled. Revenue and cessor verification efforts. sales are lost. The full payback from the many per­ The power-on of first-pass parts represented a son-years of design, simulation, design verification, significant accomplishment to those who worked and systems test is only finally realized when vol­ on the chips and the module. However, much ume manufacturing has begun. "behind the scenes" work was necessary to achieve The papers in this issue of the journal will this milestone. The success in this area represents provide insight not only into how microprocessors the culmination of work across a number of disci­ and systems are designed and architected, but also plines. Besides the semiconductor devices, a new into the multidisciplinary efforts necessary to bring 224-lead multilayer ceramic package was devel­ a successful product to market. The design of one of oped. New techniques, including tape-automated the first VAX vector coprocessors is also reviewed. bonding (TAB), were explored to attach the die to This review offers a summary of how new architec­ the package, and new specifications were necessary tural issues are resolved and how design trade-offs for the actual printed wire board material and board are made. layup. Finally, new manufacturing processes were Moving a product from advanced development necessary to permit surface-mount assembly and to engineering, through manufacturing, and into test of these devices on both sides of the module. the customer site, over a five-year period, requires Although initializing the operating system on the the efforts of many people around the world. ftrst CPU modules marks a key deliverable for the Although only the direct work of a small percentage chip and board designers, it is only a starting point of those people are represented in these papers, the for the systems activity that is necessary to fully test credit for the success of the products goes equally and qualify a new product prior to the start of high- to all members of the team.

10 Debra L. Slater David M. Fenwick D. john Shakshober Douglas D. Williams

Vector Processing on the VAXvector6000 Model400

The VAXvector 6000 Model400 processor extends the VAX 6000 family of midrange CMOS-based multiprocessors to address the computing needs of numerically inten­ sive applications. The three function unitsof the vector processorcombine to forman overall vector pipeline that operates at speedsof up to 90 MFLOPs for single-precision calculations and 45 MFLOPs for double-precision calculations. The processor's per­ formance can also be enhanced by taking advantage of overlapping and out-of order instruction execution, as well as chaining. Further, applications can be tuned to the VAXvector 6000 hardware through algorithm optimizations in areas such as equation solvers and signal processing routines to achieve optimal performance. Using the VAXvector 6000 Model400 system, performance increases ranging from 3 to 35 timesthat of the VAX6000 Model400 scalarsystem have been realized.

Vect or processing has significantly evolved over the VAX Ve ctor Processing Overview past two decades. In the late 1960s and early 1970s, The extension of the VAX architecture to include it was pioneered as a way to increase scientific vector processing features was done in a manner application computer performance ov er that that permitted a wide range of possible implemen­ 2 achieved by more traditional scalar computers. tations The extension also allowed existing VAX However, the technology was limited to an elite processors to execute code utilizing the new vector few who could afford multimillion-dollar super­ instructions under software emulation. computer sy stems and who were willing to sig­ The vector extensions to the VAX architecture nificantly re-engineer software applications. include: In the early 1980s, more sophisticated vectorizing compiler technology was developed. • The addition of 16 vector registers, each contain­ This technology allowed users to effect ively pro­ ing 64 64-bit elements gram in high-level languages, such as FORTRAN, • A set of load/st ore instructions used to move up rather than to manually vectorize using low-level to 64 elements of a vector register to and from assembly language. During this period, there memory were also significant developments in computer al gorithms that were better matched to the paral­ • A set of vector register-to-register arithmetic and lelism available in vector hardware. logical instructions, operating on up to 64 ele­ Over the past few years, a new breed of vector ments at a time processor, the mini-supercomputer, has emerged. • A set of instructions for sy nchronization This class of machine includes many of the perfor­ between scalar and vector processing mance features of traditional supercomputers, but subsystems at costs more commonJy associated with super­ . Because vector processing is now a Conceptually, the implementation of vector mainstream sty le of computing that is applicable to instructions within the VAX family of processors is a wide range of uses, the VAX architecture was sim ilar to that of floating point inst ructions. To recently extended to include vector operations. implement float ing point arithmetic, some sy stems Further, the VAX product line has been expanded to use dedicated floating point hardware, some sys­ include vector processing in both the VAX 6000 tems use microcode, and others emulate fl oating midrange sy st ems family and the VAX 9000 main­ point in macrocode. In vector processing, vector l frame family of systems. instructions dif fer from floating point instructions

Digital Tecbnicaljournal Vo l. 2 No. 2, Spri ng 1990 11 VAX6000 Model 400 System

in that they are designed to be executed in a semi­ store unit controls a cache memory and contains a autonomous manner with scalar instructions. Thus, virtual-to-physical address translation buffer (TB). vecror instmctions can be executed in parallel with Depending on the design of the scalar and vector scalar instructions or in parallel with other vector units, then: are two ways to implement the design instructions. Although the scalar and vector units memory interface for the scalar and vector operate somewhat independently, tht: units art: processors: closely coupled to ensure that memory manage­ • A combined scal.ar and vector processor that ment exceptions are precisely reported. Special shares a common cache, address translation operations ensure floating point exceptions and logic, and path to memory subsystem coherence between vector and scalar memory ref­ erences are synchronized. • Separate scalar and vector units with separate From a vector perspt:ctive, a typical VAX vector caches and address translation buffers implementation can bt: reduced to five hasic units. Both of these approaches have their relative The latter four units arc collectively rderred to as a merits and disadvantages. When significant data­ vector processor or vector unit. The hasic units are: sharing between scalar and vector units exists, the

• A scalar processor that executes scalar instruc­ combined approach provides more favorable cache tions, decodes vector ·instructions, which may performance because the common cache is updated contain multiple internal function units on both scalar and vector references. Separate caches may result in additional cache misses as • A vector instruction-sequencing control and reg­ data is "sloshed" between scalar and vector caches. ister scoreboard For limited data-sharing instances, the separate approach may offer more favorable cache perfor­ • A vector register file mance. In a combined cache, vector references can • An arithmetic pipeline or pipelines that consist displace needed scalar data and vice versa. This of one or more arithmetic/logic units problem does not arise with separate caches because the scalar and vector data each has a • A load/store unit for memory references dedicated cache. The separate cache approach also The vector control and scoreboard logic accepts allows scalar and vector cache operations to occur instructions and operands from the scalar processor in parallel. ami dispatches them to the individual function units In implementing a vector processor, the selection within the processor. It also reports exceptions and between the above alternatives is often driven interrupts to the scalar processor. Since multiple more by technology constraints than issues of archi­ vector instructions can be executt:d in parallel , the tectural elegance. The VAX 9000 system, which is unit may contain scoreboard logic ro identify and implemented in emitter-coupled logic (ECL), chose manage resource conflictsbetween instructions. the combined approach. This approach supported The vector register file contains the 16 vector reg­ sharing costly cache RAMS and a common path isters, each of which consists of 64 64-bit elements. to memory. The VAXvector 6000, which is imple­ The register file has multiple ports that permit loads mented in complementary metal oxide semi­ or stores to operate while operands are sent to the conductor (CMOS) technology, chose the separate arithmetic pipes and results are received. approach for two reasons. First, module space and The vector arithmetic/logic pipelines implement package pin count constraints made it difficult to all the integer, logical, and floating point instruc­ implement both scalar and vector functions on a tions. These pipelines may be composed of separate single module. Second, the cost penalties for sepa­ pipelined add, multiply, and logical units. Or, they rate scalar and vector cache RAMS and separate may be composed of multiple pipes that operate in paths to memory were not prohibitive. parallel, with each pipe consisting of a pipelined add/multiply/logic unit. VAX 6000 Ve ctor Processor The load/store unit is responsibk for memory Description references. It generates the required virtual addresses (VA), performs translation from virtual Sy stem Block Diagram to physical addresses, and loads or stores the data The system block diagram for a vector-capable to or from the register files to memory . The load/ VAX 6000 Model 400 machine is shown in Figure 1.

12 Vi>/. 2 Nu. 2, Sp ring 1990 Digital Technicaljournal Ve ctor Processing on the VAX vector 6000 Mo de/ 400

SCALAR VECTOR SCALAR VECTOR PROCESSOR I- PROCESSOR PROCESSOR I-- PROCESSOR

I I I ... XMI SYSTEM BUS I 1 I 32MB 32MB Bl DISK ... MEMORY l··· MEMORY PORT CONTROL (UP TO 256MB OR MODULES) 8 CABLE I f-- RAI 70 STRIPESET VAXBI DEBNI BI PORT ESE20 L..SOLID- STATE DISK

RAXX KDB DISKS 1--

Figure 1 VAX vector 6000 Mode/ 400 Sys tem Block Diagram with Dual Scalarand Ve ctor Processors

The vector processor occupies a slot adjacent to the • A load/store unit, implemented by one chip, scalar processor, and both are interconnected by a which also controls a 1 megabyte (MB) cache short interface cable. The vectOr processor receives all instructions from and returns status tO the scalar Ve ctor ControlCh ip processor across this cable. For memory references, When the scalar processor encounters a vector the vectOr processor has its own independent path opcode, it parses and fetches the operands. to main memory . The VAX 6000 Model 400 system The opcode and all its operands are dispatched supports configurations of up to six scalar proces­ through the instruction bus to the vector processor. sors. However, vectOr systems have additional For arithmetic instructions, the scalar processor configuration constraints because of the increased will proceed to decode the next opcode in the memory bandwidth and XMI slot requirements of instruction stream. However, fo r load and store the scalar/vectOr processor. The VAXvector 6000 instructions, the scalar processor is stalled until all Model 4 00 system supports configurations of single address translations are completed. Stalling guaran­ or dual scalar/vector processors. or configurations tees that any memory management violations are of one scalar/vectOr processor and up tO three addi­ synchronous and that the scalar processor can tional scalar processors. To satisfy memory band­ restart the fa ulting instruction correctly. Within the width requirements, VAXvectar 6000 systems with vector unit the vector control chip is responsible a single scalar/vector processor require at least two for all scalar vector communication. When instruc­ memory controllers. Dual scalar/vector systems or tions are received by the vector controller chip, single scalar/vector systems with additional scalar the vector controller buffe rs the instructions and processors require at least fo ur memory controllers. controls instructions issuing to the other function units within the vector processor. VAX 6000 Mode/ 400 Ve ctor Processor An important aspect of the vector control chip The block diagram for the vector processor is is the register scoreboard logic, which identifies shown in Figure 2. The machine is divided into potential register conflicts when vector instructions three separate function units that can operate in are executed in parallel. By maintaining accurate combination or independently: register usage data, the vector control chip can optimize parallelism with the vector processor. • A vector controller, implemented as a single chip Optimal performance is achieved by executing • Arithmetic pipelines implemented by fo ur pairs arithmetic operations in parallel with load and store of chips, i.e., register fi le and vector data path operations, chaining the results of arithmetic opera-

Digital Te cbnicaljournal Vol. 2 No . 2, !>/JI "ing 1990 13 VAX 6000 Model 400 System

SCALAR PROCESSOR VECTOR PROCESSOR

VECTOR CONTROL CHIP

64

INTERNAL CACHE BUS

VECTOR ARITHMETIC FPU CHIP REGISTER PIPELINES FILES

LOAD/STORE SCALAR XMI DUPLICATE AND INTERFACE TAG STORE XMI INTERFACE

XMI SYSTEM BUS

Figure 2 VA Xvector 6000 Mode/ 400 Scalar/Vector Processor Block Diagram tions into store operations, and even dynamically double-precision calculation every two cycles. The re-ordering the execution of arithmetic instructions four pipes collectively retire fou r single-precision relative to load and store instructions to improve calculations every cycle, or two double-precision parallelism. calculations every cycle. Thus, a much higher float­ The vector control chip sends aU error status ing point performance is achieved than with only and machine checks to the scalar processor. When one individual pipeline. an error is encountered, the control chip attempts The register file chips receive instructions from to retry the failing transaction. If the retry is suc­ the vector controLler and data from the cache or cessful, a soft error interrupt is sent to the scalar load/store unit. The register file chip provides read processor. If the retry fails, either a hard error operands to the arithmetic pipeline and stores write interrupt or a machine check is sent to the scalar results and mask information. To maximize the use processor. Read operations that fa il result in of cache bus bandwidth, two 32-bit operands can machine checks. Write operations that fail result be combined into a single 64-bit transfer that is in hard error interrupts. simultaneously read or written to two separate reg­ Although not part of the overall control function, ister file chips. The register file internally has four the vector control chip also contains logic to imple­ 64-bit ports. (One is a read/write port for memory ment the IOTA instruction. The IOTA instruction data; two are read ports for operands; and one is builds a set of offsets in a vector register. This func­ a write port for results. While one instruction is tion did not fit conveniently in any other vector writing its results, a second can start reading its function. The control chip was selected because it operands. Thus, the instruction pipeline delay is had the space available to contain the fu nction. hidden. Variations in pipeline length between instructions are smoothly handled to ensure that Ve ctor Register File and Arithmetic no gaps exist in the flow of write data. Pipeline The register file can hold two outstanding arith­ The VA.Xvector 6000 processor's arithmetic pipe­ metic instructions in its internal queue. Therefore, line is organized as four pipes. Each pipe consists of the vector controller can preload the arithmetic a quarter of the register file (every fourth element of instruction queue with a second instruction, i.e., the vector registers), and an associated arithmetic/ deferred instruction. Preloading allows the vector logic unit. Each individual pipeline can retire controller to free the cache data bus, which is one single-precision calculation every cycle or one also used for instruction issuance, for use by a sub-

14 Vo l. 2 No. 2, Sp ring 1990 Digital Te cbnicaljournal Ve ctor Processing on the VAX vector 6000 Mode/ 400

sequent load or store instruction. This feature tions. These instructions involve virtual-to-physical improves performance because the arithmetic address translation, cache management, and inter­ pipeline can execute two arithmetic instructions action with the memory bus. If a load or store in the time it takes to execute one load or store instruction requires an offset register, such as instruction. scatter or gather, the offset register is first read into The register file's operand and result ports are a buffer and then added to the instruction's base used by the vector arithmetic pipeline chip. address. This process eliminates turning around the Operand data is sent over a 32-bit bus that is driven internal bus for each offset read, which would add twice per cycle. Results are returned on a separate more overhead. For strided load or store instruc­ 32-bit bus that is driven once per cycle. The two tions, the address is generated by adding the stride operands for single-precision instructions can be to the instruction's base address. passed in one cycle, while double-precision Load or store instructions can operate on either operands take two cycles to transfer. Each arith­ 32-bit (i.e., long word, single-precision) or 64-bit metic pipeline chip has a throughput of one single­ (i.e., quadword , double-precision) data types. precision operation per cycle, one double-precision When executing unity-stride 32-bit load or store operation per 2 cycles, and one single-precision or instructions, the load/store chip operates on two double-precision divide per 10 or 22 cycles. The elements at a time. Two 32-bit elements are com­ arithmetic chip has a pipeline delay of six cycles for bined into a single 64-bit cache reference. This double-precision multiplications, and five for all combination significantly enhances performance others (except divides), including the data transfer enhancement in unity stride single-precision data cycles. Integer operations are recoded internally as operations. double-precision floating point data types. The Virtual-to-physical address translation is per­ vector arithmetic pipeline chip is a full custom formed using an on-chip, 136-entry, 68-way-asso­ implementation largely based on the design of the ciative translation buffe r (TB). This configuration scalar processor's floating point unit:' maximizes address translation efficiency, which is very important because onlyli mited chip space was Load/Store Un it available. To optimally service TB miss conditions, The control chip uses the vector processor's inter­ the load/store chip contains dedicated logic that nal bus to issue instructions to the function units. directly references page table entries upon a TB However, once a load or store instruction is issued, miss. A simpler alternative would have been to use the load/store chip becomes bus master and con­ microcode in the scalar processor to fetch new trols the internal bus. Either the load/store chip, page table entries upon a TB miss. However, the vector register files, or the cache can drive the bus. dedicated logic approach was chosen to enhance Once a load or store instruction starts execution, performance for applications that exceed the size no further instructions can be issued until it com­ of the TB. Under certain TB miss conditions, the pletes. This rule simplifies the control chip score­ vector processor may be unable to compute a new boarding because once a load or store instruction is virtual-to-physical address translation. This situa­ started, no further instructions can start. Therefore, tion can occur when the addressed page is invalid or scoreboarding of these instructions against the out­ has been paged-out to disk. When such a miss standing load or store instruction is not necessary. occurs, the vector unit reports an exception back Because scoreboarding of outstanding instructions to the scalar processor. Once the scalar processor requires considerable logic complexity in the corrects the situation, the instruction is retried from vector control chip, it was important to keep the the beginning. complexity of this operation minimal. An addi­ Since the scalar processor must be able to restart tional benefit was the simplification of the internal the fa ulting vector instruction, it is important to bus protocol. It was excessively complex to imple­ precisely identify any vector memory management ment the capability to stop load or store instruc­ exception with the associated vector load or store tions in progress. This alternative was not pursued instruction. This identification is achieved by block­ because the resulting performance benefit was ing issuance of further instructions until the vector minor in comparison to the amount of work unit notifies the scalar processor that the vector involved. instruction is free of memory management faults. The load/store chip executes the vector load, The vector unit contains memory management store, scatter and gather memory reference instruc- prediction logic, called MMOK logic. MMOK logic

Digital Tecbnicaljournal Vol. 2 No. 2, Sp ring 1990 1 5 VAX 6000 Model 400 System

allows the scalar processor to issue additional is being written to memory. If the load instruction instructions in parallel with the execution of the takes a cache miss, the load stalls until the store vector load or store instruction. During execution completes. This simple scheme improves instruc­ of a strided vector load or store instruction, once tion overlap when load instructions follow store it is established that the current vector element instructions without adding undue complexity to references the same TB entry as the last element the load/store unit design. of the load or store instruction, and that the associ­ ated TB entry is free of memory-management Performance Characteristics exception conditions, the load/store unit can safely The interaction between the different functional report "address translation successful," i.e., MMOK, units of the vector processor creates a number of sit­ to the scalar processor. Early prediction of success­ uations that affe ct the performance and execution ful address translation permits the scalar processor of vector instructions. These include: to be released and allows it to operate asyn­ • Overlapping instructions chronously with the remainder of the vector load or store instruction. • Out-of-order instructions Once a physical address is obtained, the load/ • Chaining store chip references its 32K entry tag store. The address is delayed and passed to the 1MB cache data Overlap of Instructions store. This delay permits cache tag lookup and com­ Arithmetic and load/store instruction execution pare to complete before data is written to the cache may overlap because the functional units are inde­ on store operations. In parallel, the corresponding pendent. In order to achieve this overlap, the register file address is presented to the fou r register following conditions must be met. file chips. The data and addresses are automatically aligned fo r load and store operations to permit • The arithmeticinstruction must be issued before correct reading and writing of the register file and the load or store instruction. cache data RAMs. Upon cache miss, the load/store • There must be no register conflict between the unit queues the associated 32-byte block read oper­ arithmetic and load/store instructions. ation with the memory interface logic and contin­ ues processing other elements. Up to four cache In the fo llowing examples of arithmetic and load/ misses can be outstanding before the read data for store instruction interactions, an "1" represents the first miss returns. Hits continue to be processed instruction issue time, and an "E" represents while the misses are outstanding. On vector proces­ instruction execution time. The "VR" represent sors, the most important fa ctor is the time required vector registers. The expression "std" is used to to complete the entire load or store operation, represent the stride. A series of periods " ..." repre­ rather than the time needed to fe tch an individual sents wait time in the arithmetic unit for deferred element. The cache miss handling feature permits instructions. (Note: These examples are not the vector processor to m:vdmizeits use of available intended as timing diagrams.) XMI bandwidth. VVADDx VR1 ,VR2 ,VR3 IEEEEEEEE The vector cache tag and data are parity-pro­ VLDx A,std,VR1 IEEEEEEEEEEEEEE tected. Should a cache parity error occur, the cache is disabled and the instruction retried from the As can be seen in the example above, the execu­ beginning. This method was the simplest option for tion of the vector load instruction (VLDx) can soft recovery of cache parity errors. The operating overlap the vector add instruction (VVADDx) system receives a soft error interrupt and can, at because there are no register conflicts between the its option, re-enable the cache. two instructions. In the next example, instruction The load/store chip contains a 32-element write overlap is inhibited because the VVADDx instruc­ buffer to enhance performance of store operations. tion is writing to the register to be loaded, VR3. Since the vector cache operates at higher band­ VVADDx VR 1 ,VR2 ,VR3 IEEEEEEEE widths than the system bus, the buffe r isolates the VLDx A,std ,VR3 IEEEEEEEEEEEEEE store performance from the slower XMI memory bus. Furthermore, a subsequent load instruction In comparing these two examples, it is clear that that hits in cache can execute while the write buffe r the overlap of the execution of the VVADDx and the

16 Vu l 2 No. 2, Sp ring 1990 Digital Te chnical]ournal Ve ctor Processing on the VAX vector 6000 Model 400

VLDx greatly reduces the total execution time of in the deferred instruction case. Once the VLDx the instruction sequence. By raking advantage of instruction is issued, no other instructions may be this hardware feature, application codes can show issued. The instruction overlap execution made greatly improved performance. possible by the deferred instruction queue signi­ ficantly reduces total execution time. Out-oforder Instruction Execution Comparing the last two examples, in the case The arithmetic unit includes a deferred instruction where a deferred instruction queue wa s used, the queue of length 1. This queue allows the vector VLDx instruction can begin executing before the control and scoreboard logic to queue one instruc­ VVMULx. It also could complete before the VVMULx tion to the arithmetic unit while that unit is still instruction completes simply because the VVMULx executing a previous instruction. The vector con­ instruction is sent to the deferred arithmetic troller checks the queue's status for an instruction instruction queue. This out-of-order execution of when it checks the function unit's availability. Both instructions allows increased overlap of instruc­ the deferred and currently executing instructions tions, which again reduces the total execution time are checked for register availability. This queue of the instruction sequence. frees the issue unit to process another instruction rather than waiting for the arithmetic unit to com­ Chaining plete its current instruction. Ve ctor operands are generally read from and writ­ For the following instruction sequence, ten to the vector register file. An exception to this VVADDx VR1 ,VR2 ,VR3 process occurs when a store instruction is waiting

VVMULx VR3 ,VR1 ,VR4 for the results of a currently executing arithmetic VLDx A,std ,VR2 instruction. (Divide instructions are nor included in this exception because they do not have the same execution without a deferred instruction queue degree of pipelining as the other instructions.) As would resemble this example: results are generated by the arithmetic instruction Issue VVADDx I EEEEEEEE and are ready to be written to the register fi le, they Issue VVMU Lx IEEEEEEEE are also inunediately available for input to the wait­ Issue VLDx IEEEEEEEEEEEEEE ing store instruction. Therefore, the store instruc­ tion can begin processing the data before the Execution with a deferred instruction queue arithmetic instruction has completed. This process would look like the following: is called "chain into store." The srore instruction Issue VVADDx IEEEEEEEE will not overrun the arithmetic instruction because Issue deferred VVMU Lx I ...... EEEEEEEE the srore instruction can process data fa ster than the Issue VLDx IEEEEEEEEEEEEEE arithmetic unit can generate results. These examples illustrate the use of a deferred The following instruction sequence arithmetic instruction. If a deferred instruction VVADDx VR 1 ,VR2 ,VR3 queue was not implemented, the VVMULx instruc­ VVMULx VR 1 ,VR2 ,VR4 tion could not be issued until the VVADDx was VSTx VR3,A,std either completed or nearly completed. The VLDx instruction would not issue until after the VVMULx would resemble the example in Figure 3 ifexecuted was issued and would complete much later than without the chain into store process.

Issue VVADDx IEEEEEEEE

Issue deferred VVMULx I. . EEEEEEEE Issue VSTx IEEEEEEEEEEEEEE

Figure 3 Sample Instruction Sequence without Chain into Store

Digital Teclmical]ournal Vo l. 2 No. 2. Sp ring 1990 17 VAX 6000 Model 400 System

If the instruction sequence were executed with In the inst ruct ion sequence shown in Figure 4, the chain into store process, however, it would the main loop of a SAXPY or DAXPY BLAS 1 routine, 1 follow this example: there is very little instruction overlap. In the reordered instruction sequence shown in Issue VVADDx IEEEEEEEE Figure 5, the VVMULx and second VLDx instruc­ Issue deferred VVMULx I ...... EEEEEEEE tions overlap, and less total execution time is Issue VSTx IEEEEEEEEEEEEEE required than in the first example. In these examples, the VSTx instruction requires The only real difference between the instruction the result of the VVADDx instruction. Without the sequences in Figures 4 and 5 is the order in which chain into store operation, the VSTx instruction they are issued. By recognizing that the VVMULx must wait for the VYADDx to complete before does not require the result ofth e second VLDx and beginning. The chain into st ore operation allows can precede that instruct ion, a significant reduct ion the VSTx instruction to begin while the VYADDx in execution time is achieved. is st ill executing, increasing the amount of instruc­ The overlap of load and st ore inst ructions can tion execution overlap. As a result , the instruction also be effect ively maximized by preceding, sequence requires a shorter period of time to com­ wherever possible, all load and store instructions by plete execution. at least two arithmet ic instructions. In this way, both the load and st ore pipeline and the arithmet ic pipeline are in use. Load and Store Un it Performance Memory access instructions are typically slower Minimize Register Conflict Wa its A load instruc­ than arithmetic instructions and will frequent ly tion cannot begin execution until the register to dominate the performance of vectorized appli­ which it will write is free. A register conflict may cations. Certain coding techniques help minimize occur ifth e dest ination register of a load inst ruction the time spent waiting for load and store instruc­ is the same as the register for a preceding arithmetic tions to complete and reduce the resulting impact instruction. If using a different register for the load on performance. instruction would permit instruction execution overlap to occur, the destination register should, if Maximize In struction Execution Overlap Three possible, be changed. important hardware features help maximize instruction execution overlap in the load/store unit . Locality of Reference of Data The locality of ref­ First , a load or store instruction can execute in erence of data is important in determining the per­ parallel with up to two arithmetic instructions, pro­ formance of load and store operations. Because vided the arithmetic instructions are issued first. unity stride load and store instructions are the most Second, the chain into store sequence can reduce efficient memory access instructions. whenever the perceived execution time of a store instruction. possible, data should be stored in the sequential Finally, early detection of no memory fa ults allows order in which it is usuallyreferenced. scalar-to-vector unit communications to overlap Non-unity st ride load and st ore operat ions can with load or store instruction execution. have a significant ly higher impact than unit y stride

VLDx X,st d,VR1 IEEEEEEEEE

VLDx Y,std ,VR2 IEEEEEEEEE

VVMULx VR3,VR1 ,VR1 IEEEEE

VVADDx VR1 ,VR2 ,VR2 I. .. .EEEEE VSTx VR2,Y,std IEEEEEEEEE

Figure 4 Sample Instruction Sequence of Ma in Loop of a SAXP Y or DAXPY BIAS 1 Routine

18 Vo l. 2 No. 2, Sp ring /')')() Digital Technicaljournal Ve ct01·Pmcessing on the VA Xvector 6000 Mo del 400

VLDx X,std,VR 1 IEEEEEEEEE VVMU Lx VR3 ,VR1 ,VR1 IEEEEE VLDx Y,VR2 IEEEEEEEEE VVADDx VR 1 ,VR2 ,VR2 IEEEEE VSTx VR2 ,Y,std IEEEEEEEEE

Figure 5 Sample Instruction Overlap Sequence of a Main Loop of a SAXPY or DA XP Y BLAS 1 Routine

operations on the performance level of the XMI code. Several examples that illustrate effective opti­ memory bus. More memory references are required mization methods are discussed in the Algorithm for non-unity stride operations of the same vector Optimization section of this paper. length. If the ratio of cache miss load and store instructions to arithmetic instructions is sufficiently Arithmetic Un it Performance high and non-unity stride is used, bus speed and Once an instruction begins execution in the vector bandwidth can limit performance. arithmetic unit, it continues executing until all Load and store operations that hit in cache are results are completed. A deferred arithmetic less costly than those that miss cache. Any piece of instruction may only begin execution after the data must be loaded from memory to cache the first instruction in the pipeline completes, or if the first time it is referenced. If data that is referenced more results of the deferred instruction will not complete than once remains in the cache, i.e., is not displaced before the last results from the current instruction by subsequent data accesses, other references to are completed. The instruction overlap will be that data will not incur memory access costs, and particularly significant for shorter vectors because better performance results. the startup time, i.e., the time that can be On any cache fill, a 32-byte data block is read into overlapped , for an arithmetic instruction is fi xed the cache. (Note: This is equivalent to 8 long words overhead that represents an increasing portion of or 4 quadwords. Single-precision data is a long the execution time as vector length decreases. word; double-precision data requires a quadword .) When non-unity stride loads are used, performance Peak Performance can be improved by using the additional data read The Model 400 vector processor has a cycle time in to cache. This improvement can be achieved by of 44.44 nanoseconds. For single-precision opera­ fol lowing cache miss non-unity stride loads with tions, this cycle time translates to a theoretical peak non-unity stride loads that reference the additional performance of 90 MFLOPs. For double-precision data and will, therefore, hit in cache. operations, the theoretical peak performance is 45 Large data arrays and strides can also have an MFLOPs. Theoretical peak performance is calculated impact on the efficiency of the translation buffer. from the number of results per cycle and the cycle For large strides, i.e., greater than 256 single-preci­ time as follows: sion elements or 128 double-precision elements, 1 a translation buffer miss can occur for each vector Single-precisionpeak =4 /(44 .44 x 10 ) element. Even with unity stride, the translation 3 buffe r miss rate will be higher for large data arrays. Double-precisionpeak =2 I (44.44 x 10 )

Algorithms It is sometimes necessary to consider Crossover Po int the algorithm that is represented by the code to be For any given instruction or sequence of instruc­ optimized because some algorithms are not as well tions, there is a particular vector length where both suited to vector processing as others. It may be more the scalar and vector processing of equivalent oper­ effective to change the algorithm used or the way ations yield the same performance. This vector it is implemented than to optimize the existing length is the crossover point between scalar and

Digital Te cbnical]ournal Vu l. 2 No. 2, Sp ring 1990 19 VAX 6000 Model 400 System

vecror processing and is unique to the particular To rake advantage of the processing power of the instruction or sequence. Scalar operations are fa ster VAXvector 6000 Model 400 system, we concen­ for vector lengths below the crossover point. Ve ctor trated on four basic optimization mer hods: operations are more efficient for vector lengths • Rearrange code for maximum vectorization of above the crossover point. A low crossover point is the inner loop and remove data dependencies considered a benefit because it indicates that it is within the loop easier to take advantage of the power of the vector processor. A low crossover point means that more • Ve crorize across contiguous memory locations to code can benefit from the vector processor. produce unity stride vectors for increased cache For any single, isolated vector instruction, the hit rates and optimized cache miss handling crossover point on the Model 400 system is quite • Reuse the data already loaded inro the vector reg­ low, generally about 3 or 4. However, an instruction isters as freq uently as possible to reduce rhe is not performed in isolation. Jn a routine or number of vector load and store operations application, other factors affect the performance of • the operations on short vectors. These effects are Maximize instruction execution overlap by pair­ particularly seen when the short vector's data is ing arithmetic instructions between load and used in several vector operations. store instructions wherever possible On the Model 400 system, performing as much (Note: Further information on optimization code as possible on the vector processor, including techniques in FORTRAN can be found in the VA X short vector length sections, can mean higher FORTRA N Performance Guide available with the " system performance. Performance is improved FORTRAN-High Performance Option. Additional because the cache is used more optimally. Speci­ information on macrocoding for the VAXvector fi cally, once vector instructions have referenced a 6000 Model 400 vector processor can be fo und in piece of data, that data is included in the vector the VA X 6000 Ve ctor Processor Programmer's unit's cache. Subsequent scalar operations on that Guide. 6) data will require moving the data from the vector By analyzing the groups of applications that have cache into the scalar cache. Continued code sec­ high vector processing potential, we identified two tions of vector references fo llowed by scalar basic areas where optimization techniques can be references tend to invalidate the two caches too most useful: equation solvers and signal process­ frequently. Therefore, a vector operation is usually ing routines. For example, computational fluid more efficient than a scalar operation. The cross­ dynamics, finite element analysis, molecular over point on the Model 400 system is low enough dynamics, circuit simulation, quantum chromody­ that scalar processing is the faster alternative only namics, and economic modeling applications use for isolated operations on short vectors. various types of simultaneous or differential equa­ tion solvers. Applications such as air pollution modeling, seismic analysis, weather forecasting, Algorithm optimiza tion Examples radar imaging, speech and image processing, and The previous section of this paper discussed how many other scientific and engineering applications the characteristics of the VAX vector 6000 Model 400 use signal processing routines, such as fa st Fourier system's vector processor can affect performance. transforms, to obtain solutions. The fo llowing examples illustrate how that perfor­ mance information can be used to build optimized Equation Solvers routines. The examples also show how an algorithm Equation solvers generally fa ll into fo ur categories: and its implemenration can change the performance general rectangle, symmetric, hermitian, and tri­ of an application on the VAXvector 6000 processor. diagonal. The most common benchmark used to Algorithm changes can alter the data access pat­ measure a computer system's ability to solve a terns to use the memory subsystem more efficiently, general rectangular system of linear equations is � can increase the average vector length, and can min­ Linpack. The Linpack benchmarks, developed at imize the number of vector operations required. Argonne National Laboratory, measure the perfor­ By applying Amdahl's Law of vectorization, we can mance across diffe rent computer systems while improve performance by increasing rhe percentage solving dense systems of 100, 300, and 1000 linear of code that is vectorized. equations.

20 Vol. 2 No. 2. Sprin!!, 1990 Digital Te chnicaljo urnal Ve ctor Processing on the VAX uector 6000 Mode/ 400

These benchmarks are currently written to call The performance of the Linpack IOO by IOO subroutines from the Linpack library. The subrou­ benchmark, which calls the Figure 3 routine, shows tines , in turn, call the basic linear algebra subrou­ how an algorithm with approximately 80 percent tines (BLAS) at the lowest level . For each benchmark vectorization can be limited by the scalar portion. size, there are different optimization rules which One form of Amdahl's Law relates the percentage of govern the type of changes permitted in the Linpack vectorizecl code compared to the percentage of report. Optimizations to the BLAS routines are scalar code to define an overall vector speedup. This always allowed . Modifications can be made to the ratio between scalar runtime and vector runtime is FORTRAN source or by supplying the routine in described by the following formula: macrocode. Algorithm changes are only allowed for the largest problem size, the solution to a system of Ve ctor sp eedup = time scalarI (%[scalarI x 1000 linear equations. time scalar) + (I% vector I x time vector) The smallest problem size uses a two-dimensional Under Amdahl's Law, the ma.:ximum vector array that is 100 by 100. The benchmarks are writ­ speedup possible, assuming an infinitely fast vector ten to use Gaussian elimination fo r solving 100 processor, is: simultaneous equations. This two-step method fea­ tures a factorization routine, xGEFA, and a solver, Ve ctor sp eedup = 1.0 I (0. 2) X 1.0 + (0.8) X 0 = xGESL. Both are column-oriented algorithms and 1.010.2 = 5.0 use vector-vector level 1 BLAS routines. Column orientation increases program efficiency because As shown in Figure 7, the Model 400 processor it improves locality of data based on the way achieves a vector speedup of approximately 3 for FORTRAN stores arrays. the 100 by 100 Linpack benchmark when using the As shown in Figure 6, the BLAS level I routines BLAS 1 subroutines. It follows Amdahl's Law closely allow the user to schedule the instructions opti­ because it is small enough to fit the vector proces­ mally in vector macrococle. Deficiencies in BLAS I sor's 1 Mbyte cache and, therefore, incurs very little routines include frequent synchronization, a large overhead clue to memory hierarchy. calling overhead, and more vector load and store For the Linpack 300 by 300 benchmark, opti­ operations in comparison to other vector arithmetic mizations include the use of routines that are operations. equivalent to matrix-vector level 2 BLAS routines.

xAXPY - computes Y( I) • Y< I l • aX ( I)

where x.prec ision F, D, G

MSYNC ;synchron ize wi th scalar LOOP :

VLDx X( Il,std,VRO ; X( I) is loaded into VRO VSMULx a,VRO ,VRO ;VRO ge ts the product of VRO

;and the scalar va lue "a" VLDx Y < I l , s t d , VR 1 ; Y< I) get loaded into VR 1

VVADDx VRO , VR1 , VR1 ;VR1 gets VRO summed with VR 1 VSTx VR 1 , Y < I ) , s t d ;VR1 is stored bac k into Y< I)

INC ; incremen t I by vec tor length

IF < I < S I Zl GOTO LOOP ; Loop for al l va l ues of I MSYNC ; synchron ize wi th scalar

Figure 6 Core Loop of a BLAS 1 Routine Us ing Ve ctor- Ve ctor Op erations

Digital Te chnicaljournal Vo l. 2 No. 2, .vJrin� /'J'JO 2I VAX 6000 Model 400 System

35 store operations than BLAS I routines. Although the

0.. 30 300 by .300 array fits into the vector processor's ::J 0 25 I MB cache, not all the cache can be mapped by its w w 0.. 20 translation buffer. By changing the sequence in (f) a: 15 which this routine is called in the program, the data 12 access patterns can be altered to better use the u 10 w > vector unit's translation buffer. Thus, higher per­ 5 formance is obtained. 0 200 400 600 800 1000 1200 The percent of vectorization increases primarily DIMENSION OF PROBLEM SIZE because of the increase in the matrix size from 100 by 100 to 300 by 300. With a vector fraction of approximately 95 percent, Figure 7 shows the Figure 7 Linp ack Performance Graph, speedup impron.:ment in the 300 by 300 bench­ Double-precision BLASAl gorithms mark when using methods based on BLAS 2 rou­ tines. With a matrix vector algorithm, the 300 by Figure 8 details the core loop of a BLAS 2 routine. 300 benchmark yields speedups of between 10 and BLAS 2 routines make better use of cache and trans­ 12 over its scalar counterpart. lation buffers than the Bl.AS I routines do. Also, There are no set rules to follow when solving the BLAS 2 routines have a better ratio between vector largest problem size, a set of 1000 simultaneous arithmetics and vector load and stores. The larger equations. One potential tool fo r optimizing this matrix size increases the average vector length. Per­ benchmark is the LAPACK library developed by formance is improved by amortizing the time to Argonne National Laboratory, in conjunction with decode instructions across :t larger work load. the University of Ill inois Center for Supercomputing By removing one vector load and one vector store Research and Development (CSRD). The LAPACK from the innermost loop, the BLAS 2 routine has a library features equation solving algorithms that better ratio of arithmetic operations to load and will block the data array into sections that fit into

xGEMV - computes YC I) = YC I) + X(J)•M( I ,J)

where x = precision = F, D, G

MSYNC ;synchron ize wi th scalar !LOOP :

VLDx Y ( I) , s t d,VRO ; Y( I) is loaded as VRO JLOOP .

VLDx M ( I , J) , s t d,VR 1 ;VR1 gets columns of M( I , J ) VSMULx XCJ) ,VR1 ,VR2 ;VR2 gets the produc t of VR 1 ;and X(J) as a scalar

VVADDx VRO ,VR2 ,VRO ;VRO gets VRO summed wi th VR2 INC j IF (J < S!Z) GOTO JLOOP ; Loop for al l va lues of J

VSTx VRO,Y(!) ,std ;VRO gets stored into YC I)

INC IF ( I < S I Z) GOTO I LOOP ;Loop for al l va lues of I

MSYNC ; synchronize wi th scalar

Fig ure 8 Core Loop of a BLAS2 Routine Us ing Ma trix- Ve ctor Op erations

22 Vo l. 2 No. 2, Sp ring 1?90 Digital Te cbnicaljounlal Ve ctor Processing on the VAX vector 6000 Mo de/ 400

a given cache size. The LAPACK library calls not exhibits vector speedups greater than 35 for the 64 only the BLAS 1 and BLAS 2 routines but also a third by 64 matrix multiplication described above. level of BLAS, called matrix-matrix BLAS or the BLAS Although the overall performance of the 1000 by " level 3. 1000 size benchmark is less than a single 64 by 64 Figure 9 shows that a matrix-matrix multiply is at matrix multiplication, it does indicate the potential the heart of one BLAS 3 routine. The matrix multi­ performance when blocking is used. Improving the plication computation can be blocked for modern performance of this benchmark is most challenging architectures with cache memories. Highly efficient because the 1000 by 1000 matrix requires about vectorized matrix multipl ication routines have eight times the vector cache size of ll\-18. Further been written for the VA X vector architecture. For analysis is being conducted to determine the most example, a double precision 64 by 64 matrix efficient block size that would maximize the use of multiplication achieves over 85 percent of the peak BLAS 3 and remain within the size of the cache for a MFLOPs on the Model 400 system. given block of code. Performance can be further improved with other The vectorized fraction increases to approxi­ methods that increase the reuse of data while it is mately 98 percent for the 1000 by 1000 benchmark. contained in the vector registers. For example, loop The proportion of vector arithmetics relative to unrolling can be done until all the vector registers vector loads and stores is much improved for the have been fu lly utilized. Partial results can be BLAS 3s. Although the cache is exceeded , perfor­ formed within the innermost loop to minimize the mance more than doubles when using a method that loads and stores required. Because both rows and can block data based on the BLAS 3 algorithms. columns are traversed, the algorithm can be blocked Therefore, the VAXvector 6000 Model 400 pro­ for cache size. The VA Xvector Model 400 system cessor's performance for Linpack 1000 by 1000

xGEMM - computes Y< I ,J) = Y< I ,J) • X( I ,Kl*M(K,J)

where x = precision = F, D, G

MSYNC ; synchron ize wi th scalar IJLOOP :

VLDx Y( I ,J) ,std,VRO ;Y(1 :N,J) gets loaded into VRO KLOOP :

VLDx M(K,Jl ,std,VR1 ;K(1 :N,K) get loaded into VR 1 VSMULx X ( I , K) , VR 1 , VR 1 ;VR1 gets VR1 summed with

;X( I ,K) as a scalar

VVADDx VRO ,VR2 ,VRO ;VRO gets VRO summed with VR2

INC K ; increment K by vector length IF (K < SIZl GOTO KLDDP

RESET K ; reset I to SIZ VSTx VRO ,Y

INC J ; increment J by vector length RESET ;reset I to SIZ IF (J < SIZl GOTO IJLOOP

MSYNC ; synchronize wi th scalar

Figure 9 Core Loop of a BLAS3 Routine Us ing Ma trix-Matrix Op erations

Digital Technicaljo urnal Vo l. 2 No. 2, Spri ng 1990 23 VAX 6000 Model 400 System

obtained a vector speedup of approximately 25, as BIT shown in Figure 7. REVERSAL CORRECT REORDER RESULT 0 0 0 Signal Processing- Fas t Fourier 1 8 1 2 4 2 Tr ansfo rms 3 12 3 The Fourier transform decomposes a \vaveform, 4 2 4 or more general ly. a collection of data, into com­ 5 10 5 ponent sine and cosine representation. The discrete 6 6 6 Fourier transform (DFr) of a data set of length .v 7 14 7 performs the transformation fo llowing the strict 8 1 8 2 mathematical definition which requires U(N ) 9 9 9 10 5 10 floating point operations. Jn 1965, the fast Fourier 11 13 11 transform (FFT) was developed by Cooley and 12 3 Tukey. FFT reduced the number of operations 12 13 11 13 to O(N x LOG[Nl), which is a significant improve­ 14 7 14 ment for computational speed .H 15 15 15

As shown in Figure 10, the complex data in the LOG (N) STAGE 1 STAGE 2 STAGE 3 STAGE 4 bottom butterfly is multiplied in each stage by the STAGES appropriate weight. The result is then added to the top burterfly and subtracted from the bottom Figure 10 The Cooley-Tukey Butterfly Graph, butterfly. If the algorithm is left in this configura­ One-dimensional Fa st Fo un·er tion, it must use non-unity stride vectors, very short Transfo rm fo r N = Hi) vectors, or masked arithmetic operJtions to per­ form the very small buttertlies.

shown in Figure I I, the hit-reversal permutation is Op timized One-dimensional Fa st Fo urier performed as dose to the cenrer as rossible, when

Transforms the stage number = LOG (N)/2. To complete the The bit-reversal process that permutes the data to a algorithm, the butterfly distances now increase form that enables the Coolcy-Tukey algorithm to again. Second, this process entirely eliminates the work is also show n in Figure 10. When using need for short butterflies. vectors, a common approach to performing the bit­ Another optimization made to the FFT algorithm reversal reordering is to use vector gather or vector is the use of a table lookup method to access the sine <) scatter instructions. These instructions allow and cosine factors, which reduces repetitive calls to vector loads and stores to he performed using an the computatiorully inrensive trigonometric func­ index register. Vector loads and srorcs require a con­ tions. The initialization of this trigonometric table stant stride. However, vector gather and scarter has been fu lly vectorized hut shows only a modest operations allow the user to build a vector of offsets factor of 2 performance gain. ·ro build the table, a to support indirect addressing in vector mode. Both first-order linear recurrence loop is fo rmed that gather and scatter instructions are available with severely limits vector speedup. Because this calcula­ VAX vectors. tion is only done once, it becomes negligible for A vector implemenration of the FITalgorithm has multiple ca lls to the one-dimensional FI·Ts and fo r been developed that is well suited for the VA X vector all higher dimensional FFTs. The benchmark shown architecture. One optimization made to the algo­ in Figure 12 was looped and includes the calculation rithm involves moving the bit-reversal section of the of the trigonometric table performed once fo r each code to a place where the data permutation will FFT data length. benefit vector processing. By doing so, two goals Reusing data in the vector registers also saves are accomplished . First, the slower vector gather vector processing time. The VAX vector architecture operations are moved to the center of the algorithm provides 16 vector registers. If all 16 registers are such that the data will already be in the vector used careful ly, data can be reused by two successive cache. In Figure 10, the first FFT stage starts out with butterfly stages without storing and reloading the large butterfly distances. After each stage the butter­ data. With half the number of loads and stores, the fly distance is halved. For the optimized ,·ersion vector performance almost doubles.

24 Vo l. 2 No . l. .\jJrinJ< I'J'JO Digital Te chnicaljournal Ve ctor Processing on the VAX vector 6000 Model 400

BIT REVERSAL CORRECT 16 REORDER RESULT 14 (l. 0 0 0 0 5 12 1 1 8 1 tlj 10 2 2 4 2 (l. Cf) 8 3 3 12 3 a: � 4 4 2 4 &l 4 5 5 10 5 > 6 6 6 6 14 7 7 7 0 1000 2000 3000 4000 5000 LENGTH OF DATA SIZE 8 8 1 8 9 9 9 9 KEY: 1 0 10 5 10 & WITH REG-REUSE 11 11 1 3 11 0 NO REG-REUSE 12 12 3 12 13 1 3 11 13 14 14 14 7 Figure 12 One-dimensional Fast Fourier 15 15 15 15 Tra nsfo rm Performance Graph, STAGE 1 STAGE 2 STAGE 3 STAGE 4 (HAS Op timized Single-precision REG-REUSE) Complex Transfonns

Figure II Op timized Cooley-Tu keyButter fly Crapb, One-dimensional Fast The small two-dimensional FFT can achieve per­

Fo urier Tra nsfo rmfo r N = 16 fo rmance equal to that of the aggregate size one­ dimensional FFT by linearizing the data array. Figure 15 shows the trade-off between using the lin­ Op timized Tw o-dimensional Fast Fourier earized two-dimensional routine (for smaiJ N) and Transforms the transposed method (for large N) to maintain high performance across all data sizes. The optimized one-dimensional FFT can be used to The optimization of an algorithm that vectorizes compute multidimensional FFTs. Figure 13 shows poorly in its original form has been shown. The how an N by N two-dimensional FFT can be com­ resulting algorithm yields much higher perfor­ puted by performing None-dimensional column mance on the VAX vector 6000 Model 400 processor. FfTs and then N one-dimensional row FFTs. The High performance is due to the unique way the same routine can be called for column or row access algorithm touches contiguous memory locations FFTs by simply varying the stride parameter that is and its effort to maximize the vector length. The passed to the routine. (Note: In FORTRAN, the implementation described above always uses unity column access is unity stride and the row access has stride vecrors and always results in a vector length of a stride of the dimension of the array.) 64 fo r FFT lengths greater than 2K (2 x 1024). For improved performance on VAX vector systems, the use of a matrix transpose can dra­ matically increase the vector processing perfor­ N 1-0 FFTS N 1-D FFTS mance of two-dimensional FFTs for large values of N, i.e. , N > 256. The difference between unity stride and non-unity stride is the key performance issue. Figure 14 shows that a vecrorized matrix transpose can be performed after each set of N one-dimen­ sional FFTs. The computation will be equivalent to Figure 10 but with a matri.,x transpose: each one­ dimensional FFT will be column access which is COLUMN ROW unity stride. The overhead of transposing the matrix becomes negligible for large values of N Figure 13 Tw o-dimensional Fast Fourier When the value of N is relatively small, i.e., Tra nsfonns Us ing N Co lumn and N < 256, the two-dimensional FFT can be com­ N Row One-dimensional Fast 2 puted by calling a one-dimensional FFT of length N Fo urier Tra nsfonns

Digital Te chnicaljo urnal Vo l. 2 No. 2, Sp ring /'J

N 1-D FFTS TRANSPOSE Linear algebra and signal processing applications that utilize the various hardware features have demonstrated vector speedups between 3 and 35 over the scalar VAX 6000 Model 400 CPU times. With the integrated vector processing available on the VAXvector 6000 Model 400, the performance of computationally intensive applications may now approach that of supercomputers. COLUMN

N 1-D FFTS TRANSPOSE Acknowledgments The authors would like to acknowledge the tech­ nical contribution of the following people: Paul Brodeur, Doug Burns, Giao Dau, Bob Dickson, Darrel Donaldson, Hugh Durdan, Bill Gist, Ani! Jain, Chandrika Kamath, Dwight Manley, Mike Pline, John Redford, Sean Reilly, Tim Stanley, and Mike Uhler. COLUMN References

1. ]. Croll, " VA X 6000 Model 400 Multiprocessor Figure 14 Tw o-dimensional Fast Fo urier System Overview," Proceedings of COMPCON Tra nsforms Us ing a Ma trix '90 (IEEE, forthcoming 1990). Tra nspose between Each Set of N Co lumn One-dimensional Fa st 2. D. Bhandarkar and R. Brunner, "Vector Exten­ Fo urier Tra nsfo rms sions to the VAX Architecture," Proceedings of COMPCON '90 (IEEE, forthcoming 1990). 3. M. Gavrielov et al. "A 50 MHZ Uniformly Pipe­ 14 lined Floating-Point Arithmetic Processor," (l_ 12 ::::l Proceedings of ISSCC '89 (IEEE, February 1989): 5J10 w 50-5 1. 8 BJ 4 .]. Dongarra, Performance of Va rious Computers � 6 f- Us ing Standard Linear Equations Software in a � 4 FOR TRAN Environment, (Argonne, IL: Argonne > 2 National Laboratory,June 1989). 5. VA X FORTRA N Perfonnance Guide (Maynard: 0 200 400 600 800 1 000 1200 DIMENSION OF PROBLEM SIZE Digital Equipment Corporation, Order No.

KEY: AA-PB75A-TE, forthcoming 1990).

AI. TRANSPOSED 6. VA X 6000 Ve ctor Processor Programmer'sGu ide (Maynard: Digital Equipment Corporation, 0 LINEARIZED Order No. EK60VAA-PG, forthcoming 1990). 7. C. Bischof et al., LAPA CK, (Argonne, IL: Argonne Figure 15 Tw o-dimensional Fas t Fourier National Laboratory, April 1989). Tra nsfo rmPer formance Graph, 8. ]. Cooley and J. Tukey, "An Algorithm for Op timized Single-precision Machine Calculation of Complex Fourier Series," Complex Tra nsfo rms Ma thematical Computing, no. 19 (1965): 297-301. Summary 9. P. Swartztrauber, "Vectorizing the FFT's," Pa rallel Computing (Academic Press, 1982): The VAXvector 6000 Model 400 processor delivers 51-85. high performance for computationally intensive applications. The CMOS-based VAXvector 6000 is capable of operating at peak speeds of 90 :\!FLOPs single precision and 45 MFLOPs double precision.

26 Vo l. 2 No. 2, Spring 1990 Digital Tecbnicaljournal Patrick Sullivan Michael A. Callander, Sr. james R. Lundberg Rebecca L. Stamm lJ!illiam� ll�bill

The VAX 6000Mo del400 Scalar Processor Module

The VAX 6000 klodel 400 CPU module is the latest generation of the compatible VAX 6000fa mily of computers. The Model 400 isa single-board, OvtOS-basedCPU that significantly extends the performance of the VAX 6000 series. The system provides nearly 7 VAX unitsof performance {VUPs) in single-processor applications and up to 36 VUPs in six-processorsys tems. The Model 400 module isa plug-in replacementfo r the Mo del 200 a1ul Model 300 processors. Chip set and module designers of this new system cooperated closely to meet aggressive timing and perfor mance goals. Several enhancements were made to the cache and bus interface units to improve multipro­ cessor petformance. A vector interfa ce was includedfo r connection to a companion vector processor module. Signal integrity was an important consideration fo r both chip and module design.

1 When the Midrange Systems Business (MSB) Group instructions: The vector processor module, which began to develop the VAX 6000 series, the Semi­ was developed by MSB, can provide a significant conductor Engineering Group (SEG) had started performance boost for certain classes of vector development of a new CMOS-based CPU chip set. 1 problems. The project's goals were the fo llowing: The first Model 400 systems were shipped in July 1989. This delivery was made just 15 months • Achieve CPU performance at least equal to the after the introduction of the initial 6000 Model 200 '5.'5 YUPs of the VAX 8700 system series.

• Support a 28-nanosecond (ns) module cycle time

At that time, the VAX 8700 system was the fastest Design Challenges VAX available and used emiuer-coupled logic (ECL) The aggressive 28-ns cycle time goal for the module to achieve a 45-ns cycle time. SEG designers design required a tight coupling of the chip set and believed a system im plemented in CMOS technology module design efforts. With this short 28-ns cycle could meet or exceed that performance level and at time goal, the module interconnect had to be a much lower manufacwring cost . With reference treated as transmission lines. Consequently, signal to the second goal. a 28-ns cycle time would take integrity considerations were critical to design suc­ advantage of chip sets that could run faster than the cess and would impact all areas of the design ­ projected 40 ns cycle time. chip, package, and etch board . The approach taken In discussions with MSB, we realized our module to address signal integrity is described in more project could be modified to include an XMI inter­ detail in the Signal Integrity section. fa ce and, therefore, become another member of the The performance goals also dictated a change in 2 VAX 6000 series We then agreed to undertake a the VAX 6000 Model 400 data and address line (DAL) joint development effort between MSB and SEG , pin bus. The older, less complex multiplexed 32-bit which resulted in the VA){ 6000 Model 400 scalar bus would have to be separated into a separate CPU. Development of this scalar processor module 27-bit address bus (A-bus) and a 64-bit data bus is the fo cus of this paper. (D-bus). That decision in turn resulted in the need Halfway through the project, support for a vector for high pin count packages (224 pins) and the asso­ processor module was included because the VAX ciated signal integrity challenges of dealing with as architecture was extended to include vector many as 90 output drivers switching simulta-

Digital Technicaljournal Vo l. 2 No. 2, Sp ring 1990 27 VA X 6000 Model 400 System

neously. while driving transmission lines with a low controller chip (VC), a clock distribution chip effective impedance of 60 ohms. (CLK), and a system support chip (RSSC). The REXMI Two new technologies were needed to meet interface consists of three chips: two copies of the these challenges. First, new chip packages had to be data path chip (XDP), and a controller/interface developed to supply the increased number of signal chip (XCA). pins. The packages use multilayer ceramic sub­ A block diagram of the XRP module is shown strates to provide signal planes, as well as separate in Figure 1. The module consists of four major power and ground planes for both internal logic sections: and pad ring power. The packages have 224 pins on • CPU and F-chip floating point accelerator a 25-mil pitch and are surface mounted for better module routability. Second. the 25-mil package pin • VC chip and backup cache RAMarray pitch required the use of a finer geometry etch • The RSSC system support chip board: 13-mil module vias and 10-mil routing pitch. This specification required the physical design team • The XMI interface, including REXMI to initiate a very close development and qual­ ' The REX520 is the first VAX microprocessor chip ification effort with the etch board vendors. to implement a fully pipelined microarchitecture. The F-chip has (-,4-bit-wide data paths and a pipe­ Major Module Subsections lined execution unit. These chips cooperate to The Model 400 module, or XR P, is a single-board implement the base instruction group of the VAX VAX CPU implemented with the Model 400 chip set architecture. The two chips represent the CPU sec­ and a REXMI interface to the XMI bus.2 The chip set tion of the XRP module, and both chips connect includes five chips: a CPU chip (RFX520), a floating directly to the DAL. Both chips also have a private point accelerator chip (F-chip), a backup cache 8-bit bus for control and status information.

SYSTEM AND AUXILIARY CONSOLE I REX520 CONTROL DATA ROM CPU WITH F-CHIP RSSC AND CACHE �EEPROM ADORE

DAL � I CLOCK I DATA DACSS l I BACKUP CONTROL I-BUS CACHE VC CHIP REXMI RAMS

VECTOR INTERFACE XCII BUS (CABLE) ,--+------t ---, I I XMI I X CLOCK XLATCHES (7) : CORN ER I I L--� ------� -- -_j XMI

Figure 1 XRP Mo dule Block Diagram

2H Vo l. 2 No. 2, .\j!rinp, /1.)')0 Digital Technicaljournal The VA X 6000 Model 400 Scalar Processor Module

The REX520 provides the hardware and there are eight chips that require all four phases, microcode necessary to parse specifiers, execute and each set of clock outputs drives fou r chip loads. instructions, handle exceptions, and otherwise Discrete Schottky diodes are used at the receiving implement the VAX architecture. The REX520 ends of all the clock lines for termination. This also provides fu ll VAX memory management, a design achieved a clock skew of less than 0.5 ns at 4-gigabyte (GB) virtual address space, and support the receiving chips. for 512 megabytes (MB) of physical memory. The The DAL pin bus is a fully synchronous, hand­ chip contains a (">4-eorry, fully associative transla­ shake protocol bus. The DAL is nonpended in that it tion buffer. Both process and system-space map­ has only one transaction outstanding at a time, and pings are stored in the buffer. The chip also includes it is also nonmultiplexed with separate address and a 2-kilobyte (KB), direct-mapped instruction and data lines. The design consists of a 27-bit address data cache (primary cache) with an 8-byte block bus (A-bus), a 64-bit data bus (D-bus), and associated and fill size. control signals. The OAL runs at a 28-ns cycle time, The F-chip enhances the computation phase of synchronous with the Model 400 chip set. The tim­ floating point and certain integer instructions in ing is controlled by 4 overlapping 1 4-ns-wide clock conjunction with the REX 520 chip. The F-chip exe­ phases, which are separated from each other by cutes the F-, D-, and <�-format floating point instruc­ 7 ns. The P-chip, F-chip, vc chip, backup cache tions, as well as the long word variants of integer RAMs, and the REXMI communicate through the multiply. The chip receives operands from the DA L. The P-chip is the default bus master and REX520, computes the result, and passes the result contains the arbiter for the bus. The P-chip uses the and status back to the HEX520. The REX520 chip OAL to initiate reads and writes to the backup cache completes the instruction. and main memory through the REXMI. The backup The V\. chip implements a 2K tag store and neces­ cache and the REXMJ send read data to the P-chip. sary control for a 128KB backup cache. The cache The bus master notifies other bus nodes of the uses 15-ns, 16K-by-4 static random-access mem­ start of a transaction. The receiving node can termi­ ories (SIV\Ms) on the module. The vc chip also nate a transaction in one of three ways. It can includes a copy of the primary cache tag store, the indicate a successful completion, indicate an error, invalidate bus (1-bus), and an interface to the vector or request that the transaction be retried. Once the interface bus (VI B). The 1-bus connects to the REXMI receiver indicates one of the transaction termina­ and boosts performance by doing invalidate filter­ tion signals, the bus master deactivates. ing. The cache design is discussed in more detail in One DAL transaction takes a minimum of three the Model 400 Caches section. cycles. The maximum transaction time depends The RSSC chip is a modification of the sse chip upon the system response to read and write used in a number of previous \.PUs. It incorporates requests. The RSSC includes a bus timeout mecha­ the common core of functions that support the chip nism that prevents the system from hanging. set in the XMI system environment. The RSSC chip is Most DAL signals are transferred in three phases, discussed in detail in the Model 400 System Support although some control signals are transferred Chip section. in two. The XMl interface consists of the standard XML­ As noted earlier, vector interface capability was corner components and the REXMI chip set. The added after the VAX architecture was extended to HEXMI chip set interfaces the DA L to the XCI, which support vector operations. The vector functionality is the user side of the XMI corner. The XMI interface was added to the Y\. chip because that chip could is discussed fu rther in the XM I interface and REXM I accommodate the extra pins required for the inter­ section. fa ce, i.e., 224 pins versus 164. DAL operations are synchronous on the XRP The major units of the XRP module are described module. Therefore, a very low skew clock distri­ in the following sections. bution system was required . The clock chip and controlled, equal-length clock lines to each of the Mode/ 400 Caches chips provide this distribution. The clock chip The XRP module incorporates a two-level cache receives a 1 43-megahertz (,\1Hz) oscillator input and hierarchy that maximizes CPU performance. The provides two sets of synchronous four-phase first-level cache is the primary cache, called the clocks. Each clock phase driver output can drive P-cache, which is contained entirely in the P-chip. four 50-ohm lines in parallel. In the XRP design, The second-level is the backup cache or B-cache,

Digital Tec bnicaljournal Vo l. .! No. .!, Sp rin!( 1<)<)0 29 VAX 6000 Model 400 System

and it consists of rhe vc chip and 24 15-ns, 16K-by-4 static HA:-.-ts. The vc chip contains the rag store and F-CHIP the control logic for the B-cache. The SRAMs store the cache data. The P-cache and the B-cachc contain 24 16K-BY-4 SRAMs BC_CS_L<7:0> both instructions and data. The P-cache is a 2KB cache and is direct-mapped, BC_WE_L with an 8-byre block size. The cache is read­

r{]·.. allocate, no-write-allocate, and write-through . In write-through, writes that hit in the cache are A-BUS simultaneously written to the cache and to main D-By US I memory. The P-cache can perform a new access BC_HIT_L I I once every cycle. P-C HIP VC CHIP Each P-cache rag entry includes an 18-bit tag, one MEM_RO_L valid bit, and one parity bit. There arc 256 tags cor­ responding to 256 data blocks. Each data block con­ � BC_MISS_L tains 8 data bytes and 8 parity bits, with parity was 1-- implemented on each byte. Parity permits a byte REXMI 1----- write to be done without the need to reca lculate XMI parity across the other bytes. This process avoids the performance penalty that occurs when all byres are not written at once, as in the read-modify-write Figure 2 Backup Cache Access Diaf!,ram process. The B-cacl1e is a 128KB cache and is direct­ mapped, with an 8-byte access size. It has a 16-byte REXM! to send the read request to memory . At the sub-block size and a 64-byre block size. The cache is same time, the VC chip asserts the cache RAM wrire also read-allocate, no-write-allocate, and write­ enable (BC_ WLL) and waits for the data to return through. There arc 2048 entries in the B-cache tag from memory . When the REX1vl ! returns the data, it store. Each entry comains a 12-bit tag, 4 valid bits, as serts a transaction termination signal. This signal and a parity bit. informs the P-chip, the VC chip, and the F-chip that data is ready on the bus. All three chips receive the Backup Cache Hit data simultaneously, and the data is written into the Designers optimized access time to the backup cache RAMs. The transaction ends when the P-chip cache by connecting the cache RAMs directly ro the accepts the data. DA L. The chips, the bus, and the specialized hit sig­ nals are shown in Figure 2. XMI In terfa ce and REXMI When the P-chip issues a read on the DA L, it The XRP module accesses all memory and 1/0 drives the address on the A-bus to the vc chip and devices over the XMI bus. The REXM! interfaces the cache RA Ms. The P-chip asserts memory read chip set to the XM! bus by means of two data path (ME.\LR D_L). The VC chip uses MEivLRO_L to chips (XDPs) and one control/address chip (XCA). enable the assertion of the cache RAM chip select Each XDP is responsible for 32 bits of the HEXMI 's lines (BC_cs_L< 7:0> ). This process accomplishes 64-bit data pa th. The XCA is responsible for the two things. The chip select lines are valid by the address data path, DAL control logic, Xl\1! control time the P-chip has driven the A-bus to a valid state. logic, and control of the two XDPs. Both chips are Further, the total time to perform the read from the implemented in CMOS-2 standard cells. RAMs is minimized . The primary tasks of the XMI interface are to The RAM access begins in parallel with the tag • Forward REX520 references to the Xl\11 store access. If the vc chip fi nds a match in the tag store, it asserts BC_H IT _L to notify the P- chip. By • Implement a write buffe r that reduces traffic to the time BC_H!T_L is asserted, the data from the main memory cache RAMs is valid on the 0-bus , and the P-chip • Support control of cache fills and cache then accepts the data. invalidates If the vc chip does not find a match in the tag store, it asserts BC-MISS_L. This signal notifies the • Support Xl\1! interrupt logic

30 Vu l. 2 No. 2. 5jJ ring 1<)<)1! Digital Te chnicaljo urnal The VAX 6000 Model 400 Scalar Processor Module

CMOS-2 Standard Cell Te chnology • XMI l/0 space read or write

Custom chips were not used to implement the XCA • Interlock read or unlock write and XDP chips. Instead, two alternatives were investigated . These alternatives were to use either • lnterprocessor interlock gate arrays or standard cells from external vendors, • XMI read to an octaword location that includes or internally built CMOS-2 standard cells. data contained in the write buffe r A number of external vendors offered products with the density and internal gate speeds needed to • In response to a clear write buffer conunand implement the REXMI chips. However, none offe red Combining P-chip writes reduces the number of the performance and flexibility needed to interface write transactions needed by over 40 percent. This DAL. to the Therefore, we chose the CMOS-2 stan­ reduction certainly improves single-processor per­ DA L. dard cells, which could interface to the fo rmance. However, the greatest improvement is in We added a number of new cells to the standard multiprocessor performance where the XMI band­ cell library : several versions of a high-performance width required by each CPU is reduced. 3.3-volt output driver, input latch with the required 0-ns set-up time, and a low skew internal clock Cache Coherence buffer capable of driving over 30 picofarads of capacitance. These new cells were largely based on The XRP module allows data to be shared among circuits designed for the custom CMOS-2 CPU chips, mu ltiple processors. The XRP design assures that which met the high performance goals set by the the most recently written copy of any data is pro­ Model 400 program. vided to a running process. This process is called cache coherence. Performance Considerations In multiprocessing systems, coherence can be Performance bottlenecks in high-speed computer ensured in two ways. Cached copies of data that systems most often occur in the path ro and from have been written can be inva lidated, or each cache main memory . The XRP's two cache levels greatly can be updated with more recent data. Since it is reduce the number of reads required. However, simpler to invalidate than to update, an invalidation since both caches are write-through, all writes must scheme was implementedon the XRP module. be fo rwarded to main memory by the REX1'vll. To Every XRP processor write is sent to XM I mem­ improve performance. the REXMI implements a ory . When an XRP module broadcasts a write on the write buffer with fo ur octaword (16-byte) entries. XMI, the command, address, and data are captured The write buffer improves performance in two in memory . Other XRP modules capture the com­ ways. First, it decouples the REX520 write rate from mand and address to invalidate any valid B-cache or the slower XMI write rate. The REX520 can transfer P-cache entries that correspond to the address. 8 bytes of write data to the REXMI every three We could have opted to broadcast all XMI writes cycles. This data is loaded into the write buffer and as invalidates on the DA L. However, this method later transferred to main memory at XMI speeds. would have greatly increased the DAL traffic and Second, the write buffer combines multiple would have reduced the processor's performance. REX520 writes into a single XMI write. The most To increase the performance of multiprocessor efficient write transaction is a fu ll octaword write. systems, the vc chip provides a low-overhead The REXMI always tries to combine multiple invalidate mechanism through the J-bus. The REXMI REX520 writes into full octaword writes. The can determine through the !-bus if data is currently REX.MI loads write data into a write buffer entry. cached. The REXMI sends an invalidate on the DAL The write data is held umil either a new octaword only if the data is cached. address is received or a purge write buffe r condi­ The vc chip maintains a duplicate copy of the tion occurs. When the REXMI receives a write to a primary cache tag store. The chip accesses the copy new octaword address, the current write buffer in parallel with the backup cache tag store entry is marked fu ll. A new write buffer entry is whenever invalidate addresses are placed on the then opened with the new octaword address. Write 1-bus. When an !-bus address match is detected in buffer entries are transmitted as they are marked either tag store, the chip notifies the REXMl of the fu ll. To guarantee that the write buffer data is writ­ hit. The REXM! broadcasts the invalidate address ten to main memory in a timely manner, the write onto the DA L. The invalidate address notification is buffer is flushedbefore the following conditions: recognized by both the VC chip and the REX520.

Digital Te chnicaljo urnal Vo l. 2 No. 2, Sp ring 19<)0 31 VAX 6000 Model 400 System

The !'-chip inv:.tlid;.tt<:s th<: !'-cache entry. and the vc vector module through the vector intt:rface. Ratht:r, . chip invalidates the entries in both the B-cache and the vector module rt:ft:rt:nccs the , t:ctor data in the copy of the I'-cache tag store. directly over the X:\11 bus. To perform this [XOC<:�s. If the 1'-ctChe were a subset of the B-cache, we the vector module implemt:nts rh<:fu ll \'A X memory would not han: had to implement the P-cache tag management ;�. rchitectur<:. stor<: copy on the VC chip. In this design. any The opcodc and operand information is trans­ address that hit in th<: B-each<: would ha\'e been sent ferred to the vector module through tht: \'ector to the DA L. hut this process wou ld h:t\T caused an interface The RF.X')20uses internal processor rt:gis­ in\'alidate request to be sent to the P-each<: as well. ters (II'Rs) for read and write operations to transfer However. \V ith this approach, the vc chip would the information ro the vector interface. IPRs ar<: also have ro send im·alidares ro the P-cache whenever a used to read and write register information stored R-cache block was displaced . The transmission to both in the interface hardware and on the \'ector the !'-cache wou ld have been required because the module. 1'-cach<.: and B-cachc block sizes ar<: diffc rt:nt. It was The vector unit <:x<:cutt:s most vector instructions simpler to impl<.:mem the copy of the P-cache tags in parallel with the sc:�lar Cl'l l execlltion of subs<:­ and irs control . quent instructions. For som<: v<:cror instructions, Anorhn alternative would have lx:en ro use a particularly memory transfer instruct ions, th<: duplicate copy of the B-cache and I'- cache rag stores REX')20 microcode reads a vector unit register at impll'mented in cxtt:rnal logic on the module. The the end of the instruction. At this point, the REX ')20 RI'X:VII then wou ld havt: interrogated the tag stores stalls until the vector unit r<:sponds and eff<:ctively directly. This alternative was rejected because it fo rces synchronous <:xecution of instructions. would have used too much space on the module. The hardware implementation of the vector interface consists of two pieces. The first is an int<:r­ The Mode/ 400 Sy stem Support Ch ip face to the Model 400 DA L. This interfac<: allows A \'ariety of support logic is required to complete microcode-generated orcode, orcrand, and regis­ the fu nctionality of a VAX C l'l i module. The Model ter data to be received from ami driven to the 400 S\ Stt:m support chip (RSSC) integrates the com­ f{ f.X')20. The second pice<.: is the vector interface mon core of funnions ncn:ssary to support the VAX bus (VJB) that connects the vector and scalar mod­ system em·ironmt:m onto a single chip. It providt:s uks together over :1 cable. This interface and th<: th<: operat ing system with the hardware primitives connection to the VIB arc impkmented in the VC nt:nlu.J to implement the hoot and console rou­ chip on the scalar side and in the VECTL chip on the tines. Tht: chip also pro\'idt:s st:n:ral nect:ssary tim­ vector side. ing mechanisms. Th<: RSSC is designed to interface Thc vector module clock sysr<:m is asynchronous 1 direct ly with the Modcl ·iOO chips. It is based on the to the scalar module. The V!B runs synchronously systt:m support chip (SSe) that v.·as design<:d for use with respect to the \'ector module clock system for ' with tht: earl ier C.'

Vo l. 1. No .!. .\firing I')<)(I Digital Te clmicaljo urnal The VAX 6000 Mode/ 400 Scalar Processor Module

The XRP signal integrity methodology was rers that rend to suppress ringing generally increase developed early in the design process. All simula­ settling times and vice versa. ( tions were made in the SPICE program ' Ideally, the The overall design had ro satisfy the criteria for entire module wou ld have been modeled. However, achieving specified settling times and simulta­ the complexity of the module environment and neously reduce ringing to an acceptable level. computer resource constraints precluded that Several modeling parameters were simulated to approach. Instead, only one driver was used in the worst-case starus. These parameters are listed in simulations. The effects of intersignal coupling and Table 1. The impact of these parameters can be fur­ parallel switching were included by appropriately ther understood by referring to Figure 3. This figure scaling interconnect and package impedances. Even represents the same simulation under worst-case with this simplified method, an estimated 1500 CPU slow and fast conditions. hours on a VAX H700 system were required to per­ form rhe XRP signal integrity simulations. Ta ble 1 Worst-case Modeling Parameters

Simulations and Models Simulation A typical simu lation included all circuits and any Worst-case Worst-case accompanying parasi!ICS from the external Parameter Settling Noise enabling clock edge through to the receivers. All CMOS process corner Slow Fast simulations modeled a three-stage clock receiver, CMOS junction Maximum M in imum on-chip resistance capacitance delay, the output temperature driver, package impedance models for signals and Internal power supply Minimum Maximum internal/external power, receivers, and module Output series resistor Maximum Minimum interconnect. Number of drivers Maximum Maximum The clock receiver model generated the on-chip switching clock phases from rhe low-skew double phases Module interconnect Maximum Maximum received from the clock chip. The model also con­ length tained an internal switching model. It switched Interconnect Minimum Maximum several capacitances that were similar to the chip effective ZO internal loading. Over a cycle, this modeling met the chip's maximum internal power. The switching model, together with the chip package impedance 6.0 model for internal power, produced worst-case noise on rhe chip internal power rails and substrate. 5.0 The module interconnect was modeled as trans­ mission lines. The impedance of a transmission line 4.0 accounted for the coupling of adjacent signals switching. For example, if adjacent lines were (f) 3.0 VIH switching in the same direction, the effective f- _J 0 impedance would be increased; whereas switching > in the opposite direction would lower the effective 2.0 impedance. WORST-CASE SETTLING TIME 1.0 Finally, the signal and external power impedance (SLOW) models were scaled to reflect worst-case coupling, and size and number of drivers switching in paral­ lel. Collectively, these models accounted for the - 1 . 0 effects of an entire hus switching. 45":---: 50'-::- -�- 55 �60 --!:- 65 -::'::-----'-70 75- --::':-----:80 85'-__J 90 NANOSECONDS

Wo rst-case Conditions KEY There are two distinct subsets of the simulation VIH - INPUT VOLTAGE HIGH VIL - INPUT VOLTAGE LOW models. One set simulates worst-case signal noise or ringing. The other simulates worst-case, i.e., slow­ est, settling times. These subsets are, for the most Figure 3 Simulation under Worst-case parr, mutually exclusive. In other words, parame- Slow and Fast Co nditions

Digital Te cbnica/journal Vo l .! No . .!, .�fJrin[!, f

Signa/ In tegrity Constraints 7.0 6.0 The signal integrity analysis had a direct impact on 5.0 the design of the scalar CPU at the chip. package, 4.0 and module levels. The impact at the chip kvel was � � 3 01---+----1-��----�--��--� seen in three areas. The first was the design of the > 2.0 I/O buffers. The second was the determination of 1.0 the optimal series output resiswrs (on-chip) for different drivers. The third was the segregation of 0.01--+--+-- the external power buses to eliminate noise at 43 200 quiescent drivers.

At the package level, two design decisions were KEY:

made. The external power reference planes were VIH - INPUT VOLTAGE HIGH split to prevent coupling of asynchronous buses. VIL - INPUT VOLTAG E LOW Power was delivered to the lower bonding tier to reduce power supply loop inductances. At the mod­ Figure 4 Bench Results of the REX520 ule level, several specifications were developed, Driving the D-bus which included the fo llowing:

• All discrete terminations and their placement 6.0

• The maximum allowable etch length per signal 5.0 and order of connection 4.0 • The maximum allowable package dispersion � 3.0 etch length t:::; VIH � 20 • The module etch technology tO best reduce coupling 1.0 0.0 Cross-organizational cooperation was essential to the successful production of these design levels. -1.0 L_�'------'------'-----'------'----'----'----'---' 0 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 For example, a joint review of the chip packaging NANOSECONDS technology in the design stage ensured that the KEY design met stringent signal integrity requirements. VIH - INPUT VOLTAGE HIGH Wo rking closely with the module PC designer VIL - INPUT VOLTAGE LOW ensured optimal component placement and inter­ connect routing. Cooperative efforts such as these Figure 5 Simulation Results of the REX520 helped ensure the reliability and performance of the Driving the D-bus design .

Results Signal integrity had a major impact on the design, Two correlations of bench versus simulation results performance, and reliability of the Model 400 scalar were used for verification purposes. The first corre­ CPU. All critical signals were carefully simulated lation was run on a test module early in the design and analyzed prior to chip and module implementa­ phase. The diffe rence between simulated and bench tion. The Model 400 is currently the fastest VA X results averaged 4.4 percent. The correlation on the system in production . The Model 400 also has the final XRP implementation presented an average dis­ distinction of needing no revisions, from prototype crepancy of 2 percent, which is less than 200 pico­ tO final product, for signal integrity purposes. seconds (ps). These results strongly validated the modeling methodology. Figures 4 and 5 show the Performance results of the bench and simulation of the REX520 The VAX 6000 Model 400 system, based on the XRP driving the 0-bus. The waveforms exemplify the module, represents the highest performance VA X excellent correlation obtained on the fi nal module. system yet released. Performance ranges from (Note: The results shown in these figures represent nearly 7 VUPs in a single-processor system to nominal rather than worst-case conditions.) 36 VUPs in a six-processor system. The VAX 6000

34 Vo l. 2 No. 2, Spring 1990 Digital Te chnicaljountal The VAX 6000 Model 400 Scalar ProcessorMo dule

Model 410 system provides roughly twice the ule development effort: Dick Bagley, Jean Basmaji, performance of the previous generation CMOS­ Rick Calcagni, Glenn Garvey, John Guildoo, Andy based system, the VAX 6000 Model 310. It also Ladd, Bill LaPrade, Bob McCarty, Dina McKinney, represents a 28 percent performance gain over the Curt Miller, Roland Ouellette, Dave Rouille, John previous generation EC L system, the VAX 8700 sys­ Sweeney, Mike Uhler, Mike Warren, and John tem. (Note: One VUP is equal to the performance of Wilson. the VAX- 11/780 system.) Table 2 compares the performance of the References VAX 6000 Model 410 system to other VAX systems. 1. H. Durdan et al., "An Overview of the VAX 6000 Model 400 Chip Set," Digital Te chnical Ta ble 2 Single-processor Performance jo urnal, vol. 2, no. 2 (Spring 1990, this issue): Comparison 36-51.

System VUPs 2. B. Allison, "An Overview of the VAX 6200 Family of Systems," Digital Te chnical jo urnal, vol. 1, 1.00 VAX 11/780 no. 7 (August 1988): 10- 18. VA X 6000 Model 210 2.53 3. D. Slater et al., "Vector Processing on the VA X 6000 Model 310 3.38 VAXvector 6000 Model 400," Digital Te chnical VAX 8700 5.20 jo urnal, vol. 2, no. 2 (Spring 1990, this issue): VAX 6000 Model 410 6.64 11-26. 4. ]. Bartoszek et a\., "VAX6000 Model 400 Physical Technology," Digital Te cbnicaljournal, vol. 2, Up to six XRP modules may be configured in a no. 2 (Spring 1990, this issue): 52-63 . single VAX 6000 Model 400 system. These modules 5. ]. Winston, "The System Support Chip, a Multi­ deliver up to 36 VUPs. Ideally, a linear performance function Chip for CVAX Systems," Digital increase is expected as more processors are added Te chnical jo urnal, vol. 1, no. 7 (August 1988): to a multiprocessor system. However, a number of 121-128. factors limit the overall system performance, such 6. SPICE is a general purpose circuit simulator as contention for bus bandwidth, increased mem­ program developed by Lawrence Nagel and Ellis ory latency, and additional software overhead. Cohen of the Department of Electrical Engi­ A great deal of effort was expended in the neering and Computer Sciences, University of Model 400 design to limit the amount of perfor­ California, Berkeley. mance lost in a multiprocessor system. A number of multistream benchmarks were assembled. These benchmarks were run on the VAX 6000 Model 400 system to efficiently measure performance by simulating real work environments across a number of areas. The results for each system are shown in Ta ble 3.

Ta ble 3 Multiprocessor Performance Comparison

Work Area 1 CPU 2CPUs 4CPUs 6CPUs

Engineering 1.00 1.92 3.71 5.31 Scientific 1.00 1.92 3.74 4.78 Commercial 1.00 1.98 3.80 5.25

Acknowledgments The authors would like to acknowledge the fol low­ ing people for their contributions to the XRP mod-

Digital Te cbnical]Otlrnal Vo l. 2 No. 2, Sp ring 1990 35 W. Hugh Durdan Wi lliam]. Bowhill johnF. Brown William V. He rrick Richard C. Marcello Sridhar Samudrala G. Michael Uhler Nicholas Wa de An Overview of the VAX 6000 Model 400 Chip Set

The VA X 6000 Mo del 400 processor is a CMOS implementation of Digital's V,4X architecture, offe ring an average of seven times the performance of the VAX -111780 processor at a cy cle time of 28 ns. The processor comprisesfi ve custom chips imp le­ mented in Digital'spro prietary CMOS-I and CMOS-2 semiconductor processes. The ch ip set design incorporates the best fe atures of the previous VA X 8700 and VLSI VAX designs and in addition implements new performancefe atures. Among these are a larger translation buffer and primary cache, a de-multiplexe d 27-bit address and 64-bit data bus, and a tigh t�y coupled 128KBbackup cache. The fi ve chips, which are designed fo r multiprocessing environments, are the REX520 CPU, the floating point accelerator, the vc vector and cache controller chip, the RSSC systen1 supp ort ch tjJ, and the CLKclock chip.

In troduction The F-chip is a companion chip to the REX520. The VAX 6000 Model 400 chip set consists of five It accelerates the computation phase of the VAX F-, custom VLSI (very large scale integration) chips D-, and G-format floating point instructions, and implemented in Digital's CMOS- I and CMOS-2 pro­ the longword-length integer multiply instruction. cesses. The five chips are the CPU chip {REX520), The F-chip receives control information from the the floating point accelerator chip {F-chip), the vec­ REX520, operands from either the REX520, the tor and cache controller chip (VC chip), the system backup cache, or memory, and returns status and support chip (RSSC), and the clock chip (CLK chip). results to the REX520. These chips are designed to be used in multiple­ The vc chip provides the tag store and necessary system environments, of which the VAX 6000 control for a 128KB backup cache that is imple­ ' Model 400 series is one such example. mented in external random access memory (RAM) The REX520 chip is a pipelined VAX CPU that on the CPU module. The chip implements a dupli­ implements the VAX base instruction group and cate tag store for the REX520 primary cache and an 2 controls the operation of all other chips The interface through which the system environment design for the REX520 is an evolution of both the can determine if data is cached at a particular previous generation CMOS processor chip and the address. Through a vector interface bus (V I B), the VAX 8800 processor:H The REX520 is logically vc chip also provides the control and status inter­ divided into four sections: the 1-box, E-box, M-box, face between the CPU module and an optional and bus interface unit (BtU). The 1-box fetches and vector module. decodes VAX instructions, and provides this infor­ The RSSC chip incorporates the common core mation to the E-box. The microcode-controlled of fu nctions required to support the VAX 6000 E-box parses instruction specifiers, executes VAX Model 400 chip set in a system environment. RSSC instructions, and processes interrupts and excep­ supports read-only memory (ROM) and electri­ tions. The M-box contains a 64-entry, fully associa­ cally erasable programmable read-only memory tive translation buffer, and a 2-kilobyte (KB)on-chip (EEPROM), and contains 1 KB of battery hacked-up primary cache. The BIU acts as the interface RAM, the console terminal universal asynchronous between the REX520 and the interchip environ­ receiver/transmitters (UARTs), interval and pro­ ment described below. grammable timers, and a time-of-year clock.

36 Vo l. 2 No 2, Sp ring 1990 Digital Te cbnicaljournal An Overview of the VAX 6000 Mode/ 400 Ch ip Set

The CLK chip receives a 143-megahertz (MHz) transfers into and out of the vector register file are osciUator input. The chip provides four low-skew performed by the vector processor through a direct clock phases to the chip set and the module port to the memory subsystem. environment. Interrupt requests are received by the REX520 from the module environment on nine dedicated Interchip and Module Environment interrupt request lines. Five of these Jines are for The chips in the Model 400 chip set are connected requests for special purpose interrupts, such as together and to the rest of the module environment interval timer requests. a5 shown in Figure 1. Unlike the previous generation CMOS processor PerformanceGoals and Design design," the Model 400 chip set implements sepa­ Considerations rate 27 -bit address and 64-bit data buses. Backup The goal of the design was to meet or exceed the cache reads, writes, and fills are done with the performance of the vAX 8700 processor. To meet address on the A-bus. The data is driven to or this goal, a 40-nanosecond (ns) cycle time was received from the parity protected D-bus. Control required under worst-case conditions. As the design for the backup cache RAMs is provided by the vc progressed, it became clear that the CMOS-2 pro­ chip on dedicated control lines. cess, in which most of the chip set is implemented, Memory, 110 space, and external processor regis­ offered enough performance to allow the target ters are accessed by driving the address to the A-bus cycle time to be decreased. and the data to the D-bus. ROM and EEPROM control As a result, the cycle time was reduced from 40 ns is provided by the RSSC on dedicated control Jines. to 28 ns. At this cycle time, the VAX 6000 Model 4 10 F-chip operands are driven to the D-bus from the system runs at nearly 7 VUPs in most applica­ REX520, backup cache, or memory to the F-chip. tions, or roughly 1.3 times the performance of the Results are driven back to the REX520 on the 0-bus. VAX 8700 processor. (The acronym vurs stands for Control and status information for these transfers VAX units of performance; 1 VUP equals the perfor­ is performed on a private bus between the REX520 mance of a VAX-11/780 system.) The performance and F-chip. of the system may be further expanded by adding Vector instructions are decoded by the REX520. processors to the system, to a maximum of 35 VUPs The opcode and instruction operand information is in the VAX 6000 Model 460. transferred from the REX520 to the vc chip. From To achieve the performance goals, a number of there, the information is transferred to the V!B cable microarchitectural trade-offs were made relative to and then to the optional vector unit. Note that only the VAX 8700 and previous VLSI VAX designs. In status and scalar operands (contained within the essence, the best features ofeach were incorporated instruction stream) are transferred on the VIB. Data into the design of the VAX 6000 Model 400 chip set.

A-BUS

D-B US

INTERRUPT REQUESTS

CLOCKS

143 MHZ OSCILLATOR

I-BUS

VIB CABLE

Figure I VAX 6000 Mode/ 400 lnterchipEnvironment

Digital Te chnical journal Vo l. 2 No . 2, Sp ring 1990 37 VAX6000 Model 400 System

For example, the REX520 chip and the F-chip are By developing the technology in-house, Digital fu lly pipelined designs that make shorter cycle has gained competitive performance and time-to­ times possible and improve performance. The market advantages for its low-end and midrange REX520 design also includes read-and-run and products. A summary of the transistOr and pin­ write-and-run features. These features decouple count statistics for the chip set is shown in Table I . subsequent execution in the CPU pipeline from the CMOS-2 is an N-well, P-epitaxial, double-metal, completion of references to cache and memory. As 5-volt process with 1.5 micron minimum feature opposed to previous VLSI designs, a larger transla­ sizes. With respect to CMOS- I, the ftrst generation tion buffer and primary cache and a de-multiplexed process, CMOS-2 offers a 25 percent reduction in 27-bit address and 64-bit data bus in the REX520 lateral and key vertical dimensions, a 78 percent also improve performance. The tightly coupled improvement in circuit density, and nominally a 128KB backup cachesig nificantly reduces the read 33 percent improvement in chip performance. The latency seen by the CPU when a read misses in the VAX 6000 Model 400 chip set cycle time and cir­ primary cache. It also reduces the read traffic seen cuit density requirements helped drive the develop­ by the memory subsystem. ment of the process that has been optimized for The chip set was designed from the beginning microprocessor chip performance. Details of the with multiprocessing in mind. Because caches must process features and capabilities are shown in remain coherent across all CPUs in a system, a Table 2. method must be provided to invalidate cached loca­ Metal oxide semiconductor field effect tran­ tions in all other caches when one CPL writes to sistors (MOSFETs) are built in a P-epitaxial layer that location. One option would have been tO mir­ (30 ohm-em) grown on a low-resistance P+ silicon ror all writes done by any CPU in the system onto substrate (0.02 ohm-em). The high resistance of the the A-bus of all Cl'l·s. However, mirroring is a rela­ epitaxial layer keeps parasitic junction capacitance tively expensive operation, especially when most of low and allows better transistors to be fabricated. these addresses are not cached in any other CPU. The low-impedance substrate dramatically reduces Instead of the mirroring method, we chose to latchup, a phenomenon in which parasitic bipolar implement a duplicate copy of the primary cache transistors are triggered into a sustained high cur­ tag store in the vc chip and to implement a low­ rent mode. Latchup disrupts normal circuit opera­ overhead port, the 1-bus. The module environment tion and often destroys the chip. The P-channel MOSFETs are made in the N-doped uses the 1-bus to determine if the address is actually cached by either the primary or the backup cache. well regions of the epitaxial layer. The N-channel With this method, only those invalid addresses that MOSFETs are made in the as-grown P-regions of correspond to cached locations must be mirrored the epitaxial layer. This process optimizes the onto the A-bus. mobility of the N-channel MOSFETs and overall circuit speed. CMOS and Packaging Te chnologies An N+ polysilicon (polycrystalline silicon) and The chip set was implemented in Digital's proprie­ tungsten-disilicide sandwich material, polycide, tary second-generation complementary metal forms the MOSFET gates. Polycide resistance is in oxide semiconductor technology, CMOS-2. We order of magnitude lower than that of the poly­ selected CMOS as the chip processing technology silicon used in CMOS-1. The resulting smaller para­ because it offers high density, high reliability, low sitic delays across the MOSFET gates and local power, and low-cost performance. interconnections help improve circuit speeds.

Table 1 Summary of Chip Statistics

Transistor Count Signal Power Control & Chip Pins Die Size Dissipation Memory Data Path

REX520 CPU 157 12mm x 12mm 6W 180K 140K

VC cache controller 178 10.7mm x 10.8mm 2.5W 184K 34K

F-chip floating point 103 12.7mm x 11mm 4W 134K

CLK clock 20 5mm x 5mm 2W 5K Total: 364K 313K

38 Vo l 2 No. 2, Spring 1990 Digital Te cbnicaljournal An Overoiew of the VAX 6000 Mo de/ 400 Chip Set

Table 2 CMOS-2 Process Features and Capabilities

Effective channel length 0.9 microns

Metal 1 3.0 micron width , 1 .5 micron space

Metal 1 contact 1.5 micron x 1.5 micron Metal 2 3.75 micron width, 1.5 micron space

Metal 2 contact 1.5 microns x 1.5 microns Gate oxide thickness 225 angstroms Metal 1 field oxide 1.1 microns Metal 2 field oxide 1.9 microns Polycide resistance 2-4 ohms/square Typical gate delay 300-500 picoseconds Polycide equivalent gate delay length 400 microns Metal 1 equivalent gate delay length 5000 microns Metal 2 equivalent gate delay length 8000 microns

Other interconnections are accomplished Although the REX520 hardware organization through two layers of aluminum. The first layer, and placement resemble that of the previous gener­ metal 1, can connect to either polycide or the ation microprocessor, the REX520 performance N+/P+ source/drain regions via metal 1 contacts. goals were met by tailoring the microarchitecture ' . The upper layer, metal 2, can connect to metal I more closely tO the VAX 8700 processor.··" Any through metal 2 contacts. No other connections are deviations from the microarchitecture of the ECL­ allowed . Based upon its parasitic delay charac­ based V�'< 8700 system were made in the CMOS­ teristics, polycide is used only for local interconnec­ based REX520 to compensate for technology tions. Metal 1 is used for signals communicating differences and to exploit the beneficial aspects of across distances less than half of the chip dimen­ VLSI design. A photomicrograph of the REX520 is sions. Metal 2 is used for global signals, clock dis­ shown in Figure 2. tribution, and power and ground distribution. The chip employs a si..'c-level pipelined engine Because of the speeds and complexities of the built around three autonomous pipes. These pipes VAX 6000 Model 400 chips, noise is a particularly provide simultaneous instruction prefetch and difficult problem for the chip designer. CMOS-2 decode, instruction formatting, operand reference, contains a deep P+ implant that can be used to execution, address translation and result stare, and provide a very low resistance connection between 110 access. the tap surface of the chip and the P+ substrate. The As shown in Figure 3, the major hardware chip designer uses this deep P+ implant to reduce functions of the REX520 are partitioned into the substrate noise that can upset the operation of following: dynamic circuits. • An instruction box (!-box) that contains the The VAX 6000 Model 400 chips were packaged in custom-designed, rigid perimeter-leaded, single­ instruction decoder and a 16-byte prefetch queue (PFQ) chip ceramic packages. The packages included four

power and ground planes to help maintain inter­ • A microcode-controlled execution box (E-box) chip signal integrity and allow full-speed operation that provides the capability for data manipu­ of the chip set. 'i lation in a 32-bit data path

TheREX 520 CPU Chip • A memory box (M-box) that implements VAX The REX520 CPU chip is a third-generation, single­ memory management by utilizing a 64-entry, chip VAX microprocessor. The REX520 provides the fu lly associative translation buffer hardware and microcode sufficient to parse • A 2KB write-through, direct-mapped primary operand specifiers, execute instructions, and han­ cache (P-cache) with a quadword (8 byte) fiU size dle interrupts and exceptions. It cooperates with the F-chip to implement the base instruction group • A bus interface unit (BIU) that controls a fully of the VAX architecture. handshaked, synchronous chip bus

Digital Technicaljournal Vo l. 2 No . 2, Sp ring 1990 39 VAX 6000 Model400 System

• • • • •

.

. : . ::

Figure 2 Photomicrograph of the REX520 Chip

REX520 Pipeline and the instruction data length for the E-box. The REX520 pipeline contains si,x functional seg­ Collectively, this information is called the context ments that cooperate in the execution of instruc­ for the instruction. tions. As shown in Figure 4, there are two segments The !-box divides each VAX instruction into a in the !-box, three microcode-controlled segments microcode subroutine, or microflow, for each in the E-box and M-box, and a single segment for the specifier, and a microflow for the execution of the BIU and P-cache control. instruction. The control programmed logic arrays The !-box decodes the VAX instruction stream. (PLAs) in the !-box cause it tO sequence through the A 16-byte prefetch queue (PFQ) is fi lled with the specifiers of each instruction, sending a microflow instruction stream asynchronously to the pipeline address to the microsequencer, and immediate data control during otherwise unused bus cycles. !-box (if needed) to the E-box for each specifier. When the segment 1 updates the PFQ and parses the next last specifier is parsed , the !-box sends a microflow piece of the instruction stream. This segment sends address, the opcode, and the data length of the microcode addresses to the E-box microsequencer. instruction ro the E-box for the execution of the Segment 2 formats immediate data, the opcode, instruction.

40 Vo l. 2 No. 2, Spring 1990 Digital Te chnicaljo urnal An Overview of the VAX 6000 Model 400 Ch ip Set

MICROFLOW fetched microinstruction to E-box segment 1. The ADDRESS microinstruction is pipelined forward to E-box seg­ ments 2 and 3 in consecutive cycles. The microsequencer fetches one or more microinstructions of a microflow starting at the initial microflow address supplied by the !-box. If a microflow contains more than one micro­ instruction, the microsequencer computes subse­ MICROINSTRUCTION quent intrafl.ow microaddresses and fetches the corresponding microinstructions. A field within INSTRUCTION the microinstruction indicates the end of each CONTEXT 1---1- microflow. The E-box performs all the address and data manipulation required for the REX520 to adhere to the VAX architecture. The three E-box segments INTERCHIP BUS operate under microinstruction control; the operand fe tch segment reads operands from gen­ eral-purpose registers (GPRs) or from the memory data (MD) fi le and presents them to the fu nctional units; the execution segment performs data manip­ ulation on the operands; and the result store seg­ Fig ure 3 REX520 Block Diagram ment writes results to registers or memory. The E-box pipeline segments are fu lly folded, i.e., each The !-box pipeline runs autonomously to the segment simultaneously operates on a diffe rent E-box pipeline. That is, the 1-box segments con­ microinstruction. tinuously parse instruction stream data, making The performance of memory read accesses is microt1ow addresses available. The I-box pipeline improved using a read-and-run technique in the advances whenever the microsequencer accepts a E-box. The destination for data stream memory microtlow address. reads is the E-box MD file. Microinstructions that The microsequencer in the E-box performs a read initiate a memory read simply queue the request to of the 1696 word control store each cycle, overlap­ the BIU. The microfl.ow may then continue, without ping in time with 1-box segment 2. It presents the waiting for the request to complete. In this manner,

I-BOX I-BOX E-BOX/M-BOX E-BOX/M-BOX E-BOX/M-BOX SEGMENT 1 SEGMENT 2 SEGMENT 1 SEGMENT 2 SEGMENT 3

INSTRUCTION INSTRUCTION OPERAND EXECUTION RESULT STORE PARSE AND DATA AND FETCH DECODE f.-- CONTEXT f- ,._ ,. FORMATTING VIRTUAL MICROFLOW ADDRESS ADDRESS CONTROL STORE SELECTION AND ACCESS GENERATION ACCESS TRANSLATION CHECKING

BIU/P-CACHE SEGMENT 1

BUS ARBITRATION

CACHE ACCESS

Figure 4 REX520 PipelineSe gments

Digital Tecbnicaljournal Vo l. 2 No. 2, Sp ring 1990 4 1 VAX6000 Model 400 System

some or all of the time required for memory greatly, as demonstrated by the P-cache and control accesses is hidden during productive microflows. store designs. The request completes when the BIU and memory The VAX 8700 system uses cache memory to subsystem return data to the MD file. decrease the effective memory access time by Because subsequent microinstructions may refer­ employing a 64 KB cache in ECL RAM with a single­ ence an MD file location for which a memory read cycle access. It was not possible to implement a has not yet completed, MD file accesses are syn­ RANI of this size on the REX520, and signal integrity chronized through a valid bit mechanism. Each lvi D constraints dictated a minimum of three cycles to file location has a valid bit that is reset when a mem­ access off-chip RAM. The REX520 compensates by ory read is started. The bit is then set when data is using a two-level cache. The REX520 has a 2KB pri­ written to the file. If a microinstruction attempts to mary cache on the chip with a single-cycle access, reference an MD file location whose valid bit is not and there is a 128KB off-chip secondary cache. The set, the pipeline stops advancing (stalls) and waits cache hierarchy reduces the effective memory for the data to be returned. access time sufficiently to meet the performance The three M-box segments are microcode­ goals, and it keeps the off-chip bandwidth within controlled and typically run synchronously with reasonable rates. the E-box segments. M-box segment 1 decodes the Similar signal integrity constraints are imposed current microinstruction. Segment 2 then selects on the REX520 control store design. The VAX: 8700 an appropriate address source and performs a vir­ system uses a 16K entry control store implemented tual to physical address translation if required. The in ECL RAM. The VAX 8700 system designers use the final segment issues the physical memory request to large control store address space to improve per­ the BIU. formance and simplify hardware. The control store During normal pipeline flow, the BIU and P-cache access of the VLSI REX520 cannot cross a chip operate in a single pipeline segment which is over­ boundary and keep pace with the fast internal cycle lapped with segment 3 of the E-box/M-box pipeline. time. Because of these limitations, an on-chip con­ The BIU acts as an arbiter for the external and some trol store ROM was used. Chip area limitations con­ internal buses, and supplies control to the P-cache. strained the size to 1696 entries. The designers of The BIU receives address and memory request the REX520 compensated for the lack of control information from the M-box. The BIU then decides store address space by altering the coding style to whether the M-box or an external reference should make microflows more serial than the vAX 8700 receive service from the P-cache and sends the system microflows. Additional hardware features appropriate information to the P-cache to process also helped to absorb some of the work previously the request. The BIU sends read and write requests done in microcode. that cannot be serviced by the P-cache to the off­ The operand specifier microflows offer an exam­ chip memory subsystem. ple of microcode space compression facilitated Write buffers in the BIU improve the perfor­ through hardware enhancements. The REX520 mance of memory write operations. Microinstruc­ microflows group similar specifier types together tions that initiate memory writes queue the request and provide a general routine for each. The general to the BIU, and the microflow continues without routines are parameterized in hardware to resolve waiting for the request to complete. In this manner, the differences in specifier types within each group. the time required for memory writes can be hidden In this way, the size of the microflow is minimized, under productive microflows. The BIU can store up and the performance characteristics of a dedicated to two quadword writes (32 bytes) in its buffers microflow are preserved. while waiting for the memory subsystem to become Each general specifier microflow performs the available. function appropriate to the specifier mode by using a parameterized access type and data length. For Adaptationsfo r VL SI example, a microinstruction in the specifier flows VLSI designs have relatively good signal integrity on may make an access-type dependent memory the major bus structures within a chip. However, request. If the 1-box supplied a read access type, the signal integrity diminishes when crossing a chip memory request hardware initiates a read. If the boundary because off-chip bandwidth is incapable !-box parsed a result store specifier, the write access of sustaining the internal data rates. This character­ type parameter causes hardware to initiate a write. istic of VLSI design influences microarchitecture Operand data length is supplied in a similar fashion.

42 Vo l. 2 No. 2, Spring 1990 Digital Technicaljournal An Oven;iew of the VAX 6000 Mode/ 400 Ch ipSet

The general microflows leave operands in fi xed microflow also needs to know nothing about the locations in the E-box MD file. The exact location in kind of specifier that supplied the operand. the MD file is also supplied by the !-box. Parameterizing the microflows with the access Tbe F-cbip type, data length, and MD file location allows the The floating point accelerator chip, F-chip, is generalized specifier microflows to be shared. the companion processor chip to the REX520 for The MD file parameter creates an independence floating point operations in VAX 6000 Model 400 between the specifier and execution microflows. systems. A photomicrograph of the chip is shown The same specifier microflow can then be used for in Figure 5. each specifier mode, independent of where it The F-chip implements all VAX base instruction actually appears in the instruction. The execution group floating point instructions and the longword

Figure 5 F-chip Photomicrograph

Digital Tecbnicaljournal Vo l. 2 No. 2, Sp ring 1990 43 VAX 6000 Model 400 System

integer multiply instruction. The data types imple­ Microarchitecture and Implementation mented by the F-chip are F-floating (1-bit sign, 8-bit The F-chip is composed of an interface section and a exponent, 24-bit fra ction), D-floating (I-bit sign, five-stage execution core. A block diagram of the 8-bit exponent, 56-bit fraction), G-floating (!-bit F-chip is shown in Figure 6. The F-chip interfaces to sign, 11-bit exponent, 53-bit fraction), and integers the rest of the VAX 6000 Model 400 system through (8-bits, 16-bits, and 32-bits). the D-bus , eight status and control lines, and four Unlike the VAX 6000 Model 200 floating point bus control signals. The imerface section receives chip, the F-chip employs a uniformly pipelined the opcode from the REX520, the operands from architecture that has performance which is inde­ the CPU chip, and cache and memory data on the pendent of the operand values. Early in the design it D-bus. The interface section decodes the opcode was decided to use a pipe lined microarchitecture in and receives the required number of operands. It order to use the same execution core in future assembles the diffe rent pieces of the operands and designs, such as the VAX 6000 Model 400 vector supplies fo rmatted operands to the execution core. processor floating point unit. Consequently, the After the execution of the floating point operation, execution core of the F-chip was designed to the output interface transfers the fo rmatted result include low latc.:ncy for scJiar applications and high back to the CPU chip. ( throughput for vector pron:-ssing ' The core exe­ The execution core consists of a divider, which is cutes most instructions in four cycles. Double­ bypassed in all operations except division, and four precision and integer multiply instructions take pipelim:-d stages that are uniformly utilized in the five cycles, and divide operations take 13 to 24 execution of all instructions. Each stage has a frac­ cycles, depending upon the datJ type . tion data path, control , and sign and exponent data

INPUT INTERFACE J t t t

EXPONENT DIVIDER CONTROL DIVIDER *---- 1--- LOGIC t t t +

FRACTION MULTIPLIER EXPONENT ADDER RECODE CONTROL STAGE 1 DETECTION - 1--- PROCESSOR LOGIC [3II LOGIC I t I t I t t

MULTIPLIER 1 OR EXPONENT CONTROL STAGE 2 ARRAY STICKY BIT - - PROCESSOR ,I � ':LEADING 1 :t I t t t

EXPONENT ADDER NORMALIZER CONTROL - STAGE 3 I I 1- PROCESSOR I + I + t t

EXPONENT CONTROL STAGE 4 1- f-4- PROCESSOR �+ � l OUTPUT INTERFACE I

Figure 6 F-chzp Execution Core

44 Vn l. 2 Nn. 2, Sp n·ng /C)')() Digital Techllicaljournal An Overoiew of the VAX 6000 Mode/ 400 Ch ip Set

paths. The fraction data path, the most complex (See Figure 7.) For the effective subtract operation, portion of the chip, is 60 bits wide. stage 1 uses an exponent difference prediction The divider fraction data path consists of the scheme in which the least significant two bits of the hardware divider array and quotient logic. The two exponents are examined to determine whether divider array implements an iterative radi.x-2 SRT 0 or 1 bits of alignment are required. nonrestoring division scheme, and generates three If the actual exponent difference is 0 or 1, stage I signed quotient bits per cycle.7 The quotient and selects the adder result. Otherwise, stage 1 passes remainder are driven on consecutive cycles to the original operands to the next stage for align­ stage 1 of the pipeline. ment. In multiplication flows, the stage 1 adder is The stage 1 fraction data path consists of a 60-bit used to compute three-times the multiplicand, and adder, the multiplier recode logic, and the fraction the recode logic generates the recoded multiplier detection logic. Stage 1 receives its inputs from the bits for stage 2. In divide operations, stage 1 is used interface section, or from the divider for divide to assimilate the redundant quotient and remainder operations. In add and subtract operations, stage 1 vectors into a two's complement form. is used primarily to compute the difference The stage 2 fraction data path consists of a between the exponents of the two operands. In dynamic 57-bit right shifter, unified leading one or parallel, stage I is also used to determine the frac­ sticky-bit detection logic, and a multiplier array. tion diffe rence for exponent differences of 0 and I. (The sticky-bit indicates whether a "1" was shifted out of the data path during alignment operations.) The shifter is used in add-like operations for align­ ing the fractions. The leading one or sticky-bit logic SIGN 1 EX PONE NT SIGN 2 1 determines the normalization amount during sub­ FRACTION 1 FRACTION 2 EX O NENT 2 tract operations and integer-to-floating-point con­ I I versions. This logic also determines the sticky-bit ADDER during alignment in effective subtract flows. The SIGN AND multiplier array implements a radix-8 modified I I CONTROL � EXPONENT Booth algorithm and consists of nine carry save SELECT I adder (CSA) rows. It is traversed once for single­ I :I precision fo rmats and twice for integer and double­ I precision formats. The hardware organization for SHIFTER-R the multiply operation is shown in Figure 8. II r SIGN AND Stage 3 of the pipeline consists of a 60-bit adder EXPONENT and a 57-bit dynamic left shifter for normalizing the L 1 D/STICKY BIT I B intermediate results. The fraction adder is used in multiply operations to assimilate the product sum >II SHIFT AMOUNT • • I and carry vectors, and in add-like flows to add or I NORMALIZER subtract the aligned operands. I Stage 4 of the pipeline consists of a 60-bit adder

SIGN AND used for rounding and negation of the final result, ADDER CONTROL EXPONENT and the exception detection logic. The exception I I handling is done with a PLA which detects and sig­ � nals overflow, underflow, zero results, and invalid SELECT operands. I J The control is hardwired (no microcode is used) + I and distributed among the pipeline stages. Each SIGN AND ADDER CONTROL � I EXPONENT stage has autonomous control implemented with comrol decoder PLAs. Each stage supplies control information to the next stage one cycle ahead of the FRACTION� RESULT SIGN� EXPON � ENT data. This enables the data path control signals to RESULT RESULT set up before the data is propagated to the input of the stage. The distributed control scheme permits Figure 7 F-chip Hardware fo r Addition simultaneous execution of a divide and four other and Subtraction instructions in the pipeline.

Digital Te chnicaljournal Vo l. 2 No. 2, Sp ring 1990 45 VAX 6000 Model 400 System

FRACTION DATA PATH pendent of the data, as opposed to five steps without the overlap. MULTIPLIER MULTIPLICAND For multiplication, this organization enabled the ADDER use of a radi.x-8 algorithm, which requires the w I I computation of three-times the multiplicand. The (') 3X MULTIPLICAND

(f)� implementation of allother operations was realized I RECODER/IPPR I with the addition of minimal logic. RECODE

TC IPP 1beVC Ch ip

SF CF The vc chip, or the backup cache controller and vector interface chip, implements the second level N SELECT w of a two-level cache structure and the interface to [ I an optional vector processor. A photomicrograph � s c (f) of the vc chip is shown in Figure 9. MULTIPLIER 1-9 ROWS I I Cache Control Fu nctions The REX520 contains a 2KB primary cache data and M w tag store. The vc chip contains the tag store and � ADDER control logic to perform reads, writes, and invali­ (f) I dates to a 128KB secondary, or backup, cache. The PR " backup cache is direct-mapped and write-through. w (') {1 ADDER (RND/NEG) The vc chip backup cache tag store is organized � such that one tag and four valid bits correspond to (f) I �FP every four-octaword (64 -byte) block of the cache. Each valid bit corresponds to a one-octaword sub­ KEY: block, as illustrated in Figure 10. When a cache tag SF, CF SUM AND CARRY OUTPUTS miss occurs on a read, a block is allocated, a sub­ TC TWO'S COM PLEMENT VECTOR IPP INITIAL PARTIAL PRODUCT block is fi lled, and the corresponding valid bit is set. S, C SUM AND CARRY INPUTS When a cache tag compare is successful but the PR PRODUCT FP FRACTION PRODUCT valid bit is not set, a sub-block is filled from memory and the corresponding valid bit is set.

Figure 8 F-chipMulti plication Fraction The Backup Cache Data Store Data Pa th The bach.CJp cache data store is built with off-the­ shelf CMOS static RAMs, which are located on the The exponent data path in each stage is 13 bits module. The discrete cache data RAMs are orga­ wide. Each stage has a 13-bit adder and detection nized as 8 bytes wide (the width of the D-bus) by logic for detecting zero operands, exponent differ­ 16,384 locations deep. In addition, there are eight ences, and exception conditions. To compute the bits of parity, with one bit corresponding to each absolute value of the exponent difference in add, data byte. Fourteen bits of the VAXphysical address subtract, and convert operations, the stage I expo­ are needed to access the cache, as shown in Fig­ nent data path has an additional 13-bit adder and ure 11. When the backup cache returns data to the selection logic. REX520 primary cache, it returns one quadword, The microarchitecture and the hardware orga­ the fill size of the primary cache. nization of the F-chip were chosen to efficiently implement the three basic operations add/subtract, TheInvalidate Filter Bus, /-bus multiply, and divide. As an example of this effi­ As mentioned, the chip set implements a special­ ciency, the adder in stage 1 is used in both the purpose bus, the !-bus, that allows the memory effective subtract flow and the multiplication and subsystem filter invalidates. In addition to this bus, division flows. This flow overlap permits the execu­ the vc chip also implements a copy of the REX520 tion of effective subtraction in only four steps, inde- primary cache tag store for invalidate flltering.

46 Vo l. 2 No. 2, Sp ring 1990 Digital Te chnicaljo urnal An Overview of the VAX 6000 Mod.e/ 400 Chip Set

Figure 9 VC Chip Photomicrograph

VALID BITS BACKUP CACHE } FIRST VA LID ONE SUBBLOCK OCTAWORD BIT 0 (FILL SIZE)

SECOND VALID OCTAWORD 1 BIT ONE BLOCK VC CHIP (ALLOCATION TAG STORE SIZE) THIRD VALID OCTAWORD BIT 2

FOURTH VALID OCTAWORD 3 BIT

Figure 10 Ta g, Va lid Bits, andBackup Cache Store

Digital Te ch-nicaljo urnal Vol. 2 No. 2, Sp ring l

ADDRESS BIT The RSSC provides the operating system W•ith the 29 28 11 1 0 4 3 0 hardware primitives needed to implement the boot

CACHE ENTRY TAG and console routines, and with several necessary I I timing mechanisms. The RSSC is designed to inter­ face directly with the VAX 6000 Model 400 chips, 1/0 SPACE. NOT CACHED and is based on the system support chip, or sse s BAC KUP CACHE INDEX The decision to base the RSSC chip on the previ­ LJ ous sse design was made because we could take UNUSED advantage of an existing core of logic which already implemented the required functions. The sse was Figure 11 VA X PhysicalAd£ lress and Backup fabricated in the CMOS-I process and had been thor­ Cach e RAMs oughly debugged and qualified for the MicroVAX 9 3500/3600 system. The challenge fa ced by the When an address is sent to the vc chip on the RSSC team was to design a new pad ring and bus !-bus, the VC chip notifies the memory interface as interface unit (Bll ) that would interface the existing to whether or not the address produced a hit in sse core to a much different and fa ster interchip either cache. The memory subsystem broadcasts environment. the invalidate onto the A-bus only if a match was detected. This mechanism significantly improves RSSC Fu nctions performance in multiprocessor systems by reduc­ The functions that the RSSC provides for the VAX ing the overall traffi c on the processor buses. 6000 Model 400 module can be grouped into two categories : boot and console code support, and TheVe ctor Interfa ce timer functions to support the operating system. The vc chip implements the interface logic that Some of the important console and boot code allows an optional vector unit to be connected to support fu nctions contained in the RSSC are the the scalar processor module. The vector unit has ROM/EEPROM interface, CPU halt-request protec­ the potential to signific:mtly increase the perfor­ tion, ARTs with programmable baud rates, I KB of mance of many analytical applications. standby RAM, and input and output ports which If the vector unit is present in the system, the interface to several console and module-level REX520 scalar CP chip decodes vector instruc­ switches and LEOs. The RSSC also implements three tions and passes operand and control information timer functions: the bus timeout counter, a I 0-milli­ to the vector module through the VC chip vector second intervaltimer, and a time-of-year clock. interface. The vector unit, using its own memory interface, accesses vector data from memory. 1be CLK Chip The vector and scalar modules are connected Clock generation for the VAX 6000 Model 400 chip through the vector interface bus (VIB). The VC chip set was placed in a separate chip, the CLK chip, to is the sole master of the VIB. The VIB is asyn­ better control interchip skew. By separating the chronous to the vc chip and synchronous to the clock, we eliminated power, pin count, area, and vectOr module. The vc chip implements the asyn­ noise problems from the other possible home of the chronous control logic required for transferring clocks, the REX520. Three main factors were taken data between the two modules. The decision to into account in designing the CLK chip. These fa c­ operate the VIB asynchronously with respect to the tors were the number of clock edges, clock skew, VAX 6000 Model 400 module was made in order to and signal integrity. A photomicrograph of the CLK simplify the design of the vector module. chip is shown in Figure 13. In determining the optimal configuration for TheRSSC Ch ip the number of clock edges, the CMOS micro­ A variety of support logic is required to com­ architecture was examined closely. The number of plete the functionality of a VAX CPU module. The clock edges per microcycle determines the granu­ VA X 6000 Model 400 system support chip (RSSC) larity available to the designer. Increased granular­ integrates on a single chip the common core of ity can simplify the design process by reducing the functions necessary to support the VAX system need for self-timed design techniques. However, environment. A photomicrograph of the RSSC is the usefulness of partitioning the design cycle is shown in Figure 12. limited. Granularity on the order of a gate delay is of

48 Vo l. 2 No. 2, Sp ring 1990 Digital Te chnicaljo urnal An Overviewof the VAX 6000 Model 400 Chip Set

Figure 12 RSSC Photomicrograph

little use, whereas granularity on the order of a four 0-type flip-flops, is shown in Figure 14. CLK is microcycle imposes the use of self-timed circuits. A the clock input, 0 is the data input, Q is the latched decision was made to distribute four 14-ns pulses true data output, and QB is the latched complemem phased in time by 7 ns to facilitate the design of data output. regular arrays, data path structures, and IIO func­ CLK is the 143-MHz crystal oscillator clock. The tions. This distribution made master-slave clock circuit consists of a 0-type flip-flop with one mas­ generation and single-phase generation easy, and ter, 01, and three slaves 02, 03, and 04 . 02 pro­ complementary clocks for latches more available. vides local feedback for the divide-by-two function Clock skew was minimized through logic and in order for 03 and 04 to match identically. 03 is circuit techniques. The clocks are generated from a the Q slave and 04 is the QB slave. 143-MHz crystal oscillator. The frequency is imme­ 03 allows the inversion in the feedback path to diately divided by two to generate an even duty be absorbed and the Q and QB to see the same num­ cycle 71.5-MHz signal and its complemem (PHI and ber of gate delays. This logic technique works for PHLBAR). The divide circuitry, which comprises all dividers used in the design. However, the

Digital Technicaljournal Vol. 2 No. 2, Sp ring 1990 49 VAX 6000 Model 400 System

Figure 13 CLK Chip Ph otomicrograph

dividers feed a series of large ourpur drivers in package, three power pins, three ground pins, and which skew could not be controlled through logic two signal pins were used for each driver. techlliques. In these cases, skew was minimized This pin design reduced the inductance seen by by matching the capacitance on each node. The end each driver by approximately a factor of two. In result was a design in which the skew between any two edges generated by the CLK chip was not greater than 0.3 ns. Signal integrity was carefully considered in the design. The clocks had to be distributed to eight different loads, and the total skew between any two edges could not exceed 0. 5 ns. This design implied CLK that the CLK chip package and the clock module interconnect could not contribute more than 0.2 ns of skew. To minimize the electrical impact of the Figure 14 CLK Ch ip Divide Circuitry

50 Vo l. 2 No . .!, 5/J rin[?, 1()()0 Digital Te chnicaljounwl An Overview of the mX'6000 Model 400 ChipSet

addition, the clocks were radially distributed, and 6. B. Benschneider et a!., "A Pipelined 50-MHz diode termination was used on each clock leg. The CMOS 64b Floating Point Arittunetic Processor," length of each leg was identical. IEEE jo urnal of Solid State Circuits, vol. 24, no. 5 (October 1989): 1317- 1323. Conclusions 7. D. Atkins, "Higher Radi,-x Division Using Esti­ The VAX 6000 Model 400 chip set represents a suc­ mates of the Divisor and Partial Remainders," cessful mapping of the ECL-based VAX 8700 system IEEE Transactions on Computers, vol. C-17 microarchitecture into a CMOS-based VLSI chip set. (October 1968): 925-934. The fully folded pipeline, on-chip cache and control 8. ]. Winston, "The System Support Chip, a Multi­ store, and read and run/write and run strategies of function Chip for CVAX Systems," Digital the processor chip, combined with a high perfor­ Te chnical jo urnal, vol. 1, no. 7 (August 1988): mance floating point processor and a second-level 121-128. 6 4 cache, enabled the VAX 000 Model 00 to exceed 9 G. Lidington, "Overview of the MicroVAX 3500/ the original performance goals for the system. 3600 Processor Module," Digital Te chnical journal, vol. 1, no. 7 (August 1988): 79-86. Acknowledgments The authors would like to acknowledge the follow­ ing people for their contributions to the VAX 6000 Model 400 chip set development effort: Randy Allmon, Brad Benschneider, Mary Jo Butler, Sandy Carroll, Gerry Cheney, Larry Commodore, Beth Cooper, Amnon Fisher, Nannette Fitzgerald, Annette Flohr, Moshe Gavrielov, Ed Gomes, Paul Gronowski, Bill Grundmann, Ellen Kagan, David Kravitz, Liam Madden, Vijay Maheshwari, Steve Martin, Karen McFadden, Roger Meeks, George Mills, Mike Minardi, Millind Mittal, Dave Morgan, Victor Peng, Jeff Pickholtz, JimRei nschmidt, Doug Sanders, Katie Siegel, Rebecca Stamm, Pete Starvaski, Bob Supnik, Bill Wheeler, Jeff Winston, and Bill Upham.

References

1. P. Sullivan et al., "The VAX 6000 Model 400 Scalar Processor Module," Digital Te chnical journal, vol. 2, no. 2 (Spring 1990, this issue): 27-35. 2. T. Leonard, ed., VAX Architecture Reference Ma nual (Bedford: Digital Press, Order No. EY-34 59E-DP, 1987). 3. T. Fox et al., "The CVAX 78034 Chip, a 32- bit Second-generation VA X Microprocessor," Digital Te chnical jo urnal, vol . 1, no. 7 (August 1988): 95-108. 4. S. Mishra, "The VAX 8800 Microarchitecture," Digital Te chnicaljo urnal, vol. 1, no. 4 (February 1987): 20-33. 5. ]. Bartoszek et al., "VAX 6000 Model 400 Physical Technology," Digital Te chnicaljoumal, vol. 2, no. 2 (Spring 1990, this issue): 52-63.

Digital Tecl.micaljoumal Vo l. 2 No. 2, Sp ring 1990 51 jo hn T. Bartoszek Robert]. Hannemann Stephen P. Hansen Robert]. McCarty john C. Sweeney

VAX 6000 Model 400 Phys ical Te chnology

Thephy sical realization of the VAX 6000 Model 400 microprocessor design offe red a number of significant challenges at both the chip package and the module Levels. In meeting the requirements for a robust and manufacturable midrange imp le­ mentation, the VAX MOO Mo del 400 physical technologyapp roach broke new ground for Digital, and, in some cases, fo r the industry. New developments included the fi rst tape-automated bonding (JAB} interconnected semiconductors, extensive board-level physical simulation, and the use of advanced testability fe atures on a microprocessor-based midrange product. Thispa perprov ides details of thephysi­ cal technology used in the �'lL-Y 6000 Model 400 proj ect to achieve system-level product goals.

In troduction • The design and development of an advanced (PWB) As shipped beginning in July 1989 , the VAX 6000 printed wiring board technology that Model 400 microprocessor chip set is one of the allowed over 5000 inches of interconnecting fa stest and highest performance complex instruc­ wiring on only four routing layers of a 9-inch by tion set computer (CISC) CPLJs offered by the indus­ 11-inch board, with two different controlled try. Its microcycle time is 28 nanoseconds (ns). impedance levels Such performance makes the chip set one of the • The extensive use of electrical and thermal simu­ most demanding in terms of physical technology lation at the chip package and mod ule levels design at the chip package and single-board­ computer (SBC) level. This paper details the req uire­ • The employment of an advanced surface-moum ments that drove the VAX 6000 Model 400 physical technology (SMT) that allows the use of mixed technology and describes the resulting technology component styles (50-mil SMT devices and fine­ solutions. The range of solutions included the pitch, 25-mil devices, and a limited number of design process, module assembly, and advanced through-hole mounted devices); and attachment test features. of both active and passive components on both The VAX 6000 Model 400 project was the first sides of the board Digital microprocessor-based system effort that • The use of advanced rest techniques, most nota­ re4u ired developers to use large-computer bly an innovative continuity transistor strucwre design tools and pro cesses. During the project, a (CTS) for assembly verification and boundary number of firsts, for either Digital or the industry, scan design for the core CPU chips were achieved. AU of the project's technology developments will • The use of advanced rape-automated bond­ be employed on fol low-on products, in both mid­ ing (TAB) technology in a manufacturing range and entry-level systems. Ensuring this kind of environment technology extensibility was our explicit goal. • The development of innovative surface-mount­ In the remainder of this paper, physical tech­ able chip packages th:H provide high lead count nology refers to the chip packaging and inter­ and a controlled electrical environment for the connection technologies at the individual device custom CMOS CPU and bus interface chips and module levels.

52 Vo l. 2 No. 1, Sp ring !'.NO Digital Te chn icaljo urnal VAX 6000 Model 400 Physical Te chnology

Te chnology Requirements The core chip set was expected to be used in The physical technology requirements that drove low-end products with limited air-flow capab il­ the VAX 6000 Model 400 technology effort arose ities. However, a junction temperature limit of85 from two primary sources: architectural and per­ degrees Celsius at the DEC standard 102 Class B formance requirements, and manufacturing and environmental limits had to be maintained . reliability related needs. How we addressed these problems is described in the later sections, Chip Packaging and Module Architectural and Perfo rmance Te chnology. Requirements The CMOS-2 semiconductor process technology Ta ble 1 VA X 6000 Model 400 CMOS-2 and the microarchitecture were chosen to satisfy Features Affecting Interconnect the system perfo rmance goal of at least six VAX units Technology

of performance (VUPs). That choice established the Driver rise time 1 ns baseline physical technology performance and elec­ Clock cycle 28 ns trical requirements. - 4 phase cycle One processor design goal in particular had a Data bus width 64 bits significant effect on the physical interconnect tech­ Cache access time 7 clock nology. The processor was to operate at a CPU phases SBC microcyclc time of 28 ns at the level. Although (49 ns) the nominal target of the chip set design was opera­ Max. chip pin count 224 tion at 40 ns, early indications were that the CMOS-2 Max. chip power 6 watts process would allow significantly faster operation. Therefore, the rest of the physical technology needed to support the 28-ns goal. Ta ble 2 lists the key CMOS-2 chips developed for Ta ble 1 sununarizes the major features of CMOS-2 the VAX 6000 Model 400 processor and the key and the VAX 6000 Model 400 architecture that characteristics of these chips that affected the drove the physical technology requirements. Chip physical architecture of the design. packaging and module technology requirements were affected in two ways: Ta ble 2 VAX 6000 Model 400 Chip Set Characteristics • The wide data bus could result in many drivers switching simultaneously. Coupled with the rel­ Lead Maximum Lines atively fast driver rise times, simultaneous Count Switching Power Chip (Actual) (Approximate) (W) switching significantly increased package elec­ trical performance. Our major concern was the P-chip/DC520 224 81 6.1 high current draw from local (within the pack­ F-chip/DC523 224 80 4.5 age) power planes. The electrical performance VC-chip/DC592 224 60 4.2 required of the chip packages clearly indicated XDP/DC550 224 100 3.6 that simple cerquad packages would not work. XCA/DC551 224 70 3.6 The primary shortcoming of the cerquad pack­ ages was the high inductance of the signal and power connections. Performance levels required A more direct measure of the electrical perfor­ multilayer ceramic packages with features that mance required from the VAX 6000 Model 400 were at the limit of available ceramic technology. physical interconnect system can be derived by examini ng the timing budget allocated to some of A wireability estimation technique used early in the signal transfers that occur within the CPU. the program indicated that fou r to sLx routing Three of the most critical are the clock distribution layers would be needed in the PWB, using 5-mil system, the cache access loop, and the data bus line widths and 5-mil spacing between lines, to between the core CPU chips. The clock signals need route the core CPU chips with associated cache. to be valid and synchronized within 0.5 ns at the • The relatively high power dissipation require­ end points of all clock lines. The clock electrical ments contained in the initial design specifica­ performance relied heavily on the uniformity of the tion posed considerable development difficulty. distribution system.

Digital Technicaljo urnal Vo l. 2 No. 2. .vJring 1990 53 VAX 6000 Model 400 System

The cache access loop timing was slightly more the chip boundaries as opposed to the original complicated. The cache access loop timing budget target of fu ll scan at all logic boundaries. that was established for the 28-ns version of the CPU is shown in Ta ble 3. The interconnect system pri­ • Test transistor structures for interconnect verifi­ marily affects the address bus (A-bus) settling time cation at module assembly were implemented (20.5 ns). for the custom chips.

• Separate, system-specific micro heat sink designs Table 3 Cache Access Loop Timing Budget allow all chips to operate at a maximum of Address bus settled 20.5 ns 85 degrees Celsius, with one exception: the Buffer 7.0 ns 6. 1 watr P-chip reaches 89 degrees Celsius under RAM access 21 .0 ns absolute worst-case conditions. Although not specifically meeting the defined goal, overall To tal 48.5 ns system reliability and operations were not fe lt to be impacted to any significant degree by aUow­ ing this exception. Ma nufacturing and Reliability Requirements Some goals were not met .

Certain manufacturing and reliability goals estab­ • The interchip routing included over 1000 sig­ lished for the VAX 6000 Model 400 product also nal nets and over 1800 routed nets within the influenced the physical technology selection and PWB. This large number of nets required the use design. The major goals were as follows: of the clock routing layer to complete the inter­ connect of the CPU module. Although the PWB • The use of TAB as the off-chip interconnect tech­ technology supported 5-mil lines and spaces, the nology, which was viewed as an appropriate signal integrity constraints of the system entry point for TAB required more than 5-mil spaces between signal • An all surface-mount technology (SMT) module lines. In addition, the ability to do etch cuts on a assembly approach 5-mil line was determined to be too difficult and risky to pursue. Engineering changes were per­ • A robust test strategy, including significant use of formed by doing cuts in the surface etch, which fault diagnosis using boundary scan and other connects each surface-mount pad to its associ­ test fe atures ated via, which in turn connects to an internal • Maintenance of the ability to perform engineer­ signal layer. ing changes using etch cuts on the PWB by restricting signal routing to the second and ninth • TA B tape and packages were designed, proven, layers, with clock lines only on the remaining and sourced for six of the 224-pin CMOS-2 chips two signal layers in the VAX 6000 Model 400 processor. However, for manufacturing logistics and line-loading rea­ • Achievement of the required 85 degree Celsius sons, TAB is currently used in only a subset of junction temperatures these devices. A series of wire-bonded backup Most of these goals were met : packages was designed and is now in use for the remainder of the 224-pin devices. • The worst-case A-bus and data bus (D-bus) settling times were 19.6 ns and 20.2 ns, respec­ Nonethekss, the physical interconnect technology tively. These settling times met the timing used did result in package designs and a module requirements as defined in Ta ble 3. design that supports 7 VUPs, 224 1/0, fine-pitch devices, with associated cache, operating at 28 ns. • A nearly total surface mount assembly was The details of the chip packages, the PWB tech­ achieved, with only connectors, oscillators, and nology and module design are presented in the erasable progranunable read-only memories fo llowing sections. Also described are the test struc­ (EPROMs) in through-hole configurations. tures required to satisfy the performance needs and • Boundary scan was designed into the core CPU other established goals of the VAX 6000 Model 400 chips. Observe-only scan latches were used at CPU product.

54 Vo l. 2 No. 2, Sp ring /'J'JO Digital Te cbnicaljournal VAX 6000 Model 400 Physical Te chnology

Ch ip Packaging with a wide variety of packaging formats (both sin­ In March 1989, Digital's Semiconductor Intercon­ gle chip and multichip ), and improved pin density. nect Technology Group (SCIT) completed quali­ The technology also had the potential for electrical fication of the Model 400 processor advanced chip enhancements with multiconductor TA B tape, and packaging technology. indications were that the industry as a whole would This qualification consisted of two primary ele­ embrace TA B technology. ments, which are discussed in the sections below. Unlike the wire-bonding process, TA B utilizes The first element is the initial implementation of photolithographically defined copper conductors our internallydeveloped perimeter tape-automated stabilized by a dielectric film made of polyamide. bonding (PTAB- 1) process. The process serves as the As shown in Figure I, the TA B tape consists of an interconnection medium from the semiconductor inner set of connections for the chip-to-tape or device to the chip package. The second element is inner lead bonds (ILB). External to the dielectric the chip package itself, which is an advanced 224- material is the outer lead bond (OLB) region where Jead multilayer ceramic (MLC) surface-mount com­ the tape-to-package connection is made. The leads ponent. The TA B/MLC combination provides an then fan out to a set of test pads. These pads permit effective chip package solution for the VAX 6000 full electrical testing of the semiconductor device at Model 400 products. Moreover, TA B/MLC serves as a this level of assembly. Sprocket holes serve as han­ technology springboard for further TAB-based dling and alignment features. packaging solutions currently in development for Unlike wire bonding, in which the wires may be future CMOS systems. bonded directly to the aluminized bonding pad on A fundamental decision was made early in the the chip, the planar TA B tape requires a raised pillar, TA B technology program. To reduce risk, we or bump, on each bond pad. The bump, usually wanted to introduce TA B in an environment that made of gold, is typically 25 microns high. In addi­ allowed a convenient wire-bonded backup to be tion to acting as a standoff, the bump provides a developed in parallel. The TA B/MLC solution proved surface that is appropriate for the bonding of the ideal for this approach, and a set of wire-bonded gold-plated tape. Figure 2 shows a bumped and packages was designed and qualified. Subsequently, bonded device. The bump process may be thought several of the VAX 6000 Model 400 chips were of as an extension to the wafer fabrication process moved to the backup wire-bonded packages for in which the construction of the bond pads is con­ manufacturing logistics and line-loading reasons. tinued vertically. Wire-bonded technology is not described in this paper because the package designs are very similar to the TA B/MLC packages detailed below.

Tape-automated Bonding Te chnology The connections from a semiconductor chip to the package have traditionally been made with wire­ bonding technology. In this process, free-floating wires are placed individually from the aluminum chip termination pads to the package internal con­ nections. Although this technology is versatile, it has some limitations in high pin count, high-density applications. Bonding becomes very time-consum­ ing at high pin counts; package and manufacturing tolerances become critical; and package choices with wire bonding have increasingly narrowed to newer and more expensive technologies, such as precision thin films. From 1980 to 1981, exploratory work was per­ formed within Digital on alternative interconnec­ tion technologies. In 1982, the decision was made to pursue TA B technology. The rationale for this deci­ sion included TA B's easy testability, compatibility Figure I TAB Ta pe Configuration

Digital Technicaljournal Vo l. 2 No. 2, Sp ring 1990 55 VAX6000 Model 400 System

Once the TAB tape has been aligned with the bumps, a tool supplies pressure and ultrasonic energy to the interface and creates a strong gold­ gold metallurgical bond. In 1985, the VA X 6000 Model 400 processor wa� identified a-; the first product that would incorpo­ rate TAB technology. Consequently, the TAB pro­ gram's goals were set to meet the product's needs:

• CMOS-2 (I.5-mic ron technology) compatibility

• 150-micron pad pitch

• 224 pins

• 35-mm format TA B tape

• Assembly in ceramic package

The program was designated PTAB 1, the fi rst implementation of TAB with chip pads in a peri­ meter or single row format. The PTAB 1 process was formally qualified in March 1989

Fig ure 2 TAB Bumped and Bonded Device VAX 6000 Model 400 Multilayer Ceramic Pa ckaging As shown in Figure 3, after the semiconductor The custom semiconductor packages implemented wafer has completed the standard diffusion, metal­ for SCIT's ZMOS- and CMOS-1-based systems have lization, and passivation steps, the wafer is metal­ been either cofired multilayer ceramic (MLC) lized again over the entire surface with a sputtered through-hole pin grid array (PGA) packages, or the barrier metallization of titanium/tungsten. In the simpler lead-frame-based cerquad surface-mount same process step, a top seed layer of gold is packages. The maximum pin counts in use up applied . The total thickness of these films i' s 10,000 through CMOS- I are 132 pins for PGA style packages angstroms. The barrier prevents diffusion between and 164 pins for fine-pitch cerquads. the aluminum pad and the gold bump. Such diffu­ sion would, with time and temperature, seriously degrade the mechanical strength of the interface. The seed layer forms the ba'\eupon which the bump will be plated during a subsequent step. Metallization is fol lowed by the deposition of a thick layer, typically _) 0 microns of photo resist. This layer is then photolithographically patterned with a bump mask. Subsequent etching opens a hole over the bond pads down to the barrier and seed layer. At this point, the wafer is electroplated with pure gold. The gold is deposited only in the resist openings. Finally, the resist is stripped and a series of etches are performed to remove the seed and barrier metallizations from the areas between the bumps. After the wafer processing steps are complctccl, the devices may be electrically tested by probing the bumps. Assembly of the functional die consists of sawing the wafer, selecting the good devices, and bonding the bumped chips to the TA B tape. The bonding process is very similar to wire bonding. Figure 3 Semiconductor Wafe r

56 Vo l. 2 No. 2, Sp ring I'J'JO Digital Tecbnicaljounwl VAX 6000 Model 400 Physical Te chnology

The VA X 6000 Model 400 P, F, VC, XDP, and XCA chips require a packaging solution consistent with their electrical and thermal performance needs as well as compatibility with a lead count of 224 pins. The solution chosen for these components was a blend of the PGA and cerquad technologies: a multi­ layer ceramic body with 25-mil pitch perimeter leads for surface mounting. The resulting 224-pin MLC is depicted in Figure 4. KEY: There are several significant features in the 224 MPl - SEAL RING AND LOGO MP2 - VSS EXTERNAL (GROUND) MLC family of packages. In wire-bond packages, MP3 - SIGNAL LAYER AND BOND SHELF MP4 - VDD EXTERNAL (POWER) maximum wire span constraints require a different MPS - SPACER LAYER layout for each chip configuration. The TAB format, MP6 - VSS INTERNAL (GROUND), DIE ATTACH AREA MP7 - VDD INTERNAL (POWER) however, has more routing flexibility and permits MPB - CHIP CAPACITOR SOLDER LANDS. LEAD one basic package layout. The package has five BRAZE PADS. HEAT SINK ATTACH AREA internal interconnect layers with assignments, as shown in Figure 5. In cofired MLC technology, the conductor traces Figure 5 Pa ckage Itselfwith Internal are made with screen-printed, tungsten-filled vias. Interconnect Layers with To specialize the package interconnect for each Assign ment chip, a programmable approach was developed in which only one via layer would have to be changed for each package design. The combination of TAB shown in Figure 6. The thermal performance of the interconnect and the programmable package con­ package is shown in Figure 7. cept considerably reduces design time, design com­ The 224 MLC package is assembled by mounting plexity, tooling costs, and lead times. the TAB chip into the package cavity with silver­ Another feature of the 224 MLC is the provision ftlled die attach epoxy. Outer lead bonding of the for eight chip capacitors for power decoupling. The TAB film to the gold-plated package pads is accom­ heat-sink design fo r the 224-pin MLC package is plished with the same pressure or ultrasonic pro­ cess used for inner lead bonding. Final steps consist of lid seal, lead plating, chip cap and heat-sink attach, and lead trim. The part is then ready for electrical testing.

Module Te chnology

Printed Wiring Board Te chnology The VA X 6000 Model 400 processor's printed wir­ ing board (PWB) requirements offered a significant challenge to hoard fabricators. The primary differ­ ence between the Model 400 processor's boards and previous boards is 10-mil finished vias, resulting in a 7 to 1 aspect ratio. We were initially very con­ cerned whether our vendors could produce boards

Figure 4 The 224-pin MLC Figure 6 ML C Pa ckage Heat Sink

Digital Te chnical]ounral Vo l. 2 No. 2, Sp ring 199() 57 VAX6000 Model 400 System

20 an average glass transltton temperature (tg) of 120 degrees Celsius, was suspected to be incapable of surviving the assembly process. The alternative to FR4 was polyamide, which has a tg of 240 degrees Celsius. In response to the con­ >=' 15 1- cern over FR4, a PWB material selection task force w � u was convened, which was composed of board za: <( W 1- (L experts from throughout the company. '!2m (f) :::J The task force discovered that board material w- a:rn 10 _j actually is not the primary consideration. Either FR4 _jw <( U or polyamide is acceptable (both were eventually ::::; (/) a:w w w used in production). However, other board parame­ I a: 1- (.9 ters become critical when FR4 is used. Primarily, w 9. minimum barrel copper plating thickness should be 5 one milfor FR4 boards. Variation in barrel copper should not exceed 50 percent. In addition, there can be no smearing or

Ta ble 4 VAX 6000 Model 400 Printed Wiring 0 200 400 600 800 1000 Board Specifications AIR VELOCITY (LINEAR FEET PER MINUTE) Parameter Va lue

General:

Figure 7 Th ermal Performance of Pa ckage Board size 9.2 inch x 11.0 inch Layer count 10 layers: 4 signal, 4 power/ground that met our specification in the volumes we Material options FR4, polyamide needed. Ta ble 4 lists some of the primary features Copper foil type Class Ill of the Model 400 PWB specification. Te sting 100% data-driven The most severe environment the boards would see was the assembly plant. Because the module Electrical: used a variety of component types, it was put Characteristic 50 ohms ± 10% for 10-mil lines through several process steps, each step requiring impedance general or localized heating to solder reflow tem­ DC resistance 4 ohms maximum peratures. We wanted to ensure that the boards Physical: would survive multiple assembly and repair cycles and still be reliable. We established a "fit for use" Via type Through plan that required the boards to undergo a series of Via size 10 mil finished 13.5 mil drilled thermal cycles. In these cycles, temperatures and Maximum 7:1 times were set to match intended assembly process aspect ratio steps. The boards were then cross-sectioned and SMT pads 50-mil pitch: 0.030 inch x examined for defects. Once manufacturing began 0.076 inch building modules, a fe w of these were also cross­ ± 0.001 inch sectioned. With this strategy, we could quickly 25-mil pitch: 0.016 inch x determine the quality of each lot of boards. More­ 0.076 inch over, we could begin to correlate board structure ± 0.001 inch with board quality in each lot. This process allowed Etch widths Signal: 5 mils Clock: 10 mils us to assess each vendor's capability to provide boards that would meet our specifications. Proximity to 5 mils next feature Because module assembly process temperatures Solder 0.15 mil minimum typically exceed 200 degrees Celsius, questions requirement arose over what was the appropriate board mate­ Tin/lead alloy 63/37 ± 10% rial. The more commonly used FR4 material, with

58 Vo l. 2 No . 2, Spri ng [<)!)() Digital Tecbnicaljournal VAX 6000 Mo del400 Physical Te chnology

other defects in the vias from the drilling process. cross-sectioning. This testing process uncovered To maintain tight control over these parameters, we inadequate or nonuniform plating, pad-to-copper­ included statistical process control as a vendor plating separations and misregistration. The infor­ requirement. Ta ble 5 summarizes the results of the mation gained from this procedure was given to the task force findings. vendor as a basis for corrective action on future Because of recent board quality problems and the board production. aggressive nature of the Model 400 specification, Assembly verification was final proof that the we ran a product-specific board qualification. The board would make a reliable module. Each board goal was to verify that each vendor could consis­ went through the full assembly and test process, tently produce the Model 400 boards in volume including burn-in, to ensure it could survive the before the vendor was placed on the qualified process and pass all functional tests. vendor list. The four key components of the plan were as follows: Surface-mount Assembly Te chnology • Incoming inspection The surface-mount assembly technology (SMT3) • Electrical test used for the VAX 6000 Model 400 processor is the latest in a series of electronic assembly technologies • Cross-section analysis developed by Digital since 1985. The SMT3 tech­ • Assembly verification nology allows double-sided mounting of high lead­ count and fine-pitch devices on a printed wiring Incoming inspection testing was nondestructive board, with surface-mounted passives and mixed and covered plating thickness, plating composition, component styles. The VAX 6000 Model 400 pro­ and characteristic impedance. cessor uses essentially all of the SMT3 features. an internally developed tester, we could Using High pin count and fine-pitch devices presented verify 100 percent connectivity between pads and, new problems to the surface-mount attach process thus, detect both shorts and opens. We also used team. The small, tightly spaced leads require a this test to verify prototype boards, ensuring that smaller pad and less solder than their 50-mil pitch the boards were good before valuable prototype predecessors. The smaller, more fragile pins can parts were committed to them. become misaligned and no longer coplanar where Digital's Component Evaluation Laboratory they meet the board surface. performed bare board and assembled module Our primary goal was to find the correct pad size and solder volume for attaching 25-mil pitch com­ Table 5 VA X 6000 Model 400 PWB Materials ponents. Once the fine-pitch pad size and solder Task Force Summary volume were determined, it became apparent that Laminate: the correct solder amount could not be delivered to Polyamide should be used for prototypes. both ftne-pitch and standard pads in a single oper­ Polyamide and FR4 are acceptable for volume. ation using the standard solder-paste screen Long-term, vendor capability and cost may favor approach. The smaller, fine-pitch screen openings polyamide. could not consistently pass the correct amount of solder. The solution was to use a laminated stepped Copper: stencil that places a thinner solder deposit on the Class 3 foil should be used . fine-pitch pads than on the rest of the pads. For polyamide, standard plated-through-hole (PTH) The initial approach fo r attaching fine-pitch copper is acceptable. devices was to use the existing vapor-phase mass­ For FR4, minimum barrel plating thickness is 1 mil. reflow process. However, if pin noncoplanarity Barrel plating thickness variation maximum is 50 exceeded 2 mils, some pins would not be soldered percent. Hole quality must be good . adequately. The best that could be guaranteed was 4 mils. Process Control: The development team turned to solder-in-place, Ve ndor should provide test coupon cross-sections with each lot. which uses a thermode fi xture to place the compo­ nent, push the pins into the paste, and reflow the Vendor should institute statistical process control. solder, forming a good solder connection for each Vendor site should be monitored regularly. pin. An advantage of this process is that it does not

Digital Te chnicaljo urnal Vo l. 2 No. 2, Sp ring 1990 59 VAX6000 Model 400 System

hear the entire board, and the pn.:viously attached Module Design components, ro solder-reflow temperatures. Several board design requirements combined to The surface-mount module process was by make the design task challenging. These require­ necessity developed concurrently with the Model ments included signal integrity constraints, finer 400 product design. As the design progressed, a layout and routing grids, and short dispersion etch. problem with the primary assembly equipment Figures 8 and 9 show the complete module. Most developed. Suppliers for high-volume thermode of the module area is composed of the six large pick, place, and solder equipment did nor keep pace 224-pin devices that form the processor core and with our schedule requirements. We were then interface to the XMI corner. These devices had faced with the choice to develop equipment inter­ critical placement requirements. Their 25-mil pin nally or switch the process to a more developed pitch forced very dense etch runs. technology. To attain maximum use of available routing area, The Midrange System Manufacturing Group's the design used tenth-mil grid instead of the previ­ most readily available backup process was vapor­ ously used one-mil grid. The new grid allowed opti­ phase mass-reflow. This pron:ss guards against mum etch channel placement. Similarly, vias were coplanarity problems by including careful inspec­ placed on a 25-mil grid with 50-mil spacing. This tion of all fine-pitch components before commit­ allowed the designer more flexibilityin placing vias ring them to a board. The pins that do nor solder so the available space was used most effectively. are manually repaired. Since it was nor possible to The VAX 6000 Model 400 processor introduced develop solder-in-place equipment internally in new routing parameters and stringent signal time to meet our schedule, the program decided to integrity constraints. The density of the design use vapor-phase mass-ref!ow. required 10-mil vias. Signal integrity considerations

Figure 8 Processor Module -Side One

60 Vo l 2 No. 2, Sp ring 1990 Digital Tecbnicaljournal VA X 6000 Mode/ 400 Physical Te chnolog y

imposed component spacing and electrical connec­ less than a specified maximum to meet perfor­ tion requirements that resulted in components mance specifications. To avoid skew problems, being closer than suggested by current standards. clocks were routed equal lengths to within a tight Surface etch was needed to keep connections short. tolerance. Table 6 details the resulting design Components on side two were placed underneath parameters. fine-pitch devices on side one for electrical prox­ imity. New manufacturability rules were generated Tes t Technology tO cover these situations as the design progressed. Because surface layers are not used to route sig­ nals, these layers were not designed to have good Assembly and Te st ProcessDevelo pment signal integrity characteristics. However, a signal Assembly and test process issues were tracked has to travel a short distance from its pin p�1d to its throughout the development and selection of the dispersion via which connects to an inner routing Model 400 physical technology. The manufacturing layer. Therefore, it is essential to keep surface etch impact of each physical technology choice was as short as possible to minimize the distance the sig­ quantified in a spreadsheet analysis of cost and qual­ nal travels outside a controlled impedance environ­ ity metrics. ment. To meet this distance requirement, each These metrics were estimated by using integrated critical component was placed very precisely and assembly and test process models for each possible its dispersion pattern was individually designed. physical technology implementation. Two issues Signal integrity considerations placed other con­ became evident during the physical technology straints on the design. Critical signals had to be kept selection process.

Figure 9 Processor Mo dule -Side Two

Digital Te chnicaljournal Vn /. 2 No. 2, Sp ring 1')90 61 VAX 6000 Model 400 System

Table 6 VAX 6000 Model 400 Processor the internal module and verifies correct contact by Board Design Statistics reading current flow. CTS 1 Parameter Va lue The is shown in Figure 10. Pins through N represent aU signal pins on the chip. The design uses General: minimum-sized transistors. The CTS design does Board size (inches) 9.2 X 11.0 not require any dedicated pins because the test pin Board thickness (inches) .093 is a normal device signal pin. There is no perfor­ Layers 10 mance penalty because the transistors are placed in Initial route area (square inches) 99 paraUel to the normal system logic, which results in Routing vias 1699 a negligible load on those signals. The use of the CTS in the processor module Dispersion vias 3813 manufacturing process has been very successful. Total components 623 The ICT very quickly isolates open connections to To tal component pins 4976 the device pin and differentiates them from device Total used component pins 4218 test pattern fa ilures. This process allows the open Total networks 1005 connection to be repaired rather than replacing To tal etch length (inches) 5002 the device. Further, CTS testing allows prototype modules to be fu lly tested for assembly defects, even if the VLSI First, the high number of signals and the fine in-circuit test patterns are not available. This pitch of those signals in all of the possible product advantage is possible because designers can fully CTS implementations significantly increased the risk develop tests without any knowledge of the VLSI of manufacturing defects, such as shorts between device internal structure or function. signals or open faults along a signal. A continuity transistor structure (CTS) is designed into each of Observe BoundaryScan the VLSI (very large scale integration) devices to Modules that pass ICT testing are then tested at a help test and diagnose these open faults. "system like" test station. Self-tests and boatable Second, high-speed operation and reduced phys­ diagnostics are run, and the VMS system is booted. If ical access would make diagnosis of processor fail­ a module fails any of these tests, skilled technicians ures difficult. To alleviate this problem, test features diagnose the failures by attaching logic analyzer and a test system were developed. The test feature probes to the mo dule. Because of the fine-pitch was a fo rm of boundary scan, called observe surface-mount devices and the high-speed opera­ boundary scan (OBS). The test system, the VAX 6000 tion, it is very difficult to attach logic analyzer Model 400 scan monitor, utilized the OBS in a probes to many nodes on the processor module. system test environment. OBS allows the CPU module to be observed as it executes VA X macrocode on board self- test stimu­ Continuity TransistorStructure lus at the module's full clock rate. This additional Because the devices are surface-mounted to the observation is used by the scan monitor to help test module, a large percentage of the manufacturing and diagnose module-level faults in a system test defects were expected to be open faults between environmentin stage-one manufacturing. the module and the chips. Typically, these open fa ults are difficult to detect and diagnose because they usuaUy require the development of a set of TEST PIN (ALSO A NORMAL complex test vectors that will be applied to the DEVICE SIGNAL PIN) chip. To simplify the test for open faults, a continu­ ity transistor structure (CTS) is designed into each N-CHANNEL MOSFET chip. The CTS tests for open faults by using simple instruments such as voltage sources and current ::: N-CHANNEL MOSFET meters on an in-circuit tester (ICT). GND The Module 400 CPU module is placed on a bed­ ··· r3PIN N of-nails fe<.ture chat gives the tester electrical access to at least one point on each internal module signal network. The tester then applies digital stimulus to Figure 10 Continuity Transistor Structure

62 Vo l. 2 No. 2, Sp ring J<)

Designing OBS into a custom VLSI device is more • Achievement of a technology set capable of complex than adding CTS, but is still relatively sim­ 28-ns clock cycles through the use of full elec­ ple. OBS is simply a parallel-load, serial-shift register trical simulation at the device, package, and with one bit of the register on each device pin. module levels Although it is not negligible, the OBS uses a rela­ • Development and implementation in manufac­ tively small amount of silicon area and also does not turing of the SMT3 module assembly technology, affect product performance. The total area used by which allows double-sided mounting, high both the CTS and OBS was estimated at about one to lead-count fi ne-pitch surface mounting, surface­ two percent of the chip area. Unlike CTS, which mounted passive components, and mixed com­ does not require any dedicated device pins, OBS ponent types uses two dedicated pins on each device in which it is implemented. To fully utilize the OBS test fea­ • Introduction of innovative testability features, ture, the VAX 6000 Model 400 scan monitor was including the continuity transistor structure for designed and built. assembly verification and observe boundary scan The scan monitor controls and reads the OBS on for diagnosis in engineering debug and module the Model 400 module. A host computer system manufacturing interfaces with the monitor. The scan moni.tor As intended at the outset of the project, these control program (SMCP) operates the scan moni.tor, technologies will be employed on a significant num­ makes pass/fail decisions on the data received, ber of follow-on midrange and low-end products. and diagnoses failures. SMCP also includes many features that allow it to perform as a virtual logic Acknowledgments analyzer, including waveform displays that high­ A large number of people contributed to the success light fa ulty behavior, as we.ll as full triggering of the VAX 6000 Model 400 physical technology functionality. program. It is not possible to name all these impor­ Conclusion tant contributors in this space. Therefore, the authors want to acknowledge that the VAX 6000 The aggressive performance goals and advanced Model 400 project could not have been undertaken semiconductor technology used for the VAX 6000 nor have succeeded without significant efforts by Model 400 processor meant a signi.ficant develop­ members of the SCIT Physical Technology Group, ment effort for packaging and interconnect techno­ Semiconductor Assembly Group, Midrange Systems logy. The technology requirements included high Engineering Group, Corporate PWB Group, Mid­ lead count, electrically tailored single-chip pack­ range Systems Integration Group, and the SCIT ages, very dense controlled impedance printed Semiconductor Engineering Group. wiring boards, a state-of-the-art surface-mount assembly process, and advanced test features. The physical technology achievements in the VAX 6000 Model 400 project represent an effort in the packaging and interconnect disciplines more akin to mainframe and supercomputer develop­ ments than to traditional microprocessor-based system approaches. The accomplishments of the efforts include:

• Development and implementation of an advanced TAB technology for the high lead­ count custom chips

• Design of an innovative semi customized ceramic single-chip package that combines the best fea­ tures of surface-mount devices and traditional pin grid arrays

• Development, sourcing, and qualification of very dense printed wiring boards with multiple controlled impedances

Digital Te cbnicaljournal Vo l. 2 No . 2, Sp ring 1990 63 Richard E. Calcagni Will Sherwood I

VAX 6000 Mo del 400 CP U Chip SetFu nctional Design Ve rification

Tbe VAX 6000 Model 400 sys tem is Digital'sfi rst VLSI CPU to employ a fu l�y micro­ pipelined architecture. The CPU chip set fo r this systemposed verification challenges Ja r beyond those of previous des ig ns. The major problem was the large number of complex control sequences and combinations that could exhibit design errors. A single verification strategy wouldnot suff'icient(vhandle thiscomplexi�y. Therefo re, t:et�(ication engineers developed a multipronged app roach fo r simulation modeling and fu nctional clesign verij'ication. They also employed CPU diagnostic programs, hand-generated tests, and directed pseudo-random techniques to verify that the design conformed to the VAX architecture. These techniques helped them fi nd bugs prior to committing the de sign to masks. As a result, the first-pass versions of the CPU chip set successfu lly booted em operating system. Simulation also minimized chip rework and delays in bringing the product to market.

TheDesign Ve rification Project neers used the Jist to create tests fo r the functional The VAX 6000 Model 400 chip set verification pro­ units of each chip. These tests were implemented in ject had two goals: find implementation bugs in the either microcode, macrocode, pin-stimulus, or design and verify that the design performed as a VAX some combination of the three. Tes ts were imple­ system. The design verification tasks involved about memnl in the priority determined by the design 25 person-years of effort in the areas of system ream's identification of the most complex areas of microcode, custom VLSI (very large scale integr::t­ the design and those most susceptible to bugs. tion) chips, and the VAX 6000 Model 400 sc::tlar pro­ cessor module. Simulation Models The chip set verification team coordinated a set Functional design verification using software simu­ of simulation models origin::tlly written by the chip lation is inherenrly slow in a design as large and logic designers. At various stages of the project, complex as a VAX CPU. 1 To use resources most models were available at the gate/transistor, behav­ efficiently, the verification team specified and coor­ ioral, and architectural performance levels. The dinated a project-wide modeling methodology that verification team used these models to run a wide incorporated a number of different modeling levels, assortment of both basic and sophisticated tests. trading off detail versus other factOrs such as speed. The use of simulation models is described in more These trade-offs allowed us to match the testing of detail in the next section. each phase of the design to a model that met its The CPU design was partitioned into functional units, and one or more units were assigned to each specific needs and characteristics. Thus, simula­ member of the verification team. A list of specific tions were only as detailed as necessary for a parti­ tests or testing activities was produced for each sec­ cular test situation, and the overall efficiency of the tion of the CPU chips' specifications. This list was verification effort was increased. Most levels of ;wgmented by project-wide brainstorming sessions. modeling could contain different levels of descrip­ These sessions were used to analyze obscure or sub­ tion detail for diffe rent areas of the design, which tle combinations of events in the design . Often, the further optimized simulation performance. thinking process would identify that a bug existed There were three major phases of the design: before any testing had been done. Ve rification engi- architecture, detailed block diagram, and logic/

64 Vo l. 2 No. 2, Sp n·ng 1')<)0 Digital Te chnicaljo urnal VAX 6000 Mode/ 400 CPU Chip SetFu nctional Design Verification

transistor schematic. To match these design levels, Behavioral Model architectural, behavioral, and structural models The behavioral level model describes each chip sec­ were written. Each model section was written by its tion's logic in detail. It is written in the DECSIM 'u 7 designer and then integrated into a whole-chip behavioral mod eling Janguage. ' DECSIM pro­ model, a system model, or both models for chip vides a high-level computer-hardware description and system testing. In addition the design method language that executes procedurally rather than ensured that the overall organization of each using event-driven algorithms. Models written in modeling level was fa ithful to the hardware design such a language generally simplify both model and level it represented. design debugging. The modeling methodology req uires that, in the Architectural Mo del behavioral model, every signal in the design be The architectural-level model is the highest level of explicitly modeled, with its timing accurate to the modeling for this project.!.I 1 This model describes clock-phase boundary. This methodology maxi­ only the control algorithms and abstract data paths mizes the probability that timing problems will be found at this level of simulation . (Note: No addi­ in the microarchitecture. The VAX 6000 Model 400 tional logic timing verification was done on this CPU architects wrote this model in PASCAL for exe­ project.) Each designer writes the section's model in cution performance reasons. The PASCAL program parallel with writing the chip's specification. The avoids most of the simulation-oriented overhead writing must be done before detailed schematics are because it is a standalone program. started. This method ensures that the model accu­ In the architectural model, actual microcode is rately represents the real hardware behavior. used. However, because much of the detail of the The model executes hierarchically. The system microarchitecture is abstracted in this model, clocks are advanced at the beginning of each phase, cmtches (additional simulation aids) are required then the top-level routines for each section are to execute the microcode flows. Model simulations called. These routines, in turn, call the routines for are driven from instruction traces. Opcode, each subsection. The subsection routines do the operand, and address information are extracted actual work. from user programs and system software running The DECSIM behavioral model is the basis for on actual VAX systems. Special fields in each many model variations. These variations range from microword make use of the information from these the simplest single-CPU mod ule ro elaborate multi­ traces and direct the flow of microcode execution processor versions that include peripherals and I/O accordingly. These special fields are on ly used for adapters. simu lation and are not included in the actual The basic version of the behavioral model is non­ microcode implemented in hardware. The use of ported; i.e. , the model is implemented as a single. trace data extracted from real VAX systems permits self-contained hierarchy of procedures that does actual machine loads to be reproduced, and the not use or connect to any other models. The non­ architect can evaluate implementation trade-offs ported model contains detailed descriptions of the from these reproductions. three custom VLSI chips: CPU processor, secondary Signals in the architectural model are correct to cache controller, and floating point accelerator. within a machine cycle, which allows execution This model also includes a representation for times to be accurately measured. Microarchitec­ back'Up cache RAMs connectedto the DAL (data and tural parameters, such as cache or translation buffe r address lines). Also modeled are a system support size, can be easily adjusted to analyze their effects chip and a simple memory that can return data as on system performance. More complex design fea­ fast as the protocol allows. tures, such as bus protocols and pipeline control This nonported model was used for extensive algorithms, can also be modified relatively easily. testing of the microcode, microarchitecture, and The model describes most of the hardware sec­ logic design of the VLSI chips and their interfaces. tions that will be in the fi nal design. Therefore, it A majority of the verification tests did not need serves as a prototype system debugging tool to detailed memory timing or I/0. Te sts could be run predict and tune system performance. The model faster on the nonported model than on one that also uncovers design flaws before implementation simulated the actual memory access delays. These begins. About a hundred bugs were found at this delays would have increased the testing time with­ preliminary stage by using the architectural model. out adding value to the testing process.

Digital Te cbn ica/]ournal Vol. 2 No. 2, Sp ring /')')() 65 VAX6000 Model 400 System

Another version of the behavioral model is behavior, which in turn allows chip- and module­ ported. This model was constructed to test the level interactions to be tested. DAL and the interaction of the core chips with the Table I lists more usages of the ported core chip system support chip (RSSC) and bus interface stan­ set behavior model as compared to the nonported H dard cell (REXMI) chips. Ported models have ports architectural performance model. at the boundaries of procedural model compo­ nents. Ports are used where chip pins or similar Structural Models boundaries appear in the design. Ported model The structural models were derived automatically components are integrated for simulation through a from the designers' transistor-level schematics. The structural wire list derived from module (printed wirelists, or network descriptions, were translated wiring board) schematics. The main ported model for the two simulator systems: DECSIM MOS and 9 10 11 consists of the core CPU chips with a ported repre­ the ZYCAD simulation engine ' DECSIM MOS is a sentation of the DAL. This model connects to ported transistor-level simulator based on RSIM and ESIM models of the other chips on the VAX 6000 Model that models R-C delays, undefined-state initiali­ 12 1 400 module, which in turn connect to memory and zation, and charge sharing. ' u� The hardware­ IIO models. Several different combinations of these accelerated ZYCAD system abstracts transistors into ported models were used for various specific a three-state, gate-level model. The DECSIM MOS verification test applications as shown in Ta ble I. model was used for standalone chip sections and The combined ported models run several times whole chip simulations to find initialization and slower than the less complex nonported model. charge-sharing bugs. Both DECSIM MOS and ZYCAD The slower time is offset by increased testing granu­ were used to fi nd logic and schematic bugs. Both larity. The ported models also allow asynchronous systems used pin-level pattern stimulus that was

Table 1 Combinations of Behavioral Model Configurations

Performance Number of Module Number and Level (Microcycles/ Ported Modules Abstraction of Memory Peripherals Applications Second)

No Architectural level; 1 Abstract memory None Architecture 600 per-cycle detail module debugging and performance tuning No Behavioral level for 1 Abstract memory None CPU verification 7 support chips; per- module and generating phase detail chip test patterns Yes Behavioral level for 1 Detailed memory None Self-test code 2.5 support chips module debugging and bus interface verification

Yes 2 Behavioral level for 2 Detailed memory None Multiprocessor support chips modules verification Yes Behavioral support 1 Detailed memory 1 RL02 disk Booting VMS 2 chips module with gate- (high level) system level bus interface Yes Gate level for 1 Detailed memory None Bus interface bus interface chips module with gate- verification and level bus interface generating chip test patterns

Yes Gate level for 1 Detailed memory None Module bus interface chips modu le with gate- verification level bus interface Yes 2 Gate level for 2 Detailed memory Bus adapter System 0.5 bus interface chips modules with gate- verification in level bus interface multiprocessor mode

66 Vu l. 2 No . 2, Sp n·ng 1990 Digital Te cbnical]oumal VAX 6000 Mo de/ 400 CPU Chip Set Functional Design Ve rification

generated from the nonported behavioral model. ExistingDesign Ve rification Te sts Signals traced in the behavioral model matched The VAX architecture has undergone a number of the boundary of the section of logic or chip being implementations since the first VAX- 11/780 system simulated at the gate level. Test results were com­ was designed in the mid- 1970s. Over the years, a pared on a cycle-by-cycle basis. The tests uncovered substantial body of knowledge regarding the key many bugs in the logic design implementation. areas and problems associated with designing a VAX processor has been accumulated from various VAX Gate-level Fa ult Simulation implementations. We put these past lessons to use In addition to ZYCAD true-value simulation, single in the VA X 6000 Model 400 verification effort. We stuck-at fault simulation was done. Fault simulation actively sought out bug lists, test plans, and actual measured verification and manufacturing test test code used by previous VAX system design coverage, and provided guidance for verification teams. One key example of this is HCORE, a self­ engineers to enhance tests. The fa ult simulation checking VAX macrocoded diagnostic program. effort for the CPU processor chip alone was almost HCORE specifically fo cuses on the high-risk areas six months long. As a result of this effort, five new that are common across VAX designs. The HCORE tests were written, and manufacturing fa ult test program was originally developed from coverage was subsequently increased from 83 per­ another basic field-diagnostic program. It was later cent to 94 percent. modified many times, throughout several projects, to focus on testing potential high-risk instructions Ve rificationStrate gies and functional areas that had been identified in past CPU chip set verification engineers had two designs. Existing design verification tests (DVT) explicitly stated and somewhat overlapping goals. such as this are almost always VAX macrocoded We had to prove that the hardware design intent tests. Macrocoded tests transport more easily across adhered to the VAX architecture standard in every implementations than microcoded tests because respect, and that the logic implementation adhered the microword formats are usually different from to the intent. '5 We strongly believed that any bugs implementation to implementation. in prototype hardware (first-pass silicon for the We derived three benefits from using existing custom VLSI chips) would negatively impact our DVTs. They provided a strong level of confidence ability to meet time-to-market for the product. Bugs in the basic functional operation of the design. Sec­ found at a later stage of the design process are ond, they found any functional bugs that might be more expensive to fix for custom VLSI chips. It is hiding in obscure or seldom used areas of the VAX expensive because we are severely restricted in our instruction set. Third, when used with demons ability to isolate and work around bugs in the hard­ (explained later), they were useful in finding bugs in ware. Therefore, for custom VLSI chips, verification very implementation-specific areas, such as error explicitly meant proving the design and finding the recovery logic. bugs in simulation. No single verification strategy or technique can find all of the bugs in something as Custom Design Ve rification Te sts complex as a VA X CPU. Therefore, a breadth of veri­ Although all VAX CPU designs implement the same fication strategies were flexibly applied. architecture and run the same software, the hard­ In addition to technical strategies, the verifica­ ware and firmware implementation details of each tion team cultivated a "bugs are good " philosophy are unique. Therefore, generic VAX diagnostic tests throughout the project. 16 Past experience has did not necessarily cover the specific critical paths shown us that bugs will always creep into the and functions in the VAX 6000 Model 400 design. design of something as complex as a VAX CPU. Existing DVTs often could not provide a clear Instead of being viewed as mistakes or fa ilures, bugs picture of what had and had not been covered . To were celebrated because a bug fo und in simulation solve these problems, we used custom DVTs to test was a bug that didn't make it into prototype hard­ specific, obscure, and hard-to-get-at areas of the ware. This subtle shift in how the finding of a bug design. There were several techniques for imple­ was regarded had, we believe, a strong motivational menting custom DVTs: impact on members of the design and verification • Handwritten macrocode teams and increased the probability of finding bugs during verification. • Handwritten microcode

Digital Te clmtcaljournal Vo l. 2 No. 2, Sp ring 1990 67 VAX 6000 Model 400 System

• Manipulation of pins and imernal signals under testing implies simulating many cycles and trad­ simulation comrol ing off test efficiency to address the problem of unanticipated bugs. These techniques could be used individually and It is absolutely necessary to automate the pseudo­ in combination, within a single custom DVT. For random test process, both test generation and test example, although custom DVTs for the instruction scoring, as much as possible since pseudo-random fe tch and parse logic (!-box) were written primarily tests are much longer than focused tests. in microcode, a custom macrocoded instruction A powerful tool already available for pseudo­ stream was written to give the 1-box something to random testing VAX designs is the VAX architectural parse. Explicit manipulation of pins in simulation exerciser tool suite (AXE and MAX).17 Originally was used to generate asynchronous events, such as intended as hardware prototype verification tools, interrupts, when necessary. AXE and MAX have proven to be even more effec­ Custom DVTs provided confidence in areas of the tive as design verification tools in a simulation design that could not easily be tested with existing environment. They provide a virtually inex­ DVTs. We made the tests as focused and efficient as haustible source of unique, interesting macrocode possible. However, in doing so, the generation of test cases, and require a minimum of intervention such tests required large amounts of development and effort by the user. time and people resources. Although these tests Although AXE and MAX provide some control uncovered several bugs in all areas of the design, we over test case parameters, they still aspire to be gen­ now believe that many of these same bugs could eral, architecturaUy focused exercisers. We also have been found with less labor-intensive methods, wanted pseudo-random test case generation that such as pseudo-random tests. At the time, the pri­ could be targeted at specific, risky areas of the mary advantage of custom DVTs was the clear implememation. These areas were the most likely indication they gave that specific fu nctional areas locations for unanticipated bugs. Custom pseudo­ of the design had been tested and were working random exercisers were developed fo r these areas. as specified. These exercisers provided very detailed control over test case parameters, yet retained many of the Pseudo-random Design Ve rification Te sts fe atures and advamages of AXE and MAX. Each new VAX CPU design aspires to improve on A powerful techni�ue for pseudo-random test is ' the price or performance of the previous design. the use of demons. ' A demon is any automated Improvemems are sought by pushing the limits of imervention of a simulation model's normal execu­ available technology to package hardware into tion behavior. For example, a bus demon can imer­ smaller and, if possible, less expensive spaces. At ject one or more bus commands, error conditions, the same time, a decrease in the cycle time or an or interrupts at random intervals in order to aggra­ increase in the work done per cycle in the func­ vate normal system operation. By doing this, a tional design is also sought. In particular, this last dense environment of unusual or uncommon event item has substantially increased design complexity combinations can be created to stimulate the design by imroducing techniques such as pipelining and with worst-case situations. Demons typically special-case hardware. As a result of this complex­ slowed model execution by a fa ctor of ten, but they ity, we often encoumer very obscure bugs when often found bugs that had not been considered by debugging new VAX implememations. These bugs the designers or architects. Without demons, it involve unamicipated interactions in the logic, could have taken many months of field testing to between seemingly unrelated functional areas, and fi nd and characterize these bugs, if they would have interactions dependem on intricate combinations been found at all. and sequences of evems. We were concerned about Pseudo-random testing with A.,'\E, MAX, custom these types of bugs because it is extremely difficult exercisers, and demons was used throughout the to write tests for unanticipated problems. The development cycle of the CPU chip set. All four method we chose to address these problems was uncovered obscure interaction bugs, as expected. pseudo-random testing. Unfortunately, they did not find them all. Some The intent of pseudo-random testing is to exer­ unanticipated bugs slipped through verification and cise the design in ways that are likely to find bugs into the first-pass silicon stage. These bugs were without necessarily knowing in advance what those eventually found after running many more cycles in bugs are or where they might be. Pseudo-random real prototype hardware. The failure to find all

68 Vo l. 2 No. 2, Sp ring 1990 Digital Te cbnicaljournal VAX 6000 Model 400 CPU Chip Set FunctionalDesign Verification

unanticipated bugs in the simulation stage illus­ ficially processes model requests fo r block data trates a fu ndamental problem in the use of pseudo­ transfers and performs rhe transfers instantly. This random testing. The effectiveness of testing is techniqueelimin ated wasting simulated CPU cycles closely coupled to the number of cycles run, and while waiting for transfers to complete. simulation speed severely restricts this number. With model and code optimizations in place, Our application of pseudo-random testing to the booting the VMS system to the point of printing the problem of unanticipated bugs was largely success­ VMS banner and starting the process scheduler was fu l fo r this project. However, we learned that we actually achieved in simulation. It took approxi­ must do more in the future to increase the mately seven CPU days on a host V�'< 8800 system. efficiency and scope of these tests. To provide this One model bug and one real design bug were fo und increase, we are looking toward more directed during this effort. Booting the VMS system in simu­ pseudo-random testing. lation required a large amount of verification resources. The process was worthwhile in terms of Booting the VMS Op erating Sy stem the bugs that were found and the confidence it gave us that we could boot VMS on first-pass hardware. A major milestone in the development of any new VAX CPU is booting the VMS operating system on Analysis of the Functional Bugs Found prototype hardware. Not only does this demon­ strate significant functionality in the design, but it To demonstrate where the verification project was also provides a platform from which further testing successful and where it needed improvement, we can proceed. As previously stated, increased com­ discuss here the bugs that were fou nd both before plexity in these designs can produce very subtle and after the chip set first-pass design was commit­ bugs. Often, such bugs do not even appear until the ted to masks. (Note: The milestone of this commit­ PG , hardware is run under a heavy system load in a large ment is called for mask data preparation's multiprocessing or I/O-intensive environment. Suc­ pattern generation .) Only the bugs pertaining to cessfully booting VMS on prototype hardware is architecture, microcode, functional design, and necessary before any such system load testing can logic design are discussed. Layout and circuit prob­ begin. The sooner such resting begins, the better the lems, as well as modeling and tool bugs, are outside chance of finding subtle bugs. For this reason, boot­ the scope of this paper. ing the VMS system in simulation was an important During the design period, several hundred bugs, goal of the VAX 6000 Model 400 chip development with a variety of complexity levels, were found in and verification effort. all sections of the design. These bugs were found At first glance, it would seem impossible to boot through the verification techniques described in the VMS system on a simulation model in a reason­ previous sections and are detailed in Ta ble 2. able amount of time. Simulation speeds on the The prototype chips were first tested on a Ta keda fas test of our models were many orders of magni­ 3381 chip tester. This tester allowed prototype tude slower than actual hardware. However, careful chips to be tested in a standalone environment. The analysis of the macrocode modules involved in the chips were then inserted into a prototype CPU boot process revealed that by optimizing or remov­ module, which was part of a custom-designed engi­ ing large, iterative sections of code, we could sub­ neering tester. The prototype module provided a stantially reduce the number of cycles in the boor path without losing significant coverage for veri­ Table 2 Methods Used to Find Design Bugs fication. For example, the primary bootstrap mod­ before First-pass Silicon ule contains code that creates a bit map of all Number of physical memory in the system. The code tests each Verification Process Bugs Found memory location fo r errors. By reducing this code to test fe wer memory locations, the number of Custom DVTs, DVT reviews, 273 microcode assertions cycles executed is vastly reduced. Another key optimization involved speeding up Existing macrocode test programs 171 simulated transfers from clisk. At several points dur­ Pseudo-random macrocode tests 90 ing the boot process, code and data are pulled into Boot VMS operating system on the memory from a mass storage device (usually a disk). behavioral model For simulation, a special "turbo" disk model was To tal of bugs found prior to 535 first-pass si licon written by the verification team. This model arti-

Digital Te cbnicaljournal Vo l. 2 No. 2, Sp ring 1?90 69 VAX6000 Model 400 System

1H true system environment forthe chip set During on the hardware at-speed as opposed to using prototype debugging, 11 bugs were found. The simulation. One of the new features stimulated characteristics of these bugs are shown in Table 3. the conditions for bug 3. Had this lV1AX feature None of these bugs wasa "show-stopper" in been available prior to PG, we would have terms of prototype debugging or field testing. In encountered the bug in simulation within the fact, most of them were so obscure that the proba­ first few tests. Typically, 90 percent of the bugs bility they would appear during normal use was uncovered by using the 1V1AXtool are found with very low. Several generalizations and conclusions the first 1 00 tests generated. can be drawn from the types of bugs found and how • The verification methodology for bug 10 was they were found, particularly whether they could correct, but had not been followed. Appropriate have been found earlier. gate-level comparison testing would have found • Bugs 1, 2, 6, and 9 were simple, but were missed this bug prior to PG. because of overlooked test coverage. The sim­ plest test, if identified and written, would have • Bugs 7 and 8 were extremely obscure and diffi­ found them. We learned that we needed more cult to find. The first symptom of these bugs was discipline and thoroughness in generating lists of a series of unexplained system crashes over a tests as guided by test coverage indicators. period of several weeks. The situation was finally resolved by attaching a logic analyzer to a system • Bugs 4, 5, and 11 were found on the simulator and waiting for days for the right combination after prototypes were available. It is questionable of events to trigger the failure. Unfortunately, it whether these bugs would ever have been is highly unlikely that we would have found noticed, much less isolated, on shipped systems either bug in simulation, even with a broader because the conditions triggering them were scope of directed-random verification, given the obscure. For example, six conditions, including limited amount of simulation that wasdone. The a parity error interrupt, had to happen simul­ architectural, microcode, and pipeline combi­ taneously to reveal bug 4. It was acceptable to nations required to trigger these bugs were just find these bugs after PG because the impact from too complex. such bugs on prototype debugging was minimal.

• Bug 3 was found using prototype hardware Conclusions rather than simulation. A new version of MAX Although there were 11 bugs in the ft rst-pass chip was released during the prototype debugging set, the verification efforts were considered quite time frame. This version had some new test successful. Prototype debugging was never stopped coverage features, and we decided to run them because of a bug, and system field testing was able

Table 3 Bugs Found after Pattern Generation

Bug Number Chip Type Bug Type Complexity How Found

1 CPU Microcode First order Inspection 2 CPU Microcode First order Inspection 3 CPU Logic Sequential MAX on prototype 4 CPU Microcode Second order Pseudo-random on simulator 5 CPU Microcode Second order Pseudo-random on simulator 6 CPU Microcode First order Self-test on prototype 7 CPU Logic/microcode Second order Field test system crash 8 CPU Microcode Second order Field test system crash 9 CPU Logic First order Prototype debugging 10 Floating point Logic Sequential Chip tester comparison accelerator 11 Cache Logic Second order System verification test on simulator controller

70 Vo l. 2 No. 2, Sp ring 1990 Digital Te chnicaljournal VA X 6000 Model 400 CPU Chip SetFu nctional Design Verification to complete on schedule. The modeling and veri­ References fication methodologies contributed to this success. 1. ] . Basmaji et al., "The Role of Computer-aided

• All logic, functional, and microcode bugs could Engineering in the Design of the VAX 6200 be reproduced in the behavioral model. There­ System," Digital Te chnicaljo urnal, vol. 1, no. 7 fore, the bugs found after PG could have been (August 1988): 47-56. found with simulation had the appropriate com­ 2. C. Wiecek, "The Simulation of Processor bination of events been tested. Performance for the VAX 8800 Family," Digital

• Complex pipeline activity, an area of concern Te chnical journal, vol. 1, no. 4 (February from the beginning, was the primary problem 1987): 100-110. area in the design. Compromising design com­ 3. C. Wiecek and S. Steely, "Performance Simu­ plexity to make thorough verification more lation as a Tool in Central Processing Unit achievable should be considered. Design," PerformanceEv aluation Review, vol. 11, no. 1 (August 1979): 41-47 . • In general, the tool or test that first exercised a chip section or fu nctional area with a bug found 4. B. Moses and K. DeGregory, "Performance the bug. We believe this indicates that it is more Evaluation of the VAX 6200 Systems," Digital productive to first debug using existing Te chnical jo urnal, vol. 1, no. 7 (August 1988): macrocode tests or automatically generated tests 64-78. before writing new tests. 5. M. Kearney, "DECSIM: A Multilevel Simulation

• Although the effort to boot the VMS operating System for Digital Design," Proceedings of system on the simulator was comparatively ICCAD (October 1984). large, one bug was found that would not have 6. R. Gries and ]. Woodward, "Software Tools been found through other means. We had not Used in the Development of a VLSI VAX considered the combination of events that Microcomputer," Proceedings of the MICR0-17 caused the bug. Finding this bug clearly showed Conference (October 1984). the benefit of running this application on the 7. W. Sherwood, "An Interactive Simulation simulator. Further, it gave us confidence that the Debugging Interface," Computer Ha rdware VMS operating system would boot on the fi rst­ Description Languages and Their Appli­ pass prototype hardware to ensure that proto­ cations, Breuer and Hartenstein, editors, type debugging could proceed unimpeded. (Amsterdam: North-Holland Publishing, 198 1):

• The flexible and wide-ranging modeling meth­ 137-144. odology served the design team well. The source 8. P. Sullivan et a!., "The VAX 6000 Model 400 code of the CPU chip set model has been reused Scalar Processor Module," Digital Te chnical in system verification and chip set application jo urnal, vol. 2, no. 2 (Spring 1990, this issue): development projects at least ten times. The cor­ 27-35. porate standard DECSIM logic simulator made 9. B. Milne, "Put the Pedal to the Metal with Simu­ this modeling effort savings possible. lation Accelerators," ElectronicDesign, vol. 35, Acknowledgments no. 21 (September 1987): 39-52. The verification of the CPU chip set design was 10. M. McMahon, "Accelerators for Faster Logic a team effort performed by the Semiconductor Simulation: The ZYCAD Approach," Proceed­ Engineering Group's microprocessor verification ings of VLSI and Computers, First Interna­ team. Members of this team included David Asher, tional Conference on Computer Te chnology, Rick Calcagni, Liza Hudepohl, George Mills, Rick Sys tems and Applications (COMPEURO '87 Cat. Pekkala, Ed Rocha, Will Sherwood, and Chris No. 87CH2417-4, 1987): 98 1. Spear. Part of our success is attributed to our CAD 11. D. Jenkins and S. Morton, "Transistor-Level tool development groups, including the SEGCAD Logic Simulation Using the ZYCAD Logic DECSIM team and the VAX Architecture Group AXE/ Evaluator," Proceedings of the Conference MAX team. We would also like to thank Mike Uhler on Automated Design and Engineering fo r for his technical guidance throughout the project. Electronics(19 85): 154-163.

Digital Technicaljournal Vo l. 2 /Vo . 2, Spri ng 1990 71 VAX6000 Model 400 System

12. C. Terman, "Timing Simulation for Large Digital MOS Circuits," Computer-aickd Design of VLSI Circuits and Sy stems, (Grc<:nwich: JAI Press, 1986). 13. C. Terman, "RSIM-A Logic-Level Timing Simulator," International Conference on Computers and Design ( 1983). 14 . C. Terman, "Simulation Tools for Digital LSI Design," Te chn ical Report TR-304, MIT/LCS (Cambridge: MIT Laboratory for Computer Science, 1983). 15. T. Leonard, ed ., \1,4X Architecture Reference Ma nual (Bedford: Digital Press, Order No. EY-3459E-DP, 1987). 16. D. Clark, "Bugs are Good: A Probkm-oriented Approach to the Management of Design Engi­ neering," Research - Te chnolof!Y Management (forthcoming 1990). 17. D. Bhandarkar, "Architecture Management for Ensuring Software Compatibility in the VAX Family of Computers," IEEE Computer (February 1982): 87-93. 18. W. Sherwood, "The VLSI VAX Chip Set Micro­ architecture," Microarchitecture of VLSI Com­ puters, P. Antogneui, ed . (1985): 103-126.

72 ViJ I. 2 No. 2, Spring 1990 Digital Te cbntcaljournal john W. Croll Larry T. Camilli Anthony]. Va ccaro

Te st and Qualification of the VAX 6000 Model400 Sy stem

Computer-aided design simulation, which is used in the design of the VAX 6000 fa mily, finds most problems during the hardware design phase. Simulation, however, cannot test a complex system running undersystem software control. For the VAX 6000 Model 400 sys tem, a qualification process was designed to completely test the interaction of the system s hardware and software components. The benefi t of such a process is clearly shown in the results. Nearly all the problems fo und in the qualification stage could not have been fo und in the simulationproce ss. The testing and qualification of the Model400 was a multigroup effo rt. Th ispa per describes the methods and tools of three Midrange Sys tems Engineering groups who were involved in the project.

The VAX 6000 Model 400 system is the third in the test hardware executing under operating system VA X 6000 series. The Model 400 is designed to control. A companion system test and qualification enhance CPU performance using the same platform, process, executed on hardware prototypes, is that is cabinetry, buses, and power systems, as all required to thoroughly test the complete system models in the VA X 6000 fam ily. The basic architec­ and ensure its reliability. ture of the VAX 6000 Model 400 is unchanged from A key goal of the VAX 6000 common-platform 1 that of the earlier Model 200 and 300 systems 12' design strategy was to allow new processor tech­ The Model 400 is distinguished from earlier models nology to be brought to the market quickly. To primarily by two additions: J. new processor design, achieve this goal, we had to minimize the time which offers over twice the performance of the required for system test and qualification without original VAX 6000 Model 200 processor, and by the compromising the quality or reliability of the final addition of a vector coprocessor.'·' The interfaces in product. By reusing common platform compo­ the Model 400 to the common platform remain the nents, we could primarily focus on testing the new same. However, the processor is an entirely new components. design, and is the first Digital system to use semi­ This paper addresses the system test and qual­ 6 conductors designed with the CMOS-2 process ification process used for the VAX 6000 Model 400 All VAX 6000 systems use a common design pro­ system. This process was designed to maximize test cess that relies heavily upon simulation ro detect effectiveness and minimize test time. The first cus­ and correct design errors. This simulation process is tomer shipments of the VAX 6000 Model 400 system designed to ensure that first-pass hardware, or the occurred only six months after the introduction of initial engineering prototypes, will run the operat­ its predecessor, the VAX 6000 Model 300 system. - ing system at speed. The VAX 6000 Model 400 processor is an excellent example of the effec­ Sy stem Te st and Qualification Process tiveness of simulation techniques. The elapsed time The VAX 6000 Model 400 System Integration Group from power-on of the first prototype to reliable is responsible for the overal.l design and manage­ login under both the VMS and LJLTRIX operating ment of the system test and qualification process. systems was less than six weeks, which is sub­ This group comprises engineers who reside within stantially less time than has been seen for previous the hardware design group, but have not directly VAX processors. participated in the design of the components being Current computer systems are very complex, tested. Therefore, any problems found during the especially when hardware and software interac­ test period can be rapid ly communicated and tions are considered. Simulation cannot adequately resolved, whereas the possibility of "testing to

Digital Tecbnicaljournal \11/. 1 No. 1, Sp ring J

implementation" versus "testing to specification" is limitations prohibit an in-depth discussion of each avoided. function. Therefore, we will examine in-depth the A distributed test process was developed for the roles of three of the Midrange Systems Engineering VAX 6000 Model 400 system that used the resources Groups-the VAX 6000 Model 400 System Integra­ and expertise of a variety of groups within Digital. tion Group; the VAX Architecture Ve rification The test process also allowed many tests to be exe­ Group; and the Midrange Systems Evaluation Engi­ cuted in parallel to minimize time. Additionally, neering Group. We will also describe the tests and some of these groups have specialized test facilities tools used in the qualification of the VAX 6000 that are required to satisfy standards imposed by Model 400 system. Digital and various government regulatory bodies, e.g., the FCC and UL. System In tegration Group The groups that participated in testing the Planning for the system test and qualification of the VAX6000 Model 400 system, together with a brief VAX 6000 Model 400 system began approximately description of each group's fu nction, are shown in one year prior to scheduled availability of the first Table 1. Each group played a valuable role in testing prototype systems. During this period, the overall a particular aspect of the system. Initially, a very qualification plan was developed, and individual aggressive test schedule of sixmonths for the entire test plans were solicited from each group that test and qualification process was planned. How­ would participate in the testing. Each plan was ever, clue to some delays in planned prototype reviewed by the System Integration Group for test availability, all testing had to be completed in about coverage, as well as for minimization of overlaps five months to allow systems to ship as scheduled in and duplication in the component plans. In parallel, July 1989. The expertise of all the groups involved the System Integration Group developed plans for was required to meet this schedule. However, space hardware-specific system design verification tests

Table 1 Organizations Involved in System Qualification

Organization Function

System Integration Manage qualification process; perform DVT, system test, reliability confidence test; manage field test VAX System Architecture Exercise new system to find discrepancies with the VAX architecture Midrange Systems Evaluation Test a variety of system configurations for proper operation, Engineering concentrating on 1/0 configurations VA Xcluster Validation Test complex VAXcluster environments with new products to find problems in new or existing VAXcluster components Mechanical Technology Demonstrate successful product operation while exposed to specific environmental conditions (e.g., vibration, humidity, altitude) Electromagnetic Compatibility Test new products for electromagnetic compliance to various Engineering government regulatory requirements (e.g., FCC, VDE) Product Safety Laboratory Te st various aspects of new product safety and ensure compliance to safety standards (e.g., UL, GSA) Diagnostic Quality Assurance Te st various diagnostic programs to ensure correct operation Man ufacturing Product Provide the necessary data to verify and improve the manufacturing Ve rification process to enable consistent production of a quality product Central Characterization Group Characterize performance of industry and application-specific workloads

System Performance Analysis Model and measure performance of midrange products Group Customer Services Systems Develop service delivery and training programs, test diagnostic and Engineering repair features of products Software Quality Maintenance Te st software-layered product operability with new hardware and operating system versions

Systems Reliability Engineering Predict, test, and analyze hardware and system reliability

74 Vo l. 2 No. 2, Spring 1990 Digital Tecbnicaljournal Te st and Qualification of the VAX 6000 Model 400 Sys tem

(DVTs) and for external fi eld test. Prowtype plans an operating system and a processor can operate in specified the number, distribution, and cost of the a nearly infinite number of states. It is impossible to prototyped systems that wou ld be required during design a series of tests to verifyeach of these states. the test period. Random tests exercise the system in more com­ The System Integration Group also acted as the prehensive ways than directed tests. They do not problem-reporting center. In implementing a dis­ seek specific results. Instead, random tests attempt tributed test process, two fu nctions are essential. to push the system into as many different states as There must be a central fo cus that disseminates possible, as quickly as possible. Greater test cover­ information regarding observed problems to all age results from these tests, but problem diagnosis test groups. Second, an established method for and isolation are more difficult. tracking status and resolution of these problems Because random testing does not look for specific must be maintained. An internally designed system results, it is effective only if done for extended maintained a complete audit hiswry of each prob­ periods of time. Even if identical test scripts are run lem. There were 121 problems reported during repeatedly, system activity becomes unpredictable VAX 6000 Model 400 testing. Each step in the reso­ over time, due to events such as network activity or iution process was tracked for each problem in the disk fragmentation. This unpredictability is impor­ problem database. This database was available to all tant because it means more system states are being test groups. The database was supplemented by exercised. weekly cross-functional meetings, at which repre­ The System Integration Group developed a ran­ sentatives from engineering, manufacturing, and dom test package, called the Systems Integration customer services reviewed and updated each Te st Package (SITP), to test the VAX 6000 Model 400 problem's status. system. This package consists of a comprehensive The complex VAX 6000 Model 400 project sched­ collection of test programs and a script-driven ule was developed and tracked by the group. Due to mechanism that controls their execution. SITP is the number of groups involved in testing and the diverse and flexible. The test programs were delays involved when problems were found, the obtained from many sources. The System Integra­ critical path tended to be very dynamic during the tion Group also wrote some test programs to exer­ test phase. A project management tool, which was cise specific aspects of the Model 400 system that developed within Digital and which used PERT, were not fu lly exercised by other tests. tracked status against milestones and modeled dif­ The test programs used with SITP are high level. fe rent scenarios to prevent overall schedule slip­ Each high-level test uses many lower level functions page as changes occurred. within the system. Many of these programs are run The System Integration Group performed four together, with varying test parameters and run­ major types of test. These were system test, reli­ times. The programs are self-checking. If an action ability confidence test, design verification tests, and does not complete properly, the program nores the field test. The fol lowing sections describe each of error immediately. The program does not attempt these tests. to identifythe cause of an error; rather, it gathers as much information about the error as possible. This Sy stem Te st information is later examined by a test engineer. There are two fo rms of system testing, directed and SITP is easy to use, restarts automaticaUy after random. Most testing groups use directed tests, system crashes or power fa ilures, and includes mon­ which test hardware or software features, or follow itoring tools. Periodic reports, with details about a strict test sequence. Directed tests seek specific system activity and error log data, are generated by results and are well defined. the test package. With this information, the test The System Integration Group performed many engineer can gauge the effectiveness of the tests and directed tests on the Model 400 system. Some of adjust them as necessary. The test engineer can these tests were done to satisfy the requirements of also control and monitor tests on many different external regulatory agencies, or internal Digital machines. Machines can be located locally and development standards. Other directed tests remotely. include system DVT tests, which are discussed in A number of SITP scripts were developed to more detail later in this paper. provide different workloads for testing the Model Many aspects of complex systems cannot be ade­ 400 system. Each set of scripts emphasized a dif­ quately tested in a directed fashion. For example, ferent type of system activity. Some were com-

Digital Te cbnicaljournal Vo l. 2 No. 2, Sp ring 1990 75 VAX6000 Model 400 System

pure intensive, some I/O intensive, and some protOtype machines for other testing. Two of these stressed parallel and multiprocessing activity. The bugs were fixed by modifications to the CPU mod­ scripts were modified to suit system configurations ule. The other three required changes to the proces­ as needed. sor chip. As corrected processor chips became SITP and the test scripts were installed on all the ava ilable, SITP was used tO ensure the fixes had not Model 400 system prototypes in the system integra­ introduced further bugs. tion lab. Tes ts under the control of SITP were run It is interesting to note that four of these five on the prototypes as prototypes were avail:lble. problems occurred in system areas not simulated Because the prototypes were heavily used during during hardware design. Of these four, two daytime hours for various debugging tasks, SITP occurred in the handling of external system events. tests were run overnight and on weekends. Tht:test One was in system reset handling. The other was in scripts were designed to run for a specific number handling "control/P" interrupts. Conrrol/P is the of hours and thm stop. The prototype was then standard method an operator uses to get the atten­ H available for the next user. This procedure allowed tion of the system console on VA X systems Two otherwise idle prototype hours to be used in system bugs were caused by interactions between the new testing and ensured a clean shutdown of the tests. In processor and other system components. These this way, test data could be retrieved without inter­ interactions were not simulated during hardware ference from other prototype users. design. The fifth bug was not found during simula­ SITP was used on the earliest Model 400 system tion because of a deficiency in a simulation test tool. prototypes and was continually used throughout the qualification period. as prototype time was Reliability Confiden ce Te st avai lable. Scripts wen: t:�ilored to cause test con­ To accumulate uninterrupted run-time on the centration in specific areas and were modified as Model 400 system, five identically configured sys­ necessary to suit various prototype configurations. tems were set up in an isolated area. The machines Typical SlTP runs would last for 16 or 19 hours were isolated to protect them from outside inter­ (overnight), or 58 or 60 hours (over weekends). fe rence while rhe confidence test was running. Processor, memory, and 1/0 configurations varied The purpose of this test was to derermine the from run to run, and depended on test needs and actual reliability parameters of the Model 400 equipment availability. system hardware and to compare the results to the The overall results from system testing were very system's actual reliability requirements. A second­ positive. Over 6700 CPU hours were accumulated ary goal of the test was to determine the long-term on various prototypes and configurations. Many system reliability, both for the hardware and oper­ errors were encountered during this period, but ating system software. most were due to SITP bugs (SITP was still under The duration of the test was planned for SL'X development for most of this period) or to errors in weeks, which was sufficient to show the hardware setting up test scripts. Hardware errors occurred in reliability. Once this six-week period was over, peripheral devices, principally disks and com­ we planned to continue to run the machines in munications devices, and were corrected as they the same environment with the same workload for occurred. as long as possible to accumulate further system Of the mort: serious problems fo und, one was a run-time. hardware problem that would cause a system hang. The test started at the beginning of May 1989, The problem was identified as a bug in a bus inter­ when enough CPU modules became available to face chip on the CPU module, which was operating populate the five machines. The formal test period in an untested mode. It was resolved by modifying ended two months later, in late June. Three of the system console to ensure that this mode was the machines continued to run for two and a half never used. An error was found in the VMS machine months, until mid-September. Of the other two check handler, which was corrected in a subse­ machines, one machine was needed for other pur­ quent release of the VMS operating system. poses, and another's CPU modules were removed to Five other serious bugs were found in the new change the configurations of the remaining three. CPU modules. Although none of these bugs were The systems ran identical SITP scripts that con­ fo und by the System Integration Group's testing, centrated on exercising the new CPUs. Tests each took time to invcstipte, resolve, and test the included compute-intensive programs, programs fixes. As a result, there was less time available on that explicitly tested various aspects of the new

76 Vo l. 2 No. 2, Sp ring /'J'.JO Digital Te cbn icaljoun1.al Te st and Qualifi cation of the VAX 6000 Mode/ 400 Sys tem

CPUs (e.g. , multiprocessor cache coherency), some or to how the new hardware fits into the existing decomposed parallel applications, and tests that system. These tests are called design verification generated many VMS processes. tests (DVTs). The overall results of these tests were very good. A complete test plan for the Model 400 system The systems demonstrated a hardware reliability of DVTs was written and reviewed by the group. The over a year between hardware failures. list of DVTs performed is shown in Ta ble 2. Only two module failures occurred. One had a The Model 400 DVTs complemented those tests bad cache tag store, which was discovered very performed on the new hardware components. The early in the test. As a result of this discovery, the test parts of the system tested were those in which other process for the CPU module's cache control chip testing was weak or nonexistent. These tests were was changed. The other module failure was a float­ conducted in a fo rmal manner, with a written ing point chip fa ilure. One specific test program sequence of events and formal reports of results. began generating wrong answers late in the fo rmal Any problems found were noted and reported to period of the confidence test. This module was the relevant development groups for analysis and removed for repair. eventual correction. Of the other fa ilures that occurred during this The DVTs executed for the Model 400 uncovered test, all were attributable either to test script or two system bugs and some minor documentation set-up errors, or to software or hardware prob­ problems. Both bugs were related to power failure lems which were corrected prior to shipment to recovery. One was in the console and one in the customers. VMS operating system . Both bugs were eventually Once the formal test period was completed, the fixed. The minor documentation problems in the three machines that continued to run until mid­ system installation guide were also fixed. All other September exhibited no new fa ilures and eventu­ design verification tests fou nd no problems with ally accumulated a year's run-time. The reliability of the system. the VAX 6000 platform and the Model 400 CPU was successfully demonstrated. Field Te st Design Ve rification Te sts Field tests are made on prototypes of new products Part of the System Integration Group's responsibil­ provided to customers. The purpose of field test ity is to ensure that parts of the system not covered is to gain experience with the new products before by tests from other groups are tested. In general, production; new products are actually used as these parts are specific either to the new hardware opposed to tested.

Ta ble 2 VAX 6000 Model 400 System Design Ve rification Te sts

Test Fu nction

Keyswitch Verifies front panel fu nctions with new hardware/software Voltage margin Verifies proper operation over allowable voltage range Thermal Verifies proper operation over allowable temperature range Power-fail/battery backup Te sts battery backup operation with hardware and software Interlocks Te sts memory interlock functions XMI saturation Tests system operation under very heavy bus loads Queue contention Tests proper operation under very heavy worst-case memory loads Multiprocessing Tests proper operation of all multiprocessing features Load test Runs system for extended period under heavy compute and 1/0 load System installation Verifies overall manufacturing process, and installation documentation Boot Verifies that system boots all operating systems from all devices System configuration Verifies that all allowable configurations work properly Remote services console Ve rifies that remote services console hardware and process works with new system Console input Verifies that console terminal hardware and software work properly

Digital Te cbnicaljournal Vo l. 2 No . 2, Sp ring 1990 77 VAX6000 Model 400 System

For VAX system products, field test historically rected. Two problems arose because the Model 400 has lasted a minimum of four months. This period system was different from the other VAX systems in was determined from tracking problem reports use at two of the sites. One site reported minor dif­ from field test sites. Generally, a month was ferences in results from a benchmark program, required for the new product to be installed and for which was due to differences between run-time usage to reach a level where meaningfu l testing libraries used for full VAX architecture implementa­ H occurred. The next three months provided useful tions and subset implementations. The other prob­ data about the new system. After three months, the lem resulted because a customer program was amount of useful data declined. referencing an internal processor register that does In field testing the VAX 6000 Model 400 system, not exist in subset implementations. we shortened this four-month period to three. The Field test progress was assessed two months into plan was to check on field test results two months the test, at the beginning ofJun e. The results of field into the test. If field test was not progressing well test were then examined, together with the data at this point, we were prepared to extend the test obtained from other qualification testing. Since no period. major problems had been found, we decided to pro­ The System Integration Group did two things to ceed with plans to ship the Model 400 system as eliminate the field test startup time. First, because scheduled in mid-July. the Model 400 system was an upgrade from earlier Some improvements could have been made in the VAX 6000 systems, Model 200 systems were field test process. First, site audits prior to instal­ shipped to each field test site in advance of the lation of the prototypes were not very thorough. Model 400 field test start. These systems were Many of the sites were not running the neces­ installed and running approximately a week before sary software revision levels. Therefore, the new the official fi eld test started. A considerable amount machines could not be immediately put into of time was saved in site preparation and system VA.Xcluster environments with existing machines. installation. Second, some pieces of hardware were missing and Second, the startup of the field test sites was had to be supplied later. Better communications performed by system integration engineers, who with the site prior to shipping prototypes would 4 brought Model 00 CPU modules and new VMS have reduced these problems. software to each site. These engineers supervised The System Integration Group wrote some moni­ the installation of the new CPl's in the previously toring software that was to be installed on each field installed systems, installed the VMS operating sys­ test machine. Because this was written at the last tem, and ensured that the systems were available to minute, it was not properly tested and did not work the customer before leaving the site. properly. The problems were fixed, but the moni­ A total of seven field test sites were started up in toring software was not run at all the sites. Finally, early April 1989. Six of these sites were located in how to get data from the monitoring software back the United States. Five of these sites were installed to the engineering groups was not well defined. and turned over to the customers within two days. Thus, usage data obtained from field test machines The sixth site was ready in four days. The seventh was spotty. These tools and methods are being site was located in Europe, and was started up by improved for use in the field test of fu ture products. mid-April. Once the site systems were running, the System VA X Architecture Verification Group Integration Group maintained regular contact with each site. Each site was assigned a "captain" (a sys­ The Architecture Ve rification Group ensured that the Model 400 CPU conformed to the VAX architec­ tem integration engineer), who polled the site H weekly, talked to the users, and received first-hand ture specification. information about machine usage. This method was used instead of the traditional dial-up problem Test Process Overview reporting method for two reasons. First, technical The architectural verification process consists of problems existed in making reliable connection to running two programs- AXE (architecture exer­ the dial-up system. Second, many people are reluc­ ciser) and MAX (multi-instruction architecture tant to report problems, unless the problem is so exerciser)- in various modes for a given number of large that work stops or is severely impaired. test cases. The tests are simple to run. However, the Overall, field test went smoothly. Most of the test programs are quite complex and required many problems that occurred were minor and easily cor- years to develop. The Architecture Ve rification

78 llri/.2 No. 2, Sp n'np, /l)'JO Digital Tecbnicaljournal Te st and Qualification of the VAX 6000 Mode/ 400 Sy stem

Group maintains and enhances these test programs When the instruction completes, AXE compares and the databases used to verify the architecture. the instruction stream and the re levant data to When one of the tests fa ils, the group identifies reference data. In this example, the relevant data the problem and helps resolve it. Once the problem includes the three general purpose registers. is fi xed, the group repeats the tests. Both AXE and MAX ignore a machine state defined to be unpredictable for a given condition. AXEandMAX Therefore, allowable differences between imple­ The VAX instruction set consists of over 360 instruc­ mentations do not cause erroneous fa ilure reports. tions and 21 addressing modes. Most modes are Limiting AXE's testing to single instructions pre­ valid for up to six operands per instruction. Con­ cludes meaningful testing of the pipelining that is CPU ceptually, both AXE and MAX divide an instruction's common in today's designs. MAX overcomes context into several components. These com­ this limitation. MAX currently acts as an adjunct to ponents include opcode, operand specifiers, oper­ AXE. However, it will eventually replace AXE. ands, page protection and validity, and processor status long word (PSL). For each component, valid MAX MAX is similar to a compiler in that it creates and invalid values are pseudo-randomly selected to complete instruction streams. However, MAX does create a test case. The exerciser continues to create not have source code to ensure that the resultant unique cases for as long as it is run. machine code is logically consistent. CPU The VAX architecture has a clearly defined excep­ To test how a handles inter-instruction data tion and instruction restart structure. Much of the dependencies, MAX must create test cases with VAX architecture's complexity is in those opera­ instructions that share registers and memory loca­ tions. Therefore, both AXE and MAX favor values tions. Creating sensible instruction streams can be that cause faults. Each program establishes a situa­ difficult. For instance, the result of one instruction tion with fa ults, starts the instruction or sequence, could be used as part of an address calculation for verifies that the fa ult occurs, ft xes the fa ult, restarts a subsequent instruction. However, the likelihood the instruction, and verifies that it completes cor­ is slim that the result of a randomly selected arith­ rectly. Upon completion, AXE or MA)( compares the metic calculation will be used within the test case's results from the unit under test to a known good virtual address space. reference, and reports any diffe rences. The known MAX first creates and executes sensible cases in good references contain the correct results of each logical steps. It then assembles and executes each test case. These references have been accumulated case as a whole. over the years of testing VAX systems and have After selecting the fi.rst instruction, MAX executes changed as the VAX architecture changed. the instruction, including fa ult restarts , and saves the final results. MAX next selects and places the AXE AXE is the older and simplerexerciser. It cre­ second instruction in memory fol lowing the first ates test cases that consist of a single instruction. A instruction. Where possible, it uses the results of typical instruction stream would be: the first instruction for operands and operand specifiers of the second instruction. MAX selects

ADDF3 R1 , R2 , 5"'1 R1 .00008 000 , R2=4624681 1 new values for operands and specifiers for which old values cannot be used. When this instruction is first executed, either a MAX includes the new values in the initial state of reserved operand fa ult (on Rl) or a reserved the entire case. It then executes only the second addressing mode fa ult (short literal destination) instruction. This process repeats until an instruc­ should be reported. AXE will fix whichever fa ult is tion stream of the desired length is created. At this reported and restart the instruction. point, the entire stream is executed. Once the Assuming the reserved addressing mode fa ult stream is run, the results of all of the instructions are was reported, the instruction might then look like compared. This comparison is made against the this: results of the single-step execution and the results of the known good reference. ADDF3 R1 , R2 , R3 R1 .00008000 , R2:4624681 1 Te st Process If the reserved operand fa ult is reported, AXE will The minimum testing requirements for shipment change R 1 to a valid floating point value and restart. of VAX systems were developed from experience

Digital Te cbntcal]ournal Vo l. 2 No. 2, Spri ng 1990 79 VAX6000 Model 400 System

gain<.:tl from testing all previous VAX systems. The review of a test plan. The rest plan is based on requirements are a compromise between the num­ product specifications and information from devel­ ber of bugs likely to be found and the time required opment groups. It identifies the test tools, configu­ for the tests. The testing program consists of over rations, and strategy the System Evaluation Group 200 million test cases. will use. The use of AXE and 1'

Te st Process Overview Configuration Selection and Te st System evaluation typically begins when the first The System Eval uation Group selects configurations hardware and software prototypes become avail­ according to many fa ctors, including able and continues through to product shipment. • VMSoper ating system restrictions Evaluation planning often begim six to nine months prior ro the actual pron:ss with the creation and • Bus architectures and slot placement limitations

80 Vo l. 2 No . 2, Sp ring 1990 Digital Te cbnical]ournal Test and Qualification of the VAX 6000Mo del 400 System

• Power and packaging restrictions figuration-dependent problems were corrected by modifying console and operating system software. • Marketing requirements As shown in Figure 1, VAXBI channels on the The maximum number of each supported option Model 400 were also configured with all currently is tested within laboratory resource limits. Because supported VAX 6000 system 1/0 adapters. Although the Model 400 CPU is a higher performance proces­ testing with large and diverse system configurations sor for the existing VAX 6000 platfo rm, it was tested poses logistical challenges, these envirorunents will within established VAX 6000 fa mily configuration often succeed in exposing device compatibility guidelines. The System Evaluation Group chose problems. nine Model 400 system configurations to test from The device compatibility problems found during one up to six processors, 32 to 256 megabytes of the Model 400 evaluation occurred either during memory, and from two to six VAXBI channels. operating system initialization or when interactive An important part of configuration testing for a workloads concurrently exercised all system new processor involves verification of proper sys­ devices. One such problem resulted from the tem initialization and operating system boot using increased processing speed of the Model 400 pro­ various load paths. With the Model 400 system, this cessor and would occur only when an adapter was testing meant loading the operating system through tested in a specific configuration under a certain diffe rent VAX BI channels, disk or CI adapters, and workload. load devices. Several problems were noted with More specifically, this problem was due to a race bus adapter initialization and self-test while testing condition between a VMS application level program certain VAXBI channel configurations. These con- issuing I/O requests and the adapter hardware pro-

MEMORY ARRAYS MODEL PROCESSORS 400 1 TO 8 PER SYSTEM 1 TO 6 PER SYSTEM (32 TO 256MB)

XMI I

2 TO 6 VAXBI CHANNELS

I VAXBI I I I I PARALLEL ASYNCHRONOUS CONSOLE UNIBUS Cl PORT COMMUNICATION LOAD ADAPTER ADAPTERS ADAPTERS ADAPTERS DEVICE 1 PER VAXBI 1 PER SYSTEM 4 PER VAXBI 2 PER VAXBI 1 PER SYSTEM 1 PER SYSTEM 8 PER SYSTEM 3 PER SYSTEM

DEC net DISK TA PE SYNCHRONOUS COMMUNICATION ADAPTERS ADAPTERS COM MUNICATION ADAPTERS 2 PER VAXBI 2 PER VAXBI ADAPTERS 2 PER VAXBI 12 PER SYSTEM 4 PER SYSTEM 2 PER VAXBI 4 PER SYSTEM 2 PER SYSTEM � LAVC/LAN VA Xcluster

Figure I Adapters/Op tions Te sted during VAX 6000 Model 400 Evaluation

Digital Te cbnical]ournal Vo l. 2 No . 2, Sp t"ing 1990 81 VAX6000 Model 400 System

cessing and returning a response. With slower Memory arrays were also configured in a non­ machines, the adapter hardware won the race; but interleaved mode to degrade memory access time with the faster Model 400 processor, the host won and aggravate potential bus timeout conditions. hy issuing commands faster than the adapter hard­ The System Evaluation Group ran interactive test ware could process them. The problem was cor­ experiments for a minimum of four hours and a rected with a modification to the device driver maximum of three days. A typical experiment software. lasted 18 hours. Most long duration experiments Complete configuration test coverage for the were performed using the combined CPU/memory Model 400 processor also required VA Xcluster/local and l/0 adapter workload. The longer run-times area VAXcluster (LAVC) and DECnet/local area net­ provided adequate time for adapter exercisers work (LA�) evaluation using each of the supported to step through a preprogrammed range of trans­ CI and DEC net communication adapters. To accom­ fe r sizes. plish thest: tests, a 13-node VAXcluster was estab­ Following each test, VMS error counters, test soft­ lished composed of t:ight VAX host systems, four ware status reports, and VMS error log entries were HSC mass storage servers, and the VAX 6000 Model examined for system or device errors. The group 400 system under test. then characterized problems in terms of their fre­ Primarily, the VA Xcluster/LAVC and OECnet/LAN quency, repeatability, and the system environment testing verified the functional compatibility of the in which they occurred. The system environment Model 400 processor with the VAX 6000 series CI included detai led information regarding hardware and NI adapters. Cluster and LAN activity were used configuration, the software test tools and parame­ simply as adapter loads. Interactive experiments ters used, and module, operating system, device were designed to emphasize stress at the local sys­ driver, and firmware versions. tem level . Cluster-level verification was deferred to Although some of the problems noted during the another group. Model 400 evaluation were easy to reproduce and Evaluation of the VAX 6000 Model 400 system occurred frequently, others were intermittent in also verified the Model 400 system-level power-fail/ nature and not so easily induced. For example, the warm-restart capability in large configurations with configuration-dependent problems occurred each high compute and 1/0 loads. These tests ensured time a specific configuration was tested, whereas a that battery backup units would maintain supply particular device compatibility problem was inter­ voltages for the guaranteed duration. Further, these mittent and required long duration runs before tests ensured that error reporting and recovery pro­ appearing. cedures operated properly. The detailed problem descriptions, along with Interactive Test Me thod pertinent error log or crash dump data, were pro­ The interactive test method first selects test soft­ vided to the appropriate development groups fo r ware. It then modifies parameters such as VMS analysis. Also, descriptions were logged in the pro­ queue 1/0 (QIO) request size, number of outstand­ ject-specific problem reporting databases to ensure ing conunands, and device mode of operation. that problems were properly tracked and resolved. Using this method, three system workloads were The System Evaluation Group then worked with the generated. development group to fix the problem. The first workload used small VMS QIO request From a total of six problems noted during the sizes and maximum QIO request queue lengths to Model 400 evaluation, four were configuration­ achieve high l/0 rates. This workload minimized dependent and corrected through modifications to idle CPU time and maximized time on the inter­ the VMS operating system or console microcode. rupt stack. Two problems were device compatibility bugs that The second workload used large VMS QIO request occurred during operating system initialization and sizes with sequential disk accesses and device loop­ interactive testing. These problems were fi Xed back to generate high bus utilization rates (bytes through modifications to the VMS operating system per second). This workload saturates 1/0 buses and or device driver software. interconnects by generating large amounts of direct When the VA X 6000 Model 400 system evaluation memory access activity. was completed, a final report was distributed. The The third workload combined CPU/memory and report summarized the configurations and test tools I/O adapter test software with a distribution of used, the specific experiments performed, and the small, moderate, and large VMS QJO request sizes. current status of all problems that were identified.

82 Vol.2 No. 2, Spring /<)<)0 Digital Te cbnical]ounwl Te st and Qualification of the VAX 6000 Model 400 Sys tem

Summary References System qualification is the last stage in the system 1. B. Allison, "An Overview of the VAX 6200 Family development process. The qualification process for of Sys terns," Digital Te chnical journal, vol. 1 , the Model 400 system was designed specifically to no. 7 (August 1988): 10-18. take into account computer-aided design and simu­ 2. B. Allison, "The Architectural Definition Process lation used by the hardware design process. of the VAX 6200 Family," Digital Te chnical As our experience with the Model 400 system has jo urnal, vol . 1, no. 7 (August 1988): 19-27. shown, nearly all of the problems found during system qualification were in areas of the system that 3. R. Gillett, "Interfacing a VAX Microprocessor to a High-speed Multiprocessing Bus," could not be simulated. As a result, the qualification Digital vol . 1, no. 7 (August 1988): process is most effective when focusing on testing Te chnical jo urnal, those parts of the system that cannot be simulated 28-46. because of their complexity, which presents both a 4. P. SuUivan et al., "The VAX 6000 Model 400 chaUenge and an opportunity. Scalar Processor Module," Digital Te chnical The challenge is to design test processes and jo urnal, vol. 2, no. 2 (Spring 1990, this issue): tools that can adequately test a complex system in a 27-35. reasonable time. With SITP and other test tools, 5. D. Slater et al., "Vector Processing on the VAX such as those used by Midrange System Evaluation 6000 Model 400 System," Digital Te chnical Engineering, we have made a significant start in jo urnal, vol. 2, no. 2 (Spring 1990, this issue): developing these tests. However, there is still much 11-26. room for improvement, and work is continuing in 6. H. Durdan et al., "An Overview of the VAX 6000 this area. Model 400 Chip Set," Digital Te chnicaljo urnal, The opportunity is to shorten the qualifica­ vol. 2, no. 2 (Spring 1990, this issue): 36-51. tion process. Because first-pass hardware is more 7. ]. Basmaji et al., "The Role of Computer-aided robust, more system testing can occur earlier. Also, Engineering in the Design of the VAX 6200 better test tools enable us to provide more test System," Digital Te chnicaljo urnal, vol. 1, no. 7 coverage in less time and with fewer resources, (August 1988): 47-56. both in prototypes and number of people. We took 8. T. Leonard, ed., VAX Architecture Reference advantage of this opportunity in the Model 400 Ma nual (Bedford : Digital Press, Order No. system qualification to cut the length of field test EY-3459E-DP, 1987). by 25 percenr, thereby bringing the Model 400 system's new technology to market faster.

Digital Tecbnicaljournal Vo l. 2 No . 2, Spri ng 1990 83 Thomas C. Furlong Michael]. K. Ni elsen Ne il C. Wi lhelm

Development of the DECs tation 3100

The DECstation 3100 is the jlrst member of Digital'sfa mily of high-perfonnance ULTRIX workstations. Built with the chip set fr om .HIPS Computer Sy ste�m� Inc. , and highly integrated 110 and graphicssubs_ystems, the Dtestation 3100 imple­ ments 12 mips of RISC-based computing, workstation 110, and e.y cellent bit-map graphics on a single module. The DECstation 3100 workstation runs Digital 's ULTRIX op i-�ating sy stem (compatible with software) as well asDEC windows software, TCP!IP, DECnet software, and Ne twork File Services (NF!:J). The workstation can be config ured with 8MB to 24;HB of parity-protected memory, monochrome or 8-plane colorgraphics, 15-ilu'h or 19-inch monitors, and SCSI disk and tap e devices. This paper describes the DECstation 3100 product, the design effo rt, details of the s_ystem, and measured benchmark performance.

Sy stem Overview caused many workstations to be used only as rather Packaged in a desktop system box, the DECstation expensive tt'rminal emulators for the larger 3100 workstation is implemented as a single mod­ machines. The slow workstations also meant that ule that contains CPU/I'PU,separate instruction and window applications were often bogged down in data caches, memory control logic, Ethernet and screen update activity, and that NFS performance small computer systems int<.:rconnect (SCSI) con­ was limited not by device or network speeds but by trollers. fou r serial lines, and video display logic. the workstation's processing power. Connectors on the module accept as many as 12 The primary goal of the DECstation ) 100 project memory modules of 2 megabytes (MB) apiece as team was £O produce a fast RISC lJLTRIX workstation well as a single monochrome or color frame buffer that would bring processing and windowing perfor­ module. The box optionally contains 3.5-inch, nunce to the user's desk at a competitive price. The 104 MB SCSI disk driv<.:s. An SCSI connector on the product would run the ULTRIX operating system, back of the system box supports the attachment of DECwindows products, and network software for additional SCSI devices, such as the 332MB disk both TCP/IP and DEC net networks. drive, the 95 MB tape drive, and the 600MB CDROM In early May of 1988, we received approval disk drive. tO build an ULTRIX workstation that featured increased processing power. The remaining design Background and Project Goals goals were time to market, packaging, reuse of Digital's research and development groups in existing designs, and system cost and price. The Palo Alto had used a UNIX operating system in a aggressive schedule called for the fi rst workstation client/server computing environment for more to ship in mid-January 1989. We were asked to than three years. The clients were various VAX use the new desktop system hox designed for workstations. The servers were VA X systems and product to be later announced as the VA Xstation RISC-technology research machines, called Titans, 3100. We were also asked to reuse any hardware which were developed by Digital's We stern or software elements of the VA Xstation 3100 that Research Laboratory. we possibly could. Our own view of the market­ The major frustration in this environment was place caused us to choose an entry-level price of the lack of proct'ssing power at the workstation. approximately $10K. Computer room servers dt'livered up w 12 million At the end of the project, we achieved all of our instructions per second (mips), but most office goa ls. We shipped the first system in the desired box workstations delivered only I mips. This disparity on the very day we had promised. We reused most

84 Vo l. 2 No. 2, Sp rin[i fl)')O Digital Te chnicaljournal Developmentof the DECstation 3100

of the existing ULTRIX, windowing, and network software, and documentation. Many designers, software, and supported the same internal disk particularly software engineers, temporarily relo­ drives as the VA Xstation :1 100. We held the entry­ cated from New Hampshire to part icipate in the level price within 20 percent of the SlOK goa L project. Researchers from Digital's two Palo Alto research laboratories gave generously of their time Sy stem Cost Issues in reviewing the design, developing new software, We needed to control the high-cost items of proces­ and testing prototype systems. sor, caches. and memory . We also had to resist the The product fo cus ream built extended support tendency to add things, either because other groups groups for functions such as manufacturing, mar­ requested them or because we had ourselves keting, sales, and application development . wanted them. The processor choice was based more on cost per Basic Design Rules mips than the absolute cost of the processor chip set itself. The cost per mips consideration led us to From the beginning we agreed upon some basic select a RISC processor instead of a VAX processor in design guidelines. We strictly adhered to these deci­ order to obtain at least a two-to-one performance sions throughout the design phase of the project. advantage. Since we were building a product for We would develop all functions on one system users of the LJLTRIX operating system, the lack of module. Anything that did nor fit would not be part VA X instruction compatibility was not an issue. We of the product. chose the R2000 chip set from MIPS Computer We would do no ASIC or other IC design; the Systems as the best CMOS HISC technology on the schedule did not allow for it. So we would use lots market. of random, low-cost, standard logic functions, PALs Even though caches contribute significantly to in particular. Opportunities for lower cost integra­ system performance, we still considered the use of tion wou ld be saved for fo llow-on products. small caches to reduce cost. Simulations of system We would build dumb J/0 controllers. This deci­ designs indicated that the higher expense of large sion eliminated the use of secondary processors, caches was necessary to achieve fa st desktop microcode, and hardware coordination of ­ performance. ligent devices. Various-sized buffe rs on every con­ Our most diffi cult challenge was determining troller would allow devices to run at their own how to implement memory . We did not want to speeds independent of general processor activity. burden the entry-level systems with more memory Software drivers running on the genera l CPU would than was necessary, but we did want the opportu­ interact with industry- or Digital-standard con­ nity to add memory to systems that could use it. We trol.ler chips for the I/O subsystems. considered memory on the system module, mem­ We would build dumb graphics - no pipeline, no ory on daughter cards, memory on commodity graphics processors, no rendering chips. bur simply single in-line memory modules (SIMMs), and fi nally a frame buffer configured as part of the main mem­ memory on custom-designed SIMMs. In the end, we ory address space. The onl y exception was to add a chose the custom-designed SIMMs because of their color plane mask for use by the color software. density, cost, and configurability. We wou ld aim for ease of manufacturing by keeping the option choices low. The only choices Basic Project Rules to be made in manufacturing the system box were To succeed at this project we kept the size of the how many memory modules to insert. which frame team to the minimum and isolated the team from buffer module to insert, and whether to add one outside influences. We settled on a minimal product or two internal disks. This decision kept the total fo cus team and a minimal design team. The five­ manufacturing permutations down to 30. (Earlier person product focus team would manage the pro­ Digital workstations were up around 1000.) ject while the three-person hardware design team We would try to let experienced computer users would build the machine. Everyone would work upgrade and service their own workstations. The from the Palo Alto base. choice of system box limited this capability since The first machine was running two months the VAXstation 3100 box was not originally after the project start dare. 13y that rime the design designed for user access. We simplified the original team had expanded to about twenty people box design and left much of the box empty to make developing the electronics package, diagnostics, the electronics more accessible.

Digital Te cbnicaljoumal Vol. 1 No . .2, Sp ring 1990 85 VA X 6000 Model 400 System

Conflicts and Resolutions the systems, power-cycled them, temperature­ At project starr, the design team resisted the idea of cycled them, even submitted them to rainfall due offering both monochrome and color graphics to an environmental failure. We recorded every options. We believed in the value of provid ing color fa ilure and traced it back to its source. Many of wherever and whenever possible on workstations: these fa ilures led to changes in diagnostics, compo­ customers prefer color, and it provides additional nents, placement, and mechanical solutions. This fu nctionality. However, we were not confident early stress resting did not uncover any problems we could develop both graphics options and still nor seen elsewhere, but with a small number of remain on schedule. We decided to usc a simple machines, it validated all problems seen in the other frame buffer along with a single, configurable video test situations. output design. We would bear the cost of an unnec­ To test the total system with its software, we essary VDAC in the monochrome system, but we invented a qualification team nicknamed the on ly had to design one system. "wrecking crew," a group of about twenty-five Although the hardware design team was confi­ senior engineering researchers and developers. dent that cost/performance trade-offs were the They agreed to accept early protOtypes and to sub­ right ones, other project members voiced their con­ ject them w heavy use for a period of three months. cern . Some issues were memory size and I/0 and Their goal was to break the systems, as often and in graphics performance. Would the maximum mem­ as many ways as they could. During the wrecking ory size of 24 MB he large enough for applications period , we constantly installed the latest soft­ such as computer-aided design {CAD) and model­ ware changes, replaced diagnostic ROMs, added ing� How would Ethernet and disk performance hardware, and moved systems from person to per­ compare to VA Xstation and competitors' work­ son to allow everyone to try different configura­ stations� Wou ld the choice of frame buffe r matched tions. Each crew member was responsible fur an with a fa st RISC processor deliver adequate graphics exhaustive test of a subset of the ULTRIX commands performance, particularly in color systems� and utilities. These questions were answered once the soft­ The wrecking crew was a huge success. Te am ware development was complete and performance members reported 785 problems, complained measurements could be made on late prototype mightily and usefully, ported CAD tools, window systems. Sec the Product Qualification section of applications, and compilers in their spare time. this paper. Th<.:ir constant demand for the performance they Another topic we debated was whether to anow expected exposed many bugs that were artificially peripheral devices in the box, which was clearly limiting performance. Best of all, the wreckers dt�signed for such devices. Various combinations of almost doubled the number of software experts disks and tapes in the box presented three prob­ knowledgeable about the workstation from an lems: more complicated options in manufacturing, early stage and thus contributed significantly to a heavy drive plate and complex cabling for the user system quality. to remove during system upgrade, and a power problem if an internal rape drive was present. Processor Subsystem Details Eventually we permitted only the internal 3.5-inch The DECstation 3100 CPU consists of the MIPS disk drives and simplified the drive mounting plate R2000 integer processor, the R2010 floating point to be easier to remove. coprocessor, and four R2020 write buffers. The chip set operates at 16.67 megahertz. Product Qualification In the DECstation 3100, the R2000 chip set runs Since the entire project time was only eight in "Little Endian" mode. In other words, bits within months, we needed to maximize the test time of bytes and words are counted from right to left , and the DECstarion 3100. The hardware was an entirely the low-order bit is the rightmost bit in a word. new design. The software was a port from a VAX "Little Endian" mode means that the integer data base to a RISC base, and much of the lower level fo rmat of the OECstation 3100 is identical to the graphics software was completely new code. integer data format of any VA X processor. The To test modules and the hardware system, we fl oating point data format is compliant with IEEE sent many early prototypes to a local testing lab­ standards. oratory for stress testing. While running both diag­ The R2000 CPU implements the instruction set, nostics and the ULTRIX operating system, we shook processor registers, virtual memory, and interrupt

H6 Vo l. 2 No. 2. Sp ring i'J'JO Digital Technicaljournal Development of the DECstation 3100

system as defined by the R2000 architecture. The characteristics as memory and may be cached if CPU maintains the direct-mapped, write-through desired. The memory system supports byte, half­ data cache. Each cache is 64 kilobytes (KB) in capac­ word, word writes, and word reads. ity with a 4-byte line size. The tag and data stares of The memory system control logic is optimized each cache are byte-parity protected, and cache for minimum memory read latency, at a sl ight cost parity errors transparently generate cache misses ro in memory write latency. On a memory read, the reload the cache from memory. CPU incurs a fi ve-cycle stall in the absence of mem­ The R20 10 floating point coprocessor imple­ ory refresh contention. The memory system can ments the IEEE arithmetic functions and coproces­ sustain fi ve-cycle reads, which results in a peak read sor registers defined by the R2000 architecture. bandwidth of 13.3MB per second. The R2020 write buffer implements a four-stage Memory writes to an empty write buffer com­ write buffer for the CPU. This write buffe r allows plete in eight cycles, but do not stall the CPU. Suc­ the CPU to write to its write-through cache without cessive memory writes complete at the rate of six stalling the CPU as long as the write buffer is not fu ll. cycles, and the CPU stalls whenever the write buffer is full. The memory system can sustain six-cycle Graphics writes, which results in a peak write bandwidth of Graphics on the DECstation 3100 is implemented in 11.1MB per second. a tightly integrated subsystem. Frame buffer mem­ The DRAM and VRAM arrays are implemented ory is a region of memory in the processor address with SIMMs. Each DRAM array contains 2MB of space - 256KB in a monochrome system and 1MB memory on a double-sided module. The VRAM in a color system. Less than half of the monochrome arrays contain either 1 megabit (Mb) (monochrome) frame bu ffer and three quarters of the color frame or 8Mb (color) of frame buffer memory on single­ buffer are displayed on the workstation monitor. sided modules. The remaining frame buffer memory may be used for storage of graphics data structures such as fonts. Ethernet The frame bu ffer memory is not parity protected. The Ethernetinterf ace on the DECstation 3100 con­ At boot time, the lJLTRIX operating system sists of a CMOS controller chip and a 64 KB buffer. detects the size of frame buffe r memory and The controller chip manages transmission and whether the system is monochrome or color. reception of packets through ring descriptors and Because frame buffer memory is cacheable and packet buffers located in the Ethernet buffer. The addressable in the same way as the dynamic ran­ buffer is time-multiplexed between the controller dom access memory (DRAM), the software is able chip and the workstation CPU. to achieve extremely high performance without Connection to the Ethernet is by a thick-wire or any special-purpose graphics hardware. ThinWire cable. A push-button switch on the rear of A color plane mask allows processor writes to the the system box selects the appropriate connector. color frame buffer to affect only specific bits of a pixel. This design allows modification of a given SCSI plane of the color frame buffer using only write The DECstation 3100 supplies a small computer cycles, which increases performance significantly. system interconnect (SCSI) as the interconnect for The graphics programmable cursor supports a storage peripherals. The workstation's SCSI inter­ 16-hy- 16 pixel, two-plane cursor. The cursor can face consists of a gate array controller chip and a take two forms: a 16-hy- 16 bit pattern or a crosshair 128KB buffer. The controller chip manages the SCSI whose lines may extend to the edges of the visible bus through selection, DMA data transfer, and dis­ raster or may be clipped to a programmed region. connect commands. The interface supports com­ The cursor in a color system may have up to three mand disconnect/reconnect and synchronous data colors, and the cursor in a monochrome system transfers at 4MB per second on the SCSI bus. The may have up to three gray-scale values. buffer is time-multiplexed between the controller chip and the workstation CPU. Memory An SCSI connector on the rear of the DECstation The DECstation 3100 supports 8MB to 24MB of byte­ 3100 system box allows connection of external SCSI parity protected memory in 4MB increments. The peripheral devices. Digital offers a 332MB disk, a memory system includes both the DRAM array and a 95 MB tape, and a 600 MB CDROM reader. Each of video random access memory (YRAM) frame buffer. these devices is pac kaged with power in its own The video frame buffer has the same memory access sidecar box.

Dtgitul Te clmicaljourrUll Vo l. J No. 2. Sp ring 199() 87 VA X 6000 Model 400 System

Ta ble 1 Comparison of RISC System Performance

DECstation Sun Sun MIPS 3100 4/1 10 4/260 M/1 20-5

Dhrystones/second 22.7K 12.8K 16.7K 24.7K Linpack single precision 3.7 0.95 1.3 4.0 (MFLOPs)

Unpack double precision 1.6 0.57 0.89 2.1 (MFLOPs) Stanford small integer benchmark 0.1 15 0.220 0.150 0.118 (seconds) Digital Review's CPU 2 benchmark 6.91 18.99 13.71 NA suites (seconds) XLIB graphics performance rate 4.9K NA 0.7K NA

Serial Lines Table 2 DECstation 3100 1/0 Performance Four serial line interfaces are present and are pro­ Subsystem grammable from SO to 9600 bits per second. The Peripheral 110 Performance serial transmittt:rs art: double buffered , and the receivers shart: a 64-entry FIFO. The workstation RZ55 file reads 930KB per second uses one serial line for the kt:yboard and another for (with read-ahead the mouse. One serial line, designed for modem microcode)· use, supports data-terminal-ready/data-set-ready (DTR!DSR) control signals. RZ55 file reads 330KB per second RZ55 writes 320KB per second

Software NFS file reads 450KB per second The DEC:station 3100 runs the standard software (with read-ahead expected by w;ers of NIX operating systems as RZ55 microcode) well as software that allows easy networking and NFS file reads 330KB per second windowing of VAX and 3100 systems. Digital's NFS file writes 320KB per second ULTRIX operating system is compatible with Bt:rkeley v4 .3, AT&T System V, and is compliant IP/UDP 1500-byte 800KB per second with POSIX standards. OECwindows software runs packets on the DEC:station 3100 in both the monochrome IP/UDP 64-byte 1 000 packets per second and color configurations and integrates searnlessly packets VA X with DEC:windows running on systems with IPfTCP end-to-e nd 350KB per second VMS operating systems. ULTRIX supports DE\.net,

TCP/IP, and NrS. Compilers include the C, • RZ55 read-ahead microcode has not yet been FORTRA N, and PASCAL compilers adapted from the released by Digital Equipment Corporation. compilers from MIPS Computer Systems. Conclusion Performance One of the delights of the DECstation 3100 project Table 1 lists key performance measures of the was that we were building the machine that we OECstation 3100 workstation. For comparison pur­ ourselves had wanted for a very long time. By using poses, the table also lists the performance of other a small, fo cused core design team and resisting RISC systems, namely the Sun 4/110 and 4/260 and incremental additions, we achieved the aggressive tht: MIPS M/120-S. time-to-market goal. l:tble 2 lists the SCSI and network subsystem periphera ls 1/0 performance of the DE< :station 3100.

88 Vo l. 2 No. 2, Spring 1990 DigikllTe cbnicaljournal LarryB. We ber I

Compiler Op timization in RISC Sy stems

Compiler op timization detennines the level of RJSC system performance. Thearchi­ tectural design of compilersfr om MIPS Computer Systems, Inc., combined with sup­ port tools fa cilitates compiler op timization and overall system throughp ut. The compiler design takes advantage of small and high-speed cache memory to enhance performance. Thecord toolpositions theprogram in memory to ensure that the most frequently used memory locations never compete fo r the same cache locations. Portability is crucial to compiler effectiveness. MIPS compilers plemen im t many industry-wide extensions to the standard languages to make them compatible with other implementations.

RISC (reduced instruction set computer) system not directly correspond to the machine's features. performance embodies many components. In addi­ Compilers must also deal with programs that do not tion to the performance of individual instructions, take best advantage of these features. The optimizer the processor architect must consider how the is the portion of the compiler that deals with perfor­ compiler combines the instructions, how system mance issues. vendors construct the memory system, and how the user writes programs. Of particular importance is Op timization BoostsPer formance how well the compiler optimizes programs for a Optimization occurs at many stages within the given hardware architecture. In addition, program compiler. Some optimizations are best done at the portability is essential to ease the burden of moving front end of the compilation process when first applications to new systems. By considering all processing the source program. These optimi­ such aspects of system performance, the processor zations are called language-dependent optimiza­ architect can use the fu ll potential of a RISC system. tions because they rely on features unique to a Traditional CISC (complex instruction set com­ specific programming language. Other optimiza­ puter) processors were developed without signifi­ tions, called machine-dependent optimizations, are cant information on how high-level programming performed late in the process because they require languages would use them. In contrast, RISC archi­ detailed information about the target machine and tects make trade-offs between microprocessor how the program actually uses that machine. Still, structures and compiler complexity, with the goal other optimizations are independent of both the being overall system performance. source language and the target machine. The compiler is the key link between the archi­ Compilers from MIPS Computer Systems consist tect and the system user. Therefore, it is essential of several independent front ends that convert that the processor architect completely understand individual languages into a common intermediate the compiler and its capabilities. code called ucode. (See Figure 1.) MIPS currently Compilation is the process of converting high­ supports six programming languages: ADA, c, level source code written in a programming COBOL, FORTRAN,PAS CAL, and PLI; ANSI C and C++ language into machine code for a target machine. will be available in 1990. The common back end This process must consider translating the instruc­ performs the bulk of the optimizationand generates tions correctly into machine language, as well as machine code. into optimal machine language. Often, the high­ The common back end of the compiler uses a level language masks the primitive level of the target variety of optimization techniques that require machine by providing programming tools that do varying amounts of information. The compiler

Reprinted with the permission of f:.SD Magazine, December must gather the information from the source code 1989, Digital Design Publishing, Westborough, MA 01581. and analyze it. Peephole optimizations require the

Digital Tecbnical]ournal Vo l. 2 No. 2, Spr ing 1990 89 VAX6000 Model 400 System

1986

1986 UCODE LINKER INTERCOMPILATION UNIT OPTIMIZATION FUNCTION IN-LINE }

GLOBAL OPTIMIZER GLOBAL OPTIMIZATION

CODE GENERATOR LOCAL OPTIMIZATION

PIPELINE SCHEDULER PEEPHOLE OPTIMIZATION

Figure 1 CompilerStructure

least amount of information-usually only an The -00 option disables the optimization nor­ instruction or two. Global optimization requires mally performed by the code generator and the most information; it must take into account assembler. The -01 option (the default) designates control tl ow (the branching and looping stmcture minimal and fast op timization. Under this option, of a program) and data tlow (the data usage within the code generator and assembler perform local each section of the program). lntercompilation unit optimizations within basic blocks. Apart from tradi­ optimizations represent an extreme fo rm of global tional local optimizations such as local common optimization that occurs between independent subexpression, expression simplification, constant source files. Local optimization requires an interme­ fo lding, dead code elimination, and peephole opti­ diate amount of information - usually data usage mizations, the code generator performs branch and within a group of consecutive statements. Ta ble I label optimization, and the assembler performs lists optimiz:.�tionsshared in the MIPS compilers. architecture-dependent pipeline scheduling. Com­ One of four optimization levels (-00, -01 , -02, or pilation speed does not lengthen noticeably -0:)) can invoke the MIPS compiler. The levels between -00 and -01 . Thus, -00 is sddom needed indicate the relative compilation speed - not the and is used mostly for comparison studies. importance - of the various optimization classes Option -02 adds the uopt phase to the compi­ that the compiler can implement. for example, a lation stream to perform global optimization and program compiled at the -00 level, which specifics register allocation within the fu ll range of individ­ no optimization, would compile faster than a pro­ ual procedures. Compilation time might lengthen gram compiled at the -03 level, which offers the full substantially because global data-tlow analysis and range of optimizations. These optimization kvels coloring register-allocation algorithms are invoked. also correspond closely to rhe components invoked Supporting optimizations across multiple source during the compilation process. files is the -03 option. This option adds the uld

90 Vo l. 2 No. 2. Spring 1990 Digital Te cbnicaljournal Comp iler Op timization in RISC Sys tems

Ta ble 1 Optimization Methods pie contiguous words into the cache on a miss , and the sequential nature of instruction execution takes Peephole Instruction scheduling optimization Instruction selection advantage of this locality. Instruction substitution Register optimization makes the most significant Local Call/return selection architecture/compiler performance enhancement . optimization Branch-to-branch optimization Because the optimizer knows that data allocated in Local subexpression elimination registers can be accessed without delay, it places the Constant folding most frequently used variables into registers. The Expression simplification Dead code elimination optimizer computes the lifetime of individual data Local register assignment items and replaces the memory usage with a register Short-circuit evaluation usage. The relatively large number of registers Global Invariant code removal makes it likely that the optimizer can successfu lly optimization Strength reduction promote the most frequently referenced variables Global register assignment into a register. (There are thirty two 32-bit integer Global su bexpression elimination Shrink-wrapping register saving registers and sixteen 64-bit floating-point registers.) Linear test replacement One of the most important architectural features Loop unro lling used to improve performance is the instruction Ta il recursion pipeline. Although several cycles arc required to Copy propagation actually complete the instruction, the processor can Redundant store elimination be viewed as if each instruction takes only one cycle lntercompilation lnterprocedural register allocation unit optimization In-line expansion of procedures because a new instruction is started each cycle. The compiler is aware of several exceptions; for exam­ ple, load instructions require one instruction between the load and the use of the data loaded. phase, which comhines separate compilation units This load-delay slot is used by the compiler for into a single fi le at the ucode level. Thus the option another instruction - effectively, the cache mem­ enables multimodule programs to achieve the same ory access of the load executes in parallel with the degree of optimization as single-module programs. other instruction. This technique is called instruc­ The umerge phase selectively expands procedure tion scheduling. Other examples of parallel instruc­ calls by in-line substitution. The resulting ucode tion execution include object is then sent into normal back-end optimi­ zation and compilation stream starting with uopt. • Overlapping a branch instruction with another The -03 option causes uopr to perform interproce­ operation dural register allocation; uopt also benefits from the • Testing for division by zero while the divide is in complete information in the linked ucode file to progress perform the other global optimizations normally associated with this phase. • Executing several different fl oating point oper­ MIPS Local Ha rdware Architecture ations simultaneously (The processor has separate fl oating point ad d, mu ltiply, and divide MIPS' architectural design fa cilitates compiler opti­ units.) mization and overall system throughput. Important to system performance is the memory hierarchy. A The MIPS architecmre does not have condition split cache provides independent access to both codes. Although this seems unusual compared with instructions and data in a single cycle. A single com­ many other machines, this design actually improves bined cache would limit the processor ro obtaining performance. The architecture provides branch only a single instruction or data item each cycle. instructions that both rest a condition and then The compiler design takes into consideration the branch. Thus, the compiler must generate only one effect of cache memory. On average, instruction instruction for conditional branches rather than the references cache miss less frequently than data ref­ two instructions usually required (one to test for the erences ; this observation allows the compiler to condition, then another to perform the branch). prefer slightly longer instruction sequences if they Only comparisons that compare larger (or smaller) avoid extra data references. The instruction cache values between two registers, or a register and miss rate is lower because the hardware loads multi- inuned iate value, cannot be handled this way. The

Digital Te cbnicaljournal Vol. 2 No. 2, Sp ring 1990 91 VAX 6000 Model 400 System

compiler restrucrures most comparisons ro avoid sent hardw:.�re 110 structures when being refer­ this case, rhus decreasing average test time. enced by an optimizing com piler. Consider an I/0 Instructions in a CISC processor often have device register that is used to define the status of widely varying execution times. This difference the I/0 device. The optimizing compiler would see makes it hard to determine which of several alterna­ multiple references to the address without inter­ tive sequences is actually fastest. Because almost all vening assignments. The compiler would cleverly RISC instructions take the same time. the optimizer (b ur incorrectly) optimize all references to a proces­ can select the fastest sequence with relative ease. sor register. MIPS has added the key word volatile to indicate to the compiler that this variable changes in Program Portability ways that the compiler cannot detect. This exten­ To make a system useful, programs must be ported sion was recently incorporated into the current onto the processor. The usc of UNIX operat­ ANSI standard for c, bur it was added to the MIPS c ing systems and standardized languages such as compiler four years ago . FORTRAN has tremendously improved program portability. However, incorrect programs may have Toolsto Development latent bugs that are masked by a nain: compiler. Fundamental to system development is a tool set Thus, an optimizing compiler tends to expose more that aids in compiling, debugging, performance tun­ of these problems. A naive FORTRAN co mpiler may ing, and system construction (bring-up). MIPS' tool assign all variables to memory locations, giving set includes those tools traditionally found in UNIX variables predictable initial values. An incorrect operating systems, as well as tools unique to MIPS. program that relics on these initial values will fa il The multiple front-end, common back-end con­ when an optimizing compiler assigns a variable to a struction of the MIPS compilers provides a consis­ fast register that tenetsto have unpredictable values. tent set of languages to the developer. Options and An optimizer converts the program (as written) flags :tre the same across all languages. In fact, cc into ont> that is kkntical except that it executes can usually be used to compile programs in the faster. To do this, tht> optimizer must make assump­ other languages. All languages share a common tions about what the programmer intended . Often linkage convention that makes it easy ro write or the programmer tlepcnds on experience with previ­ port programs written in two or more languages. ous compiler implementations rather than the rules The l"\IIX tool make provides a convenient of the language. method of program development. This tool iden­ The :viii'S compiler suite provides a number of tifies which source modules of a program changed options to allow a program to run without modifi­ and recompiles onl y those modules. In recom­ cation in the presence of such common errors. This piling, make provides the correct co mpilation permits a program to be ported quickly, giving the options. It also provides the complete set of compi­ programmer a choice as to when to correct the lation options; for example, the debugging option problem. can be enabled through a make target to trou­ Traditionally, system vendors have added unique bleshoot a program. Later, the debugging options extensions to their language implementations. can be disabled, and higher levels of optimization While these may be a boon to a programmer when can be specified in compiling the production writing the program, they can be a bane when it version of the program. The make rules file, which comes time to port the program. The MIPS com­ accompanies the source program modules, deter­ pilers implement many industry-wide extensions to mines how the modules are to be combined into the the stanclarcl languages to make them compatible final run program. This prior determination elimi­ with other implementations. An important set of nates the need for detailed written documentation extensions is the support of Digital's VAX FORTRAN or programmer support , making it simpler for extensions. Another is the inclusion of I BM 's PL l developers to exchange source programs. extensions to permit an independent software Crucial to efficient program development is vendor to port a 1.8-million-line PLl progr:tm to the a source-oriented debugger. MIPS provides an MIPS architecture. extended version of dbx which supports all pro­ Wherever extensions arc required, Mil'S chooses gramming languages. (ADA has a special debugger.) proposed extensions for similar functions that are The debugger provides features such as print­ being considered by the standards committees. An ing variables as they change, or replaying a debug­ example of this last situation is the need to repre- ging session to a certain point before continuing

92 Vol 2 Nn. 1. .\jJring /'J'JO Digital Te cbnica/journal Comp iler Op timization in RISC Sy stems

the debug session. The user can view and ed it the instrumented program, the counters are dumped source, as well as see each statement as it is to a file for analysis by several programs. Prof dis­ executed. plays tables of interesting information such as the The debugger also lets the user view the program following: in both the high-level language in which it was • Number of CPU cycles for each source line written and in the generated machine language. Thus, the user can set a breakpoint on a specified • Number of times each function is called statement or instruction. Full debugging facilities • Average number of cycles in each call to a function are available only when the debugging option is Figure 2 shows examples of prof output Listings. specified. However, the compiler (by default) main­ A second program, pixstats , takes the same counts tains the line number tables in a compressed format and displays information about the program in in the load module even when the option is not architectural terms, such as : speci fied . Retention of the line number tables per­ mits panial debugging without the need to recom­ • Number of cycles used for each instruction pile the program using the debugging option. Line • Number of unused delay slots number information is kept for each instruction, permitting the instruction scheduler to move the • Number of FLOPS instruction while keeping track of the line that • A general indication of cache locality originated it. These two programs assume a perfect memory Specification of both debugging and optimization system, that is, no effect due to cache misses. options can create conflicts. For example, it is possi­ For a complete analysis, a cache simulator is also ble fo r a bug to appear only when a higher level of available that uses pixie to provide a memory optimization is specified . Moreover, optimizations address trace to the simulator to model the mem­ such as register allocation can confuse the debug­ ory system. MIPS uses this technique to plan new ger, because variables are not at the locations that machine designs. Each proposed change to the it assumes. To avoid this situation, the MIPS compil­ system requires a detailed simulation to exhibit its ers disable any optimization that interferes with effect. When a MIPS designer is convinced that a debugging when both debugging and optimization balanced and optimal point has been fo und, imple­ options are specified. To debug the problem that mentation begins. Experience with this technique shows up only in optimized code, a special option has shown an accuracy of better than 4 percent permits both the debugger and optimization to be when comparing predicted performance ro actual enabled. This technique requires the developer to performance. use caution when inquiring about the contents of a variable. To tune a program for optimal performance, the An Op timiza tion to Improve Cache developer must learn where in a program the time Perfo rmance is spent and why. Traditional UNIX systems provide An area seldom add ressed in compilers is the opti­ a method called pc sampling. With this technique, mization of programs in memory to improve cache the system must interrupt a program at regular hit rates. Modern microprocessor performance has intervals (usually 60 or 100 times per second) and been increasing faster than the supporting memory increment a per-location counter. After enough systems. Ta king advantage of this higher perfor­ sample points are taken, a pattern of execution time mance without introducing costly memories has emerges. This method has a severe flaw because required the use of small and high-speed cache modern processors execute 10 to 20 million instruc­ memory. Cache memory contains recently used tions per second; this means the number of samples instructions and data. Hardware substitutes cache is less than 1 in 100,000. The method requires a very memory if the desired word is in the cache. This long execution time ro collect a statistically mean­ substitution is invisible (other than performance ingful sample. improvements) to the program. Each main memory Although MIPS provides the pc sampling tech­ location must share a cache location with other nique, it also furnishes a more exacting method. memory locations because the cache is smaller than Pixie is a tool that takes an executable load module main memory. For a cache to be effective, it must and prepares it for measurement by inserting a contain enough of the program or data to ensure counrer in every basic block. After running the multiple reuse of the instructions or data.

Digital Te clmical]ournal Vo l. 2 No. 2, Spring 1990 93 VAX 6000 Mode1 400 System

eycl

TOTAL NUMBER OF cyc le >-.. l'L09 19.09 �1751 �ot----- 1 PROGRAM CYCLES

cycles \cyc les cum \ cycles bytes procedu re (file) /ll/'\e ______/c, lil _ ___

THE MOST HEAVILY USED LINE JO H eQln L . /t.cxt. "iput• • C) IS 120. LOCATED IN PROCEDURE 23 2"1 nt.-d ehdt" I ext input. e) .. . write_char, COMPILED FROM 8 writ.; ch.Hs I . /texto t.puL.cl - SOURCE FILE lexloutpul.c. l � writ. e i n teger ( ../t�xtoUlJJ ut .c) write d�.1r ( , ./text.output .c) r:..::..._-=..:.:.______:_o:c:_ _:..:..,:rltt:-st. r i l'lg . ,/telOI�O'-ltpUt. .c) - o:�odl;;- 1 ../t.e .lnput .c:l wrItut .cl read ch write 'tri:-:q (. ./text.output. .cJ J>O .. 571050 O.J'J � 6 1> 0.00 lOC.OO - 95. 100 • write .. 561855 0.38 95.84 lJ 0.00 .00 -ch.- rs { . . /t.ext.out.put. . c) ll 0.00 1 00 , 00 wrJte eh,us ( ../t.ttxt.output. .c) " . 561855 0.38 96..21 6 0.00 100.00 write=ch.:tr:� ( •./t.e"toutp ut..c) " 28 481387 0.33 96..55 > 0.00 100.00 wr!te chars 1 ../te "t.out.pot.c) J8 10 348150 0.24 96.1'} 0 . 00 100.00 m• in (f!Jdont. .pl )J 1 00 348000 0.23 97 .02

Figure 2 Examples of prof OutputLi stings

A valuable optimization is to position the pro­ ond, the MIPS operating system places virtual pages gram in memory so that the most frequently used in physical memory so that adjacent virtual pages memory locations never compete for the same map to adjacent cache pages. As a result, the place­ cache locations. MIPS has built a tool called cord ment of functions by cord has very predictable that rearranges the program to improve instruction effects on the cache. cache utilization. This tool is made possible through Figure 3 also shows the arrangement in an the existence of precise profilingtools. unexpected way. Rather than placing the densest To use cord, the programmer compiles the pro­ function (A) at the beginning of memory, it is placed gram in the usual way. Pixie is used to add counters farthest from the start of the cache. This arrange- to the program for each basic block. After exe­ cuting the instrumented program, prof is run with an option that creates a file containing dynamic 80 execution information. That file is given to cord, 70 along with the original executable module. 60 >- Cord computes the density for each function f- 50 (j, B z (procedure). The density is defined as the average UJ 40 c Cl D number of cycles executed by each instruction in 30 the function. Figure 3 is an example of eight 20 fu nctions, their sizes, cycle counts, and density. 10 OK 64K Cord then creates a new executable module after sorting the functions according to density. Figure 3 MEMORY ADDRESSES also shows the order of the functions in the rear­ KEY ranged program. NAME SIZE CYCLES DENSITY This sort improves cache hit rate because it A 12K 960K 80 24K 1680K 70 places the fu nctions that use the most cycles in B c 20K 1200K 60 memory so they do not compete for the same cache D 8K 400K 50 E 16K 640K 40 location as other frequently executing functions. F 12K 360K 30 The effectiveness of sort is helped by two other G 16K 320K 20 H 20K 200K 10 fe atures in the MIPS architecture. First, the caches are direct-mapped to memory so that each memory location corresponds to a single cache location. Sec- Figure 3 Cache PerformanceIm proved by cord

94 Vo l. 2 No. 2, Sp ring 1990 Digital Technicaljournal Comp iler Op timiza tion in RISC Systems

ment has the effect of making the densest func­ Trends in Co mpilers tion share cache locations with the function Compilers are taking better advantage of the paral­ 128 kilobytes (twice the cache size) away (H). Cord lelism in today's RISC processors. Evidence of this improves performance by as much as 20 percent to can be seen in the scheduling of instructions to 30 percent on programs exceeding the size cache. lt capitalize on load and branch delays and multiple works solely by improving the efficiency of the floating point units. This trend will continue as it instruction cache. Methods to improve data cache becomes fe asible to build effective multiprocessor accesses are not available yet because of the more systems. In this area, compilers that will partition a random nature of data accesses and difficulty in get­ problem across multiple processors - each per­ ting accurate data reference information. forming a portion of the iterations of a loop - will An advanced architectural simulator, Sable mod­ be seen. A major challenge will be to find ways to els (in C) the processor, including TLB, pipeline, use this kind of parallelism in nonengineering prob­ register set, and system design, incorporating the lems. These problems tend not to be loop intensive cache subsystem, main memory, and 110 interface. and will require a breakthrough in compiler tech­ Developers can customize Sable for a unique system nology for automatic parallelism. MIPS design. Sable has been used routinely at to Over the past five to ten years, the programming bring up the UNIX operating system before hard­ language C has come to the forefront as a major ware is available. Simulation with Sable assures that systems language. While C offers many advantages, the software is reliable and performs optimally it requires a user to deal with fairly primitive struc­ when the hardware is actually available. Sable can tures rather than abstractions. C++ will offer much be used with the debugger to provide full symbolic of the flexibility of C with the added capability of debugging in the simulation environment. Also, data abstraction. Sable can provide the same address traces that pixie It is expected that future compilers will take provides to analyze operating system performance. advantage of optimizations that reduce cache After a system has been brought up using Sable, misses. These optimizations include loop inter­ other tools can assist in constructing the system on change, which reorders the accesses to an array to the real hardware. A simple debug monitor is improve locality of data references, and software available to work with the symbolic debugger to pipelining, which takes better advantage of over­ provide a symbolic debugging environment on the lapping memory accesses and computation. real hardware.

Digital Te cbnicaljow71al Vo l. 2 !Vo. 2, Sp t·ing 1990 95 I FurtherRea dings

The Digital Tec hnictiJouroal Digital Press is Digital Equipment Corporation's publishes papers that e.\jJ/ore imcrnat.ional publisher of books for computer pro­ the technologiculjimndatitms fessionals. Copies of t he new titles now available of Digitals maj( >rpn Jducts. from Digital Press that are listed below can he Each jo urnalfo cuses on at least ordered by writing ro Digital Press, Department one product area andpresents DTJ, 12 Crosby Drive, Bedford, MA 01730, U.S.A. a compilation oj j; apers written COMMON LISP: The Language by !be eng ineers who de t,eloped Guy Steele Jr., Second Edition. 1990 the product. The contentfo r tbe (538.95 in soft cover. 544 .95 in clothcover) journalis selected by thejo urnal Advis01y Board. The Matrix: Computer Networks and Conferencing Systems Wo rldwide John Quarterman, 1990 (549.95)

Topics covered in previous issues of the IJ(�ital UNIX for VMS Users Te cbnicaljournal are as follows: Philip Bourne, 1990 (S2B.95)

The VMSUser's Guide VA X 8600 Processor James Peters and Patrick Holmay, 1990 ($23.00) Vo l. I, No . 1, August NH5 A Beginner's Guide to VA X VMS Utilities MicroVA X II System and Applications vbl. 1, No . 2, J'vlarch 1')86 Ronald Sawey anu Troy Stokes, 1989 ($23.00) Networking Products VMS Internals and Data Structures: HJ/. I ..Yo . 3, 5>eptenzber Jl}86 Ve rsion 5 Update X press VA X 8800 Family Ruth Goldenberg and Lawrence Kenah, Vo l. I, No . 4, Februcny N8' Vo lumes I, 2, and 3, 1989 (535.00)

VA Xcluster Systems VA XNMS Internals and Data Structures: Vo l. I, No. 5, September 1')87 Ve rsion 4.4 Lawrence Kenah, Ruth Goldenberg, and Software Productivity To ols Simon Bate, 1988 ($75.00) Vo l. I, No . 6, Fe brumy 1988 Digital Guide to Software Development CVAX-based Systems Corporate l ser Publications Group of Digital HJ I. 1, No. 7, August /988 Equipment Corporation, 1990 ($27.95) Storage Te chnology Te chnical Aspects of Data Conununication Vo l. I, No. 8, Fe bruary 1989 John McNamarJ, Third Edition, 1988 (S42.00) Distributed Systems Infonnation Te chnology Standardization: Vo l. I, No . 9, ju ne 1989 Theory, Practice, and Organizations Compound Document Architecture Carl Cargill, 1989 (S24 .95) Vo l. 2, No . 1, Winter 1990 Computer Programming and Architecture: The VA X See the inside front cover of this issue for subscrip­ Henry Levy and Richard Eckhouse, tion info rmation. Second Ed ition, 1989 ($24 .95)

Single copies and past issues of the Digital Te chnical ABCs of MUMPS: An Introduction for Novice jo urnalcan be ordered from Digital Press at a cost and Intennediate Programmers of S 16.00 per copy. Richard Walters, 1989 ($24 .95)

96 \lu i. 2 Nn. 2. Sp ring /'J'JIJ Digital Te cbnicaljournal ISSN 0898-90 1X

Printed in U.S.A. EY-C 197E-DP/')(} Oi 02 30.0 RUO Copyright 1990 Digital Equipment Corpora tion All Rights Reserved