Alphaserver Multiprocessing Systems • DEC OSF/1 Symmetric Multiprocessing • Scientific Computing Optimizations for Alpha

• AlphaServer Multiprocessing Systems • DEC OSF/1 Symmetric Multiprocessing • Scientific Computing Optimizations for Alpha

Digital Technical Journal

Digital Equipment Corporation

Volume 6 Number 3

Summer 1994 Editorial JaneC. Blake, Managing Editor Helen L. Patterson, Editor Kathleen M. Stetson, Editor

Circulation Catherine M. Phillips, Administrator Dorothea B.Cassady, Secretary

Production Terri Autieri, Production Editor AnneS. Katzeff, Typographer Peter R. Woodbury, Illustrator

Advisory Board Samuel H. Fuller, Chairman Richard W. Beane Donald Z. Harbert Richard J. Hollingsworth Alan G. Nemeth Jean A. Proulx Jeffrey H. Rudy Stan Smits Robert M. Supnik Gayn B. Winters

The Digital Technicaljournal is a refereed journal published quarterly by Digital EquipmentCorporation, 30 Porter Road LJ02/Dl0, Littleton, Massachusetts 01460. Subscriptions to the journal are $40.00 (non- U.S. $60) for four issues and $75.00 (non-U.S. $115) for eight issues and must be prepaid in U.S. funds. University and college professors and Ph.D. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Technicaljournal at the published by address. Inquiries can also be sent electronically to [email protected]. Single copies and back issues are available for $16.00 each by calling DECdirect at 1-800-DIGITAL (1-800-344-4825). Recent back issues of the journal are also available on the Internet at http://www.digital.com/info/DTJ/home.html.Complete Digital internet listings can be obtained by sending an electronic mail message to [email protected].

Digital employees may order subscriptions through ReadersChoice by entering VTX PROFILE at the system prompt.

Comments on the content of any paper are welcomed and may be sent to the managing editor at the published-by or network address.

Copyright© 1994 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved.

The information in thejournal is subject to change without notice and should not be construed as a commitment by Digital EquipmentCorporation or by the companies herein represented. Digital EquipmentCorporation assumes no responsibility for any errors that may appear in thejournal. ISSN 0898-90IX

Documentation Number EY-S799E-TJ

The following are trademarks of Digital EquipmentCorporation: Alpha, AlphaServer, DEC, DEC Fortran, DEC OSF/1, DECpc, DECthreads, Digital, the DIGITAL logo, Micro VAX, Cover Design Open VMS, Storage Works, and ULTRIX. The cover design captures two major concepts CR AY-1 is a registered trademark ofCray Research, Inc. in this issue-symmetry and pamllelism. At the Intel is a trademark oflntelCorporation. hardware level, the AlphaServer multiprocess KAP is a trademark of Kuck Associates, Inc. ing systems provide symmetrical access to hard & ware system resources. As processors are added Microsoft and MS-DOS are registered trademarks and Windows NT is a trademark of MicrosoftCorporation. to the multiprocessing system, the DEC OSF/1 MIPS is a trademark of MIPS Computer Systems, Inc. operating system provides the parallelism that Multimax is a trademark of EncoreComputer Corporation. allows applications to take advantage of the added processor power. The preprocessor Open Software Foundation is a trademark and OSF/ I is a registered trademark of Open KAP Software Foundation, Inc. also provides parallelism, specifically for pro PAL is a registered trademark of Advanced Micro Devices, Inc. grams running on symmetric multiprocessing SPECfp, SPECint, and SPECmark are registered trademarks of the Standard Performance systems. In each case, symmetry and parallel EvaluationCouncil. ism are among the keys to achieving designs SPICE is a trademark of the University ofCalifornia at Berkeley. that offer the highest levels of performance. TPC-A and TPC-C are trademarks of the Transaction Processing PerformanceCouncil. The cover was designed byjoe Pozerycki, Jr., UNIX is a registered trademark licensed exclusively by X/OpenCompany , Ltd. of Digital's Design Group. Book production was clone by QuanticCommunications, Inc. Contents

6 Foreword Steve Holmes

AlphaServer Multiprocessing Systems

8 Design of the AlphaServer Multiprocessor Server Systems Fidelma M. Hayes

20 The AlphaServer 21001/0 Subsystem Andrew P. Russo

DEC OSF/1 Symmetric Multiprocessing

29 DECOSF/1 Version3.0Symmetric Multiprocessing Implementation Jeffrey M. Denham, Paula Long, and James A. Wo odward

Scientific Computing Optimizations for Alpha

44 DXML: A High-performance Scientific Subroutine Library Chandrika Kamath, Roy Ho, and Dwight P Manley

57 The KAP Parallelizer for DEC Fortran and DEC C Programs Robert H. Kuhn, Bruce Leasure, and Sanjiv M. Shah I Editor's Introduction

The next paper presents the significant software work done to ensure high performance and reliabil ity as crus are added to the 2100 and 2000 multipro cessing systems. Jeff Denham , Paula Long, and Jim Wo odward first review the fou ndations of DEC 0Sf11 ve rsion 3.0, Digital's imple mentation of UNIX fo r the AlphaServer multiprocessing systems. They then examine issues that arise when moving an operating system from a uniprocessor to a shared memory SMP platform, in particular, the design team's efforts in lock-based synchronization and algorithm modifications aimed at paral lelism jane C. Blake within the operating system kernel. Ma naging Editor The total impact of 64-bit RJSC systems and oper ating system support for shared memory SMP plat fo rms is demonstrated by meeting the demands of scientific and technical applications. A tool fo r Designs that capitalize on Digital's 64-bit AJpha accelerating appl ication performance on all Alpha RJSC processors or that enhance the performance systems is the DXM.L Extended Math Library. of scientific applications are the subjects of papers Chanclrika Kamath, Roy Ho, and Dwight Manley in this issue. Featured topics include the well briefly discuss the role of mathematical libraries received AlphaServer multiprocessing systems, and then present an overview of DX MI. compo the DEC OSF/1 symmetric multiprocessing operat nents, which include both public domain BLAS and ing system, a high-performance math library, and LAPACK libraries and Digital proprietary software. a preprocessor program developed by Kuck & Using example routines, they explain optimization Associates, Inc. techniques that effectively exploit the memory To develop a price/performance leader for the hierarchy and provide substantial performance server market, designers of the AJphaServer 2100 improvements. and 2000 multiprocessing systems had to make Another tool for optimizing scientific application decisions that were at once creative, pragmatic, performance is KAP, a preprocessor to parallelize and timely. Fidelma Hayes, an engineering manager DEC Fortran and DEC C programs. As authors Bob for the Server Group, presents an overview of these Kuhn, Bruce Leasu re, and Sanjiv Shah from Kuck & high-performance servers that incorporate Alpha Associates describe it, the KAP product is a super RJSC technology and re-style 110 subsystems, and optimizer, performing optimizations at the source support three operating systems-Microsoft's code level that go beyond those performed by the Windows NT, DEC OSF/1, and Open VMS. Because of compilers. Their paper reviews adaptations to KAP the engineering team's persistent fo cus on perfor fo r SMP systems and the key design aspects, such as mance, cost, and time-to-market, all these goals fo r data dependence analysis and the selection of loops the AlphaServer systems were surpassed . to parallelize from among many in a program. Introducing two PC buses in the AlphaServer The editors thank Andrei Shishov, Mid-range multiprocessing system was an important fa ctor in AlphaServers Program Manager, fo r his help in market success and an interesting engineering chal developing this issue of the jo urnal. lenge. Andy Russo discusses the benefits of a dual level 1!0 strucrure that contains both the wiclely used EISA bus and the newer high-performance PCI bus that connects to a 128-bit multiprocessing system bus. He describes several innovative tech niques that promote efficiency in the hierarchi cal bus structure, the advantages offered by the selection of bus bridges (one custom ASIC ancl one standard chip set), ancl the 1/0 interrupt scheme that combines fa miliar technology with custom support logic.

2 Biographies I

jeffrey M. Denham A principal software engineer in the UNIX Software Group, jeffrey Denham is a contributor to the DEC OSF/1 version 3.0 symmetric multiprocessing effort. Prior to this, he helped add POSIX.Ib fe atures to the DEC OSF/1 operating system and worked on the VAXELN real-time kernel. Jeff came to Digital in 19 86 from Raytheon Corporation. He holds a l:l.A (1979) from Hiram College, an MA. (1980) from Tufts University, both in Engl ish, and an M.S. (1985) in Technical Communication from Rensselaer Polytechnic Institute.

Fidelma M. Hayes As an engineering manager in the Server Group, Fidelma Hayes led the development of the AlphaServer 2100 and Al phaServer 2000 sys tems. Prior to this work, she led the design of the DECsystem 5100. She has con tributed as a member of the development team fo r several projects, including the DECsystem 5800 CPU, the PRJSM system design. and the MicroVAX 3100. Fidelma joined Digital in 1984 after receiving a bachelor's degree in electrical engineering from University College Cork, Ireland. She is currently working toward a master's degree in computer science at Boston University.

Roy Ho As a principal software engineer in Digital's High Performance Computing Group, Roy Ho developed the signal-processing routines used in DXML. Prior to this work, he was a member of the High Performance Computing Technology Group. There he designed the clock distribution system fo r the VAX fa ult-tolerant system and the delay estimation software package fo r the VA X 9000 system boards. Roy has B.S. (1985) and ;vrs (1987) degrees in electrical engi neering fr om Rensselaer Polytechnic Institute. He joined Digital in 1987

Chandrika Kamath Chandrika Kamath is a member of the Applied Computa tional Mathematics Group. She has designe<.l and implemente<.l the sparse linear solver packages that are included in DXML. She has also optimized customer benchmarks for Alpha systems. Chandrika holds a Bachelor of Technology in electrical engineering (1981) from the Ind ian Institute of Te chnology, an M.S. in computer science (1984) and a Ph.D. in computer science (1986), both from the University of Illinois at Urbana-Champaign. She has published several papers on numerical algorithms fo r parallel computers.

3 Biographies

Robert H. Kuhn Robert Kuhn joined Ku ck & Associates as the Director of Products in 1992. His functions are to formulate technical business strateg y and to ma nage product deliveries. From 1987 to 1992, he worked at Alliant Computer Systems, where he managed compiler develo pment and appl ication software fo r parallel processing. Bob rece ived his Ph.D. in com puter science from the Unive rsity of Ill inois at Champaign-Urbana in 1980 He is the author of several tech nical publica tions and has participated in organizing various techni cal conferences.

Bruce Leasure Bruce Leasure. one of three fo unders of Kuck & Associates in 1979, serves as Vice President of Te chnolog y and is the chief scientist fo r the company As a charter member and executive director of the Parallel Computing Forum (PCF), a standards-setting consortium, he was a leader in effor ts to stan dardize basic forms of pa ral lelism. The PCF subsequently became the AJ'

Paula Long Since joining Digital in 1986. Paul a Lo ng has contributed to vari ous operating system projects. Presently a princiral software engineer with the UNIX Softw are Group. she leads the development of symmetric multiprocessing (SMP) capabil ities for the DEC OSF/1 operating system. In previous positions, she Jed the DEC OSF/ I real-time and DECwindows on VAXELN projects. Paula received a BSC.S. from Westfield State College in 1983.

Dwight P. Mantey Dwight Manley is a consul ting software engineer in the Appl ied Co mputational Mathematics Group. He jo ined the DXJ.'

Andrew P. Russo Andy Russo is a principal hardware engineer in the Alpha Server Group. Since joining Digital in 1983, Andy has been a project leader fo r several internal organizations, including the Mid-range 1/0 Options Group, the Fault To lerant Grour. and the Alp ha Server Group . While at Digital, Andy has contributed to the architecture and design of high-performance ASICs and mo dules to provide a va riety of end-product requirements. Andy holds several patents and has authored two papers. He received a B.S. in computer engineer ing from Boston University.

4 I

Sanjiv M. Shah Sanjiv Shah received a B.S. in computer science and mathe matics (1986) and an M.S. in computer science and engineering (1988) from the

University of Michigan. In 1988, he joined Kuck & Associates' KAP development group as a research programmer. He has since been involved in researching and developing the KAP Fortran and C products and managing the KAP development grou p. Currently, Sanjiv leads the research and development for parallel KAP performance.

James A. Woodward Principal software engineer James Woodward is a mem ber of the UNIX Software Group. He is responsible for DEC OSF/1 symmetric multiprocessing (SMP) processor scheduling and base kernel support. In previ ous work, Jim led the ULTRIX SMP project and the VAX 8200, VAX 8800, and VAX 8200 6000 ULTRIX operating system ports. He also wrote microcode for the VAX systems as a member of the Semiconductor Engineering Group. Jim joined Digital in 1981 after receiving a B.S.E E from the University of Michigan. Foreword I The engineering developments described in this issue represent the second of many planned gener ations of products that will be designed to fulfill Digital's Alpha vision. That vision is (a) to make Alpha systems open, and (b) to deliver a rich set of Alpha system products that lead the market both in performance and price/performance. It is heart ening to see the vision being realized. It is yet more heartening to see it unfolding simultaneously with appreciable improvements in Digital's business practices. These combined events have already resulted in substantial market acceptance of Steve Holmes Digital's AlphaServer products. Engineering Group The particular set of papers in this issue is for Manager, Sen'er tuitous in that it demonstrates the large number Pla tform Del'elopment, of individuals :l!ld range of engineering skills and Directm; Office required to bring about an industry phenomenon Server Product Line such as Alpha. Included are papers focused on the AJphaServer multiprocessing systems, on the sym metric multiprocessing implementation of the DEC OSF/ I operating s�'stem. on the optimization of mathematical subroutine libraries for the Alpha architecture. ami on the KAl' preprocessor. If one can imagine these technical efforts multiplied manyfolcl. the scope of the Alpha undertaking will emerge. The first generation of products based on the Alpha architecture was introduced in 1992. The AlphaServer 2100 system and DEC OSF/1 SMP operat ing system. introduced in mid-1994, together repre sent the beginning of the second-generation Alpha server products. The overarching development goal was to give our present and future customers a compelling reason to buy. The resultant direction was to provide very low cost multiprocessing sys tem capability with inclustry-stanclard open l/0 buses. in this case PC! and EISA. To capitalize on these attributes and to ensure that a complete solu tion was delivered, the engineering teams main tained a customer-focused perspective. It is this perspective that has enabled the AlphaServer 2100 to achieve rapid market acceptance. Truly, though. the most significant achievement for the present round of Alpha server products is this: a whole new standard of price/performance for the industry has been reached. Computing that in the past could have been performed only with very expensive high-end machines or extensive dis tributed networks is now performed by affordable A!phaSe rver systems. This price/performance breakthrough augments Digital's strong capabilities.

6 • A truly open environment that supports UNrx and All these developments are of direct and measur guided by Windows NT operating systems on Alpha systems able benefit to our customers. All are what the markets are telling us they want. The • The ongoing strength of the world's best fu ll trend and pace of these enhancements will allow fe atured commercial operating system, the Digital to continue to deliver on the promise of the OpenVMS system Alpha vision. • A world-cl;�ss ;�ncl worldwide service ;�nd deliv Performance measurements, fo r example, ery org;�niz<�tion SPECmark data and transaction-per-second tests,

• An extensive <�nd growing network of channels and competitive comparisons support the state ments above. However. the case is made most con • Over;�ll, Digit<�l's renewed <�nd me;�ningful com vincingly by the early acceptance and rapid ramp mitment to be responsive to the demands and up of AlphaServer 2100 system purchases by our needs of the markets customers. In the highly competitive server arena, This is a very exciting ami productive time in success is being demonstrated daily. Digital's history I would like to take this opportunity to offe r If this were the end of the story, there would be a very enthusiastic thank-you to all whose work is much of which to be proud. In fact, there is more to represented in the accompanying technical papers, come across the range of AlphaGeneration prod most especially to the AlphaServer 2100 develop ucts, including workstations, PCs, clustering, oper ment team whose work I have had the privilege to ating systems, and networking. In the server area observe since the team's formation. The hard work specifically, the recently announced AlphaServer and dedication of everyone is recognized, appreci 2000 increases the price/performance lead of the ated, and needed for the future. 2100 system. Processor and cache upgrades have This foreword will conclude in favor of the sub increased the absolute performance of the fa mily. stantive papers that detail the technical contribu Just around the corner are similar advances for tions made by the authors and their col leagues. It is other members of Digital's server products. A little my expectation that readers of this issue of the further away are significant enhancements in our Digital Te ch nical.fournal will gain useful technical clustering capabilities and in our server manage insights. It is my hope that they will also see, as I do, ment tools. that the future of Digital computing is bright.

7 Fidelma M. Hayes I

Design of the AlphaServer Multiprocessor Server Systems

Digitar� AlpbaSen>er multiprocessor systems are bigbjJeJfonnance sen>ers that C01Jibine multiprocessing tecbnolog)' with PC.stj>le 110 subsystems. The �ystem arcbitecture allou•s j(mr processing nodes, four memo1y nodes (up to a maximum of 2 GB), and tu•o I/O nodes. All nodes communicate tbroug!J a system bus. Tbe S)'Stem bus tms designed to support multiple generations of Alpha processor tech nology Tbe arcbitect!lre can be implemented in different zmys, depending on tbe size of the .)ystem packaging

The AlphaServer 2100 (large pedestal) and the module and silicon technology and the high AJphaServer 2000 (small pedestal) servers from availability features incorporated into the design . Digital combine multiprocessing Alpha technology The paper ends with a performance summary and with an I/O subsystem traditionally associated with conclusions about the project. personal computers (PCs). The 110 subsystem in the AlphaServer systems is based on the Peripheral Concept Development Component Interconnect (PC!) ami the Extended The engineering investigations of a client-server Industry Standard Architecture (EISA) buses. All system originated from a business need that Digital AJphaServer products. including the AJphaServer perceived when it introduced the first systems 2100 cabinet version. share common technology to incorporate the Alpha technology in late 1992 and support at least three generations of the Alpha Among Digital's first products in the server market processor. In addition. the servers support three were the OEC: 4000 high-performance departmental operating systems: Microsoft's Windows NT version system. the DEC 3000 deskside workstation/server, 3.5. and Digital's DEC OSF/ 1 version 3.0 (and higher) and the ElSA-based Alpha PC. The lack of an explic and OpenV.VIS version 6.1 (and higher). itlv identified. general-purpose system for the mid The AI phaServer svstems are designed to be range system market generated many requests from general-purpose servers for PC local area network Digital's MicroVAX rr system customers. Requests (LAN) and database applications. All models of the from these customers propelled the AlphaServer

system use a common multiprocessing bus inter product development effort. connect that supports different numbers of nodes. From the beginning of the project, two major depending on the system configuration. The systems constraints were evident: The schedule required share a common Cl'll. memory, and 1/0 architecture. a product by mid-1994. and the budget was limited. The number of CI'L!s. the amount of memory, the Accordingly. the product team was required to number of 1/0 slots. and the amount of internal stor leverage other developments or to find newer. less age vary depending on the mechanical packaging. costly ways of achieving the product goals. Work The flexibility of the architecture allows the quick on the AlphaServer systems started as a joint effort development of new anu enhanced systems. between an advanced development team and a This paper discusses the transformation of a business planning team. The business team devel set of requirements into high-performance, cost oped market profil.es and a I ist of features without effective product implementations. The following which the system would not be competitive. The section describes the evolution of the AlphaServer business team followed a market-driven pricing design from an advanced development project into model. The profit expected from the system dic a design project The paper then describes the CPU tated the product cost for the system. This cost is module, the multiprocessor system bus. and the referred to as "transfer cost." The business team's memory module. Subsequent sections discuss cost requirement was critical: if it could not be met.

ll)/. 6 ,\'u. j Slllll!ller 19')'1 Digital Te<"IJnicaljournal Design of the AlphaSeruer Multiprocessor Seruer Systems

the project would be canceled. Furthermore, the To meet these new requirements, the design team entry-level system was required to had to modify the system design during the product development phase. l. Support at least two CPUs, with performance for a single CPU to yield 120 SPECmarks and 100+ transactions per second (TPS) on the TPC-A Sy stem Overview benchmark. The base architecture deve loped for Digital's AlphaServer multiprocessor systems allows fo ur 2. Support at least 1 gigabyte (c;B) of memory. processing nodes, four memory nodes (up to a max 3. Support multiple I/O buses with at least six imum of 2GB), and two 1/0 nodes. All nodes com option slots supported on the base system. municate through a system bus. The system bus was designed to support multiple generations of 4. Provide high-availability features such as redun Alpha processor technology. The architecture can dant power supplies, redundant array of inex be implemented in different ways. depending on pensive disks (RAJD), "warm swap .. of drives, and the size of the system packaging. It is flexible clustering. enough to meet a variety of market needs. Two '>. Provide a number of critical system connec implementations of the architecture are the tivity options, including Ethernet, fiber distrib AlphaServer 2100 and the AlphaServer 2000 prod uted data interface (FODI), and synchronous ucts. Figure 1 is a block diagram of the AlphaServer controllers. 2100 implementation of the architecture.

DEC: OSF/1, In the AlphaServer 2100 large pedestal server. 6. Support the Windows NT, the and the the system bus supports eight nodes. It is imple OpenVi'viS operating systems. mented on a backplane that has seven slots. The c;iven these criteria, the engineering team seven slots can be configured to support up to decided to base the development of the new server fo ur processors. Due to the number of slots ava il on concepts taken from two Digital products ancl able, the server supports only 1 GB of memory combine them with the enclosures, power sup when four processors are installed. It supports plies. and options commonly associated with PCs. the fuJI 2 GB of memory with three processors The DEC 4000 server is a multiprocessor system or less. The eighth node. which is the system bus wit11 a Futurebus+ l/0 subsystem; it provided to-PC:J bridge, is resident on the backplane. This the basis for the multiprocessor bus design.1 The provides a 32-bit J>C:I bus that operates at 33 mega DECpc 150 PC is a uniprocessor system with an EISA hertz (MJ-Jz). It is referred to as the primary J>CI bus 1/0 subsystem; it provided a model for designing an on the system. I/O subsystem capable of running the Windows NT A second 110 br idge can be instaJled in one of operating system. The engineering team chose re the system bus slots. This option, which will be style peripherals because of their low cost. available in 199'j, will provide a 64-bit PCI bus for A strategic decision was made to incorporate the the system. A 64-bit PC! is an extension of a 32-bit emerging bus into the product in addition to PCJ PC! bus with a wider data bus. It operates at 33 ,\1Hz the EISA bus. Major PC vendors lud expressed high and is completely interoperable with the 32 -bit PC! interest in its development. and they believed the specification. 2 Options designed for the 32 -bit l'CI bus would gain acceptance by the PC commu PC! bus will also work in a 64-bit PC! s.lot. nity. The PCl bus provides a high-performance, low EISA slots are supported through a bridge on the cost 110 channel that allows connections to many primary PCI bus on the system. Only one EISA bus options such as small computer systems interface can be supported in the system since many of the (SCSI) adapters and other common PC peripherals. addresses used by EISA options are fixed.·1 Support After the initial design had been completed, chang of a single EISA bus is not perceived as an issue given ing market and competitive environments imposed the migration from the EISA bus to the much higher additional requirements on the design team. performing PC! bus. The maximum supported bandwidth on an EISA bus is 33 megabytes per I. The initial transfer cost goal was reduced by second (iVIB/s) versus the maximum bandwidth on approximately 13 percent. EISA a 32-bit PCI bus of 132 MB/s. The bus is used in 2. Support for a maximum of fo ur processor mod the system for support of older adapters that have ules was necessary. nor migrated to PC!.

6 .\tl/11111er Digital Technical journal Vol. No.3 I'J'J1 AlphaServer Multiprocessing Systems

SERIAL CONTROL BUS

===:: <�PCI BUS -======32 BITS ll�ll ===ll EISA BUS ETHERNET I I 8

r------1 I 8259A-2 I

I :I_------DDDDDD ...J

PARALLEL

8242 0 KEYBOARD-- KEYBOARD AND MOUSE FLOPPY MOUSE-- CONTROLLER 0 0 SERIAL PORT I) SERIAL PORT 0

Figure I Block Diagram of the AlphaServer 2100 System Architecture

The AlphaServer 2000 small pedestal system sup node. A system bus slot can also he used to support ports five nodes on the system bus. The backplane the optional second 1/0 bridge. provides four system bus slots, allowing a maxi The AlphaServer 2100 cabinet system is a rack mum configuration of two processor modules and mountable version of the large pedestal two memory modules. The system bus-to-PCI AJphaServer 2100 system. The rackmountable unit bridge resides on the backplane and is the fifth provides a highly available configuration of the

10 11>1. 6 No. .i Siill/1111'1' 1')')4 Digital Teclmit:al journal Design of the AlphaSeruer !Vlultzp rocessor Seruer Sys tems

pedestal system. It incorporates two separate back version. All multiprocessing members of the planes. One backplane supports eight system bus AlphaServer fa mily use the same processor and nodes that are implemented as seven system memory modules and differ only in system packag bus slots. The eighth node (the system bus-to-PCI ing and backplane implementations. This illustrates bridge) resides on the backplane. The second back the flexibility of the architecture developed for the plane provides the 1/0 slots. The number and system and decreases the development time fo r new models. configuration of 110 slots are identical to the AlphaServer 2100 pedestal system. The rackmount unit provides minimal storage capacity. Additional CPU Module storage is supported in the cabinet ve rsion through The CPU module contains an Alpha processor, a Storage\Xforks shelves. These storage shelves can secondary cache, and bus interface application be powered independently of the base system specific integrated circuits (AS1Cs). As previously unit. providing a highly available configuration. mentioned, the system architecture al lows multiple Ta ble 1 gives the specifications fo r the processor generations. Multiple variations of the AlphaServer 2100 and the AlphaServer 2000 processor module are available fo r the system, but pedestal systems. Information on the cabinet diffe rent variations cannot be used in the same version is not included because its characteristics system. Software has timing loops that depend on are similar to the AlphaServer 2100 large pedestal the speed of the processor and cannot guarantee

Ta ble AlphaServer System Specifications 1 Specifications Large Pedestal Small Pedestal Comments AlphaServer AI phaSe rver 2100 System 2000 System

Height, inches 27.6 23.8 Width, inches 16.9 16.9 Depth, inches 31 .9 25.6 Maximum DC power output, 600 400 Tw o possible per watts per supply system in either redundant or current shared mode Number of system slots 7 4 Number of processors supported 4 2 Minimum memory 32 MB 32 MB Maximum memory 2GB 640 MB Embedded controllers supported 1 1 1/0 Optional I/O controllers supported 1 1 32-bit PCI slots 3 3 64-bit PCI slots (on separate 2 2 1/0 controller module)* EISA slots 8 7 Serial ports 2 2 Parallel port Ethernet ports (AUI and 10Base-T) Not integral Up to 18 total network to system ports supported on system via PCI and EISA options SCSI II controller 1 1 Removable media bays 3 2 Internal warm-swap drive slots 16 8

· Future option

li No. Digital Te chnical journal V!>i. . > Sllll/11/l'r 1'}')1 11 AlphaServer Multiprocessing Systems

synchronization between processors of different 3. The bus supports addressing fo r block transfer speeds. The CPU modules provide a range of pert<>r only. A block is 32 bytes of data. mance and cost options fo r the system owner. 4 J/0 is treated as either primary or secondary. The cost-focused processor module uses the Primary 1/0 refers to devices that could respond Alpha 21064 processor operating at 190 MHz. This without stal ling the system bus. This designation chip was designed with Digital's fo urth-generation is restricted mainly to control and status regis complementary metal-oxide semiconductor (G ns to the second-leve l cache. Th is is a means it wo uld stall during a slow access. When a five-rimes multiple of the CPU cycle rime. The addi bus stalls, all accesses to CPUs and memories have tional 11.3 ns is needecl fo r round-trip etch delay to wait until the CSR access is complete. This cou ld and address buffer delay. The use of 12-ns SRA:VIs cause data to back up and potentially overflow. To was considered, bur the read-ancl-wrire cycle time avoid this stare, either the system bus or the soft wo uld have ro decrease to 21 ns to improve perfor ware device driver has to be pended. nu nce. The reduction of 3 ns was not sufficient to A mailbox is a software mechanism that accom meet the riming requirements of the module: there . pl ishes "device driver pending . . The processor fo re, the less costly 1";-ns SRA.VI s were useu. builds a structure in main memory cal led the mail Higher performance processor modules are also box data structure. It describes the operation to be available fo r the system. These modules are based pnformed , e.g .. CSR read of a byte. The processor on the Alpha 21064A processor. which was then writes a pointer ro this structure into a mail designed using fifth-generation GviOS tt·chnology. box pointer register. The 1/0 node on the system The Alpha 2I064A processor module operates at bus reads the mailbox uata structure, performs the 275 Iv!J-Iz. The processor has separate on-chip operation specified . and returns status and any data instruction and data caches. The 10-KB instruction to the structure in memory. The processor then cache is direct mapped, and the 16-KB data cache is retrieves the data from this structure and the trans a 2-way, set-associative cache. The backup cache action is complete. In this way, the mailbox proto holds 4 MB of memory The combination of higher co.! allows software pending of CSR reads; it also processor speed, larger internal on-chip caches, allows the software to pass byte information that is ancl a large second-leveJ cache reduces the number not available from the Alpha 21064A processor. <' of accesses to main memory and processes clara at a higher rate. As a result, the performance of the Ch anges to the .�ystem Bus system is increased by approx imately 20 percent. Although the DEC 4000 system bus proviued many Multiprocessor Sy stem Bus fe atures desirable in a multiprocessor interconnect, The technology developed fo r the system bus in the it did not meet the system requirements defined DEC 4000 cleparrmenral server provided the basis during the concept phase of the AlphaServer proj for the multiprocessor bus designed fo r the t'Ct. Two major hurdles existed. One was the Jack of AlphaServer system.' The system bus in the DEC support fo r four CPlJs and multiple J/0 nodes. 4000 product has the fo llowing features: A second, more important issue was the incom patibility of the mailbox 1/0 structure with the 1. The 128-bit multiplexed address and data bus Windows NT operating system. operates at a 24-ns cycle time. The bus runs The initial port of the Windows NT operating sys synchronously. tem to the DECpc 1')0 PC assumed direct-mapped 2. The bus supports two CPU nodes. four memory 1/0. With direct mapping the 1!0 is physically nodes, and a single 110 node. mapped into the processor's memory map. and all

12 l'o/. (, ..No ) S/111/JJ/<'1' I'J'J4 Digital TeciJnical journal Design of tbe AlpbaSerl'er Multiprocesso r Sen•er Sy stems

reads/writes to l/0 space are handled as uncached in half: 8 (;B was allocated as cacheable memory memory accesses. Clearly, this was incompatible space and the remaining 8 GR as noncacheable with the nonpended bus, which assumes the use of space. Memory-mapped 1/0 devices are mapped mailboxes. Consequently. the designers studied the into noncacheable space. The decision to support advantages and disadvantages of using mailboxes multiple 1/0 buses in the new systems together with to determine if they should be supported in the the decision to memory map all I/O (i.e., no mailbox Windows NT operating system. They fo und that the accesses) yielded a noncacheable memory require software overhead of manipulating the mailbox ment in excess of the 8 CB allocated in the DEC 4000 structure made CSR accesses approx imately three system. Therefore the designers of the AlphaServer times slower than direct accesses by the hardware. systems changed the address map and allocated a Thus the CPU performing the 110 access waits single quadrant (4 GB) of memory as cacheable longer to complete. For this reason. the designers space and the remaining 12 GR as noncacheable. chose not to use mailboxes. These 12 GB are used to memory map the 1/0 The designers also had to ensure that the system bus wou ld be available fo r use by other processors Arbitration The bus used in the DEC 4000 system while the 1/0 transaction was completing. To satisfy supports two CPU nodes and a single 1/0 node. To this requirement, they added a retry mechanism to achieve the AlphaServer product goals of multiple the system bus. The retry support was very simple 110 bridges and multiple CPU nodes, the designers and was layered on top of existing bus signals. changed the address map to accommodate CSR A retry condition exists when the CPU initiates a space fo r these extra nodes and designed a new cycle to the l/0 that cannot be completed in one arbiter fo r the system. The arbiter incl udes system bus transaction by the I/O bridge. The CPU enhanced functionality to increase the perfor involved in the transaction is notified of the retry mance of future generations of processors. Some condition. The CPU then "backs otr the multipro key fe atures of the arbiter are listed below. cessor bus and generates that transaction some 1. The arbiter is implemented as a separate chip on period of time later. Other processor modules can all processor modules. The logic was partitioned access memory during the slow 110 transaction. into a separate chip to accom modate a flexible The retry procedure continues until the 1/0 bridge architecture and to allow additional arbitrating has the requested data. At that stage, the data is nodes in the future. As many as fo ur arbiters can returned to the requesting CPU. exist in the system. Only one arbiter is enabled in the system. It is on the processor installed in slot By te Addressing Byte granularity had been han 2 of the system bus. dled in the mailbox data structure. After the direct 2. 110 node arbitration is interleaved with CPU node mapped 1/0 scheme was adopted, the designers arbitration. The arbitration is round robin and had to overcome the Jack of byte addressability in leads to an ordering scheme of CPU 0, 1/0, CPU 1, the Alpha architecture. Therefore, the designers 1/0, CPU 2, 1/0 , CPU 3. 1/0. This scheme attempts participated in a collaborative effort across Digital to minimize 1/0 latency by ensuring many arbi to define a mechanism fo r adding byte address tration slots fo r 1/0 devices. Processors still have ability in the Alpha architecture. The new scheme more than adequate access to the system bus due required the use of the fo ur lower available Alpha to the nature of 1/0 traffic (generally bursts Ad:[08:05] address bits to encode byte masks and of data in short periods of time). On an idle lower order address bits fo r the PC! and EISA buses. bus, the arbiter reverts to a fi rst-come. first For more details. see the paper on the AlphaServer served scheme. 2100 110 subsystem in this issue 6 The designers required a redefinition of the 3. The arbiter implements an exclusive access cycle. address map. All 1/0 devices are now memory This allows an arbitrating node to retain the use mapped. The Alpha 21064A processor has a 34-bit of the system bus fo r consecutive cycles. This address field that yields an address space of 16 GR. cycle is used by the 1/0 bridge in response to a PCI This 16-CB address region may be subdivided into lock cycle. A I'Ci lock cycle may be generated by a 4-GB quadrants. Each quadrant can be individually device that requires an atomic operation, which marked as cacheable or noncacheable memory. The may take multiple transactions to complete. For DEC 4000 system arch itecture split the 16-GB region example, the AlphaServer 2100 and AlphaServer

Digital Te cbnica/ ]o w-nat Vol. o No. .> Summer 1994 I,J AlphaServer Multiprocessing Systems

2000 systems use a PCI-to-EISA bridge chip set empty. the stream buffer fills from the DRAM array (Intel 82 i30 chip set).- This chip set requests by using successive acklresses from the head of the a lock cycle on PC! when an EISA device requires stream. After a buffer has been allocated and some an ato mic read-modify-write operation. amount of data has been placed in the FIFO buffer, "hit" logic compares incom ing read addresses from The use of atomic read-modify-write operations the system bus to the stream address. If a compari is com moo in older 1!0 adapter designs. The 1/0 son of these two addresses is successful, read data bridge on the system bus requests an exclusive is del ivered from the memory module without access cycle from the arbiter. \Vhenit is granted. all incurring the latency of accessing the DRAM array. buffers in the path to memory are flushed and the An invalidation scheme is used to ensure that the de vice has exclusive use of the l'CJ and the system stream buffers stay coherent. Write cycle addresses bus until its transaction is completed. The use of are checked to see if they coincide with a stream this mode is not recommended for new adapter buffer address. If the write address is equal to designs due to the unfair nature of its tenure on the any address currently in the stream buffer. that system bus. It was implemented in the AlphaServer entire stream buffer is declared invalid. Once it is product design to support older EISA devices. invalidated, it can be reallocated to the next detected stream. Me mory Module Writes to main memory complete on the system Main memory is accessed over the system bus either bus in six cycl.es, which is achieved using write by processors (after missing in their on-board caches) buffers in the memory controller. The write transac or by 1/0 nodes performing direct memory access tions are essentiall y "dump and run." The total write (DMA) transactions. They are called com manders. buffering available in each memory module is 64 The memory controller incorporates a number of bytes. which is large enough to ensure that the sys performance-enhancing features that reduce latency tem bus never has to stall during a write transaction. in accessing the dynamic RAoVI (DRA.vl) array. One The implem entation of the memory module dif concept used is called a stream buffer. Stream fers from the AlphaServer 2100 to the AlphaServer buffers reduce the read latency to main memorr 2000 system. Both memory modules contain the

Reads to main memory normally require 9 to 10 same memory controller ASIC:s, but the implemen cycles on the system bus, depending on the speed of tation of the !WA.V! array is different. Due to space DRAMs in the array. The use of stream bufters reduces constraints on the AlphaServer 2100, the DRAM this time to 7 cycles. The stream buffers provide a array was implemented as a flat, two-sided surface facility to load data fetched from the DRANI array mount module . On the AlphaServer 2000, single prior to the receipt of a read request for that data. in-line memory modules (SIMMs) were used for the A stream is detected by monitoring the read DRAM array. Memory module capacities vary from addresses from each commander on the system 32 .V!Bto 512 ,VI B. The Al phaServer 2100 system pro bus. The logic simply keeps a record of the memory vides four system bus slots that can be populated addresses of the previous eight read transactions with memory modules. The m:�ximum supported from each com mander and compares each subse configuration is 2 GB with four memory moclu les. quent read address to see if the new address is con This limits the maximum system configuration to tiguous to any of the recorded addresses. If a new three processors since one of the processor slots address is determined to be contiguous to any of must be used as a memory slot. The AlphaServer the previous eight addresses. a new stream is 2000 system prov ides two system bus slots that declared. As a result, one of the stream buffers can be populated with memory. The maximum is allocated to a new strea m. memory supported in this system is 640 MB. This A stream buffer is implemented as a four-deep, configuration consists of one 512-MI3 module and first-in, first-out (FIFO) buffer. Each entry in the one 128-MB module. The maximum memory con FIFO buffer is 32 bytes. which is equivalent to the straint is dictated by the power and cooling avail system bus line size. Each memory module con able within this system package. The AlphaServer tains four stream buffers that can be allocated to dif 2000 sti!J supports two processor modules when ferent commanders. A least recently used (LRU) configured with maximum memory. Figure 2 algorithm is used to allocate stream bu tie rs. When shows a block diagram of the AlphaServer 2000 a new stream is detected, or an existing stream is memory module.

14 1-b/. 6 !lio. .) .\11111111a /'}')4 Digital Te chHical ]ourual Design of the AlpiJaSe,-·uerMul tiprocessor Server Sys tems

SERIAL CONTROL BUS TO MEMORY MODULES AND CPU MODULE ll II SERIAL CONTROL BUS EEPROM

BANK 3 EIGHI T X36 SIMMs I 1r 1r I BANK 2 I EIGHT X36 SIMMs 1r 1rII I BANK 1 J EIGHT X36 SIMMs fl' 11'11 11 I BANK 0 I EIGHT X36 SIMMs I J DATA PATH llJlllllll DATA PATH 128 DATA ADDRESS 128 DATA AND 12 EDC AND CONTROL AND 12 EDC DRIVERS rv----

------I r I I EVEN SLICE I J ODD SLICE I MEMORY CONTROLLER ASIC MEMORY CONTROLLER ASIC I : 1 SYSTEM BUS · 1 I I INTERFACE I : I I I I t J .______------1 ------______I CLOCK l BUFFERSj 1 SYSTEM BUS TO MEMORY MODULE. 1/0 INTERFACE. AND CPU MODULES

Figure 2 Block Diagram of the AlphaServer 2000 Mernory Mo dule

Te chnology Choices ing logic would have required a backplane upgrade, This section briefly discusses some of the decisions which is a time-consuming effort. For this reason, and tr:otdc-offs made concerning module and silicon the engineers chose to build an 1/0 module fo r the 2100 technology used in the systems. AlphaServer system that contained the PCI-to EISA bridge; associated control logic; controllers fo r mouse, keyboard, printer, and floppy drive; and the Mo dule Te chnology integral Ethernet and SCSI controllers. This module The designers partitioned the logic into modules fo r was eliminated in the AlphaServer 2000 system due two reasons: (I) Removable processor and memory to the design stability of the 1/0 module . modules allow fo r installation of additional memory The Metra! connector specified by the and processors and (2) They also allow fo r easy Futu rebus+ specification was chosen for the sys upgrade to fas ter processor speeds. Since modularity tem bus implementation on the DEC 4000 product. adds cost to a system, the designers decided that the This choice was consistent with the design of the I/O subsystem (EISA and PCI logic) should reside on DEC 4000 server, which is a Futurebus+ system. the backplane. They deviated from this strategy fo r Cost studies undertaken during the initial design of the AlphaServer 2100 system design because the PCI the AlphaServer 2100 system showed that the cost to-EISA bridge was a new, unfamiliar design. Fixing per pin of the Metra! connector was high and added any problems with this chip set or any of the support- a significant cost to the system. The team decided

Digital Te ch nical jo urnal Vol. o Nu. j Summer 1994 15 AlphaServer Multiprocessing Systems

ro invest igate the use of either the PCI or the EISA timing requirements was available. Extensive SPICE connector fo r the system bus, since both connec simulations verified the process capability. ASICs tors are used widel y in the system . The PCI con that were implemented with this process had no nector is actually a variant of the MicroChannel diffi culty meeting the bus timing H Architecture (MCA) connector used in microchan Another interesting feature of the analog design nel systems. SPICE simulations showed that it per on the AJphaServer 2100 system involves the sup fo rmed better than the Metra! connector on the port of 11 loads on the PCI. The PCI spec ification Fu turebus+ .H The team chose a 240-pin version of recommends 10 loads as the "cookbook" design 2 the connector for implementation because it met The system requ irement on the AlphaServer 2100 the system requirements and had a low cost. was ro support three PCI slots. the in tegral PCI Due to the choice of the MCA connectOr, the Ethernet chip. the NCRRlO (l'Cl-to-fasr-scsr con board thickness was limited to a maximum of 0.062 troller), and the PCI - to-ElSA bridge. Each PCI inches. An 8 -layer layup was chosen for the module connector has been modeled to be equivalent to technology. The processor modules had a require two electrical loads. Taking account of the system ment for both a 5.0-V supply and a 3.0-V supply. bus-to-P(] bridge and the additional load con The designers chose a spl i t plane to distribute the tributed by the 1!0 module connector yielded a PC! power rather than two separa te power planes fo r bus with 11 el ectrical loads. Extensive SPICE simu each vo ltage. Ro uting high-speed signals across the lations of the bus and careful routing to ensure split was minim ized to reduce any emissions that a short bus guaranteed that the new design would might arise from using a split plane. Te sting later meet the electrical specifications of the PCI bus H validated this approach as emissions fr om this area were minimal. Sy stem Start-up The design team incorporated many availability fea Silico n Te chnology tures into the AlphaServer 2100 and AlphaServer The system partitioning required the design of four 2000 servers. These included support of "hot -swap" ASJCs. These were the CPU bus interface ASIC the storage devices that can be removed or installed memory bus interface ASIC. the system arbiter. and \vhile the svstem is operating. error correction code the system bus- to-PCI bridge. The DEC 4000 imple (ECC)-protected memory. redundant power sup

mentation of the Futurebus+ used an externally pi ies. and CPl · recoven·. Perhaps the most interest supplied gate-array process that was customized to ing part of the design for availabil ity was the meet the performance needs of the bus and the per emphasis on ensuring that the system l1ad enough fo rmance go als of the first Alpha systems. Gate built-in recovery and redundancy to allow it to array costs are determined by the number of chips remain io a usable or diagnosable state. Large sys that are produced on the chosen gate-array process. tems sometimes have complicated paths in which The volume of chips produced by the gate-array to access the initial start-up code. and a system fa il process for the DEC 4000 system was low because ure in that path can leave the owner with no visible the process was specially adjusted fo r that system failure indication. Moreover. in a multiprocessor application. As a resu lt. the volume of chips was system with more than one C:Pli installed. it is direc tly proportional to the volume of the DEC 4000 highly desirable to initialize the resident firmware systems built. Therefore . the cost per component and the operating system even if all CPUs are not in produced by this process was relatively high. working order. The AlphaServer 2100 and 2000 sys If they had used this customized gate-array pro tems employ two schemes to help achieve this goal . cess, the designers of the AlphaServer product The starr-up code for the AlphaServer 2100 ancl could not have met thei r cost goals. They needed Alp haServer 2000 systems is located in flash read a more generic process that could produce ch ips only memory (ROM), which resides on a peripheral that many system vendors could us e . This would bus behind the PCI-to-EISA bridge . In starting up ensure that the line utilization was high and that a m ultiprocessing operating system. only one the cost per component was low. Therefore. they processor is designated to access the start-up code changed the technology to one that is standard in and initialize the operating system. This is referred the industry. Gate-array process technology had to as the primary processor. Accessing the start-up evolved since the DEC 4000 design. and a standard code requires the processor. system bus, memory, technology that was capable of meeting the system and most of the 1/0 subsystem to be fu nctional.

l No. ,) Sllllllner I'J'J4 Digital Te e/m ica/ journal Design of the AlphaSerl'er Multiprocessor Server Systems

The AJphaServer systems have a number of fe a the system initialization code. This code is loaded tures that help make the start-up process more into main memory, and many data integrity tests are robust. Each processor module contains a separate performed . These include single and multiple bit maintenance processor implemented as a simple parity checks, various data correction code check microcontroller that connects to a serial bus on the ing, and a checksum calcu.lation. The processor system. The serial bus is a two-wire bus that has detects an error if the checksum calculation fails, a data line and a clock line. On power-up the pro i.e., if the calculated value is not equal to the stored cessor module performs a number of diagnostic value. The processor then writes a value to a regis tests and logs the results in an electrically erasable ter on the 1/0 module, which automatically changes programmable read-only memory (EEPROM) on the the address pointing to the flash ROM to a second module. This EEPROM resides on the serial bus. If bank of flash ROM . This combination of hardware a CPU fa ils one of its power-up tests or if it has an and software su pport prov ides a way for the error logged in its EEPROM, then it is not allowed to AlphaServer 2100 system user to overcome any be the primary processor. Assume that fo ur CPUs flash ROM corruption. are installed in the system; if only CPU 0 fa ils, then CPU 1 is the primary processor. If CPU 0 and CPU 1 Design Considerations fo r the 2 0, fa il, then CPU is the primary processor. If CPU AlphaServer 20 00 Sy stem CPU I, and CPU 2 fail, then CPU 3 is the primary pro The design of the AlphaServer 2000 small pedestal cessor. If all fo ur CPUs fa il, then CPU 0 is the primary system fo llowed the AJphaServer 2100 system. processor. If any one of the CPUs fails, a message is Market pressures dictated the need fo r a smal ler displayed on the operator control panel to info rm system with a lower entry-level cost. The introduc the user that there is a problem. Any secondary CPU tion of the smal ler server was scheduled to coin that has failed is disabled and will not be seen by the cide with the release of the Windows NT version 3.5 firmware console or the operating system. The pri operating system. mary processor then uses the system bus to access An examination of the AlphaServer 2100 develop the start-up code in the flash ROM. ment schedule revealed the fol lowing interesting The flash ROM may contain incorrect data. The points: (1) System power on occurred nine months fl ash RO Ms on many systems have a program after the team was formed: (2) Initial system ship update, and errors from a power spike or su rge can ments occurred eight months later; (3) The eight be introduced into the ROM cocle during the update month time period was spent mainly in porting and procedure. User error is another common way to qual ifying operating system software. introduce data error; fo r example, a user can acci Based on these fa cts, the system designers dentally press a key while the update program is bel ieved that rhe key to reducing the time-to-market running. Flash ROMs can also fail from intrinsic of the AlphaServer 2000 system was to eliminate the manufacturing fa ults such as current leakage, dependency on synchronizing the design schedule which will eventually convert a stored "1" into a with an operating system release. Consequently, the stored "0," thus corrupting the program stored in new system could not requ ire any software changes the flash ROMs. Many techniques in the industry at the operating system level. Any changes would partially solve the problem of corrupted flash ROM have to be transparent ro software. To achieve this, data. One well-known technique uses a checksum the designers took advantage of a new fe ature in the and reports an error to the user if the data is not cor DEC OSF/1 and the OpenVMS operating systems rect. Another technique provides a second set of cal led dynamic system recognition (DSR). flash RO Ms and a switch that the user manipulates A DSR machine is defi ned as a machine that to transfer control to the new set in the event of requires no new software development. Operat a fa ilure. The designers studied many previously ing systems, however, require licensing; this used methods, but rejected them si nce they information is dependent upon the system model required intervention by the user. number. There are two components to building In the AlphaServer 2100 and the AlphaServer a DSR machine. 2000 system design, the design team implemented a scheme that did nor requ ire user intervention in 1. A programmer's view of the machine must be a the event of flash ROM corruption. The system has subset of an already supported machine. In the 1 1v1B of flash ROM of which the first 'il2 KR contain case of the AlphaServer 2000, the designers

Digital Te chuicaljourual Vt1/. 6 No. J S11111111er 1994 17 AlphaServer Multiprocessing Systems

decided to make it a subset of the AlphaServer building a DSR machine, the design team met the 2100. A clear understanding of how the operat project's time-to-market requirements. ing systems initialized the AlphaServer 2100 sys tem was critical to understanding what changes Performance Summary could be made. A joint team of hardware and Ta ble 2 summarizes the performance of the systems software engineers ex amined various pieces of described in this paper. The numbers are heavily the code to identify the areas of the system influenced by the processor speed, cache, memory, design that cou ld be changed. Investigations and 1/0 subsystems. The systems exceeded the per revealed that the system bus configuration code fo rmance goals specified at the beginning of the for the AlphaServer 2100 is somewhat generic. project. In some cases the important benchmarks It assumes a maximum of eight nodes, which is that had been relevant in the industry changed dur the AlphaServer 2100 implementation. The 1!0 ing the course of system development. In the trans node to the primary PCI bus is expected to be action processing measurement, for example, the present. The presence of additional processors TPC-A benchmark was superseded by the TPC-C and memories is detected by read ing the CSR benchmark. space of each module. A module that is present The AlphaServer 2100 server was the price gives a positive acknowledgment. The design performance leader in the industry at the time of its team could therefore reduce the number of sys introduction in April 1994. Successive improve tem bus slots from seven to fou r. This had no ments in processor and 110 subsystems should help effect on the software since nonexistent slots the AlphaServer 2100 and 2000 products maintain would merely be recognized as modules not that position in the industry installed in the system.

The physical packaging of the AlphaServer 2000 Ta ble 2 System Performance also dictated that the number of 1!0 slots be re duced from 11 (8 EISA and 3 PCI) to 10. Given AlphaServer AlphaServer the industry trend toward PCI, the desirable mix 2100 412:15 2000 4/200 would have been 6 EISA slots and 4 PC! slots. The SPECint92' 200.1 131.8 PC! bus configuration code searched fo r as many SPECfp92' 291 .1 161.0 as 32 PC! slots, which is the number allowed AIM lllt by the PC! specification.2 After careful consid Number of AIMs 227.5 177. 5 eration, the designers determined that the addi User loads 1941.2 1516.0 tion of another PC! slot would involve a change Estimated TPSt 850 660 in interrupt tables to accommodate the addi tional interrupts and vectors required by the Notes: additional slot. Therefore, the team decided to • Single-processor system only implement 3 PC! and 7 EISA slots. t Dual-processor system only :j: TPS is an abbreviation for transactions per second. These 2. The other component to building a DSR machine numbers are estimated for a quad-processor system using 6.1 is to provide the system model number to the OpenVMS version running Rdb. operating system so that licensing information can be determined. The system resident code Conclusions that runs at start-up is referred to as the console. The design team exceeded all the product require The console and the operating systems commu ments set at the beginning of the AlphaServer proj nicate via a data structure known as the hard ect. The transfer cost of the final product was 10 ware parameter block (H\VRPB). The HWRPB is percent better than the goal. The reduced cost was used to communicate the model number to the achieved despite the erratic price levels for DRAMs, operating system, which uses this number to which were much higher in 1994 than pred icted provide the correct licensing information. in late 1992 The AlphaServer 2000 system was completed in Separate cost targets were established fo r each approximately nine months. Qualification was not portion of the system, and each design engineer dependent on the operating system schedules. By was responsible fo r meeting a particular goal .

18 Vo l. II No. 3 Summer 1994 Digilitf Te chnical journal Design of the AlphaServer Multiprocessor Server Sy stems

Constant cost reviews ensured that variances could References andNo te be quickly addressed. The requirement to run three 1. B. Maskas, S. Shirron, and N. Wa rchol, "Design operating systems quickly expanded the size and and Performance of the DEC 4000 AXP Depart scope of the project. The operating system devel mental Server Computing Systems," Digital opers became an integral part of the design team. Te chnical jo urnal, vol. 4, no. 4 (Special Issue, Multiple reviews and open communication between 1992): 82-99 the hardware development team and the software groups were essential to managing this work. The 2. PC! Local Bus Sp ecification, Revision 2. 0 (Hills hardware team performed system-level testing on boro, OR: PCI Special Interest Group, Order No. all three operating systems. This proved invaluable 281446-001, April 1993). in tracking down bugs quickly and resolving them 3. E. Solari, !SA and EISA , Th eory and Op eration in either hardware or software. (San Diego, CA: Annabooks, 1992). The project team delivered the expected perfor mance and fu nctionality on schedule. Develop 4. R. Sites, ed., Alpha Architecture Reference Ma n ment time was allocated fo r new power and ual (Burlington, MA: Digital Press, Order No. packaging subsystems (using third-party design EY-L520E-DP, 1992). companies), new modules, new ASICs, new system 5. DECchip 21064 Microprocessor Ha rdware Refe r firmware, and porting of three operating systems. ence Ma nual (Maynard, MA: Digital Equipment To attain the schedule, development tasks were Corporation, Order No. EC-N0079-72, 1992). fro zen at the beginning of the project. The tasks were also categorized into three classes: mandatory, 6. A. Russo, "The AlphaServer 2100 1/0 Subsystem," nonessential, and disposable. Consequently, engi Digital Te chnicaljo urnal, vol. 6, no. 3 (Summer neers were able to make trade-offs when required 1994, this issue): 20-28. and maintain the integrity of the product. Another 7. 82420/82430 PC!set /SA and EISA Bridges (Santa key factor to meeting the schedule was the use of Clara, CA: Intel Corporation, 1993). knowledge and technology developed fo r previous products. This yielded many benefits: less design 8. SPICE is a general-purpose circuit simulator pro time, fewer resources required, known simulation gram developed by Lawrence Nagel and Ellis environment, and less time to a working prototype. Cohen of the Department of Electrical Engineer ing and Computer Sciences, University of Cali Acknowledgments fornia, Berkeley. Many people contributed to the success of this proj ect. They spent long weekends and nights working to a schedule and a set of requirements that many thought were unachievable. The inspired dedica tion of team members made this project a reality. Al though not complete, the fo llowing list credits those who were the most deeply involved: Vince Asbridge, Kate Baumgartner, Rachael Berman, Jack Boucher, John Bridge, Bob Brower, Harold Buckingham, Don Caley, Steve Campbell, Dave Carlson, Mike Chautin, Marco Ciaffi, Doug Field, Rich Freiss, Nitin Godiwala, Judy Gold, Paul Goodwin, To m Hunt, Paul jacobi, Steve jenness, Jeff Kerrigan, Will Kerchner, Jeff Metzger, John Nerl, Mike O'Neill, Kevin Peterson, Bryan Porter, Ali Rafiemeyer, Lee Rid lon, Andy Russo, Stephen Shirron, Andrei Shishov, Gene Smith, Kurt Thaller, Frank To userkani, Vicky Triolo, and Ralph Ware. Special thanks also to our manufacturing team in Massachusetts, Canada, and Scotland.

Digital Te chnical jo urnal Vo l. 6 No. 3 Summer 1994 19 Andrew R Russo I

Th e AlphaServer 2100 1/0 Subsystem

The AlphaServer 2100 11 0 subsystem contains a dual-level 1/0 stmcture that includes the high-poweredPC! local bus and the widely used EISA bus. The PC! bus is connected to the server's multiprocessing system bus through the custom-designed bridge chip. The EISA bus supports eight general-pwpose EISA/ISA connectors, pro viding connections to plug-in, industry -standard op tions. Data rate isolation, dis connected transaction, and data buff er management techniques were used to ensure bus effi ciency in the 110 subsystem. Innovative engineering designs accom plished the task of combining Alpha CPUs and standard-system 1/0 devices.

Digital's AlphaServer 2100 server combines Alpha limit of 33 megabytes per second (MB/s) at a clock multiprocessing technology with an 110 subsystem rate of 8.25 megahertz (MHZ). In the current config typically associated with personal computers uration for the AlphaServer 2100 product, the EISA (PCs).1 The 1/0 subsystem on the AlphaServer 2100 bus supports eight ge neral-purpose EISA/ISA con system contains a two-level hierarchical bus struc nectors, and the EISA bridge chip set provides ture consisting of a high-performance primary connections to various low-speed, system-standard J/0 bus connected to a secondary, lower per 1/0 devices such as keyboard, mouse, and time-of fo rmance 1/0 bus. The primary 110 bus is a 32 -bit year (TOY) clock. For most system configurations, peripheral component interconnect (PC!) local bus the AlphaServer 2100 system ·s EISA bus provides (or simply, PCI bus) Z The PC! bus is connected enough data bandwidth to meet all data throughput to the AlphaServer 2100 system's multiprocessing requirements. In light of the new requ irements fo r system bus tl1rough a custom appl ication specific faster data rates, however, the EISA bus will soon integrated circuit (ASIC) bridge chip (referred to begin to run out of bus bandwidth. as the T2 bridge chip). The secondary 1/0 bus is a To provide for more bandwidth, the AlphaServer 32 -bit Extended Industry Standard Architectme 2100 system also contains a PC! bus as its primary (EISA) bus connected to the PC! bus through a bus. With data rates fo ur times that of the EISA bus, bridge chip set provided by Intel Corporation.:� the PC! bus provides a direct migration path from Figure 1 shows the 1/0 subsystem designed for the the EISA bus. The 32-bit PC! bus can sustain data A1phaServer 2100 product. The r;o subsystem rates up to a theoretical I imit of 132 MB/s at a clock demonstrated sufficient flexibility to become the rate of 33 MHz. In the AlphaServer 2100 system 1/0 interface fo r the small pedestal AlphaServer configuration, the PC! bus provides connections 2000 product and the rackmountable version of the to three ge neral-purpose 32 -bit PCI connectors, an AlphaServer 2100 server. Ethernet device, a SCSI device, the J>CI-to-EISA This paper discusses the dual-level bus hierarchy bridge chip, and the T2 bridge chip. and the several 1/0 advantages it provides. The A close examination of the bus structure reveals design considerations of the 1!0 subsystem fo r the that the AlphaServer 2100 system actually contains AlphaServer 2100 server are examined in the sec a three-level, hierarchical bus structure. In addition tions that fo llow. to the PCI and EISA buses, the AlphaServer 2100 sys tem includes a 128-bit multiprocessing system bus, 1/0 Supportfo r EISA atul PC/ Buses as shown in Figure 1. Each bus is designed to adhere The EISA bus enables the AlphaServer 2100 system to its own bus interface protocols at diffe rent data to support a wide range of existing EISA or Industry rates. The system bus is 128 bits per 24 nanosec Standard Architecture (!SA) l/0 peripherals 1 The onds (ns); the PC! bus is :)2 bits per 30 ns; and the EISA bus can sustain data rates up to a theoretical EISA bus is 32 bits per 120 ns. Each bus is required

20 Vo l. 6 No. 3 Slimmer 1')94 Digital Te cbnical journal Th e AlphaServer 2100 l/0 Subsystem

CPUs MEMORY

II SYSTEM BUS 128 BITS Jl 11 BRIDGE THREE (T2 ASIC) SCSI ETHERNET 32-BIT CHIP CHIP PCI SLOTS II PCI BUS H 32 BITS II I� II 1/0 BRIDGE SUBSYSTEM (INTEL)

EISA BUS II 32 BITS II Jl EIGHT SYSTEM 32-BIT EISA STANDARD SLOTS 1/0 DEVICES

Fi gure I l/0 Subsystem fo r the AlphaServer 2100 Sy stem to provide a particular fu nction to the system and dently: it provides bus interfaces with extensive is positioned in the bus hierarchy to maximize data buffe ring that fu nction at the same data rates that efficiency. For example, the system bus is as the interfacing bus. For example. the T2 bridge positioned close to the crus and memory to maxi chip contains both a system bus interface and a PC! mize CPU memory access time, and the lower per bus interface that run synchronously to their fo rmance 1/0 devices are placed on the EISA bus respective buses but are totally asynchronous to because their timing requirements are less critical. each other. The data buffers inside the T2 bridge To maintain maximum bus effi ciency on all three chip act as a domain connector from one bus time buses, it is critical that each bus be able to perfo rm zone to the other and help to isolate the data rates its various functions autonomously of each other. of the two buses. In other words, a slower performing bus should not affect the efficiency of a high-performance bus. The Disconnected Transactiom section below discusses a few techniques that we Whenever possible, the bridges promote the use of designed into the l/0 subsystem to enable the buses disconnected (or pended) protocols to move data to work together efficiently. across the buses. Disconnected protocols decrease Us ing the Bus Hierarchy Efficiently the interdependencies between the diffe rent buses. For example, when a CPU residing on the system This section discusses the data rate isolation, dis bus needs to move data to the PO bus, the CPU does connected transaction, data buffe r management, so by sending its data onto the system bus. Here the and data bursting techniques used to ensure bus T2 bridge chip (see Figure 2) stores the data into efficiency in the l/0 subsystem. its internal data buffers at the system bus data rate. The T2 bridge chip provides enough buffering Data Rate Is olation to store an entire CPU transaction. From the CPU's The three-level bus hierarchy promotes data rate perspective, the transaction is completed as soon isolation and concurrency fo r simultaneous opera as the T2 bridge chip accepts its data. At that point, tions on all three buses. The design of the bus the T2 bridge chip must fo rward the data to the PC! bridges helps to enable each bus to work indepen- bus, independent of the CPU. In this way, the CPU

Digital Te chnical journal Vo l. 6 No. 3 Srmtmer /1)')4 21 AlphaServerMult iprocessing Systems

SYSTEM BUS PCI BUS COMMANDER [:J MASTER SYSTEM PCI SYSTEM DMA READ, BUS PCI BUS BUS BUS BUFFER DMA WRITE, AND BUFFER CORNER :> CORNER CONTROLLER PROGRAMMED CONTROLLER 128 LOGIC 32 LOGIC 1/0 BUFFERS

SYSTEM BUS PCI BUS RESPONDER G TARGET

Figure 2 Block Diagram of the T2 Bridge Chip

is not required to waste bus bandwidth by waiting device is reading data from memory on the sys for the transfer to complete to its final destination tem bus. To maintain an even flow of data from on the PC! bus. bus to bus, the buffe r management inside the T2 The T2 bridge chip implements disconnected bridge chip attempts to prefetch more read data transactions fo r all CPU-to-PCI transactions and most from memory while it is moving data onto the PC!. PCI-to-memory transactions. In a similar fashion, Buffer management helps the bridges service bus the PCl-to-EISA bridge implements disconnected transactions in a way that promotes continuous transactions between the PC! bus and the EISA bus. data flow that, in turn, promotes bus efficiency.

Da ta Buffe r Ma nagement Burst Transactions In addition to containing temporary data buffering Using a bus efficiently also means utilizing as much to store data on its journey from bus to bus, each of the bus bandwidth as possible for "useful" data bridge chip utilizes bu ffe r management to allocate movement. Useful data movement is defined as that and deallocate its internal data buffe rs from one section of time when only the actual data is moving incoming data stream to another. In this way, a single on the bus, devoid of address or protocol cycles. ASIC bridge design can efficiently service multiple Maximizing useful data movement can be accom data streams with a relatively small amount of data plished by sending many data beats (data per cycle) buffe ring and without impacting bus performance. per single transfer time. Sending multiple data The T2 bridge chip contains 160 bytes of tempo beats per single transfer is referred to as a "burst rary data buffe ring divided across the three specific transaction." bus transactions it performs. These three transac All three buses have the ability to perform burst tions are (I) direct memory access (DMA) writes transactions. The system bus can burst as much as from PC! to memory (system bus), (2) DMA reads 32 bytes of data per transaction, and the PC! and from memory (system bus) to PC!, and (3) pro EISA buses can burst continuously as requ ired. grammed 1/0 (system bus) reads/writes by a CPU Data bursting promotes bus efficiency and very from/to the PCI. The T2 bridge chip's data buffe ring high data rates. Each bus bridge in the server is is organized into five 32-byte buffers. Two 32-byte required to support data bursting. buffe rs each are allocated to the DMA write and DMA read fu nctions, and one 32-byte buffe r is allo The Bus Bridges cated to the programmed 110 function. Each of In the previous section, we discussed certain the three transaction functions contains its own design techniques used to promote efficiency buffe r management logic to determine the best use within the server's hierarchical bus structure. The of its available data buffering. Buffe r management is section that fo llows describes the bus bridges in especially valuable in situations in which a PCI more detail, emphasizing a fe w interesting features.

22 Vol. 6 No. .i Summer 1994 Digital Te chnical journ al Th e AlphaServer 2100 11 0 Subsystem

Th e T2 Bridge Ch ip ory space, and PCI configuration space. The T2 The T2 bridge chip is a specially designed ASIC that bridge chip supports two l/0 transaction types provides bridge fu nctionality between the server's when accessing PC! memory space: sparse-type multiprocessing system bus and the primary PC! data transfers ancl dense-type data transfers. Sparse bus. (See Figures 1 and 2.) The T2 ASIC is a 5.0-volt type transfers are low-performance operations chip designed in complementary metal-oxide semi consisting of 8-, 16-, 24-, 32-, and 64-bit data trans conductor (CMOS) technology. It is packaged in actions. Dense-type transfers are high-performance a 299-pin ceramic pin grid array (CPGA). operations consisting of 32-bit through 32-byte data As stated earlier, the T2 bridge chip contains a transactions. Dense-type transfers are especially 128-bit system bus interface running at 24 ns and useful when accessing 110 devices with large data a 32-bit PC! interface running at 30 ns. By using these buffe rs, such as video graphics adapter (VGA) con two interfaces and data buffering, the T2 bridge trollers. A single PC! device mapped into PC! mem chip translates bus protocols in both directions and ory space can be accessed with either sparse-type moves data on both buses, thereby providing the operations, dense-type operations, or both. logical system bus-to-PCI interface (bridge). In addi In addition to accessing the PCI, a CPU can access tion to the previously mentioned bridge fe atures, various T2 bridge chip internal control/status regis the T2 bridge chip integrates system fu nctions such ters (CSRs) fo r setup and status purposes. For maxi as parity protection, error reporting, and CPU-to mum flexibil ity, all the T2 bridge chip's fu nctions PCI address and data mapping, which is discussed are CSR programmable, allowing fo r a variety of later in the section Connecting the Alpha CPU to the optional fe atures. Al l CPU I/O transfers, other than PC! ancl EISA Buses. those to T2 bridge chip CSRs, are forwarded to the The T2 bridge chip contains a sophisticated DMA PC! bus. controller capable of servicing three separate PCI masters simultaneously. The DMA controller sup Intel PC!- to -ElSA Bridge Ch ip Set ports different-size data bursting (e.g., single, multi The Intel PCI-to-EISA bridge chip set provides the ple, or continuous) and two kinds of DMA transfers, bridge between the PC! bus and the EISA bus. I It inte direct mapped and scatter/gather mapped. Both grates many of the common 1/0 functions found in DMA mappings allow the T2 bridge chip to transfer today's ElSA-based PCs. The chip set incorporates large amounts of data between the PCI bus and the the logic for a PC! interface running at a clock rate system bus, independent of the CPU. of 30 ns and an EISA interface running at a clock Direct-mapped DMAs use the address generated rate of 120 ns. The chip set contains a DMA con by the PC! to access the system bus memory directly. troller that supports direct- and scatter/gather Scatter/gather-mapped DMAs use the address gener mapped data transfers, with a sufficient amount of ated by the PC! to access a table of page frame num data buffering to isolate the PC! bus from the EISA bers (PFNs) in the system bus memory. By using the bus. The chip set also includes PC! and EISA arbiters PFNs from the table, the T2 bridge chip generates a and various other support control logic that pro new address to access the data. To enhance the per vide decode fo r peripheral devices such as the flash fo rmance of scatter/gather-mapped DMAs, the T2 read-only memories (ROMs) containing the basic bridge chip contains a translation look-aside buffe r 1/0 system (BIOS) code, real-time clock, keyboard/ (TUl) that contains eight of the most recently used mouse controller, floppy controller, two serial PFNs from the table. By storing the PFNs in the TLB, ports, one parallel port, and hard disk drive. In the the T2 bridge chip does not have to access the table AJphaServer 2100 system, the PCI-to-EISA bridge in system bus memory every time it requires a new chip set resides on the standard 1/0 module, which PFN. The TLB improves scatter/gather-mapped DMA is discussed later in this paper. performance and conserves bus bandwidth. Each entry in the TLB can be individually invalidated as required by software. Connecting the Alpha CPU to the PC/ The T2 bridge chip also contains a single 110 data and EISA Buses mover that enables a CPU on the system bus to initi In the next section, we discuss several interesting ate data transfers with a device on the PC! bus. The design challenges that we encountered as we 110 data mover supports accesses to allth e valid PC! attempted to connect re-oriented bus structures to address spaces, including PC! 110 space, PCI mem- a high-powered multiprocessing Alpha chassis.

Digital Te chr�icaljo ur,al Vol. G No. 3 Summer 1994 23 AlphaServer Multiprocessing Systems

Address and Data Mapp ing action, as it appears on the system bus, to a cor When a CPU initiates a data transfer to a device on rectly sized PC! transaction. Ta bles 1 and 2 give the the PC! bus, the T2 bridge chip must first determine low-order Alpha address bits and Alpha 32-bit mask the location (address) and amount of data (mask) fields ancl show how they are encoded to generate information fo r the requested transaction and then the appropriate PC! address and data masks. By generate the appropriate PC! bus cycle. This issue is using this encoding scheme, the Alpha CPL can per not straightforward because the PCJ and EISA buses fo rm read and write transactions to a PC! device both support data transfers clown to the byte granu mapped in either PC! 1/0, PC! memory, or PC! larity, but the Alpha CPU and the system bus provide configuration space with sparse-type transfers. masking granularity only down to :12 bits of data. (Sparse-type transfersiz es have 8-, 16-, 24-, 32-, or To generate less than 32 -bit addresses and byte 64-bit data granularity) masked data transactions on the PC:! bus, the T2 Another mapping probkm exists when a PC! bridge chip needed to implement a special decod device wants to move a byte of data (or anything ing scheme that converts an Alpha CPU-to-I/O trans- smaller than 32 bytes of data) into the system bus

Ta ble CPU-to-PCI Read Size Encoding 1 Tra nsaction EV_Addr[6:5] EV_Addr[ 4:3] Instructions PCI Byte PCI_AD[1 :0] Data Returned Size Enables to Processor, (L) EV_Data [127:0]

8 bits 00 00 LDL 1110 00 OW_O:[D7:DO] 01 00 LDL 1101 01 OW_O:[D15:D8] 00 1011 _O:[D23:D16] 10 LDL 10 ow 11 00 LDL 01 11 11 OW_O:[D31 :D24] 16 bits 00 01 LDL 1100 00 OW_O:[D79:D64] 1001 _O:[D87:D72] 01 01 LDL 01 ow _O:[D95:D80] 10 01 LDL 0011 10 ow 24 bits 00 10 LDL 1000 00 OW_1 : [D23: DO] _1 :[D31 :D8] 01 10 LDL 0001 01 ow 32 bits 00 11 LDL 0000 00 OW_1 :[D95:D64] 64 bits 11 11 LDQ 0000 00 OW_1 :[D95:D64] 0000 OW_1 :[D127:D96]

Ta ble CPU-to-PCI Write Size Encoding 2 Trans- EV_Addr[6:5] EV_Addr[4:3] EV_Mask[ 7:0] (H) lnstruc- PCI Byte PCI_AD[1 :0] Data Returned action tions Enables to Processor, Size (L) EV_ Data[127:0]

8 bits 00 00 00000001 LDL 1110 00 OW_O:[D7:DO] 01 00 00000001 LDL 1101 01 OW_O:[D15:D8] 10 00 00000001 LDL 1011 10 OW_O:(D23:D16] 11 00 00000001 LDL 0111 11 OW_O:[D31 :D24] 16 bits 00 01 00000100 LDL 1100 00 OW_O:[D79: D64] 01 01 00000100 LDL 1001 01 OW_O:(D87:D72] 10 01 00000100 LDL 0011 10 OW_O:[D95:D80] 24 bits 00 10 00010000 LDL 1000 00 OW_1 : [D23: DO] _1 :[D31 :D8] 01 10 00010000 LDL 0001 01 ow 32 bits 00 11 01 000000 LDL 0000 00 OW_1 :[D95:D64] 64 bits 11 11 11000000 LDQ 0000 00 OW_1 :[D95:D64] 0000 OW_1 :[D127:D96]

. Sitllllll�r 24 Vo l. (i No. ) I'J'J·1 Digital Te cbnical journal The AlphaServer 2100 1/ 0 Subsystem

memory. Neither the system bus nor its memory intention is to mark frequently accessed sections supports byte granularity data transfers. Therefore, of code as read cacheable but write noncacheable. the T2 bridge chip must perform a read-modify In this way, read accesses "hit" in main memory (or write operation to move Jess than 32 bytes of data into cache), and writes update the ROMs directly. the system bus memory. During the read-modify write operation, the T2 bridge chip first reads a fu ll Interrupt Mechanism 32 bytes of data from memory at the address range No computer system would be complete without specified by the PC! device 2 It then merges the old providing a mechanism for an l/0 device to send data (read data) with the new data (PC! write data) interrupts to a CPU. The 1/0 interrupt scheme on and writes the fu ll 32 bytes back into memory. the AlphaServer 2100 system combines familiar technology with custom support logic to provide /SA Fixed-address Mapping a new mechanism. We encountered a third interesting mapping prob Electrical and architectural restrictions prohib lem when we decided to support certain !SA ited the interrupt control logic from being directly devices with fixed l/0 addresses in the AlphaServer accessed by either the system bus or the PC! bus. 2100 system. These ISA devices (e.g., ISA local area As a result, the interrupt contro.l logic is physically network [LAN] card or an ISA frame buffer) have located on a utility bus called the XBUS. The XBUS fixed (hardwired) memory-mapped 110 addresses is an 8-bit slave !SA bus placed nearby the PCI-to-EISA 6 in the 1-MB to 1 -MB address range. bridge chips. The ISA devices being discussed were designed The base technology of the 1/0 interrupt logic is for use in the first PCs, which contained Less than a cascaded sequence of Intel 8259 interrupt con 1 MB of main memory. In these PCs, the 1!0 devices trollers. The 8259 chip was chosen because it is a had fixed access addresses above main memory in standard, accepted, and well-known controller 6 the 1-MB to 1 -MB address range. To day's PCs have used by the PC industry today. The use of the 8259 significantly more physical memory and use the interrupt controller translated to low design risk as 6 1-MB to 1 -MB region as a part of main memory. well. Although the 8259 in terrupt controller is not Unfortunately, these JSA devices were never new, its integration into a high-performance multi redesigned to accommodate this change. There processing server, without incurring undue perfor fore, to support these ISA options, the PC designers mance degradation, required some novel thinking. created 110 access gaps in main memory in the 1-MB The integration of the 8259 interrupt controller 6 to 1 -MB address range. With this technology, an into the A.lphaServer 2100 system presented two access by a CPU in that address range is automati considerable problems. First, the designers had cally forwarded to the !SA device. to satisfy the 8259 interface requirements in a way To remain compatible with the ISA community, that would have a minimal impact on the perfor the T2 bridge chip also had ro allow for a gap in mance of the interrupt-servicing CPU. The 8259 6 main memory at the 1-MB to 1 -MB address range so requires two consecutive special-acknowledge that these addresses could be forwarded to the cycles before it will present the interrupt vector. appropriate !SA device. To resolve this problem, we designed a set of handshaking lACK programmable array logic (PAL) BIOS Caching Compatibility devices. These PA Ls enhance the functions of the Today's Microsoft-compatible:: PCs provide another 8259 controllers as XBUS slaves. The interrupt performance-enhancing mechanism. \Ve decided to servicing CPU performs only a single read to a desig implement this function inside the T2 bridge chip Juted address that is decoded to the XBUS. The lACK as wel l. control PA ls decode this read and then generate the During system initialization, MS-DOS-based PCs special, double-acknowledge cycles required to read several BIOS RO Ms from their l/0 space. Once access the vector. The PA L logic also deasserts the ROMs are read, their contents are placed in fixed CHRDY, a ready signal to the !SA bus, so that the cycle locations in main memory in the 512-kilobyte (KB) has ample time to proceed without causing a con to 1-MB address range. The software then has formance error for a standard !SA slave cycle. When the ability to mark certain addresses within this the double acknowledge is complete and the vector range as read cacheable, write cacheable, read is guaranteed to be driven on the bus, the PALs noncacheable, or write noncacheable. The basic assert the CHRDY ready signal.

Digital Tech nical journal Vo l. 6 No. 3 Summer 1994 25 AlphaServer Multiprocessing Systems

The second problem involved the location of the clock, keyboard, mouse, speaker) into a high interrupt controller. As mentioned earlier, because performance Alpha multiprocessing system, with of electrical and architectural restrictions, the inter out affecting the higher performing architectural rupt controller was located on the XBUS near the resources. The mu.ltilevel I/0 bus structure served PCI-to-EISA bridge chips. With the interrupt con to alleviate the performance issues, but the develop troller located on the XBUS, an interrupt-servicing ment of a PC-style 1/0 subsystem with off-the-shelf CPU is required to perform a vector read that spans components involved inherent risk and challenge. two I/0 bus structures. For this reason and its To reduce the risks inherent with using new and potential effect on system performance, vector unfamiliar devices, such as the PCI-to-EISA bridge reads bad to be kept to a minimum, which is not chip set, we chose to build an I/O module (called easy in a system that allows more than one CPU the standard I/0 module) that plugs into the to service a pending interrupt request. AlphaServer 2100 system backplane and contains Since the AlphaServer 2100 system can have as the PCI-to-EISA bridge, associated control logic, con many as fo ur CPUs, all four CPUs can attempt to trollers fo r mouse, keyboard, printer, and floppy service the same pending interrupt request at the drive as well as the integral Ethernet and SCSI con same time. Without special provisions, each CPU trollers. Without this plug-in module, fixing any would perfo rm a vector read of the interrupt con problems with the PCI-to-EISA bridge chip set or troller only to find that the interrupt has already any of the supporting logic would have required been serviced by another CPU. Requiring each CPU a backplane upgrade, which is a costly and time to perfo rm a vector read of the interrupt controller consuming effort. on the XBUS wastes system resources, especially The standard 1/0 module is relatively small, inex when each vector read spans two bus structures. Of pensive both to manufacture and to modify, and course, this problem could be resolved by assigning easily accessible as a fi eld replaceable unit (FRU). As only one CPU to service pending interrupts, but this shown in Figure 3, the standard 1/0 module con would negate the advantage of having multiple CPUs tains the following logic in a system. To solve this problem, the T2 bridge • PCI-to-Ethernet controller chip chip on the system bus implements special "passive release" logic that informs a CPU at the earliest possi • PCHo-scsr controller chip ble time that the pending interrupt is being serviced • PCI-to-EISA bridge chips by another CPU This allows the "released" CPU to resume other, more important tasks. • Real-time clock speaker control The term passive release typically refers to a vec • 8-KB, nonvolatile, ElSA-configuration , random tor code given to an interrupt-servicing CPU during access memory (RAM) a vector read operation. The passive-release code informs the CPU that no more interrupts are pend • 1-MB BIOS flash ROM ing. The special passive-release logic allows the T2 • Keyboard and mouse control bridge chip to return the passive-release code to a servicing CPU on behalf of the interrupt controller. • Parallel port The T2 bridge chip performs this fu nction to save • FDC floppy controller time and bus bandwidth. After the designers implemented all the fe atures • Two serial ports described above, they needed to address the prob • I2C support: controller, expander, and RO M lem of how to deal with all the slow, highly volatile, • "off-the-shelf' parts. To integrate these compo Intel 8259 interrupt controllers

nents into the I/O subsystem, they invented the • Ethernet station address RO M standard 1!0 module. • Reset and sysevent logic

• The Standard 1/0 Module Fan speed monitor As part of the development effort of the 1/0 subsys • Remote fault management connector tem, the engineering team fa ced the challenge of • External PC! subarbiter integrating several inexpensive, low-performance, off-the-shelf, PC-oriented l/0 functions (e.g., TOY • 3.3-volt and - 5.0-volt generation

o. Digital Te chnical journal 26 Vo l. 6 3 Summer 1994 Th e AlphaServer 2100 1/0 Subsystem

PCI BUS

ETHERNET 8 ROM 32 BYTES 32

INTERRUPT CONTROLLER ,.------, 8259A-2 12C SCSI l I PARALLEL I PORT I ______00000___ J

lACK CONTROLLER

FAN ROTATION 8242 PARALLEL KEYBOARD MONITOR ___. KEYBOARD 0 AND MOUSE MOUSE ___. CONTROLLER (1} SYSTEM FLOPPY RESET 0 GENERATION 0 SERIAL PORT - 5-V 3.3-V 0 GENERATION GENERATION SERIAL PORT 0

Figure 3 The Standard l/0 Module

For the most part, all these functions were gener 8-bit buffered slave ISA bus), with control managed ated by using integrated, off-the-shelf components by the PC!-to-ElSA bridge chips. at commodity pricing. Solutions known to work In addition, the common PC functions are on other products were used as often as possible. located at typical PC addresses to ease their integra The flash memory resides on the EISA memory bus tion and access by software. As expected, hardware and is controlled by the PCI-to-EISA bridge chip. changes were required to the standard l/0 module A simple multiplexing scheme with minima! hard during its hardware development cycle. However, ware enabled the server to address more locations the standard 1/0 module, which takes only minutes than the bridge chip allowed, as much as a full 1 MB to replace, provided an easy and efficient method of of BIOS ROM. The National PC87312, which provides integrating hardware changes into the AlphaServer the serial and parallel port control logic, and the 2100 system. We expect the usefulness of the stan floppy disk control ler reside directly on the ISA bus. dard l/0 module to continue and hope that it will The rest of the devices are located on the XBUS (an provide an easy and inexpensive repair process.

Digital Te ch nical jo urnal Vo l. 6 No. 3 Swmner 1994 27 AlphaServer Multiprocessing Systems

Summary to Rachael Berman fo r her unflagging support of

The 1/0 subsystem on the AlphaServer 2100 system the standard 110 module; to Lee Ridlon fo r his much contains a two-level hierarchical bus structure con needed early conceptual contributions; to Stephen sisting of a high-performance PC! bus connected to Shirron fo r driving Sable software l/0 issues; to jolm Bridge fo r cleaning up the second-pass T2; and to a secondary ElSA bus. The PCl bus is connected to the AlphaServer 2100 system's multiprocessing sys To m Hunt and Paul Rotker fo r their contributions tem bus through the T2 bridge chip. The secondary to the first-pass T2. 1/0 bus is connected to the PC! bus through a stan dard bridge chip set. The 1/0 subsystem demon References strated sufficient flexibility to become the 1/0 interface fo r the small pedestal AJphaServer 2000 1. F Hayes, ··oesign of the AlphaServer Multiproces and the rackmountable version of the AJ phaServer sor Server Systems;· Digital Te chnical jo urnal, 2100 products. vol. 6, no. 3 (Summer 1994, this issue): 8-19.

2. PCI Local Bus Sp eetfication, Revision 2. 0 (Hills Acknowledgments boro, OR: PC! Special Interest Group, Order No. The AlphaServer 2100 l/0 would not be what it is 281446-001. April 1993). today, without the dedicated, fo cused efforts of sev eral people. Although not complete, the fo llowing 3. 82420182430 PC/set /SA and EISA Bridges (Santa list gives credit to those who were the most deeply Clara, CA: Intel Corporation, 1993). involved . Thanks to Fidelma Hayes fo r leading the

Sable effort; to Vicky Triolo fo r the Sable mother 4. E. Solari, !SA and EISA , Th emy and Op era tion board and her support of the T2 bridge chip effort; (San Diego, CA: A.n nabooks, 1992).

28 Vo l. 6 No . .! S/1111111er 1994 Digital Te chnical journal jeffr eyM. Denham Paula Long james A. Wo odward

DEC OSF/1 Ve rsion 3.0 Symmetric Multiprocessing Implementation

The primary goal fo r an op erating system in a sy mmetric multiprocessing (SMP) implementation is to convert the additional computing pawer provided to the sy s tem, as processors are added, into improved sys tem pe1jonnance without compro mising sy stem quali�y. Th e DEC OSF/1 version ]. 0 op erating sy stem uses a number of techniques to achieve this goal. The techniques include algorithmic enhance ments to improve parallelism within the kernel and additional lock-based synchro nization to protect global sys tem state. �ynchronization primitives include sp in locks and blocking locks. An op tional locking hierarchy was imp osed to detect latent sy mmetric multiprocessor sy nchronization issues. Enhancements to the ker nel scheduler improve cache usage byenab ling soft affinity of threads to the proces sor on which the thread last ran; a load-balancing algorithm keeps the number of runnable threads spread evenly across the available processors. A highly scalable and stable SMP implementation resultedfr om the project.

The DEC OSf/1 operating system is a Digital product additional crus in the system. That is, the operating based in part on the Open Software Foundation's system cost of mul.tiprocessing must be kept low OSF/1 operating system.1 One major goal of the DEC enough to enable most of an additional CPU's com OSF/1 version 3.0 project was to provide a leader puting power to be used by applications rather ship multiprocessing implementation of the C'-JIX than by the operating system's efforts to synchro operating system fo r AJpha server systems, such as nize simu ltaneous access to shared memory by mul the Digital AlphaServer 2100 product. This paper tiple processors. describes the goals and development of this operat Regard ing hardware, the AlphaServer 2100 prod ing system fe ature for the version 3.0 release. uct and the other Alpha multiprocessing platforms provide the shared memory and symmetric access The DEC OSF/1 Ve rsion 3. 0 to the system and l/0 buses desired by the operat Multiprocessing Project ing system designers 2 The design allows all CPUs Mul tiprocessing platforms like the AJphaServer to share a single copy of the operating system 2100 product provide a cost-effective means of in memory. The hardware also has a load-locked/ increasing the computing power of a server. Addi store-conditional instruction sequence, which pro tional computing capacity can be obtained at a vides both a mechanism fo r atomic updates to potentially significant cost advantage by simply shared memory by a single processor and an inter adding CPU modules to the system rather than by processor interrupt mechanism. adding a new system to a more loosely coupled Given these hardware fe atures, operating system network-server arrangement. An effe ct ive execu software developers have a great deal of freedom tion of this server-scaling strategy requires signifi in developing a multiprocessing strategy. The cant cooperation between the hardware and approach used in DEC OSF/1 version 3.0 is called software components of the system. The hardware symmetric multiprocessing (SMP), in which all pro must provide symmetrical (i .e., equal) access to sys cessors can participate fu lly in the execution of tem resources, such as memory and J/0, fo r all pro operating system code. This symmetric design con cessors; the operating system software must trasts with asymmetric multiprocessing (ASMP), in provide fo r enough parallelism in its major subsys which all operating system code must be executed tems to allow applications to take advantage of the on a single designated "master" processor. Such an

Digital Te chuicaljourual Vo l. 6 No. .) Summer 1994 29 DEC OSF/1 Symmetric Multiprocessing

a valuable product feature and was a preview of the Adap ting the Base Op era ting Sy stem effort that would be required to adapt the OSF/1 fo r Sy mmetric Multiprocessing code for the DEC 2000, 4000, and 7000 multipro Making the leap from a preemptive uniprocessor cessing platforms. Supporting separate preemptive kernel to an effective SMP implementation , built kernels for three versions prior to DEC OSf/ l from a single set of kernel binaries, required con 3.0, version combined with the developers' experi tributions from the OSF/1 version 1.2 and the DEC ence on other multiprocessing systems (including OSF/1 version 3.0 projects. Fundamental changes ULTRIX version 4 and an advanced development were introduced into the system to support SMP. project using MIPS multiprocessing platforms), The basic approach planned by the SMP project uncovered the following challenges and problems team was first to bootstrap the DEC: OSF/1 version that the team had to overcome to produce a com 1.3 kernel on the existing Alpha multiprocessing petitive multiprocessing product: platforms. This task was accomplished by funneling all major subsystems to a single processor while sta • Supporting two complete sets of kernel binary bilizing the underpinnings of the multiprocessing objects-base and rea l-time-was burdensome system (i.e., the scheduler. the virtual memory sub for the operating system engineers ancl awk system, the virtual file system, and the hardware ward for third-p arty developers. Therefore, the support) in the new environment. This approach DEC OSF/1 multiprocessing product team had to allowed the ream to make progress in understand strive to ship a single, unified set of kernelbina ing the scope of the effort wh ile analyzing the ries. This set should encompass the full range multiprocessing requi rements of each kernel sub of real-time features, including preemption and system. The in-depth analysis was necessary to POSIX fixed-priority scheduling. For that to be identify those subsystems in the kernel that practical, the resulting multiprocessing kernel required modifications to run safely and efficiently would have to perform as well on a uniproces under SMP. As each subsystem was confirmed to sor as the non-SNIP kernel. exhibit parallelism or was made paralleL it was • Diagnosing locking problems in the preemptive unfunnded and thus freed to mn on any processor. kernel was expensive in developer time. The This process was iterative. If incorrectly paral process required painstaking inspection of lelized, a subsystem will reveal itself by (l) leaving the simple-locking source code, which is often data incorrectly unprotected and thus open for cor disguised in subsystem-specific macros. Lock ruption and (2) developing a deadlock, i.e .. a situa ing or unlocking a spin lock multiple times or tion in which each of two threads holds a spin lock not unlocking it at all (usually in code loops) required by the other th read and thus neither would disable preemption well beyond the end thread can rake the lock ancl proceed. of a critical section or enable it before the encl. The efforts at parallel i zing the kernel fell into A coherent locking architecture with automated two classes of modification: lock-based synchro debugging facilities was needed to ship a reliable nization to ensure multiprocessing correctness and product on rime. The lock-debugging facility algorithmic changes to increase the level of paral present in the original OSF/1 code \Vas probably lelism ach ieved. inadequate for the task.

• Experiments with the real-time kernel revealed Lock-based Sy nchronization unacceptable preemption latencies, especially The code base on which the DEC OSF/ l product in funn eled code paths. This deficiency indi is built, i.e., the Open Software Foundation's OSf'/1 cated that, when moved to a multiprocessing software, provides a strong foundation for SMP. The platform, the existing kernel would fa il ro use OSF further strengthened this foundation in OSF! l additional processors effectively. nlat is, the ve rsions 1.1 and 1.2, when it corrected multiple kernel would not exhibit adequate parallel ism Si'vl l' problems in the code base and parallel ized to scale effectively. Clea rly, major work was (and thus unfunneled) additional subsystems. As requ ired to significantly increase parallelism in the multiprocessing bootstrap effort continued , the kernel. This task ·wouldli kely involve remov the team analyzed and incorporated the OSF/ I ver ing most uses of funneling, eliminating some sion 1.2 S1Y! P improvements into DEC OSF!l version spin locks, and adding other spin l. ocks ro create 3.0. As strong as this starting point was, however, a finer granularity of locklllg. some structures in the system did not receive the

:)2 \l,,f. (, No. 3 S11111111er 1994 Digital Te chnical jour11al DEC OSF/1 Ve rsion .). 0 Sy rnmetric Multiprocessing Implementation

appropriate level of synchronization. The team cor locks placed in the system has a major impact on rected these problems as they were uncovered the amount of parallelism obtained. through testing and code inspection. During multiprocessing developmen t, locking The DEC OSF/1 operating system uses a combina strategies were designed to tion of simple locks, complex locks, elevated SPL, • Reduce the total number of locks per subsystem and fu nneling to guarantee synchronized access to • system resources and data structures. Simple locks, Reduce the number of locks ta ken per subsys SPL. and fu nnel ing were described briefly in the tem operation earlier discussion of pr eemption. Complex locks. • Improve the level of parallelism throughout the SPL, like elevated are used in both uniprocessor and kernel multiprocessor environments. These locks are usu At times, these goals clashed: enhancing paral ally sleep locks-threads can block while they wait lelism usually involves adding a lock to some struc fo r the lock-which offe r additional fe atures, ture or code path. This outcome conflicts with the including multiple-reader/single-writer access and goal ofre ducing lock counts. Consequently, in prac recursive acquisition. tice. the process of successfully parallel izing a sub An example of the use of each synchronization system involves striking a balance between lock technique fo llows: red uction and the resulting increase in lock granu • A simple lock is used to protect the kernel's cal l larity. Often, benchmarking diffe rent approaches is out (timer) queue. In an S,\1 P environment, mul required to fine- tune this balance. tiple threads can update the callout queue at the Several general trends were uncovered during same time. as each of them adds a timer entry lock analysis and tuning. In some cases locks were to the queue. Each thread must obtain the call removed because they were not needed; they out lock before adding an entry and release the were the products of overzealous synchronization. lock when done. The cal lout simple lock is also For example, a structure that is private to a thread SPL a good example of synchronization under may require no locking at all. Moreover, a data ele multiprocessing because the callout queue is ment that is read atomically needs no locking. An lSR. scanned by the system clock Therefore, example of lock removal is the gettimeofday( ) sys before locking the cal lout Jock, a thread must tem cal l, which is used frequently by DBMS servers. SPL lPL. raise the to the clock's Otherwise, the The system cal l simply reads the system time, a 64- SPL thread holding the callout lock at an of zero bit quantity, and copies it to a buffe r provided by the ISR, can be interrupted by the clock which will cal ler. The original OSF/1 system call. running on a in turn attempt to take the callout lock. The 32-bit architecture, had to take a simple lock before result is a permanent deadlock. reading the time to guarantee a consistent value. On

• A complex lock protects the file system direc the Alpha architecture, the system call can read the tory structure. A blocking lock is requ ired entire 64-bit time value atomical ly. Removing the because the directory lock holder must perform lock resulted in a 40 percent speedup. I/O to update the directory, which itself can In other cases, analyzing how structures are used block. Whenever blocking can occur while revealed that no locking was needed. For example, a lock is held, a complex lock is required. an 1/0 control block called the buf structure was being locked in several device drivers while the • Funnel ing is used to synchronize access to the block was in a state that allowed only the device ISO 9660 CD-ROM file system.- The decision to driver to access it. Removing these unnecessary fu nnel this file system was .largely due to limita locks saved one complex and one simple locking tions in the DEC OSI'/ 1 version 3.0 schedule; sequence per l/0 operation in these drivers. however, the file system is a good choice fo r fun Another effective optimization involved post nel ing because of its generally slow operation poning locking until a thread determined that it had and I ight usage. actual work to do. This technique was used success To ensure adequate performance and sealing as fu lly in a routine frequently called in a transaction processors are added to the system, an SJ\IIP imple processing benchmark. The routine, which was mentation must provide fo r as much parallel ism locking structures in anticipation of fo llowing through the kernel as possible. The granularity of a rarely used code path, was modified ro lock only

Vo l. (J No Digital Te chnical jourual ..> Summer 19')4 33 DEC OSF/1 Synunetric Multiprocessing

when the uncommon code path was needed. This index into the PID entry array. The process state optimization significantly reduced lock overhead. associated with that PID (active, zombie, or nonexis To improve paral lel ism across the system, the tent) is maintained in the PID entry structure. This DEC OSf/1 SMP development team modified the lock allows process structures to be allocated dynami strategies in numerous other cases. cally, as needed, rather than statically at boot time. as before. Simple locks are also added to the process Algorithm Ch anges structure to allow multiple threads in the process to In some instances. the effective migration of a sub perform process management system calls and sig system to the multiprocessing environment nal handling concurrently. These changes allowed required significant reworking of its fu ndamental process management fu nneling to be removed algorithms. This section presents three examples of entirely, which significantly improved the degree of this work. The first example involves the rework parallelism in the process management subsystem. of the process management subsystem; the second example is a new technique fo r a thread to refer to Accessing Current Thread Sta te One critical design its own state; and the third example deals with choice in implementing SMP on the DEC OSF/1 sys enhancements in translation buffer coherency or tem concerned how to access the state of the cur "shootdown:· rently running thread. This state includes the cmrent thread's process, task. and virtual memory J'rfanaging Processes and Process State Early ver structures, and the so-called uarea, which contains sions of the DEC OSF/1 software maintained a set of the pageable UNfXstate. Access to this state, which systemwide process lists, most notably proc (static threads require frequently as they run in kernel proc structure array), allproc (active process list), context, must have low overhead . Further, because and zomproc (zombie process list). These lists tend the DEC OSF/1 operating system supports kernel to be fairly long and are normally traversed sequen mode preemption, the method fo r accessing the tially. Operations involving access to these lists current thread's state must work even if a context include process-creation time (fork( ) ), signal post switch to another CPU occurs during the access ing, and process termination. The original OSF/1 operation. code protected these process lists and the individ The original OSF/1 code used arrays indexed by ual proc structures themselves by means of fu nnel the CPU number to look up the state of a running ing. This meant that virtually every system call that thread . One of these arrays was the U_ ADDRESS involved process state, such as exit( ). wait( ), array, which was used to access the currently active ptrace( ) , and sigaction( ), was also fo rced into uarea. The U_ADDRESS array was loaded at context a single fu nnel. Experience with real- time preemp switch time and accessed while the thread exe tion indicated that this approach would exact cuted. Before the advent of multiprocessing. the excessive multiprocessing costs. Although it is pos CPU number was a compile-time constant, so sible to protect these lists with locks, the develop that thread-state lookup involved simply reading ment team decided that this basic portion of the a global variable to fo rm the pointer to the data. kernel must be optimized fo r maximum multi Ad ding multiprocessing support meant changing OSF processing performance. The also recognized the CPU number from a constant ro the resu lt of the need for optimization; they addressed the prob the WHAMI ("Who am P") PA Lcode call to get the lem in OSF/1 version 1.2 by adopting a redesign current CPU number. (PAL.code is the operating of the process management developed fo r their system-specific privileged architecture library Multimax systems by Encore Computer Corpora that provides control over interrupts, exceptions. tion. The DEC OSF/ I team adopted and enh:mced context switching, etc H) this design for hand ling process lists, process man Using such global arrays fo r accessing the current agement system calls, and signal processing. thread's state presented three shortcomings: The redesign replaces the statically sized array of proc structures with an array of smaller process 1. The \\ fJ-IAMIPA Lcode call added a minimum over identification (PID) entry structures. Each P[J) entry head of 21 machine cycles on rhe AlphaServer structure potentially points ro a dynamically allo 2100 server, nor including fu rther overhead c.lue cated proc structure Under this new scheme, find to cache misses or instruction stream stalls. The ing the proc structure associated with a user PID multiprocessing team feltthat this was too large has been reduced to hashing the PID value to an a performance price to pay.

Vol. 6 No. 3 Summer 1<)')4 Digital Te clmical]ourrwl DEC OSF/ 1 Ve rsion 3. 0 Sym metric il-'lultiprocessing Implementation

2. Allowing multiple crus to write sequential hardware maintains, TLFl cache coherency is a task pointers caused cache thrashing and extra over of the software. The DEC OSF/1 system uses an head during context switching. enhanced ve rsion of the TLB shootdown algorithm developed fo r the Mach kernel to maintain TLB 3. Indexing by CPU number was not a safe practice coherency. 1° First, a mod ification to the original when kernel-mode preemption is enabled. shootdown algorithm was needed to implement A thread could switch processors in the middle the Alpha architecture's add ress space numbers of an array access, and the wrong pointer would (ASNs) . Second, a synchronization fe ature of the be fe tched. Providing additional locking to pre original algorithm was removed entirely to enhance vent this had unacceptable performance impli shootdown performance. This fe ature provided cations because the operation is so common. synchronization fo r architectures in which the hardware can modify PTEs, such as the VA X plat These problems convinced the team that a new fo rm; the added protection is unnecessary fo r algorithm was required for accessing the current the Alpha architecture. thread's state. The final shootdown algorithm is as fo llows. The The solution selected was modeled on the way physical map (PMAP) is the software structure that the OpenVMS VA X system uses the processor inter holds the virtual-to-physical mapping information. rupt stack pointer to derive the pointer to per-CPU Each task within the system has a PMAP; operating state.9 In the OSF/1 system, each thread has its own system mappings have a special kernel P;\1AP. Each kernel stack. By aligning this stack on a power-of PMAP contains a list of processors cu rrently using two boundary, a simple masking of the stack the associated address space. To initiate a virtual-to pointer yields a pointer to the per-thread data, such as the process control block (PCB) and uthread physical translation change, a processor (the initia structure. Any data item in the per-thread area can tor) first locks the PMAP to prevent any other threads be accessed with the fo llowing code sequence: from modifying it. Next, the initiator updates the PTE mapping in memory and flushes the local TLB. The lda r16, MASK # Get ma sk va lue processor then sends an interprocessor interrupt bi c sp, r16, rO # Mask stack poi nter to poi nt to stack base to all other processors (the responders) that are ldq rx, OFFSET(rQ) # Add offset to base cu rrently active in the same address space. Each and fetch item responder needs to acknowledge the initiator and invalidate its own mapping. Once all responders Accessing thread state using the kernel stack are accounted fo r, the initiator is free to continue pointer solves all three problems with CPU-number with the knowledge that all TLBs are coherent on based indexing. First, this techn ique has very low the system. The initiator marks nonactive proces overhead; accessing the current thread's data sors· ASNs inactive , spins wh ile it waits for other involves only a simple masking operation and a read processors to check in, and then unlocks the PMAP. operation. Second . using the kernel stack pointer Figure 1 shows this final TLB shootdown algorithm incurs no extra overhead during context switching as it progresses from the initiating processor to the because the pointer has to be loaded for other uses. potential responding processors. Third, because thread stack areas are pages, no cache confl icts exist between threads running on Developing the Lock Pa ckage different processors. Finally, the data access can be preempted at any point, and the correct pointer Key to meeting the performance and rel iability is still fe tched . No processor-dependent state is goals for the multiprocessing portion of the DEC used to access the current thread state. OSF/ 1 version 3.0 release was the development of a lock package with the fo llowing ch:ll'acteristics:

lnterprocessor Translation Lookaside Buff er • Low execution and memory overhead Shootdown Alpha processors employ translation • Flexible support fo r both uniprocessor ancl lookaside bu ffers (TLFls) to speed up the translation multiprocessor platforms, with and without of physical-to-virtual mappings. The TLB caches real-time preemption page table entries (PTEs) that contain virtual-to physical address mappings and access control info r • Automated debugging facilities to detect incor mation. Unl ike data cache coherency, which the rect locking practices at run time

Digital Te chnical ]o ur11al \IIJI. 6 No. 3 Summer 1994 35 DEC OSF/1 Synunetric Multiprocessing

Initiator: Responders: Lock the PMAP. Update the translation map (PTE). Inval idate the processor TLB entry. Send an interprocessor interrupt to all processors that are using the PMAP.

Acknowledge the shootdown. Invalidate the processor TLB entry. Ret urn from the interrupt. Mark the nonactive processors' ASNs inactive. Spin while it waits for other processors to check in. Unlock the PMAP.

Figure I hansla tion Lookaside Bujfe r Shootdown Algorithm

• Statistical facil iries ro track the number of locks development team had to enhance the lock package used , how many times a lock is taken, and how to be configurable at boot time. That is, the package long threads wa it to obtain locks needed to be able to tailor itself to fi t the configura tion and real-rime requirements of the platform on Of course, the overall role of the lock package which it would run. is ro provide a set of synchroniz:Jtion primitives. The lock package supplied by the OSF/ 1 system that is, the simple and complex Jocks described in was h1rther deficient in that it did nor support error earlier sections. To support kernel-mode rhrea<.J checking when locks were asserted . This deficiency preemption, DEC OSF/ 1 version 1.0 had extended left developers open to the most common ronnen the lock package originally delivered with OSF/ I tor of concurrent programmers. i.e., dead locks. ve rsion 1.0. Early in the DEC OSF/1 version 3.0 rroj Without error checking, potential system hangs ect, the development team extended the package caused by Jocks being asserted in the wrong order again to optimize its performance and to add the could go undetected fo r years and be difficult to desired debugging and statistical fe atures. debug. A fo rmal locking order or hierarchy fo r all As previously noted, a major goal fo r D EC OSF/1 Jocks in rhe system had to be established. and the ve rsion 3 0 was to ship a single version of its kernel lock package needed the ability to check the hierar objects, instead of the base and rea l-time sets of chy on each lock taken. previous releases. Therefore, simple locks would These needs were mer by introducing the notion have to be compiled into the kernel, even fo r ker of lock mode to the lock package. Developers nels that would run only on uniprocessor systems. defined the ti >Howing five modes and associated Achieving this goal required minimizing the size of roles: the lock structure; it would be unacceptable to h;lVe hundreds of kilobytes (KB) of memory dedi • Mode 0: No lock operations; fo r production cated ro lock structures in systems that did not use uniprocessor systems such structures. Further. the simple lock and unlock invocations required by the multiprocess • Mode I: Lock counting only to manage kernel ing code would have to be present fo r all platforms, preemption; fo r production real-time unipro which would raise serious performance issues fo r cessor systems uniprocessor systems. In fact, in the original OSf/ I • Mocle 2: Locking withm1t kernel preemption; lock package, the CPU overhead cost of compiling for production multiprocessing systems in the lock code was between I and 20 percent. Compute-intensive benchmarks showe(l the cost to • Mode :') : Locking with kernel preemption; for be less than ) percent, but the cost fo r multiuser production real-time multiprocessing systems benchmarks was greater than 10 percent, which represents an unacceptable performance degrada • Mode 4: Full Jock debugging with or without tion. To meet the goal of a single set of binaries, the preemption; for any development system

Vol. () No.. ! Still/Iller 1994 Digital Te chnical jo urnal DEC 05F/l Ve rsion 3.0 Sy mmetric kf ultiprocessing Implementation

The default uniprocessor lock mode is 0; the mul of the simple and complex lock structures to 8 and tiprocessing default is lock mode 2. Both selections 32 bytes, respectively. Embedded in both structures favor non-real-time production systems. The sys is a 16-bit index into the Jockinfo structure table tem's lock mode, however, can be selected at boot for that particular lock class. The lock info structure time by a nu mber of mechanisms. Lock modes are is dynamically created at system startup in lock implemented through a dynamic lock configura mode 4. All classes in the system are assigned a rela tion scheme that essentially iiJstal ls the appropriate tive position in a single unified lock hierarchy. set of lock primitives fo r the selected lock mode. A lock class's position in the lockinfo table is also Installation is rea lized by patching the compiled-in its position in the lock hierarchy; that is, locks must fu nction calls, such as simple_lock( ), to dispatch be taken in the order in which they appear in the to the corresponding lock primitive fo r the selected table. Lock statistics are also maintained on a per lock mode. This technique avoids the overhead class basis with separate entries fo r each processor. of dispatching indirectly to different sets of lock Keeping lock statistics per processor and separat primitives fo r each call, based on the lock mode. ing this information by cache blocks eliminates The compiled-in lock fu nction calls to the lock the need to synchronize lock-primitive access to package are all entry points that branch to a call the statistics. This design, which is illustrated in patching routine called simple_lock_patch( ). This Figure 2, prevents negative cache effects that could routine changes the calling machine instruction to result from sharing this data. be patched out (for Jock mode 0) or to branch to Once this powerful lock package was opera the corresponding primitive in the appropriate set tional, developers analyzed the lock design of their of actual primitives, and then branches there (for kernel subsystems and attempted to place the locks lock modes 1 through 4). Thus, the overhead fo r used into classes in the overall system lock hierar dynamically switching between the versions of sim chy. The position of a class depends on the order in ple lock primitives occurs only once fo r each code which its locks are taken and released in relation to path. In the case of lock mode 0, calls to simple other locks in the same code path and in the sys Jock primitives are "back patched" out. Under this tem. At times, this static lock analysis revealed prob model, uniprocessor systems pay a one-time cost to lems in existing lock protocols, in which Jocks were invoke the simple lock primitives, after which the taken in varying orders at diffe re nt points in expense of executing a lock primitive is reduced to the code. Clearly, the lock protocol needed to be executing a few no-op instructions where the code reworked to produce a consistent order that could fo r the lock call once resided. be codified in the hierarchy. Thus, the exercise of To address memory consumption issues and to producing an overall lock hierarchy resu lted in provide better system debug capabi.lities, the devel opers reorgan ized the lock data structures around the concept of the lockinfo structure. This struc LOCK INSTANCES LOCK CLASS LOCK STATISTICS ture is an encapsulation of the Jock's ordering (hier archical relationship) with surrounding locks and its minimum SPL requirement. Lock debugging information and the lock statistics were decoupled from the lock structures themselves. To facilitate the expression of a lock hierarchy, the developers introduced the concept of classes and instances. L...... , CPU 1 A lock class is a grouping of Jocks of the same type. For example, the process structure lock constitutes a lock class. A lock instance is a particular lock of a given class. For example, one process structure simple lock is an instance of the class process struc ture lock. Error checking and statistics-gathering are performed on a lock-class basis and only ie1 lock mode 4. Decoupling the lock debugging information from the Jock itself significantly reduced the sizes Figure 2 Lock Structure

Digital Te chnical journal Vol. 6 No. 3 Summer 1994 37 DEC OSF/1 Symmetric Multiprocessing

a significant cleanup of the original multiprocess support the POSIX rea l-time standard, the DEC OSF/1 ing code base. To add a new lock to the system, system incorporates two additional fixed-priority a developer would have to determine the hierarchi scheduling poli cies: first in, first out (POLICY_FIFO) cal position for the new lock class and the mini and round robin (POLICY_RR) . mum SPL at which the lock must be taken. A time-share thread's priority degrades with CPU Running the system in lock mode 4 and exercis usage; the more recent the thread's CPU usage, ing code paths of interest provided developers with the more its priority degrades. (Note that OSF/1 immediate fe edback on their lock protocols. Using scheduling entities are threads rather than pro the hierarchy and SPL information stored in the run cesses.) In contrast, a fixed-priority thread never time lockinfo table, the Jock primitives aggressively suffers priority degradation. Instead, a POLICY_RR check fo r a variety of locking errors, which include thread runs until it blocks voluntarily, is preempted the fo llowing: by a higher-priority thread, or exhausts a quantum (and even then, the round robin scheduling applies • Locking a lock out of hierarchical order only to threads of equal priority). A POLICY_FIFO • Locking a simple lock at an SPL below the thread has no scheduling quantum; it runs until it required minimum blocks or is preempted. These specialized policies are used by real-time appl ications and by threads • Locking a simple lock already held by the caller created and managed by the kernel. Examples • Unlocking an unlocked simple lock of these kernel threads include the swapper and

• Unlocking a simple lock owned by another CPU paging threads, device driver threads, and network protocol handlers. A feature called thread binding, • Locking a complex lock with a simple lock held or hard affinity, was added to DEC OSF/1 version 3.0

• Locking a complex lock at interrupt level Binding allows a user or the kernel to fo rce a thread to run only on a specified processor. Binding sup • Sleeping with a simple lock held ports the funneling fe ature used by unparallelized

• Locking or unlocking an uninitialized lock code ancl the bind_to_cpu( ) system ca ll. The goal of a multiprocessing operating system in Encountering any of these types of violation scheduling threads is to run the top N priority results in a lock fa ult, i.e., a system bug check that threads on N processors at any given time. A simple records the information required by the developer way to accomp.l ish this would be to schedule to quickly track down the lock error. threads that are not bound to a CPU in a single, global The reduction in lock sizes and the major run queue and schedule bound threads in a run enhancement of the lock package enabled the team queue local to its bound processor. When a proces to realize its goal of a single set of kernel binaries. sor reschedules, it would select the highest-priority Benchmarks that compare a pure uniprocessor thread available in the local or the global run queue. kernel and a kernel in lock mode 0 that are both Scheduling threads out of a global run queue is running on the same hardware show a less than highly effective at keeping the N highest-priority 3 percent difference in performance, a cost consid threads running; however, two problems arise with ered by the team to be well worth the many advan this approach: tages to returning to a unified kernel. Moreover, the debugging capabilities of the Jock package with 1. A single run queue leads to contention between its hierarchical scheme streamlined the process of processors that are attempting to reschedule, as lock analysis and provided precise and immediate they race to lock the run queue and remove the fe edback as developers adapted their subsystems to highest-priority thread. the multiprocessing environment. 2. Scheduling with a global run queue does not Adap ting the Schedulerfo r take advantage of the cache state that a thread Multiprocessing builds on the CPU where it last ran. A thread that migrates to a different processor must reload its The normal scheduling behavior, i.e., policy, of state into the new processor's cache. This can the OSF/1 system is traditional time-sharing. UNIX substantially degrade performance. The system time-slices processes based on a time quantum and adjusts process priorities to favor To help preserve cache state and reduce wastefu l interactive jobs over compute-intensive jobs. To global run queue contention, the developers

38 Vo l. 6 No. 3 Summer 1994 Digital Te ch nical jour11al DEC OSF/1 Ve rsion 3. 0 Sy mmetric Multiprocessing Implementation

enhanced the multiprocessing scheduler by adding a processor's total potential execution time fo r a two new scheduling models: a soft-affinity sched scheduling period and subtracting from this time uling model fo r time-share threads and a last the interrupt-processing and fixed-priority run processor-preference model fo r fixed-priority times yields the total time available on a processor threads. Under these models, each processor sched (processor ticks available) to run time-share threads. ules time-share and bound threads in its local run In the worse case, a processor could be completely queue, and it schedules unbound fixed-priority consumed by fixed-priority threads and/or inter threads out of a global run queue. rupt processing and have no time to run time-share Fixed-priority threads scheduled from a global threads. In this extreme case, the scheduler should run queue are able to run as soon as possible. This give no time-share load to that processor. behavior is required fo r high-priority tasks like Adding the time-share load averages of all proces kernel threads and re al-time applications. The last sors determines the aggregate time-share load fo r processor-preference model gives a fixed-priority the system. Similarly, summing the processor ticks tlu·ead a preference fo r running on the processor available yields the total time available on the sys where it last ran; if that processor is busy, the thread tem fo r handling time-share tasks. Using this data, runs on the next available processor. Each time the scheduler calculates the desired load fo r each share thread is softly bound to the processor on processor once per second, as fo llows: which it last ran; that is, the thread shows an affinity fo r that processor. Unlike fu nneling or user bind Processor ticks System time-share ing, which support hard (mandatory) affinity, soft Desired available X load affi nity al lows a thread to run elsewhere if it is load System ticks available advantageous, i.e., ifsuch activity balances the load. Load balancing is called fo r when at least one pro Otherwise, the softly bound thread tries to return to the processor where it last ran and where its cessor is above and one is below its desired load by recent cache state may still reside. a minimal amount. If this condition arises, then Under load, however, a soft affinity model used those processors under their desired loads are alone can degenerate to a state where one proces declared to be "out of balance:· The next time an sor builds up a large queue of threads, leaving the out-of- balance processor reschedules, it will try to other processors with little to do and thus dimin take a thread from the local run queue of a proces ishing the performance of the multiprocessing sys sor that is above its desired load ("thread stealing"). tem. To mitigate these side effects of soft affinity, A processor can declare itself back in balance when developers paired the soft affinity fe ature with the its current load is above its desired load or when 3 ability to load-balance the runnable threads in the there are no eligible threads to steal. Figure shows system. To keep the load of time-share jobs spread a simplified load-balancing scenario, in which a evenly across processors, the scheduler must peri processor below its desired load steals a thread odically load-balance the system. In addition to dis from a processor above its desired load. tributing threads evenly across the local run queues To help preserve the cache benefits of soft affin in the system, this load-balancing activity must ity, a thread is eligible fo r stealing only when it has not run on its current processor fo r some config • Cost no more in processing time than it saves urable number of clock ticks. After this time has

• Prevent excessive thread movement among elapsed without a thread running, the chance of it processors having significant cache state remaining has dimin ished sufficiently that the thread is more likely to • Recognize and effectively accommodate changes benefit from migrating to another processor and in the job mix running immediately than from waiting longer to To implement load balancing, each processor run on its current processor. maintains a time-share load average, i.e., the aver To demonstrate that soft affinity with load bal age local run queue depth over the last five sec ancing improves multiprocessing performance onds. Each processor uptlates its own load average through cache benefits and the elimination of run on each system clock tick. Processors also keep queue contention, developers ran a simple test pro track of the time they spend handling interrupts gram. The program, which writes 128 KB of data, and running fixed-priority threads, which are not yields the processor, and then reads the same data accounted fo r in the loca l run queue depth. Ta king back, was run on a four-processor DEC 7000 system.

Digital Te chnicaljo urnal Vo l. G No. 3 Summ.er 1994 39 DEC OSF/1 Symmetric Multiprocessing

CPU 1 CPU 2 CPU N

CURRENT LOAD (NUMBER OF 5 CPU 1 IS 3 4 THREADS) OUT OF BALANCE

DESIRED LOAD 4 4 ... 4

LOCAL LOCAL LOCAL RUN RUN RUN QUEUE CPU 2 QUEUE QUEUE STEALS ONE THREAD FROM CPU 1 HIGHEST PRIOR ITY THREAD BETWE EN LOCAL RUN QU EUES AND GLOBAL R UN QUEUEI WINS THE PROC ESSOR I I I

GLOBAL RUN QUEUE

Figure 3 Load Balancing

Tab.le 1 shows the results of running multiple The soft affinity and load-balanci ng fe atures versions of this program with and without soft affin improved many other multiuser benchmarks. For ity and load balancing in operation. Performance example, a transaction processing benchmark benefits appear only when multiple copies of the showed a 17 percent performance improvement. program begin piling up in the run queues at the 16 -job level. Prior to this point, each job keeps Fo cusing on Quality running on the same processor, i.e., the cache it had The error-checking focus of the lock package is just just filled still had its data cached when the pro one example of how the DEC OSF/1 ve rsion 3.0 proj gram read it back-the ideal case. At the 16-job ect focused on the quality and stability of the prod level, the four processors must be time-shared . The uct. Most members of the multiprocessing team jobs that are running with soft affinity now benefit had been involved in an SMP development effo rt significantly because they continue to run on the prior to their DEC OSF/1 effort. Th is past experi same processor and thus find some of their cache ence, coupled with the difficulties other vendors state preserved from when they last ran. The sys had experienced with their own mul tiprocessing tems that schedule from a global run queue provide implementations, reinforced the need to have a no such benefit. jobs take longer to complete, since strong qual ity fo cus in the SMP project. they are likely to run on a diffe rent processor Developers took multiple steps to ensure that fo r each time slice and find no cache state that they the Si'vl P solution delivered in DEC OSF/1 version 3.0 can reuse. wo uld be production quality, including

Table 1 Benefits of Soft Af finity with Load Balancing (SAILS)

Number Time with SAILB Time without Benefit from of Jobs (Seconds) SA/LB (Seconds) SA/LB (Percent)

1 25.9 26.0 0 4 25.9 26.0 0 16 106.5 141.9 25

40 14>1. 6 No. 3 Sumn11!r 1994 Digital Te cb nica/ jo urnal DEC OSF/ 1 Ve rsion 3. 0 Sy mmetric Multiprocessing Implementation

• Code reviews utilized. These benchmarks include SPECrate_INT92, SPECrate_FP92, and AIM Suite III. • Lock debugging SMP performance is measured in terms of the • In-line assenion checking incremental performance gained as processors are added to the system. Adding processors by no means • Multithreaded test suite development for SMP guarantees increased system performance. Systems qualification that have l/0 or memory limitations rarely benefit The base kernel code was reviewed fo r multi from introducing additional CPUs. Systems that are processing correctness. During this review phase, compute bound tend to have the largest potential checks were made to ensure that the proper level of fo r gain from additional processors. Note that large, synchronization was employed to protect global monolithic applications tend to see little perfor data structures. Numerous defects were uncovered mance improvement as processors are added during this process and corrected. Running code because such applications employ little concur with lock checking enabled provided empirical rency in their designs. evidence of the incremental improvements of the Performance tuning for SMP was an iterative pro multiprocessing implementation. cess that can be characterized as fo llows: Beyond code reviews and lock debugging, inter 1. Collect and analyze performance data. nal consistency checks (assertions) were coded • CPU utilization across the processors into the kernel to verify correctness of operations • at key points. Assertion checking was enabled dur Lock statistics ing the development process to ensure that the ker • 1!0 rates nel was functioning correctly; it was then compiled • Context switch rates

out fo r the production version of the kernel. • Kernel profiling In parallel with the operating system develop ment effort, new component tests were designed 2. Identify areas that require improvement. to fo rce as much concurrency as possible through 3. Prototype changes. panicular code paths. The core of the test suite is 4. Incorporate changes that demonstrate improve a thread-race library, which consists of a set of rou ment. tines that can be used to construct multithreaded system-call exercisers. The library provides the 5. Repeat steps 1 through 4. ability to commence multiple test instances simul taneously. The individual tests are then combined In reality, the process has two stages for each to form focused subsystem tests and systemwide benchmark. The initial phase was devoted to driv tests. These tests have been used to uncover multi ing the system to become compute bound. The sec ond phase improved cocle efficiencies. ple race conditions in the operating system. The UNIX development organization had a four Figures 4 and 5 show that, as expected, the processor DEC 7000 system deployed in its develop SPECrate_JNT92 and SPECrate_FP92 benchmarks ment environment fo r more than 7 months prior scale almost linearly. Both of these benchmarks to releasing the SMP product. This system has been are compute intensive and make only nominal extremely stable, with few complaints from the demands on the operating system. user community. Extensive internal and external AIM Suite III is a multiuser benchmark that field testing produced similar results. stresses multiple components of an operating sys tem, including the virtual memory system, the scheduler, UNlX pipes, and the 1/0 subsystem. Measuring Multip rocessing Figure 6 shows AIM Jll resu lts for one and fo ur pro Performance Outcomes cessors, with a resulting 3.27 to 4 scaling factor. The major functionality cleliverecl with SMP is This equates to a greater than 80 percent scaling improved performance through application con factor, a figure well within the goals fo r this bench currency. The goal of the SMP project was to mark at first multiprocessing release. Effo rts to pro provide leadership performance in the areas of duce still better results are under way. compute and data servers. To gauge success in this AIM Suite III scaling appears to be gated by a effort, several industry-standard benchmarks were single test in the AJi\1 Suite Ill benchmark, i.e.,

Digital Technical Jo urnal Vo l. 6 No. J Summer 1994 4! Ch andrika Kam ath Roy Ho Dwight P. Manley

DXML : A Hi ghperformance Scientific Subroutine Library

Mathematical subroutine libraries fo r science and engineering applications are an important tool in high-jJ eJformance computing. By identifying and op timizing fr equent�y used, numerically intensive op erations, these libraries help in reducing the cost of computation, enhancing portability, and improving productivity The Digital eXtended Ma th Library is a set of public domain and Digital proprietmy software that has been op timized fo r high performance on Alpha sy stems. In this paper, DXi'v!L and the issues related to library software tecbnolog)l are described. Sp ecific exa mp les illustrate bow algorithms can be op timized to take advantage of the architecture of Alpha sy stems. Modem algorithms that effectively e:�ploit the memory hierarchy enable DXi'v!L routines to provide substantial improvements in pe1jonnance.

The Digital eXtended !'vl ath Library (DXML) is a set math library, more complicated special fu nctions, of mathematical subroutines, optimized for high linear algebra algorithms, and Fou rier transform performance on Alpha systems. These subroutines algorithms were included in the software layer perform numerically intensive subtasks that occur between the hardware and the user application. frequently in scientific computing. They can there Then, in the early 1970s, there was a concerted fo re be used as building blocks fo r the optimization effort to produce high-quality numerical software, of various science and engineering applications in with the aim of providing end users with implemen industries such as chemical. aerospace, petro leum, tations of numerical a.lgorithms that were stable, automotive, electronics, finance, and transportation. robust, and accurate. This led to the development In this paper, we discuss the role of mathematical of several math libraries, with the public domain software libraries, fo llowed by an overview of UNPACK and EISPACK libraries fo r the solution of the contents of the Digital eXtended Math Library. linear and eigen systems, setting the standards fo r DXML includes optimized versions of both the stan fu ture development of math software. ' - ' dard BLAS and LAPACK libraries as well as libraries The late 1970s and early 1980s saw the avail ability designed and developed by Digital fo r signa.l pro of advanced architectures, including vector ancl cessing and the solution of sparse linear systems parallel computers, as well as high-performance of equations. Next, we describe various aspects of workstations. This added another facet to the devel library software technology, including the design opment of math libraries. namely, the implemen ancl testing of DXJ'vlLsubr outines. Using key routines tation of algo rithms for high efficiency on an as examples, we illustrate the techniques used underlying architecture. in the performance optimization of the librJry. TI1e effort to produce mathematical software thus Finally, we present data that demonstrates the per became a task of building bridges between numeri fo rmance improvement obtained through the use cal analysts, who devise algorithms, computer archi of DXML. tects, who design high-pertormance computer systems, and computer users, who need efficient, The Role of Math Libraries reliable software for solving their problems. Con Early mathematical libraries concentrated on sup sequently, these libraries embody expert knowledge plementing the fu nctionality provided by the in applied mathematics, nu merical analysis, data Fortran compilers. In addition to ro utines such as structures, software engineering, compilers, oper sin and exp, which were included in the run-time ating systems, and computer architecture and

44 l·b/ (> No. 3 Summer J

are an important programming tool in the use of in fo ur ve rsions: real single precision, real double l1igh-performance computers. precision, complex single precision. and complex Modern superscalar R!SC architectures with double precision. floating-point pipelines, such as the AJpha, have DX ML is available on both OpenVMS and DEC deep memory hierarchies. These include floating OSF/ I operating systems. Its ro utines can be called point registers, mu ltiple levels of caches, and virtual from either Fortran or C, provided the difference in memory. The significant latency and bandwidth dif array storage between these languages is taken into fe rences between these memory kve. ls imply that account. DX ML is ava ilable as a shareable library. numerical algorithms have to be restructured to with a simple interface, enabling easy access to the make effective use of the data brought into any one rou tines. On DEC OSF/1 systems, DX ML supports the level. The performance of an algorithm is also sus IEEE floating-point fo rmat. On OpenVMS systems, ceptible to the order in which computations are either the IEEE floating-point fo rmat or the VA X scheduled as well as the higher cost associated with F-float/G -float fo rmat can be selected . some operations such as floating-po int square-root DX ML routines can be broadly categorized into and division. the fo llowing fo ur areas: The architecture of the Alpha systems and the • BLAS. The Basic Linear AJgebra Subroutines include technology of the Fortran and C compilers usually the standard BIAS and Digital enhancements. provide an efficient computing environment with adequate performance. However, there is often • LAI'ACK. The Linear Algebra PAC Kage includes room for improvement, especially in engineering linear and eigen-system solvers. and science applications. where vast amounts of • Signal processing. This includes fast Fourier data are processed and repeated operations are per transforms (FFTs), convolution, and correlation. fo rmed on each data element. One way to achieve

these improvements is through the use of opti • Sparse linear system solvers. These include mized subroutine libraries. direct and iterative solvers. The Digital eXtended Math Library is a collection Of these, the signal-processing library and the of routines that performs frequently occurring, sparse I in ear system solvers are designed . deve l numerically intensive operations. By identifying oped, and optimized by Digital. The majority of the such operations and optimizing them fo r high per BLAS ro utines and the l.A.PACK library are versions of fo rmance on AJpha systems. DXML provides several the public domain standard that were optimized fo r benefits to the computational scientist. the Alpha architecture. By supporting the industry • It allows definition of fu nctions at a sufficiently standard interfaces of these I ibraries, DX ,V! L pro high level and therefore optimization beyond vides both portability of user code and high perfor the capabilities of the compiler. mance of the optimized software.

• It makes the architecture of the systems more We next provide a brief description of the fimc transparent to the user. tionality provided by each subcomponent of DXML. Further details are available in the Digital eXtended • It improves productivity by providing easy Math Library Reference Manual. s access to highly optimized, efficient code.

• It enhances the portability of user software VL /B through the support of standard libraries and The vector library consists of seven double interfaces. precision ro utines that perform operations such as

• It promotes good software engineering practice sine. cosine, and natural logarithm, on data stored and avo ids duplication of work by identifying as vectors. and optimizing common fu nctions across sev eral application areas. BLAS I The Basic Linear AJgebra level I subprograms per Overview of DXML fo rm low-granu larity operations on vectors that DXML contains almost 400 user-callable ro utines, involve one or two vectors as input and return optimized fo r Alpha systems.> This includes both either a vector or a scalar as output." Examples of software developed by Digital as we ll as the 13LAS BLAS I routines include dot product, index of the and LAl'AC K libraries. Most ro utines are available maximum element in a vector, and so on.

Digital Te chnical journal I-1JI. 6 No. .) Sutwl/el' 1')94 45 ScientificComputi ng Optimizations fo r Alpha

BIAS 1 Extensions (BIAS 1 E) singu lar value problems. Va rious storage schemes Digital has extended the functionality of the BLAS l are su pported, incl uding general, band, tridiagonal, routines by including 13 similar operations. These sym metric positive definite, and so on. include index of the minimum element of a vector, sum of the elements of a veer or, and so on. Signal Processing

The signal-processing subcomponent of DX ML BIAS Sp arse (B IAS S) 1 1 includes l'FTs, convolutions, and correlations. DX ML also includes nine routines that are sparse A comprehensive set of Fourier transforms is BLAS extensions of the 1 routines. Of these, six are provided, including from the sparse BLAS 1 standard and three are • fFTs in one, two. and three dimensions enhancements.- These rou tines operate on two

vectors, one of which is sparse and stored in a com • FFTs in fo rward and inverse directions pressed fo rm. As most of the elements in a sparse • Multiple one-dimensional transforms vector are zero, both computational time and mem

ory are reduced by storing and operating on only There is no I imit on the number of elements being ULAS the nonzeros. lS ro utines include construc transformed, though the performance is best when tion of a sparse vector from the specified elements the data length is a power of 2. Popular storage for of a dense vector, dot product, and so on. mats for the input and output data are supported, allowing fo r possible sym metry in the output data BIAS 2 and consequent red uction in the storage required . The BLAS level 2 rou tines perform operations of Further efficiency is provided through the use of

a higher granularity than the Ievel l routines .s These the three-step FFT, which separates the process include matrix-vector operations such as matrix of setting up and deallocating the internal data vector product, rank-one and rank-two updates, structures from the actual application of the FFT. and solutions of triangular systems of equations. This resu Its in significant performance ga in when Va rious storage schemes are supported, including repeated application of FFTs is required. general, symmetric, banded, and packed. The convolution and correlation routines in DXNI L support both periodic (circular) and nonperi BIAS3 oclic (linear) definition. A discrete summing tech

The BLAS level .1 routines perform matrix-matrix nique is used fo r calculation. Special versions of the operations, which are of a higher granularity than routines allow control of output options such as the BLAS 2 operations. Tl1e se routines include the range of coefficients computed, sca ling of the matrix-matrix product, rank-k updates, solution of output, ancl addition of the output to an array. triangular systems with multiple right-hand sides. All FFT, convolution, and correlation routines are and multiplication of a matrix by a triangular matrix. avai.lable in both single and double precision and Where appropriate, these operations are defined support both real and complex data. for matrices that may be general, symmetric, or tri angular 9 The fu nctionality of the public domain sp arse Iterative Solvers BLAS 3 library has been enhanced by three addi DXJ'vl L includes a set of routines fo r the iterative solu tiona I ro utines fo r matrix addition, su btraction, tion of sparse linear systems of equations using pre and transpose. conditioned, conjugate-gradienr-like methods. 11·1l A flexible user interface, based on a matrix-free fo r LA PA CK mulation of the solve r, allows a choice among vari DX ML includes the standard Linear Algebra ous solvers, storage schemes, and preconditioners. PAC Kage, LAPAC K, which supersedes the UNPAC K This formulation permits the user to defi ne his or and EISPACK packages by extending the fu nctional her own preconclitioner ancl/or storage scheme fo r ity, using algorithms with higher accuracy, and the matrix. It also al lows the user to store the improving the performance through the use of matrix using one of the storage schemes defined the optimized flLAS I ibrary m LA PA CK can be used by OX.vtL and/or use the preconditioners provided. fo r solving many common linear algebra prob A driver routine provides a si mple interfa ce to the lems, including solution of linear systems, linear iterative solvers when the DXML storage schemes least-squares problems, eigenvalue problems, and and preconditioners are used.

46 vb l. 6 No . .J Summer 1')')4 Digital Te ch11icaljo urnal DXML: A High-performance Scientific Subroutine Library

The different iterative methods provided are • For unsymmetric matrices: (1) conjugate gradient, (2) least-squares conjugate - Profile-in storage mode (3) grad ient, biconjugate grad ient, (4) conjugate - Diagonal-out storage mode gradient squared, and (5) generalized minimum - Structurally sym metric profile-in storage residual. Each method supports various applica mode tions of the preconditioner: left, right, split, and no preconditioning. These solvers are available in real double precision The matrix can be stored in the symmetric diago only. nal storage scheme, the unsymmetric diagonal stor age scheme, or the general storage (by rows) Software Considerations scheme. Three preconditioners are provided fo r each As with any software effort, many software engi storage scheme: diagonal, polynomial (Neumann), neering issues were encountered during the design and incomplete LU with zero diagonals added. and development of DXML. Some issues were spe A choice of fo ur stopping criteria is provided, cific to math libraries such as the numerical accu in addition to a user-defined stopping criterion. racy and stability of the routines, while others were The iteration process can be controlled by setting more general such as the design of a user interface, various input parameters such as the maximum testing of the software, error checking, ease of use, number of iterations, the degree of polynomial pre and portability. We next discuss some of these key conditioning, the level of output provided, and the design issues in fu rther detail. tolerance fo r convergence. These solvers are avail able in real double precision only. Designing the Interfa ce The first task in creating a library was to decide the Sp arse Skyline Solvers fu nctionality, fo llowed by the design of the inter The sparse skyline solver library in DXM L includes face. This included both the naming of the subrou a set of routines fo r the direct solution of a sparse tines as well as the design of the parameter list. For linear system of equations with the matrix stored each subcomponent in DX ML, the calling sequence using the skyline storage scheme. 13·14 The fo llowing was designed to be consistent across all routines functions are provided. in that subcomponent. In the case of the BLAS and LAPAC K libraries, the public domain interface was • LDU factorization, which includes options fo r maintained to enable portability of user code. the evaluation of the determinant and inertia, For the routines added by Digital, the routine partial factorization, statistics on the matrix, and names were chosen to indicate the fu nction being options fo r handling small pivots. performed as well as the precision of the data. • Solve, which includes multiple right-hand sides Furthermore, the parameter lists were chosen and solves systems involving either the matrix or to provide a simple interface, yet allow flexibility its transpose. for the sophisticated user. For example, the sparse solvers require various real and integer parameters. • Norm evaluation, including 1-norm, infinity norm, Frobenius norm, and the maximum abso By using arrays instead of scalar variables, a more lute value of the matrbc concise interface that did not vary from routine to routine was obtained. In addition, all solver • Condition number estimation, which includes routines have arguments for real and integer work both the 1-norm and the infinity norm. arrays, even if these are not used in the code. This

• Iterative refinement, including the component not only provides a unifo rm interface but also acts wise relative backward error and the estimated as a placeholder fo r work arrays, should they be fo rward error bound fo r each solution vector. required in the future.

• Simple and expert drivers. Accuracy

This fu nctionality is provided fo r each of the fo l The numerical accuracy of the routines in DXML is lowing storage schemes: dependent on the problem size as well as the algo rithm used, which may vary within a routine. Since • For symmetric matrices: performance optimization often changes the order - Profile-in storage mode in which a computation is performed, identical - Diagonal-out storage mode results between the DXML routines and the public

Digital Te chnical jo urnal Vo l. 6 No. 3 Summer 1994 47 Scientific Computing Optimizations for Alpha

domain RLAS and LAPACK routines may not occur. Many of the ro utines, such as the FFTs aml l3LAS .), The accuracy of the results obtained is checked hy are tested using random input data. However, some ensuring that the op timized versions of the llLAS routines, such as the sparse solvers, operate on spe and LAPACK routines pass the public domain tests cific data structures or matrices with specific prop to within the specified tolerance. erties. These have been tested using matrices generated from the finite diffe rence discretization Error Processing of partial diffe rential equations or using the matri Most of the routines in OX ML trap usage errors and ces in the Harweii-Boeing test suite. " provide sufficient information so that the user can Anorher aspect to the OX 1vl L regression test pack identify and fix the problem. The low-level. fine age is the inclusion of a performance rest gauge. grained computational routines, such as the llLAS This software tests the performance of key routines level I. do not provide this fu nction because the in each component of DX,\H. and is used to ensure overhead of testing and error trapping wo uld seri that the performance of 0X ML routines is not ously clegracle the performance. adversely affected by changes in compilers or the B In the case of LAS 2, BLAS 3. and LA PAC K, the pub operating svstems. lic domain error-reporting mechanism has been maintained . If an input argument is invalid, such as Pe 1jormance Tra de-ojfs a negative value fo r the order of the matrix, the rou The design and optimization of the routines in tine prints out an error message and stops. If a fail DX:vi L often prompted a trade- off between perfor ure occurs in the course of the algorithm, such as mance on one hand, and accuracy and generality a matrix being singular to working precision, an on the other. Although every effort bas been made error flag is set and control is returned to the call not to sacrifice accuracy for performance. the ing program. reordering of computations during performance

The signal-processing routines report success or optimization may lead to results befo re optimiza fa ilure using a status function value. Further infor tion that are nor bit-for-bit identical to the results mation on the error can be obtained by using a user after optimization. Jn other cases, performance has callable routine that prints out an error message and been sacrificed to ensure generality of a routine. an error flag. The user documentation indicates the For example, although the matrix-free fo rmulation actions to be taken to recover from the error. of the iterative solvers permits the use of any sparse In the case of the sparse solvers, error is incli matrix storage scheme. it could resu lt in a slight cated by setting an error flag and printing an appro degradation in performance due to less efficient priate message if the printing option is enabled. use of the instruction cache and the inability to Control is always returned to the calling program. reuse some of the data in the registers. Te sting Per:formance Op timization

OXML rou tines are rested fo r correctness and accu DXML routines have been designed to provide high racy using a regression rest su ite. This includes performance on the Alpha systems. 11' These both test code developed by Digital, as well as the rou tines are tai lored to take advantage of the sys public domain test codes fo r llLAS and I.APACK. tem characteristics such as the number of floating These codes are usecl not only during the imple point registers. the size of the primary and mentation and performance optimization of the secondary data caches, and the page size. This opti routines, but also during the building of the com mization involves changes to data structures and plete library from each of the subcomponents . the use of new algorithms as well as the re structur The test codes check each rou tine extensively, ing of computation to effectivel y manage the mem inclu ding checks fo r error exits, accuracy of the ory hierarchy. results obtained, invariance of read-only data and Several ge neral tech niques are used across all the correctness of all paths through the code. As DXML subcomponents to improve the perfor the complete regressi on tests take over 20 hours mance. '- These include the following techniques: to execute, two input data sets are used: a short one • Unrolling loops to make better use of the that tests each routine and can be used to make a floating-point pipelines quick check that all subcomponents compiled and built correctly, and a long data set that rests each • Reusing data in registers and caches whenever path through a routine and is thus more exhaustive possible

48 Vo l. (> No. .i S/1111111<'1' I'J'J4 Digital Te chnicaljo urnal DXML: A Higb-pe1jormance Scientific Subroutine Librm:y

250 • Managing the data caches effectively so that the

cache hit ratio is maximized I \ 200 • Accessing data using stride- I computation

• Using algorithms that exploit the memory hierar chy effectively (f) 150 a:: 0 -' • Reordering computations to minimize cache and LL :2 100 translation bu ffe r thrashing

Although many of these optimizations are done by the compiler. occasionally, fo r example in the case ------of the skyline solver. the data structures or the o L_�--��--� implementation of the algorithm are such that they 6 8 10 12 14 16 18 20 22 do not lend themselves to optimization by the com VECTOR LENGTH (AS POWER OF 2) piler. In these cases, expl icit reordering of the com KEY: putations is required. -- BLAS DAXPY ---- DXML DAXPY

We next discuss these optimization techniques as ·········· BLAS DDOT used in specific examples. All performance data is - ·- - DXML DDOT fo r the DEC :)000 Model 900 system using the DEC OSF/1 version 3.0 operating system. This work Figure I Pe J.formance of BLAS I Routines station uses the Alpha 21064A chip, running at 275 DDOT and DAXP Y megahertz (J'viHz). The on-chip data and instruction caches are each 16 kilobytes (Kn) in size, and the secondary cache is 2 megabytes (MB) in size. Op tim.iza tion of BLAS 2 In the next section, we compare the perfor BLAS 2 routines operate on matrix, vector. and mance of DXM L BLAS and LAPACK routines with the scalar data. The data structures are larger and more corresponding publ ic domain rou tines. Both ver complex than the BLAS 1 data structures and the sions are written in standard Fortran and compiled operations more complicated. Accordingly, these using identical compiler options. routines lend themselves to more sophisticated optimization techniques. Op timiza tion of BLAS1 Optimized DX.,\1l BLAS 2 routines are typically 20 BLAS 1 rou tines operate on vector and scalar data percent to 100 percent fa ster than the public domain only. As the operations and data structures are sim routines. Figure 2 illustrates this performance ple, there is little opportunity to use advanced data improvement fo r the matrix-vector multiply routine. blocking and register re use techniques. Neverthe DGE.\1V.and the triangular solve routine. DTRSY.H less. as the plots in Figure I demonstrate, it is pos The DXML DGEMY uses a data-blocking technique sible to optimize the BLAS l routines by careful that asymptotically performs two floating-point coding that takes advantage of the data prefetch operations for each memory access, compared to fe atures of the Alpha 21064A chip and avoids data the public domain ve rsion, which performs two path-related stal Is. Jh.JH floating-point operations fo r every three memory General ly, the DX ML routines are 10 percent to 15 accesses. 19 This technique is designed to minimize percent fa ster than the corresponding public translation buffe r and data cache misses and maxi domain routines. Occasionally, as in the case of mize the use of floating-point registers u' IH 2o The DDOT fo r very short, cache-resident vectors, the same data prefetch considerations used on the BLAS benefits can be much greater. 1 routines are also used on the BLAS 2 routines. The shapes of the plots in Figure 1 rather dramat The DXJ\'IL version of the DTRSY routine partitions ically demonstrate the benefits of data caches. Each the problem such that a sma.l l triangular solve oper plot shows very high performance fo r short vectors ation is performed. The result of this solve opera that reside in the 16-KH, on-chip data cache, much tion is then used in a DGEMV operation to update the lower performance fo r data vectors that reside in remainder of the vector. The process is repeated the 2-MB, on-board secondary data cache, and even unt il the final triangular update completes the lower performance when the vectors reside com operation. Thus the DTRSV routine relies heavily on pletely in memory. the optimizations used in the DGEMY rou tine.

Digital Te chnical jo unwl Vo l. () No . .l Summer 1994 49 Scientific Computing Optimizations for AJpha

140 180

/-- I 160 I 120 I I I 140

100 120

(f) (f) 0:: 80 0:: 100 0 ' -' ' 0 u. ' -' ' � 80 2 60 ' I ' ' ' 60 -. ' 40 .._ "' . - . 40 -- - . - 20 ..,. ___ _ 20

0 200 400 600 BOO 1000 0 200 400 600 800 1000 ORDER OF VECTORS/MATRICES ORDER OF MATRICES KEY: KEY:

-- BLAS DGEMV -- BLAS DGEMM -- - - DXML DGEMV ---- DXML DGEMM · BLAS DTRSM ···· · BLAS DTRSV

- - - DXML DTRSM - · - ·- DXML DTRSV ·

Figure 2 Performance of BLAS2 Routines Fig ure 3 Performance of BIAS3 Routines DGEMV and DTRSV DGEMM and DTRSM

As with BLAS 1 routines, BLAS 2 routines benefit of matrices are copied to page-aligned work areas greatly from data cache. Although the effect is less where secondary cache and translation buffe r dramatic fo r the BLAS 2 routines, Figure 2 clearly misses are eliminated and primary cache misses are shows the three-step profile observed in Figure 1. absolutely minimized. Best perfo rmance is achieved when both matrix As an example, within the primary compute loop and vector fit in the primary cache. Performance is of the DXJVI L DGEMM rou tine, there are no transla lower but flat over the region where the data fits tion bu ffe r misses, no secondary cache misses, and, on the secondary board level cache. The final per on average, only one primary cache miss fo r every formance plateau is reached when data resides 42 floating-point operations. Performance within entirely in memory. this key loop is also enhanced by carefully using floating-point registers so that four floating-point Op timiza tion of BLAS3 operations are performed for each memory read BLAS 3 routines operate primarily on matrices. The access. Much of the DXML BLAS 3 performance operations and data structures are more compli advantage over the public domain routines is a cated that those of BLAS 1 and BLAS 2 routines. direct consequence of a greatly improved ratio of Typically, BLAS 3 routines perform many computa floating-point operations per memory access. tions on each data element. These routines exhibit a The DXML DTRSM routine is optimized in a man great deal of data reuse and thus naturally lend them ner similar to its BLAS 2 counterpart, DTRSV. A small selves to sophisticated optimization techniques. triangular system is solved . The resulting matrix DXML BLAS 3 routines are general ly two to ten is then used by DGEMM to update the remainder of times faster than their public domain counterparts. the right-hand-side matrix. Consequently, most The plots in Figure 3 show these performance dif of the DXML DTRSM performance is directly attrib fe rences for the matrix-matrix multiply routine, utable to the DXML DGEMM routine. In fact, the tech DGEMM, and the triangular solve routine with multi niques used in DGEMM pervade DXML BLAS 3 ple right-hand sides, DTRSM 9 routines. Al l performance optimization techniques used Figure 3 illustrates a key fe ature of DXML BLAS 3 fo r the DXML BLAS 1 and BLAS 2 routines are used routines. Whereas the perfo rmance of public on the DXML BLAS 3 routines. In particular, data domain rou tines degrades significantly as the blocking techniques are used extensively. Portions matrices become too large to fi t in caches, DXJ\1 L

50 Vo l. 6 No. 3 Summer 1994 Digital Te chnical jo urnal DXML: A High-performance Scient1ji c Subroutine Library

routines are relatively insensitive to array size, 140 shape, or orientation." 9 The perfo rmance of a DXML BLAS 3 routine typically reaches an asymptote and 120 remains there regardless of problem size. 100 (f) a: Op timiza tion of LAPA CK g 80 lJ._ I I The LAPACK subroutine library derives a large ::;; 60 I I ·' part of its high performance by using the opti / ' mized BLAS as building blocks w The DXML ver 40 ' sion of LAPACK is largely unmodified from the public domain ve rsion. However, in the case of 20 the factorization routine fo r general matrices, DGETRF, we have introduced changes to the 0 200 400 600 800 1000 algorithm to improve the performance on Alpha ORDER OF VECTORS/MATRICES systems. KEY:

-- BLAS DGETRF For example, while the original public domain ---- DXML DGETRF DGETRF routine uses Crout's method to factor a · · · BLAS DGETRS - ·- - DXML DGETRS matrix, the DXML version uses a left-looking method .11 Left -looking methods make better use of the secondary cache and translation buffe rs than Figure 4 Performance of LAPA CK Routines the Crout method. Furthermore, the public domain DGETRF and DGETRS (LDA = N + 1) version of the DLASWP routine swaps a single matrix row across an entire matrix . This is a very bad technique fo r RISC machines; it causes severe Op timization of the cache and translation buffe r thrashing. To avoid Signal-processing Ro utines this, the DXML version of DLASWP performs all We illustrate the techniques used in optimizmg swaps within columns, which makes much better the signal-processing routines using the one use of the caches and the translation buffe r and dimensional, power-of-2, complex FFT.21 The algo results in a much improved performance of the rithm used is a version of Stockham's autosorting DXJ\IIL DGETRF routine. algorithm, which was originally designed for vector The DGETRS routine was not modified. Its perfor computers but works we ll, with a few modifica mance is solely attributable to use of optimized tions, on a RISC architecture such as AJpha 22·2.1 DXM L routines. The main advantage in using an autosorting algo Figure 4 shows the benefits of the optimizations rithm is that it avoids the initial bit-reversal permu made to DGETRF and the BLAS routines. DGETRF tation stage characteristic of the Cooley-Tukey makes extensive use of the BLAS 3 DGEMM and algorithm or the Sande-Tukey algorithm. This stage DTRSM routines. The performance of DXM L DGETRF is implemented by either precalculating and load improves with increasing problem size largely ing the permutation indices or calculating them because DXML BLAS 3 romines do not degrade in the on-the-fly. ln either case, substantial amounts of face of larger problems. integer multiplications are needed . By avoiding The plots of Figure 4 also show the perfo rmance these multiplications, the autosorting algorithm ofDGETRS when processing a single right-hand-side provides better performance on Alpha systems. vector. In this case, DTRSV is the dominant BLAS This algorithm does have the disadvantage that routine, and the performance differences between it cannot be done in-place, resulting in the use the public domain and DXML DGETRS routines of a temporary work space, which makes more reflect the performance of the respective DTRSV demands on the cache than an algorithm that can be routines. Final ly, although not shown, we note that clone in-place. However, this disadvantage is more the performance of DXML DGETRS is much better than offset by the avoidance of the bit-reversal stage. than the public domain version when many right The implementation of the FFT on the Alpha hand sides are used and DTRSM becomes the domi makes effective use of the hierarchical memory of nant BLAS routine. the system, specifically, the 31 usable floating-point

Digital Te chnicaljow·nal Vo l. 6 No. 3 Summe1· 1994 51 Scientific Computing Optimizations fo r AJpha

registers, which are at the lowest, and therefore the In the above, N = Nl X N2. Steps (I) and (3) use fa stest, level of this hierarchy. These registers are the autosorting algorithm for small sizes. Jn utilized as much as possible, and any data brought step (2), instead of precomputing all N twiddle into these registers is reused to the extent possible. fa ctors, a table of selected twiddle factors is com To accomplish this, the FFT routines implement the puted and the rest calculated using trigonometric largest radices possible fo r all stages of the power identities. of-2 FFT calculation. Radix-8 was used fo r all stages Figure 5 compares the perfo rmance of the block except the first, utilizing 16 registers fo r the data ing algorithm with the autosorting algorithm. Due and 14 for the twidd le factors 21 For the first stage, to the added cost of steps (2) and (4 ), the maximum as all twiddle factors are I, radix-16 was used. computation speed fo r the blocking algorithm Figure 5 illustrates the performance of this algo (115 million floating-point operations per second rithm fo r various sizes. Although the performance [Mflops] at N=212) is tower than the maximum is very good for small data sizes that fit into the pri computation speed of the autosorting algorithm mary, 16-KB data cache, it drops off quickly as the (192 Mftops at N= 2�). The crossover point data exceeds the primary cache. To remedy this, a between the two algorithms is at a size of approxi blocking algorithm was used to better utilize the mately 2K. with the aurosorting algorithm perform primary cache. ing better at smatter sizes. Based on the length of The blocking algorithm. which was developed the FFT. the ox :vt L routine automatically picks the fo r computers with hierarchica l memory systems, faster algorithm. Note that at N=210, as the size decomposes a large FFT into two sets of smal ler of the data and workspace exceeds the 2-:'-•!B FFTs 2·i The algorithm is implemented using fo ur secondary cache, the perfo rmance of the blocking steps: algorithm drops off.

I. Compute NI sets of FFTs of size N2. Op tim ization of the Skyline Solvers 2. Apply twiddle fa ctors. A skyline matrix (Figure 6) is one where only the elements within the envetopt: of the sparse matrix 3. Compute N2 sets of FFTs of size Nl. are stored. This storage scheme exploits the fact that zeros that occur before the first nonzero ele 4. Transpose the Nl by N2 matrix into an N2 by Nl ment in a row or column of the matrix, remain matrix. zero during the factorization of the matrix. pro vided no row or column interchanges are made.l-l Thus, by sroring the envelope of the matrix, no

200 additional storage is required fo r the fi ll-in that occurs during the factorization. Though the sl-.-y 180 tine storage scheme does not exploit the sparsity 160 within the envelope. it allows for a static data structure, and is therefore a reasonable compro (f) 140 mise between organizational simplicity and com ii: g 120 putational efficiency lL In the skyline solver, the system, Ax=b, where A :2100 is an N by N matrix, and b and x are tV-vectors, is 80 solved by first factorizing A as A=l.DU, where L and

60 U are unit lower and upper triangular matrices, and Dis a diagonal matrix. The solution xis then calcu 40 �--�----�--�----�----�--�----� 6 8 1 0 1 2 14 1 6 1 8 20 lated by solving in order. Ly =b, Dz =y, and Ux = z, SIZE OF FFT (AS POWER OF 2) where y and z are tV-vectors. KEY: In our discussion of perfo rmance optimization, -- - AUTOSORTING we concentrate on the factorization routine as it is ---- BLOCKING often the most time-consuming part of an appl ica tion. The algori thm implemented in DXML uses a Figure 5 Pet.Jornumce of 1-D Complex FFJ technique that generates a column (or row) of the

52 Vo l. o No . .l SIIIIIIJ/l!l' /')')4 Digital Te clmicaljourrwl DXML: A Highperformance Scientific Subroutine Library

COLUMN i COLUMN j + + ) LENGTH OF THE INNER PRODUCT -----·n FOR THE EVALUATION OF ELEMENT (i. j) ------U -Row ;

Figure 7 Un optimized Skyline Computational Ke rnel

Figure 6 Skyline Column Storage of j is thus calculated starting with the first nonzero a Sy mmetric Matrix element in the column and moving down to the diagonal entry. factorization using an inner product fo rmulation. The optimization of the skyline fac torization is Specifically, fo r a symmetric matrix A, Jet based on the fo llowing two observations 25 26:

• The elements of column j, used in the evalua A tion of the element in location (i,j), are also used in the evaluation of the element in location =(� �) 0 = .1-/ (i+ l,j). (urr ) cu.\1 w) w d 0 I �) (�/ • The elements of column i, used in the evalua where the symmetric factorization of the leading tion of the element in location (i,j), are also (N - I) by (N - 1) leading principal submatrix M used in the evaluation of the element in location has already been obtained as (i,j+l). M = v:;D.H U.u Therefore, by unrolling both the inner loop on i and the outer loop on), twice, we can generate the Since the vector u, of length (N - I), and the scalar entries in locations (i,j), (i+ l,j), (ij+ 1), (i+ 1,)+ 1) s are known, the vector w, of length (N 1) and the - at the same time, as shown in Figure 8. These fo ur scalar d can be determined as elements are generated using only half the memory references made by the standard algorithm. The

and memory references can be reduced further by increasing the level of unrol ling. This is, however, limited by two factors:

The definition of w indicates that a column of the • The number of floating-point registers required factorization is obtained by taking the inner prod to store the elements being calculated and the uct of the appropriate segment of that column with elements in the columns. one of the previous columns that has already been • The length of consecutive columns in the calculated. Referring to Figure 7, the value of the matrix, which should be close to each other to element in location (i,j) is calculated by taking derive fu ll benefit from the unrolling. the inner product of the elements in column j above the element in location (i,j) with the corre Based on these factors, we have unrolled to a depth sponding elements in column i. The entire column of 4, generating 16 elements at a time.

Digital Te chnical jo urnal Vol. 6 No. 3 Summer 1994 53 Scientific Computing Optimizations fo r Alpha

A similar technique is used in optimizing the fo r ward elimination and the backward substitution. Ta ble 1 gives the performance improvements obtained with the above techniques fo r a symmet ric and an unsym metric matrix from the Harwell ) LENGTH OF THE INNER PRODUCT Boeing collection.1" The characteristics of the matrix FOR THE PARTIAL EVALUATION OF are generated using DXJ.\1l routines and were ELEMENTS (i, j) included because the performance is dependent on (i + 1 , j), (i, j + 1) the profile of the skyline. The data presented is fo r and (i + 1, j + 1) a single right-hand side, which has been generated ROW i using a known random solution vector. ROW (i + 1) � The results show that fo r the matrices under con sideration, the technique of reducing memory references by unrolling loops at two levels leads to a factor of 2 improvement in perfo rmance.

Summary Figure 8 Op timized Skyline In this paper, we have shown that optimized mathe Computational Kernel matical subroutine libraries can be a useful tool in improving the performance of science and engi neering applications on Alpha systems. We have

Ta ble 1 Performance Improvement in the Solution of Ax=b, Using the Skyl ine Solver on the DEC 3000 Model 900 System

Example 1 Example 2

Harweii-Boeing matrix1s BCSSTK24 ORSREG1 Description Stiffness matrix of the Calgary Jacobian from a model of Olympic Saddledome Arena an oil reservoir

Storage scheme Symmetric diagonal-out Unsymmetric profile-in

Matrix characteristics Order 3562 2205 Ty pe Symmetric Unsymmetric with stru ctural symmetry Condition number estimate 6.37E +11 1.54E+4 Number of nonzeros 81736 14133 Size of skyline 2031722 1575733 Sparsity of skyline 95.98% 99.1 0% Maximum row (co lumn) height 3334 442 (442) Av erage row (column) height 570.39 357.81 (357.81) RMS row (column) height 1135.69 395.45 (395.45)

Factorization time (in seconds) Before optimization 66.80 23.1 2 After optimization 35.02 13.02 Solution time (in seconds) Before opti mization 0.82 0.32 After optimization 0.43 0.17

Maximum component-wise 0.16E-5 0.50E-10 relative error in solution (See equation below.)

max I x(i) - x(i) I I x(i) I

where x(i) is the i-th component of the true solution, and x(i) is the i-th component of the calcu lated solution.

54 Vo l. 6 No 3 Summer 1994 Digital Te chnical journal DXML: A High-performance Scientifi c Subroutine Library

described the fu nctionality provided by DX ML, 6. C. Lawson, R. Hanson, D. Kincaid, and discussed various software engineering issues F Krogh, "Basic Linear Algebra Subprograms and illustrated techniques used in performance for Fortran Usage," ACM Tra nsactions on optimization. Ma thematical Software, vol. 5, no. 3 (Septem Future enhancements to DX!\1 L include symmet ber 1979): 308-323. ric multiprocessing support for key routines, enhancements in the areas of signal processing and 7. D. Dodson, R. Grimes, and ]. Lewis, "Sparse sparse solvers, as well as fu rther optimization of Extensions to the FORTRANBasic Linear Alge rou tines as warranted by changes in hardware and bra Subprograms," ACM Tra nsactions on system software. Mathematical Software, vo l. 17, no. 2 (June 1991): 253-263. Acknawledgments 8. ]. Dongarra, ]. DuCroz, S. Hammarl ing, and DXML is the joint effort of a number of individuals R. Hanson, "An Extended Set of FORTRAN over the past several years. We would like to Basic Linear Algebra Subprograms," ACM acknowledge the contribu tions of our colleagues, Tra nsactions on Ma thematical Software, both past and present. The engineers: Luca Broglio, vol. 14, no. 1 (March 1988): 1-17. Richard Chase, Claudio Deiro, Laura Farinetti, Leo Lavin, Ping-Charng Lue, Joe O'Connor, Mark 9. ]. Dongarra, ]. DuCroz, S. Hammarling, and Schure, Linda Tel la, Sisira We eratunga and John I. Duff, "A Set of Level 3 Basic Linear Alge Wilson; the technical writers: Cheryl Barabani, bra Subprograms," ACM Tra nsactions on Barbara Higgins, Marl! McDonald, Barbara Schott Mathematical Software, vo l. 16, no. 1 (March and Richard Wo lanske; and the management: Ned 1990): 1-17. Anderson, Carlos Baradello, Gerald Haigh, Buren Hoffman, To mas Lofgre n, Vehbi Tasar and David 10. E. Anderson et at., LAPACK Users' Guide Velten. We would also like to thank Roger Grimes at (Philadelphia: Society fo r Industrial and Boeing Computer Services fo r making the Harwell Applied Mathematics [SIAM] , 1992). Boeing matrices so readily available.

11. ]. Dongarra, I. Duff, D. Sorensen, and H. van References der Vo rst, Solving Linear Sys tems on Ve ctor

1. W Cowell, ed., Sources and Development of and Shared Memory Co mputers (Philadel Ma thematical Software (Englewood Cliffs, phia: Society fo r Industrial and Applied Math [SIAM], NJ: Prentice-Hall, 1984). ematics 1991 ).

2. D. Jacobs, ed., Numerical Software-Needs 12. R. Barrett et at., Te mplates fo r the Solution of and Availability (New Yo rk: Academic Press, Linear Systems: Building Blocks fo r Iterative 1978). Methods (Philadelphia: Society fo r Industrial and Applied Mathematics [SIAM],1993). 3. ]. Dongarra,]. Bunch, C. Moler, and G. Stewart, LINPACK Us ers' Guide (Philadelphia: Society 13. C. Felippa, "Solution of Linear Equations with fo r Industrial and Applied Mathematics Skyline Stored Symmetric Matrix," Computer [SIAM], 1979). and Structures, vol. 5, no. 1 (April 1975): 13-29.

4. B. Smith et a!, Matrix Eigensystem 14. I. Duff, A. Erisman, and]. Reid, Direct Methods Routines-EISPACK Guide (Berlin: Springer fo r Sp arse Ma trices (New Yo rk: Oxford Ve rlag, 1976). University Press, 1986).

5. Digital eX tended Math Library Reference 15. I. Duff, R. Grimes, and ]. Lewis, "Sparse Ma nual (Maynard, MA: Digital Equipment Matrix Te st Problems," ACM Tra nsactions on Corporation, Order No. AA-QOMBB-TE fo r VMS Mathematical Software, vol. 15, no. 1 (March and AA-QONHB-TE fo r OSF/1). 1989): 1-14.

Digital Te ch11iCtlljou r11al Vol. 6 No. 3 Summer 1994 55 Scientific Computing Optimizations for Alpha

16. Alpha AXP Architecture and Systems, Digital 22. D. Bailey, "A High-performance FFT Algorithm Te ch nical]ournal, vol. 4, no. 4 (Special Issue for Ve ctor Supercomputers," The interna 1992). tional jo urnal of Supercomputer Applica tions, vol. 2, no. 1 (Spring 1988): 82-87. 17. K. Dowd, High Performance Compu ting (Sebastopol, CA: O'Reilly & Associates, Inc., 23. P. Swarztrauber, "FFT Algorithms fo r Ve ctor 1993). Computers," Pa rallel Co mputing, vol. 1, no. 1 (August 1984): 45-63. 18. DECcbip 21064-AA Microprocesso r-Hard ware Reference Ma nual (Maynard, tvlA: 24. D. Bailey, "FFTs in External or Hierarchical Digital Equipment Corporation, Order No. Memory," The journal of Supercomp uting, EC-N0079-72, October 1992). vol. 4, no. 1 (March 1990): 23-35.

19. ]. Dongarra and S. Eisenstat, "Squeezing the 25. 0. Storaasli, D. Nguyen, and T. Agarwal, Most Out of an Algorithm in CRAY FORTRAN," "Parallel-Ve ctor Solution of Large-Scale Struc ACM Tra nsactions on Ma thematical Soft tural Analysis Problems on Supercomputers," ware, vo l. 10, no. 3 (September 1984): American Institute of Aeronautics and 219-230 Astronautics (A IAA) jo urnal, vol. 28, no. 7 (July 1990): 1211-1216. 20. R. Sites, ed., Alpha Architecture Reference 26. H. Samukawa, "A Proposal of Level 3 Interface Manual (Burlington, 1YlA : Digital Press, 1992). fo r Band and Skyline Matrix Factorization Sub 21. H. Nussbaumer, Fa st Fo urier Tra nsforms and routine," Proceedings of the 1993 ACM In ter Co nvolution Algorithms, Second Edition national Conference on Super Computing, (New Yo rk: Springer Ve rlag, 1982). To kyo, japan (July 1993): 397-406.

56 Vr!l. 6 No . .3 Summer 1994 Digital Te cbnica/ jo urnal RobertH. Ku hn Bruce Leasure

Sanjiv M. Shah

Th e KA P Pa rallelizer fo r DEC Fo rtran and DEC C Prog rams

The KAP preprocessor op timizes DEC Fo rtran and DEC C programs to achieve their best performance on Digital Alpha sy stems. One key op timization that KAP per fo rms is the parallelization of programsfo r Alpha shared memory multiprocessors that use the new capabilities of the DEC OSF/1 version 3.0 op erating system with DECthreads. Th e heart of the op timizer is a sophisticated decision process that selects the best loop to parallelize from the many loops in a program. The preproces sor implements a robust data dependence analysis to determine whether a loop is inherently serial or parallel. In engineering a high-quality op timizer, the designers sp ecified the KA P software architecture as a sequence of modular op timization passes. Th ese passes are designed to restructure the program to resolve many of the apparent serializations that are artifa cts of coding in Fo rtran or C. End users can also annotate their DEC Fo rtran or DEC C programs with directives or pragmas to guide KAP's decision process. As an alternative to using KAP's automatic parctl lelization capabilit]\ end users can explicitly identify parallelism to KAP using the emerging indust1:v-stcmdard X3H5 directives.

The KAP preprocessor developed by Kuck & part of the paper describes the process of mapping Associates, Inc. (KAI) is used on Digital Alpha sys parallel programs to DECthreads. The paper then

tems to increase the performance of DEC Fortran discusses the key techniques used in the KAP and DEC C programs. KAP accomplishes this by design. Finally, the paper presents examples of how restructuring fragments of code that are not effi KAP performs on actual code and mentions some cient fo r the Alpha architecture. Essentially a super remaining challenges. Readers with a compiler optimizer, KAP performs optimizations at the background may wish to explore Op timizing Super source code level that augment those perfo rmed compilers fo r Supercomputers fo r more details on by the DEC Fortran or DEC C compilers.1 KAP's techniques. 2 To enhance the performance of DEC Fortran and In this paper, the term directive is used inter DEC C programs on Alpha systems, KAI engineers changeably to mean directive, when referring to DEC selected two challenging aspects of the Alpha archi Fortran programs, and pragma, when referring to tecture as KAP targets: symmetric multiprocessing DEC C programs. The term processor generally rep (SMP) and cache memory. An additional design goal resents the system component used in parallel pro was to assist the compiler in optimizing source cessing. In discussions in which it is significant to code fo r the reduced instruction set computer distinguish the operating system component used (RISC) instruction processing pipeline and multiple fo r parallel processing, the term thread is used. functional units. This paper discusses how the KAP preprocessor The Pa rallelism Mapping Process design was adapted to parallelize programs fo r SMP Figure 1 shows the input modes and major phases systems running under the DEC OSF/1 version 3.0 of the compilation process. Parallelism is repre operating system. This version of the DEC OSF/1 sented at three levels in programs using the KAP system contains the DECthreads product, Digital's preprocessor on an Alpha SMP system. The first two POSJX-compliant multithreading library. The first are input to the KAP preprocessor; the third is the

Digital Te chnical journal No. Vu l. 6 3 S11mmer 1994 57 Scientific Computing Optimizations for Alpha

------�------,-- -- -, IMPLICIT PARALLELISM I I EXPLICIT HIGH-LEVEL PARALLELISM I I I I I I KAP GUIDING ORDINARY DEC I I I DIRECTIVES I FORTRAN OR I KAP ASSERTIONS I DEC C PROGRAM I I I I I I L------I � ---J

DEPENDENCE ANALYSIS

KAP PARALLELISM DETECTION AND OPTIMIZATION

KAP PREPROCESSOR

- ---- 1 I EXPLICIT LOW-LEVEL PARALLELISM I I I KAP-OPTIMIZED I FORTRAN OR I I C OUTPUT FILE I I

L ______j

DEC FORTRAN OR DEC C COMPILER

Figure 1 Pa rallelism Mapp ing Process

representation of parallelism that KAP generates. KAP guiding directives, KAP assertions, or X3H5 The three levels of parallelism are directives. KAP guiding directives give KAP hints on which program constructs to parallelize. KAP 1. Implicit parallelism. Starting from DEC Fortran assertions are used to convey information about

or DEC C programs, KAP automatically detects the program that cannot be described in the DEC parallelism. Fortran or DEC C language. This information can

sometimes be used by KAP to optimize the pro 2. Explicit high-level parallelism. As an advanced gram. Using X3H5 directives, the user can fo rce

feature, users can provide any of three fo rms: KAP to parallelize the program in a certain way.3

58 Vo i. 6 No. 3 Summer 1994 Digital Te chnical journal The KA P Pa rallelize r fo r DEC Fo rtran and DEC C Programs

3. Explicit low-level parallelism. KAP translates • Ease of use. Expert users like to "tweak" the eitl1er of the above forms to DECthreads with the output of KAP to fine-tune the optimizations help of an SMP support library. (The user could performed. specify parallelism directly, using DECthreads; however, KAP does not perform any optimiza Wi th KAP Guiding Directives, KAP tion of source code with DECtlU"eads. Therefore, Assertions, or X3H5 Directives the user should not mix this fo rm of parallelism Although the automatic detection of parallelism is with the others.) frequently within the range of KAP capabilities on Because the user can employ parallelism at any SMP systems, in some cases, as described below, of the three levels, a discussion of the trade-offs users may wish to specify the parallelism. involved with using each level fo llows. • In the SMP environment, coarse-grained paral lelism is sometimes important. The higher in the From DEC Fo rtran or DEC C Programs call tree of a program a preprocessor (or com The KAP preprocessor accepts DEC Fortran and DEC piler, as well) operates, the more difficult it is C programs as input. Although starting with such for a preprocessor to parallelize automatically. programs requires the compilers to intell igently Even though the KAP preprocessor performs utilize a high-performance SMP system, there are both inlining and interprocedural analysis, the several reasons why this is a natural point at which higher in the call tree KAP operates, the more to start. likely it is that KAP will conservatively assume that the parallelization is invalid. • Lots of software. Since DEC Fortran and DEC C are de facto standards, there exists a large base of • Sometimes information that is available only at applications that can be parallelized relatively run time precludes the preprocessor from auto easily and inexpensively. matically finding paral lelism.

• Ease of use. Given the high rate at which hard • Occasionally, experts can fine-tune the paral ware costs are decreasing, eveqr workstation may lelism to get the highest efficiency fo r programs soon have multiple processors. At that point, it that are run frequently. will be critical that programming a multiproces sor be as easy as programming a single processor. • For software that is more portable between sys

• Portability. Many software developers with tems, it is sometimes important to get repeatable access to a multiprocessor already work in a het parallel performance or to indicate where paral erogeneous networking environment. Some sys lelism has been applied. In such cases, explicit tems in such an environment do not support parallelism may be preferable. explicit fo rms of parallelism (either X3H5 or Three mechanisms are available to the user fo r DECthreads). The developers wo uld probably directing KAP to parallelism. The first mechanism like to have one version of their code tllat runs uses KAP guiding directives to guide KAP to the we ll on all their systems, whether uniprocessor preferred way to parallelize the program. The sec or multiprocessor, and using DECthreads would ond mechanism uses KAP assertions. The third cause their uniprocessors to slow down. mechanism uses X3H5-compliant directives to

• Maintainability. Using an intricate programming directly describe the parallelism. The first two model of para] lei ism such as X3H5 or DECthreads mechanisms diffe r significantly from the third. With makes it more difficult to maintain the software. the first two, KAP analyzes the program for the feasi bility of parallelism. With the third, KAP assumes KAP produces KAP-optimized DEC Fortran or DEC that parallelism is fe asible and restricts itself to man Cas output. This fact is imponant for the fo llowing aging the details of implementing parallelism. In reasons: particular, the user does not have to be concerned • Performance. Users can leverage optimizations with either the scoping of variables across proces from both Digital's compilers and KAP. sors, i.e., designating which are private and which

• Integration. Users can employ all of Digital's per are shared, or the synchronization of accesses to fo rmance tools. shared variables 4 KAP guiding directives will not be

Digital Te cl:micalJo urnal Vo l. 6 No. 3 Summer 1994 59 Scientific Computing Optimizationsfo r Alpha

discussed in this paper. KAP assertions and how they Because this standard is broadly adopted and are implemented are discussed later in the section language independent, it is only sl ightly less Advanced Ways to Affect Dependences. A descrip portable than implicit parallelism. tion of the X3H5 directives fo llows. The KAP preprocessor translates a program in The X3H5 model of parallelism is well struc which KAP has detected implicit parallelism or a pro tured; all operations have a begin operation-end gram in which the user explicitly directs parallelism operation fo rmat. The parallel region construct to DECthreacls. KAPpe rforms this translation in two identifies the fork and join points fo r parallel steps. First, it translates the internal representation processing. Parallel loops identify units of work into calls to a paral lel SMP support library. Second , to be distributed to the available processors. The the support library makes calls to DECthreads. critical section and one processor section con The SMP support I ibrary implements various structs are used to synchronize processors where aspects of X3H5 notation, as can be seen by com necessary Table 1 shows the X3H5 directives as paring Tables 1 and 2. implemented in KAP. In the parallelism translation phase, KAP signifi cantly restructures a program by moving the code To the DEC OSF/1 Op erating Sy stem in a parallel region to a separate subroutine. A call with DECthreads to the SMP support library replaces the paral lel Although KAP does not optimize programs that use region. This call references the new subroutine. DECthreads directly, there may be some benefits to KAP examines the scope of each variable used in specifying parallelism explicitly using DECthreads. the parallel region and, if possible, converts each variable to a local variable of the new subroutine. • DECthreads allows a user to construct almost any Otherwise, the variable becomes an argu ment to model of parallel processing fairly efficiently the subrou tine so that it can be passed back out of The high-level approaches described above are the parallel region. limited to loop-structured parallel processing. Converting variables to local variables makes Some applicati ons obtain more parallelism by accessing these variables more efficient. A variable using an unstructured model. It can even be that is referenced outside the parallel region cannot argued that for some cases, unstructured paral be made local and must be passed as an argument. lelism is easier to understand and maintain.

• A user who invests the time to analyze exactly Shared Memory Multiprocessor where paral lelism exists in a program may wish Architecture Concerns to fo rego the benefits mentioned above and to Given its parallelism model, the KAP preprocessor capture the parallelism in detail with DECthreads. requires operating system and hardware support In that manner, no efficiency is lost because the from the system for efficient parallel execution. preprocessor misses an optimization. There are three areas of concern: thread creation

• The POSJX threads standard to which DECthreads and scheduling, synchronization between threads, conforms is available on several platforms. and data caching and system bus bandwidth.

Ta ble 1 X3H5 Directives As Implemented in KAP

Function X3H5 Directives

To specify regions of parallel execution C*KAP* PA RA LLEL REGION C*KAP*END PA RA LLEL REGION

To specify parallel loops C*KAP* PA RA LLEL DO C*KAP* END PA RA LLEL DO

To specify synchronized sections of code C*KAP* BARRIER such that all processors synchronize

To specify that all processors execute sequentially C*KAP* CRITICAL SECTION C*KAP*END CRITICAL SECTION

To specify that only the first processor executes C*KAP* ONE PROCESSOR SEC TION C*KAP*END ONE PROCESSOR SECTION

60 Vol. (, No . .) Sumn1er 1994 Digital Te chnical jou rnal Th e KAP Pa rallelizer fo r DEC Fortran and DEC C Progra ms

Ta ble 2 KAP SMP Support Libra ry

Fortran OSF/1 DECthreads C Entry Point Name Name Function Subroutines Used

__ kmp_ enter_csec mppecs To enter a critical section pthread_ mutex_l oc k

__ kmp_ ex i t - csec mppx cs To exit a critical section pth read_m utex_ unlock

__ kmp_ fork mppf rk To fork to several threads pthread_ attr_c reate, pth read_ create

__ kmp_ fork- active mppfkd To inquire if already (non e) parallel

__ kmp_ end mppend To join threads pth read_ j oi n, thread detach

__ kmp_enter_o nepsec mppbop To enter a single pthread_mutex_lock, processor section pth read_m utex_ un lock

__ kmp_ex i t _o n epsec mppeop To exit a single pthread_mutex_lock, processor sect ion pthread_m utex_ un lock

__ kmp_b a rrier mppbar To execute a barrier wait pthread_m utex_L ock, pthread_cond_w a it, pthread_m u tex_u n lock

Th read Creation and Scheduling Thread cre version in a single-user environment yields the ation is the most expensive operation. The X3H5 most efficient parallelism. standard minimizes the need for creating threads Using spin locks in a multiuser environment may through the use of parallel regions. The SMP sup waste processor cycles when there are other users port library goes further by reusing threads from who could use them. Using mutex locks fo r a single one parallel region to the next. The SMP support user environment creates unnecessary operating library examines the value of an environment vari system overhead. In practice, however, a system able to determine how many threads to use. The may shift from single-user to multiuser and back appropriate scheduling of threads onto hardware again in the course of a single run of a large pro processors is extremely important fo r efficient gram. Therefore, KAP supports all lock-environment execution. The support library relies on the combinations. DECthreads implementation to achieve this. For the most efficient operation, the library should Data Caching and Sy stem Bus Bandwidth schedule at most one thread per processor. Multiprocessor Al pha systems support coherent

caches between processors. s To use these caches Sy nchronization between Th reads In the KAP efficiently, as a policy, KAP localizes data as much model of parallelism, threads can synchronize at as possible, keeping repeated references within the same processor. Localizing data reduces the • A point where loop iterations are scheduled load on the system bus ancl reduces the chances of cache thrashing. • A point where data passes between iterations When all the processors simultaneously request (for collection of local re duction variables only) data from the memory, system bus bandwidth can limit SMP performance. If optimizations enhance • A barrier point leaving a work-sharing construct cache locality, less system bus bandwidth is used, and therefore SMP performance is less likely to be • Single processor sections limited.

Two versions of the SMP support library have been developed: one with spin locks fo r a single-user KAP Te chnology environment and the second with mutex locks fo r This section covers the issues of data dependence a multiuser environment. Either library works in analysis, preprocessor architecture, and the selec either environment; however. using the spin lock tion of loops to parallelize.

Digital Te ch11ica/ jou rnal vbl. 6 No. .3 Summer 1994 61 Scientific Computing Optimizationsfo r Alpha

Data Dependence Analysis-Th e Ke rnel or a combination of the above, for example, of Parallelism Detection <>The dependence is known not to be on the DEC Fortran and DEC C have standard rules fo r the same iteration. order of execution of statements and expressions. These rules are based on a serial model of program When a dependence occurs in a nested loop, KAP execution. Data dependence analysis allows a com uses one symbol fo r each level in the loop nest. A piler to see where this serial order of execution can dependence is said to be carried by a loop if the cor be modified without changing the meaning of the responding direction vector symbol fo r that loop program. includes <, >, or* In the fo llowing program segment

Ty pes of Dependence KAP works with the four 1 for ( i = 1 ; i<= n; i++) { 2 temp [ i J; basic types of dependence:6 a 3 a [ i J b [ i ] ; 4 b [ i ] = temp; } 1. Flow dependence, i.e., when a program writes a variable before it reads the variable there is a flow dependence from statement 2 to statement 4. There is an antidependence from state 2. Antidependence, i.e., when a program reads ment 2 to statement 3 and from statement 3 to a variable before it writes the variable statement 4. There are control dependences from statement 1 to statements 2, 3, and 4 because exe 3. Output dependence, i.e., when a program cuting 2, 3, and 4 depends on the i < = n condition. writes the same variable twice Al l these dependences are on the same loop itera

4. Control dependence, i.e., when a program state tion; their direction vector is =. ment depends on a previous conditional Some dependences in this program cross loop iterations. Because temp is reused on each itera Because dependences involve two actions on the tion, there is an output dependence from statement same variable, fo r example, a write and then a read, 2 to statement 2, and there is an antidependence KAP uses the term dependence arc to represent from statement 4 to statement 2. These two depen information flow, in this example from the write to dences are carried by the loop in the program seg the read. ment and have the direction vector >. Since these dependences can prevent paralleliza tion, KAP uses various optimizations to eliminate the different dependences. For example, an optimi Data Dependence Analysis The purpose of depen zation called scalar renaming removes some but dence analysis is to build a dependence graph, i.e., not all antidependences. the collection of all the dependence arcs in the pro gram. KAP builds the dependence graph in two stages. First, it builds the best possible conservative Loop-related Dependences When dependences dependence graph 7 Then, it applies filters that occur within a loop, the control flow relations are identify and remove dependences that are known captured with direction vector symbols tagged to to be conservative, based on special circumstances. each dependence arc. 2 The transformations that What does the phrase "best possible conserva can be applied to a loop depend on what depen tive dependence graph" mean' Because the values dence direction vectors exist fo r that loop. The of a program's variables are not known at prepro symbols used in KAP and their meanings are cessing time, in some situations it may not be clear The dependence occurs within the same loop whether a dependence actually exists. KAPreflects iteration. this situation in terms of assumed dependences based on imperfect information. Therefore, a > The dependence crosses one or several itera dependence graph must be conservative so that tions. KAP does not optimize a program incorrectly. On the other hand, a dependence graph that is too con < The dependence goes to a preceding iteration servative results in insufficient optimization. of the loop. In building the best possible dependence graph, The dependence relation between iterations is KAP uses the fo llowing optimizations: constant not clear. propagation, variable fo rward substitution, and

62 Vo l. 6 No. 3 Summer 1994 Digital Te ch nical jou rnal The KAP Pa rallelizer fo r DEC Fo rtran and DEC C Programs

scalar expansion. KAP does not, however, leave the lelizing the loop. KAP applies the two-version loop program optimized in this manner unless the opti optimizations selectively to avoid dramatically mizations will improve performance. increasing the size of the program. However, the payback of parallelizing a frequently executed loop warrants their use. Advanced Wa ys to Affe ct Dependences When For example, the KAP C pointer disambiguation there are assumed dependences in the program, optimization is employed in cases in which C point KAP may not have enough information to decide on ers are used as a base address and then incremented parallelism opportunities. KAP implements two in a loop. Neither the base address of a pointer nor techniques to mitigate the effects of imperfect how many times the pointer will be incremented is information at preprocessing time: assertions and usually known at compile time. At run time, how alternate code sequences. ever, they can be computed in terms of a loop Assertions, which are similar to directives in syn index. KAP generates code that checks the range of tax, are used to provide information not otherwise the pointer references at the tail and at the head of known at preprocessing time. KAP supports many a dependence. If the two ranges do not overlap, the assertions that have the effect of removing assumed dependence does not exist and the optimized code dependences. Ta ble 3 shows KAP assertions and is executed. their effects 8.9 When the user specifies an asser tion, the information contained in the assertion is KAP Preprocessor Architecture saved by a data abstraction called the oracle. When A controversial control architecture decision in an optimization requests that a data dependence KAP is to organize the preprocessor as a sequence graph be built fo r a loop, the dependence analyzer inquires whether the oracle has any information of passes, generally one fo r each optimization per fo rmed. This design decision was controversial about certain arcs that it wants to remove. because of the fo llowing concerns: When accurate information is not known at com

pile time, a few KAP optimizations generate two • Run-time inefficiency would occur in process versions of the source program loop: one assumes ing programs because each pass would sweep that the assumed dependence exists; the other through the intermediate representation fo r the assumes that it does not exist. In the latter case, KAP program being processed, causing some amount can apply subsequent optimizations, such as para!- of virtual memory thrashing.

Ta ble 3 KAP Assertions

Assertion Specifiers Primary Effect

[NO] ARGUMENT ALIASING Removes assumed dependence arcs [NOJ BOUNDS VIOLATIONS Removes assumed dependence arcs

CONCURRENT CALL Removes assumed dependence arcs

D 0 () SERIAL, CONCURRENT Guides selection of loop order strongly

D 0 PREFER () SERIAL, CONCURRENT Guides selection of loop order loosely

[NOJ EQUIVALENCE Removes assumed dependence arcs HAZARD (Fortran only) [NOJ LAST VALUE Va riable names for Tu nes the parallel code and NEEDED () which [no] last sometimes removes assu med value is needed dependences PERMUTATION Names of permutation Removes assumed dependence arcs () variables

NO RECURRENCE Names of recurrence Removes assumed dependence arcs () variables

R E LA T I 0 N () Relation loop index Removes assumed dependence arcs known to be true

NO SYNC Tu nes the parallel code which is produced

Digital Te chnicaljout-nal Vo l. 6 No. 3 Summer 1994 63 ScientificCompu ting Optimizations for Alpha

• Added software development cost would be optimum program is the loop order whose objec

incurred because the KAP code that loops tive hmction has the highest performance score, as through the intermediate representation wo uld discussed later in this section. be repeated in each pass. Descriptions of loop orders, the role of depen dence analysis, and the objective hmction, i.e., how The second concern has been dispelled. The each program is scored, follow. added modularity of KAP, provided by its multipass structure, has saved development time as KAP has Loop Orders A loop order is a combination of grown from a moderately complex piece of code to loop transformations that the KAP preprocessor has an extremely complex piece of code. performed on the program. The loop transforma The KAP preprocessor uses more than 50 major tions that KAP performs while searching fo r the optimizations. The pass structure has helped to optimal parallel fo rm are organize them. In some cases, such as cache man agement, one optimization is broken into several • Loop distribution passes. KAP performs some basic optimizations, e.g., deadcode elimination, more than once in dif • Loop fusion fe rent ways. In some cases, such as scalar expan sion, KAP performs an optimization to uncover • Loop interchange other optimizations and then performs the reverse optimization to tighten up the program again. Loop distribution splits a loop into two or more loops. Loop fusion merges two loops. Loop fusion The run-time efficiency issue is still of interest. There is always some benefit to making the prepro is used to combine loops to increase the size of the cessor smaller and faster. paraUel tasks and to reduce loop overhead. Loop interchange occurs between a pair of loops. This transformation takes the inner loop outside the Selecting Loops to Pa rallelize outer loop, reversing their relation. If a loop is triply Parallelizing a loop can greatly enhance the perfor nested, there are three fa ctorial (31), i.e .. six, diffe r mance of the program. Te sting whether a loop can ent ways to interchange the loops. Each order is be parallel ized is actually quite simple, given the arrived at by a sequence of pairwise interchanges. data dependence analysis that KAP perfo rms. A loop To increase the opportunities to interchange can be parallelized if there are no dependence arcs loops, KAP tries to make a loop nest into one that is carried by that loop. The situation, however, can be perfe ctly nested. This means that there are no exe more complicated. If the program contains severa.l cutable statements between nested loop state nested loops. it is important to pick the best loop to ments. Loop distribution is used to create perfectly parallelize. Additionally, it may be possible not on ly nested loops. to parallelize the loop but also to optimize the loop KAP examines all possible loop orders fo r each to enhance its perfo rmance. Moreover, the loops in loop nest. Each loop nest is treated independently a program can be nested in very complex structures because no transformations between loop nests so that there are many different ways to parallelize occur at this phase of optimization. the same program. In fact, the best option may be For example, an LU factorization program con to leave all the loops serial because the overhead of sists of one loop nest that is three deep and not per parallel execution may outweigh the performance fectly nested. Figure 2 shows the loop orders. Loop improvement of using multiple processors. order (a) is the origina l LU program. The KAP pre The KAP preprocessor optimizes programs fo r processor first distributes the outer loop in loop parallelism by searching fo r the optimum program orders (b) and (c). Next, KAP performs a loop inter in a set of possible configurations, i.e., ways in change on the second loop nest which is two deep, which the original program can be transformed for as shown in loop order (d ). Then, KAP interchanges parallel execution. (In this regard, KAP optimizes the third loop nest in loop orders (e) through (i). programs from a classical definition of numerical Note that KAP eliminates some loop orders, (i) fo r optimization.) There is an objective fu nction fo r example, when the loop-bound expressions cannot evaluating each configuration. Each member of be interchanged . As explained above, there are six the set of configurations is called a loop order. The different loop orders because the nest is triply

64 Vo l. 6 No. 3 Summer 1')94 Digital Te cbnical journal Th e KA P Pa ra llelizer fo r DEC Fo rtra n and DEC C Programs

(a) ORIGINAL LU (OUTLINED): do i=1,n /*Inve rt Eliminator*/ (b) DISTRIBUTED do i LOOP do i=1,n /*Invert Eliminator*/ do k=i +1 ,n enddo /*Compu te Mu ltipli ers*/ do i=1 ,n do k=i+1 ,n enddo /*Compute Mu ltipliers*/ do j=i+1 ,n enddo do k=i +1 ,n do j=i+1 ,n /*Update Ma trix*/ do k=i +1 ,n /*Update Matrix*/ enddo end do end do enddo enddo enddo

(d) FOR SECOND NEST INTERCHANGE (c) DISTRIBUTE do i LOOP AGAIN: SECOND do i LOOP: do i=1,n do k=1 ,n /*Invert Eliminator*/ do i=1,k-1 do i=1 ,n /*Compute Mu ltipliers*/ do k=i +1 ,n /*Compute Mu ltipliers*/ do i=1 ,n do j=i+1,n do k=i+1 ,n REEXAMINE LOOP ORDERS /*Update Ma tri x*/ (e) THROUGH (i)

(e) FOR THIRD NEST (g) FOR THIRD NEST INTERCHANGE do i AND do j INTERCHANGE do j AND do k: do j=1 ,n do i=1,n do i=1,j-1 do k=i+1,n do k=i +1,n do j=i+1,n /*Upda te Matrix*/ /*Update Matrix*/

l (f) FOR THIRD NEST INTERCHANGE (h) FOR THIRD NEST do i AND do k: INTERCHANGE do i AND do k: Loop Order Rejected - do k=1 ,n New bounds spl it Loop . do i=1 ,k-1 do j=1 ,n do j=i+1 ,n do k=2,j /*Update Matrix*/ do i=1,k-1 /*Upda te Matrix*/ do k=j,n do i=1,j-1 /*Upda te Matrix*/ (i) FOR THIRD NEST INTERCHANGE do i AND do j:

Loop Order Rej ected - New bounds spl it loop . do k=1 ,n do j=2,k do i=1,k-1 /*Update Matri x*/ do j=k,n do i=1 ,k-1 /*Update Ma trix*/

Figure 2 Loop Ordersjo1' LV Fa ctoriza tion

Digital Te cbnical]ow·naf Vo l. 6 No. 3 Summer 1994 65 Scientific Computing Optimizations fo r Alpha

nested. Since the loop nest in (d) was originally tion or subroutine are examples. (Users can code nested with the triply nested loop at the outermost KAP assertions to flag these statements as paralleliz do loop, KAP will reexamine these six loop orders able.) Second, data dependence conditions may after the interchange in (d). preclude parallelization. ln general, a loop that car ries a dependence is not parallelizable. (In some Dependence Analysis fo r Loop Orders Before a cases, the user may override the data dependence loop order can be evaluated fo r efficiency, KAP deter condition by allowing synchronization between mines the validity of the loop order. A loop order is loop iterations.) Finally, the user may give asser valid if the resulting program would produce equ iva tions that indicate a preference fo r making a loop lent behavior. KAP tests validity by examining the parallel or for keeping it serial. dependences in the dependence graph according to Barring data dependence conditions that would the transformation being applied. prevent parallelization, the amount of work that will For example, the test for loop interchange validity be performed in parallel determines the score of par involves searching for dependence direction vec al lelizing a loop. (The user can also specify with a tors of a certain type. The direction vector (<,>) directive that loops should not be parallelized unless indicates that a loop interchange is invalid. The they score greater than a specified value.) In this direction vectors ( <,*), (*,>), or (*,*), if present, also manner, KAP prefers to parallelize outer loops or indicate that the loop interchange may be invalid. loops that are interchanged to the outside because they contain the most work to amortize the over Evaluation of a Loop Order After the KAP prepro head of creating threads for parallelism. cessor determines that a loop order is valid, it The actual parallelization process is even more scores the loop order fo r performance. KAP consid complex than this discussion indicates. KAP applies ers two major factors: (1) the amount of work that a number of optimizations to improve the quality of will be performed in parallel and (2) the memory the parallel code. If there is a reduction operation reference efficiency across a loop, KAP parallelizes the loop. To o much The memory reference efficiency of a loop order loop distribution can decrease program efficiency, can degrade perfo rmance so much that it out so loop fusion is run to try to coalesce loops. weighs the performance gained by executing a loop in paral lel. On an SMP, if a processor refer Performance Analysis ences one word on a cache line, it should reference How does the KAP preprocessor perform on real all the words contiguously on that line. In Fortran, applications' The answer is as complex as the soft a two-dimensional array reference, A(iJ) , should be ware written fo r these applications. Consider the parallelized so that the j loop is parallel and each real-world example, DYNA3D, which demonstrates processor references contiguous columns of mem some KAP strengths and weaknesses. ory. If a loop order indicated that the i loop is paral DYNA3D is nonlinear structural dynamics code lel, this reference would score low. If a loop order that uses the finite element analysis method. The indicated that the j loop is parallel, it would score code was developed by the Lawrence Livermore high. The score fo r the loop order is the sum of National Laboratory Methods Development Group the scores for all the references, and the highest and has been used extensively for a broad range scoring loop order is preferred. of structural analysis problems. DYNA3D contains The score for a loop order depends on which about 70,000 lines of Fortran code in more than loops in the order can be parallelized. For a given 700 subroutines. loop nest, there may be several (or no) loops that When KAP is being used on a large program, it can be parallelized. The first step is to determine is sometimes preferable to concentrate on the if any loops can be parallelized. If multiple loops compute-intensive kernels. For example, KAP devel can be parallelized, KAP selects the best one. KAP opers ran six of the standard benchmarks fo r chooses at most one loop fo r parallel execution. DYNA3D through a performance profiling tool and KAP tests loops to determine whether they can isolated two groups of three subroutines that be executed in parallel by analyzing both the state account for approximately 75 percent of the run ments in the loop and the dependence graph. The time in these cases. This data is shown in Table 4. loop may contain certain statements that block KAP's performance on some of these key subrou concurrentization. 1/0 statements or a call to a func- tines appears in Table 5. KAP parallelized all the

66 Vo l. 6 No. 3 Summer 1994 Digital Te chnical jou rnal The KAP Parallelizer fo r DEC Fortran and DEC C Programs

Ta ble 4 Performance Profiles of Six DYNA3D Problems

Problem Profile (First Tw o Initials of the Key Call Subroutine and Percent of Run Time) Sequences*

NIKE2D ST 19%, FO 15%, FE 12%, PR 10%, HG 7%, HR 5% (a) and (b) Example Cylinder Drop ST 20%, FO 15%, FE 11 %, PR 10%, HG 7%, HR 5% (a) and (b) Bar Impact WR 17%, ST 7%, FE 6% None of interest Impacted Plate SH 22%, TN 16%, TA 16%, YH 14%, BL 7% (c) Single Contact YH 24%, SH 21 %, TN 7%, TA 7%, BL 6% (c) Clamped Beam EL 12%, SH 12%, TN 8%, TA 8%, BL 6% (c)

·call Sequences

(a) ST is called; ST calls PR; and then FE is called.

(b) HR is called; HR calls HG; and then FO is called. (c) BL calls SH, then TA, and then TN.

Ta ble 5 KAP's Performance on Key Subroutines

Subroutine Number of Number of Loops Maximum Number of Loops Loops Parallel ized Nest Depth after Fusion

ST RAIN 5 5 1 3 PRTAL 9 9 FELEN 6 6 1 1 FORCE 9 9 2 2 HRGMD 5 5 3 HGX 4 4

loops in these subroutines. Since DYNA3D was designed fo r a CRAY-1 vector processor, it is perhaps s u b r o u t i n e F 0 R C E OUTER LOOP � PA RALLELIZED to be expected that the KAP preprocessor would do 60 n = 1,nnc perform well. KAP, however, is intended for a len = lczc + n + nh1 2 - 1 shared memory multiprocessor rather than fo r iO = ia(lnc) a vector machine. For this reason, KAP does more i1 = ia(lcn + 1) - 1 cdi r$ ivdep than parallelize the loops. The entries in the col do 50 i = iO, i1 umn labeled "Number of Loops after Fusion" show e(1,ix(i)) how KAP reduced loop overhead by fusing as many e(1,ix1(i)) + ep11(i) loops together as it could. KAP fused the five loops 50 continue in subroutine STRAIN into three loops and fused all nine loops in subroutine PRTAL 60 cont inue Another example of KAP's optimization fo r an SMP system is that in the doubly nested loop cases, such as subroutine FORCE (see Figure 3), the Figure 3 ParallelLoo p Selection KAP preprocessor automatically selects the outer loop fo r parallel execution. In contrast, a vector machine such as the CRAY-1 prefers the inner loop. Using KAP's inlining capability gives KAP the Because the kernels of DY NA3D code span multi most freedom to optimize the program because ple subroutines, cross compilation optimization is in this manner KAP can restructure code across suggested. There are three ways to do this: inlining, subroutines. interprocedural analysis, and directives specifying Figure 4 shows part of the call sequence of sub that the inner subroutines can be concurrentized. routine SOLDE. (Subroutine SOLDE contains ca ll

Digital Te ch11ical]o ur11al Vo l. 6 No 3 Summer 1994 67 ScientificComputing Optimizations fo r Alpha

to enable inlining automatically to depth two of subroutine SOLDE subroutine SOLDE because it contains calls to many c a l l H R G M D ------... other subroutines that are not in the kernel. Here, s u b r o u t i n e H R G M D � WHOLE CALL the user specified the subroutines to inl ine on the : SEQUENCE INLINED command line. When the user specified inlining, a l l H G X f -----; KAP fused all the loops in subroutines HRGMD, HGX, ca ll FORCE .____/ and FORCE to minimize loop overhead, and then it parallelized the fused loop. In some cases, the user can make simple restruc turing changes that improve KAP·s optimizations. Figure 5 shows a case in which fu sion was blocked Figure 4 lnlining a Ke rnel by two scalar statements between a pair of loops. The first loop does not assign any values to the vari sequence (b) of Ta ble 4.) Subroutine SOLDE calls ables used to create these scalars, so the user can subroutine HRGMD which calls subroutine HGX. move the assignments above the loop to enable KAP Then subroutine SOLDE calls subroutine FORCE. to fu se them. KAP supports inlining to an arbitrary depth. Finally, the user can elect to specify the paral Inlining in KAP can be automatic or controlled from lelism directly. Figure 6 shows subroutine STRAIN the command line. In this case, we did not want with X3H5 directives used to describe the

subroutine STRAIN subroutine STRAIN o 5 i = l f t , l l t dt1d2 = .5 * dt1 ? MOVE UP : TATEMENTS crho .0625 * rho( Lft) enddo do 5 i = lft,llt

dt1d2 = .5 * dt1 = � enddo crho .0625 * rho( lft) do 6 i Lft, llt do 6 i lft, llt

enddo enddo

Figure 5 Assisted Loop Fusion

subroutine STRAIN c*kap* para llel region c*kap*& shared ( dxy,dyx,d1 ) c*kap*& Local (i,dt1d2) c*kap* pa r allel do

do 5 i = Lft, llt dyx(i) = ... ALL STATEMENTS c"kap· 5 continue ARE X3H5 EXPLICIT c*kap* end para llel do PA RALLEL DIRECTIVES. c*kap* ba rrier

dt1d2 = ... c*kap* para llel do

do 6 i = Lft,llt

d1 = dt1d2 * Cdxy(i) + dyx( i)) 6 continue c*kap* end para llel do c*kap* end para llel region

Figure 6 X3 H5 £y,.plicit Pa ra llelism

68 Vol. 6 No. j S111nm�r 1994 Digital Te cb nical journal Th e KA P Pa rallelizer fo r DEC Fortran and DEC C Programs

parallelism. In this case, the user elected to keep • Optimizing programs is difficult when no sub the same unfused loop structure as in the original routine in the program takes more than a few code. This case is not dramatically less efficient percent of the run time. As its usability in this than the fu sed version because the parallel region area improves, KAP will become a substantial pro causes KAP to fo rk threads only once. ductivity aid. If a program is generally slow, opti A very sophisticated example of KAP usage occurs mizing repeated usage patterns will allow the when a user inputs a program to KAP that has programmer to use a comfortable programming already been optimized by KAP. This is an advantage style and still expect peak system performance. of a preprocessor that does not apply to a compiler because a preprocessor produces source code out • Increasing fe edback to the user would improve put. In this case, the statements shown in Figure 6 KAP's usabil ity. When KAP cannot perform an were generated by KAP to illustrate X3H5 paral optimization, often the user can help in several lelism. A user may want to perform some hand opti ways (e.g., by providing more information at mization on this output, such as removing the compile time, by changing the options or direc barrier statement, and then optimize the modified tives, or by making small changes to the source program with KAP again. code). KAP does not always make it clear to the user what needs to be done. Providing such fe ed Challenges Th at Re main back would improve KAP's usability. Although the KAP preprocessor is a robust tool that • Integration with other performance tools would performs wel l in a production software develop be useful. Alpha systems have a good set of per ment environment, several challenges remain. fo rmance monitoring tools that can provide Among them are adding new languages, fu rther clues about what to optimize in a program and enhancing the optimization technology, and how. The next release of the KAP preprocessor improving KAP's everyday usability. will provide some simple tools that a user can As the popular programming languages evolve, employ to integrate KAP with tools like prof and KAP evolves also. KAJwill soon extend KAP support to track down performance differences. fo r DEC Fortran to Fortran 90 and is developing C++ optimization capabilities. On a final note, the fact that KAP does not speed In optimization technology, KAJ's goal is to make up a program should not always be cause fo r disap an SMP server as easy to use as a single-processor pointment. Some programs already run as fast as workstation is today. "Automatic Detection of Par possible without the benefit of a KAP preprocessor. allelism: A Grand Challenge fo r High-Performance Computing" contains a leading-edge analysis of par allelization technology. 10 The research reported Acknawledgments shows that further developing current techniques We wish to acknowledge the Lawrence Livermore can improve optimization technology. These tech National Laboratory Methods Development Group niques frequently involve the grand challenge of and other users fo r providing applications that give compiler optimization-whole program analysis. us insight into how to improve the KAP preproces In a much more pragmatic direction, the KAP sor. We would like to thank those at Digital who preprocessor should be integrated with Digital's have been instrumental in helping us deliver compiler technology at the intermediate represen KAP on the DEC OSF/1 platform, especially Karen tation level. Such integration would increase pro DeGregory, John Shakshober, Dwight Manley, and cessing efficiency because the compiler would not Dave Velten. Everyone at Kuck & Associates partici have to reparse the source code. In addition, inte pated in the making of this product but of special gration would increase the coordination between note are Mark Byler, Debbie Carr, Ken Crawford, KAPand the compiler to improve performance fo r Steve Healey, David Nelson, and Sree Simhadri. the end user. Increasing the usability of the KAP preprocessor, References however, benefits the end user directly. KAP engineers frequently talk to beta users and encour I. D. Bl ickstein et al, "The GEM Optimizing age fe edback. The fo llowing are examples of user Compiler System," Digital Te chnicaljo urnal, comments: vol . 4, no. 4 (Special Issue 1992): 121-136.

Digital Te chrt icnljournal Vo l. 6 No. 3 Swn-mer 1994 69 Scientific Computing Optimizationsfor Alpha

2. M. Wolfe, Op timizing Supercompilers fo r ACM Transactions on Programming Lan Supercompu ters (Cambridge, MA: MIT Press, guages and Sy stems, vo l. 9, no. 4 (October 1989). 1987): 491-542.

3. Pa rallel Processing Model Jo t· High Level Pro 7 U. Banerjee, Dependence Analysis fo r Super gramming Languages, ANSI X3H5 Document computing (Norwell, MA: Kluwer Academic Number X3H5/94-SD2, 1994. Publishers, 1988).

4. P Tu and D. Padua, "Automatic Array Privatiza 8. KAPfo r DEC Fo rtran fo r DEC OSF/1 AXPUs er tion," Proceedings of the Sixth Wo rkshop on Guide (Maynard, MA: Digital Equipment Languages and Compilers fo r Pa rallel Com Corporation, 1994). puting, vo l. 768 of Lecture No tes in Co m 9. KAP fo r C fo r DEC OSF/ 1 AXP User Guide puter Science (New Yo rk: Springer-Verlag, (Maynard, MA: Digital Equipment Corpora 1993): 500-521. tion, 1994). 5. B. Maskas et a!., "Design and Performance of 10. W Blume et a!., "Automatic Detection of Paral the DEC 4000 AXP Departmental Server Com lelism: A Grand Challenge fo r High-Perfor puting System," Digital Te chnical jo urnal, mance Computing," CSRD Re port No. 1348 vol. 4, no. 4 (Special Issue 1992): 82 -99. (Urbana, IL: Center for Supercomputing 6. R. Allen and K. Kennedy, "Automatic Transla Research and Development, University of tion of FORTRAN Programs to Ve ctor Form," Illinois at Urbana-Champaign, 1994).

70 Vo l. 6 No. 3 Summer 1.994 Digital Te chnical jo urnal I Further Readings

Image Processing, Video Te rminals,and The Digital Te chnical Journal Printer Technologies publishes papers that explore Vo l. 3, No . 4, Fa ll 1991, EY-H889E-DP the technological fo undations of Digital's major products. Availability in VA Xcluster Systems/Network Each Journalfocuses on at least Performance and Adapters one product area and presents Vo l. 3, No. 3, Summer 1991, EY-H890E-DP a compila tion of refereed papers written by the engineers who Fiber Distributed Data Interface developed the products. The con Vo l. 3. No. 2, Sp ring 1991, EY-H876E-DP tentfo r the Jo urnal is selected by the jo urnalAdv isory Board. Transaction Processing, Databases, and Digital enginee1·s who would Fault-tolerant Systems like to cont1·ibute a paper to the Vo l. 3, No. I, Winte1· 1991, EY-F588E-DP journal should contact the edito1· at RDVAX::BLAKE. VAX 9000 Series Vo l. 2, No. 4, Fa ll l990, EY-E762E-DP

To pics covered in previous issues of the DECwindows Program Digital Te chnical jo urnal are as fo llows: Vo l. 2, No. 3, Summer 1990, EY-E756E-DP

Alpha AXP Partners-Cray, Raytheon, VAX 6000 Model 400 System Kubota/DECchip 21071/21072 PCI Chip Vo l. 2, No. 2, Sp ring 1990, EY-C197E-DP Sets/DLT2000 Tape Drive Vo l. 6, No. 2, Sp ring 1994, EY-F947E-TJ Compound Document Architecture Vo l. 2, No. 1, Winter 1990, EY-C196E-DP High-performance Networking /OpenVMS AXP System Software/Alpha AXP PC Hardware Distributed Systems Vo l. 6, No 1, Wi nter 1994, EY-QOll E-TJ Vo l. 1, No. 9. ]une 1989, EY-Cl79E-DP

Software Process and Quality Storage Technology Vo l. 5, No. 4, Fa ll 1993, EY-P920E-DP Vo l. 1, No. 8, February 1989, EY-Cl66E-DP

Product Internationalization CVAX-based Systems Vo l. 5, No . 3, Summer 1993, EY-P986E-DP Vo l. 1, No. 7, August 1988, EY-6742E-DP

Multimedia/Application Control Software Productivity Tools Vo l. 5, No . 2, Sp ring 1993, EY-P963E-DP Vo l. 1, No. 6, February 1988, EY-8259£-DP

DECnet Open Networking VAXcluster Systems Vo l. 5, No. 1, Wi nter 1993, EY-M770E-DP Vo l. I, No. 5, September 1987, EY-8258E-DP

Alpha AXP Architecture and Systems VAX 8800 Family Vo l. 4, No. 4, Sp ecial Issue 1992. EY-.J886E-DP Vo l. 1, No. 4, Fe bruary 1987, EY-6711 E-DP

NVAX-mi croprocessor VAX Systems Networking Products Vo l. 4, No. 3, Summer 1992, EY-j884E-DP Vo l. 1, No. 3, September 1986, EY-6715E-DP

Semiconductor Te chnologies MicroVAX II System Vo l. 4, No. 2, Sp ring 1992, EY-L521E-DP Vo l. 1, No. 2, March 1986, EY-3474E-DP

PATHWORKS: PC Integration Software VAX 8600 Processor Vo l. 4, No. 1, Wi nter 1992, EY-J 825E-DP Vo l. I, No 1, August 1985, EY-3435E-DP

Digital Te chnical jo urnal VtJI. 6 No . .3 Summer 1994 71 Further Rea dings

Subscriptions and Back Issues This practicaL highly specialized book presents Subscriptions to the Digital Te chnical jo urnal theory and examples that clearly show how to use are available on a prepaid basis. The subscription the Petri net approach to model, control. and then rare is $40 00 (non-US $60 00) for fo ur issues analyze the performance of these complex systems. and $75.00 (non-US $ 115.00) fo r eight issues. This book also brings together newly documented Orders shou ld be sent to Cathy Phillips. Digital applications of Petri nets in Japan and Europe and Equipment Corporation. 30 Porter Road l.,I 02/D 10, makes them available to practitioners worldwide.

Littleton. Massachusetts 01460, U.S.A. . Te lephone: From this book. the reader will learn how to model (508) 486-2538, FA X: (508) 486-2444. Inquiries complex manufacturing systems using Perri nets; can be sent electronical ly to [email protected]. analyze the performance of the overall manu factur Subscriptions must be paid in US dollars, and ing system in terms of production rates, machine checks shou ld be made payable to Digital utiI ization, average in-process inventory, and other Equipment Corporation. measures; generate control software from the Petri net model of an au tomated manufacturing system; Single copies and past issues of the Digital and synthesize Petri net models fo r large automated Te chnicaljo urnalare available fo r $ 16.00 manufacturing systems. each by calling DECdirect at 1-800-DJGITAL (1-800-344-4825). Recent back issues of the Applications of Petri Ne ts in Manufacturing jo urnal are available on the Internet at Sy stems Modeling. Control, and Pe rforrncmce http://w�w.digital.com/info/DTJ/home.btml. Analysis will be of particular interest to researchers Complete Digital Internetli stings can be in manufacturing systems engineering and individ obtained by sending an electronic mail uals involved in production planning and control, message to [email protected]. plant layout and design. and scheduling of manu facturing operations.

Digital Research LaboratoryReports R. Abugov and K. Zinke, "Prioritization of Defect Reports published by Digital·s research labora Reduction Activity by Yield Impact,'' 1994 SEN/ICON tories can be accessed on the Internet through Ultraclea n LVJanufacturing Co nference (October the Wo rld Wide We b or FTP. For access informa 1994) tion on the electronic or hard-copy versions of the reports, see http://g atekeeper.dec.com/ M. Ackerman and R. Buckland. ''Multiple Matrices hypertext/info/era . reports.html. fo r a Marketing QFo;· Fiftb S) •mposium on Quality Function Deployment (June 1994).

Digital Product Information M. Ackerman and R. Buckland. "Successful QFD Application at Digital: Unique Approaches and Readers of the jo urnal can keep up-to-date on Applications of QFD to Address Business Needs... Digital's products and services by subscribing Fifth Sy mposium on Quality Fu nction to the Digital Reference Sen •ice. To receive cur Deploy ment (.June 1994) rent information on all Digital 's products and services on a regular basis. contact the Digital H. Al i, J Steele, ). Bosco, and c;. Bartlett, " Electro Reference Service, PO.Box 6464, Holliston, MA mechanical Stu ely of 'No-Clean· Flux Corrosivity," 0174 6. Within the United Stares, call (800) 494-4377 Proceedings of the Eighth Electronic Ma terials or (508) 429-5515, extension 765. Outside the United and Processing Congress (August 1993). States, call (508) 429-30 15 or send a facsimile to (508) 429-6921. R. Allmon, "Design of Po rtable Systems," Pro ceedings of the IEEE Custorn In tegrated Circuits Co nference (May 1994). Te chnical Books and Pap ers byDi gital Authors P Anick, "Adapting a Full-text Information Retrieval Applications of Pe tri Ne ts in Ma nufacturing System to the Computer Troubleshooting Domain,"

Sys tems fl!l odeling, Con trot, ct nd Perforrnance Proceedings of the Set'enteenth Annual Interna Ana�)'sis, Alan A. Desrochers and Robert Y. Al-jaar, tional ACM-51GIR Conference on Researcb and IEEE Press, New Yo rk. 1994 (ISfiN 0-87942-295-5) Development in Infomwtion Retrieml (July 1994).

72 1-l'!/. 6 .Vo. j Sllll/111

N. Arora and B. Doyle, "Modeling the 1-V Charac ). Edmondson and P Rubinfeld, "An Overview of teristics of Fully Depleted SOl MOSFETs Including the 21164 Alpha AXP Microprocessor," Hot Chips VI Sy mposium (August 1994). Self Heating," IEEE International Silicon-on Insulator Conference Proceedings (October 1994).

T. Fox, "The Design of High-Performance Micropro IC N. Arora, B. Doyle, and D. Krakauer, ··sP E Model cessors at Digital," Th irty-first Design Automation and Parameters fo r Fully Depleted SOl MOSFETs, Conference Proceedings (June 1994).

Including Self-heating;· IEEE Electron Device Letters (October 1994) ]. Grodstein, E. Lehman, H. Harkness, and

W Grundmann, "Optimal Latch Mapping and N. Arora, R. Rios, and Huang, "Impact of Poly C. Retiming within a Tree," JEEE/A CM Interna silicon Depletion Effe ct on Circ uit Performance tional Conference on Computer-aided Design fo r 0.35tL CMOS Technology;· Proceedings of the (November 1994). Tw enty-fourth European Solid State Deuice Research Conference (September 1994). C. Gross, "Method fo r Selecting Semiconductor Equipment Using Empowered Teams," Advanced D. Bhavsar and). Edmondson, "Testability Strategy Semiconductor Ma nufacturing Conference and AXP of the Alpha 21164 Microprocessor," Proceed Workshop Proceedings (November 1994). ings of the IEEE International Te st Co nference (October 1994). T. Guay, "CASE-based Reasoning fo r Knowledge Acquisition Suggestions," Internationaljo urnal S. Bilotta and D. Proctor, ·'Development of a Manu of Artificial Intelligence To ols (July 1994). facturable Low Pressure ROX NOX Oxidation Pro

cess." Adl'anced Semiconductor Manufacturing S. jong, "Exploring Paths toward Quality Infor Conference and Wo rkshop Proceedings mation," Fo rtyjirst Annual Society fo r Te chnical (November 1994) Communication (May 1994).

C. Brench and B. Archambeault, "Proposed Stan C. juszczak and D. Lebel, ·' NFS Ve rsion 3-Design EMI dard Modeling Problems." IEEE ln te·rnational and Implementation," Summer 1994 USENIX Sy mposium on Electromagnetic Compatibility Te chnical Conference (June 1994). (August 1994).

D. K.rakauerand K. Mistry. "Circuit Interactions C. Brench, "Heatsink Radiation as a Function of During Electrostatic Discharge," Electrical Geometry,'' International Sy mposium on IEEE IEEE Ouer Stress/Electrostatic Discharge Sy mposium Electromagnetic Compatibility (August 1994). Proceedings (September 1994).

C. Brench, "Shield Degradation in the Presence ]. Lloyd , "Electromigration Failure of Narrow AI of External Conductors,'' International IEEE Alloy Conductors Containing Stress Vo ids,'' Sy mposium on Electrom.agnetic Compatibility Ma terials Reliability in Microelectronics (August 1994). IV Sy mposium Proceedings (April 1994).

j. Clement and A. Enver. "Modeling EJectromigration .J. Lloyd, "Electromigration Failure in Thin Film Con induced Stress Buildup Due to Nonuniform Tem ductors," Materials Reliabili�J' in Microelectronics perature," Ma terials Reliability in Microelectronics IV Sy mposium Proceedings (April 1994). IV Sy mposium Proceedings (April 1994).

Martino and G. Freedman, "Predicting Solder W Cronin,]. Hutchison, K. Ramakrishnan, and P H. Ya ng, "A Comparison of High-speed LANs," Joint Shape by Computer Modeling:' Proceedings of the Fo rty-fourth Electronic Components and Proceedings of the IEEE Nineteenth Conference on Local Computer Ne tworks (October 1994). Te chnology Conference (May 1994).

W Dubie, "Networds: The Impact of Electronic T. Moore, "A Test Process Optimization and Cost Text-Processing Utilities on Writing," jo urnal of Modeling To ol,'' Proceedings of the IEEE Interna Social and Euolutionary Sys tems (November 1994). tional Te st Conference (October 1994).

Digital Te chnical journal . VrJI. 6 No .J Summer 1994 73 Further Readings

D. Morin, T. Comard, M. Joshi, and K. Sprague, C. Schiebl, "Continuous Fluorescence Correction . "Calculating Error of Measurement on High-speed Factor for Layered Specimen, . Tw enty-eighth Microprocessor Te st," Proceedings of the IEEE Annual Microbeam Analysis Society Meeting International Te st Conference (October 1994). (August 1994).

G. Papadeas and D. Gauthier, "An On-line Data C. Schiebl, "Secondary Depth Distribution Gener Collection and AnalysisSystem fo r VLSI Devices ated by Characteristic Fluorescence in Multilayer at Wa fe r Probe and Final Te st," Proceedings of the Samples fo r Use in Quantitative EPMA," Tw enty IEEE In ternational Te st Conference (October 1994). eighth Annual Micmbeam Analysis Society Meeting (August 1994). M. Piasecki, K. Orvek, R. jones, and S. Dass, "Deep 4 UV Technology fo r 0.35J.Lm Lithography," 199 IEEE ). Seyyedi, "Soldered joint Reliability fo r Interstitial Lithography Wo rkshop (September 1994). Pin Grid Array Packages," jo urnalof Surfa ce Mo unt and Related Te chnologies Group (October 1994). K. Ramakrishnan and P Biswas, "Performance Benefits of Nonvolatile Caches in Distributed File H. Soleimani, "AnInvestigation of Phosphorous Systems," Co ncurrency: Practice and Experience Transient Diffusion in Silicon below the Solid Solu (July 1994). bility Limit and at a Low Implant Energy," jo urnal of the Electrochemical Society (August 1994). K. Ramakrishnan and H. Ya ng, "The Ethernet Capture Effect: Analysis and Solution," Proceed K. Steeples and D. Chang Kau, "Multiply Charged, ings of the IEEE Nineteenth Conference on Local Channeled, Ion Implantation," Te nth International Computer Ne tworks (October 1994). Conference on Io n Implantation Te chnology (June 1994) K. Ramakrishnan and H. Ya ng, "FIFO Design for a High-speed Network Interface," Proceedings K. Steeples, D. Chang Kau, M. Andreoli, and of the IEEE Nineteenth Conference on Local K. Mistry, "Rapid Implementation of a LATID Computer Networks (October 1994) Process," Te nth International Conference on Ion Implantation Te chnology (June 1994). R. Razdan and K. Brace, "PRJSC Software Accelera tion Techniques," IEEE InternationalCon ference N. Sullivan and S. Arsenault, "SEM/EDS Analysis on Compute1· Design: VLSI in Computers and Method fo r Bare Silicon Particle Monitor Wafe rs," Processors (October 1994). Advanced Semiconductor Manufacturing Conference and Wo rkshop Proceedings A. Sathaye, "Application of Supervisor Synthesis (November 1994). for Controlled Time Petri Nets to Real-time Data base Systems," 1994 American Control Conference (June 1994) N. Sullivan and R. Newcomb, "Critical Dimension Measurement in the SEM: Comparison of Backscat S. Sathaye, "Conventional and Early To ken Release tered vs. Secondary Electron Detection," Proceed Scheduling Models for the IEEE 802.5 To ken Ring," ings of the International Society of Ph oto-Optical jo urnalof Real-Time Sy stems (May 1994). Instrumentation Engineers (SPIE) : Integrated Cir cuit Metrology, Inspection, and Process Control

S. Sathaye, "A Real-time Scheduling Framework VIII(February 1994). for Packet-switched Networks," Fo urteenth Inter VMS 1/0 national Conference on Distributed Computing B. Thomas, "Open Concepts Kernel Sys tems (June 1994). Processes;' Digita! Systemsjournal(July 1994).

B. Thomas, "OpenVMS 110 Concepts: Software," C. Schiebl, "Application of EDX Spectroscopy to Accurate Nondestructive Measurement of Digital Sys tems jo urnal (July 1994) CoSi Film Thicknesses during Semiconductor Processing," Tw enty-eighth Annual Microbeam B. Thomas and K. Morse, "OpenVMS AXP 1/0 Con Analysis Society Meeting (August 1994). cepts," Digital Sys tems jo urnal(June 1994).

Digital Te chnical]ourrral 74 Vo l. 6 No. 3 Summer /994 I

A. To rabi, M. Mallary, and S. Marshall, "The Effect R. Ulichney, "The Vo id-and-cluster Method fo r of Rise Time and Field Gradient on Nonlinear Bit Dither Array Generation," Proceedings of the Shift in Thin Film Heads," The Sixth jo int MMM InternationalSociet y of Photo-Optical Instru In termag Conference (June 1994). mentation Engineers (SPIE) (September 1993).

M. Utt, "A System for Discovering Relationships A. To rabi, M. Mallary, S. Marshall, S. Batra, and by Feature," Proceedings of the Seventeenth S. Ramaswamy, "Performance Evaluation of Diffe r Annual InternationalACM-SIGIR Conference ent Pole Geometries in Thin Film Heads," Th e Sixth on Research and Development in Information jo int MMM-Intermag Conference (June 1994). Retrieval (July 1994).

M. Ts uk, "FASTHENRY: A Multipole-accelerated ). Vicente, "Network Capacity Planning," CM G '93 3-D Inductance Extraction Program," IEEE Tr ans Confe rence (December 1993). actions on Microwave Th eory and Te chniques (September 1994) A. Villani, "Cohesive Mechanical Behavior of Adhesive Materials," Proceedings of the 1993 ASME InternationalEl ectronics Pa ckaging Conference: M. Ts uk and R. Evans, "Modeling and Measure Advances in Electronic Pa ckaging 1993 (October ment of the Power Distribution System of a High 1993). performance Computer System," IEEE Top ical Meeting on Electrical Performance of Electronic ). Yang, "Reliability Performance of an R3000-Based Pa ckaging (October 1993). MCM for Desktop Wo rkstations," International Elec tronics Pa ckaging Co nference (September 1993). R. Ulichney, "Halftone Characterization in the

Frequency Domain," Th e Society fo r Imaging W Zahavi, "Modeling the Performance Budget," Science and Te chnology 's (IS&T's) Forty-seventh Computer Measurement Group Proceedings Annual Conference (May 1994). (CM G '93) (September 1993).

Digital Te chntcaljourtwl Vo l. 6 No. 3 Summer1994 75 I Recent Digital US. Pa tents

The fo llowing patents were recently issued to Digital Equipment Corporation. Titles and names supplied to us by the US. Pa tent and Trademark Offi ce are reproduced exact(y as they appear on the original pub lished patent.

4,592,072 R. E. Stewart Decoder fo r Self-Clocking Serial Data Communications

5,210,834 W Beach and ). Zurawski High Speed Transfer of Instructions from a Master to a Slave Process

5,210,874 P Karger Cross-Domain Call System in a Capability Based Digital Data Processing System 5,212,650 D. Hooper and S. Kundu Procedure and Data Structure fo r Synthesis and Trans formation of Logic Circuit Designs 5,212,783 S. Sherman System Which Directionally Sums Signals fo r Identify ing and Resolving Timing Inconsistencies

5,214,770 R. Ramanujan, P Bannon, System fo r Flushing Instruction-Cache Only When and S. Steely Instruction-Cache Address and Data-Cache Address Are Matched and the Execution of a Return-from-Execution or-Interrupt Command 5.216,413 L. Seiler, ). Pappas, and Apparatus and Method fo r Specifying Windows with Priority R. Rose Ordered Rectangles in a Computer Video Graphics System

5,218,684 D. Hayes and V Triolo Memory Configuration System

5,218,712 D. Bhandarkar, W Cardoza, Providing a Data Processor with a User-mode Accessible D. Cutler, D. Orbits. and Mode of Operations in Wh ich the Processor Performs R. Witek Processing Operations without Interruption 5,220,468 M. Sidman Disk Drive with Constant Bandwidth Automatic Gain Control

5,220,674 W Morgan, D. Cobb, G Bell. Local Area Print Server for Requesting and Storing Required and A. Carlson Resource Data and Forwarding Printer Status Message to Selected Destination 5,221 ,422 S. Das and). Khan Lithographic Te chnique Using Laser Scanning fo r Fabrication of Electronic Components and the Like 5,222,029 D. Hooper and S. Kundu Bitwise Implementation Mechanism fo r a Circuit Design Synthesis Procedure 5,222,223 R. Hetherington, D. We bb, Method and Apparatus for Ordering and Queueing Multiple

T. Fossum, ). Murray, and Memory Requests D. Manley 5,222,224 S. Arnold, S. Delahunt, Scheme fo r Insuring Data Consistency between a Plurality

M. Flynn, T. Fossum, of Cache Memories and the Main Memory in a Multiprocessor R. Hetherington, and System

D. We bb 5,226, 170 P Rubinfeld Interface between Processor and Special Instruction Processor in Digital Data Processing System 5,228, 129 S. Bryant and M. Harwood Synchronous Communication Interface fo r Reducing the Effect of Data Processor Latency 5,230,067 B. Buch Bus Control Circuit fo r Latching and Maintaining Data Independently of Timing Event on the Bus Until New Data Is Driven Onto

5,230,071 B. Newman Method of Control I ing the Va riable Baucl Rate of Peripheral Devices 5,230,072 D. Smith and K. O'Rourke System fo r Managing Hierarchical Information in a Digital Data Processing System

5,230,079 R. Grondalski Massively ParalleJ Array Processing System with Processors Selectively Accessing Memory Module Locations Using Address in Microword or in Address Register

76 Vo l. 6 No . . i Su/1/mer 19')4 Digital Te chnical jo urnal I 5,235,693 ]. Lynch, K. Chinnaswamy, Method and Apparatus fo r Red ucing Buffe r Storage in P Goodwin,]. Tcssari, and a Read-Modify-Write Operation M. Gagliardo 5,237,574 LWeng Error-resilient Info rmation Encoding 5,239,635 R.E. Stewart, TE. Leonard, Virtual Address to Physical Address Translation Using Page and ST Lee Ta bles in Virtual Memory 5, 247,398 M. Sidman Automatic Correction of Position Demodulator Offsets 5,249, 187 W Bruckert and T Bissett Dual Rail Processors with Error Checking on l/0 Reads 5,249,293 B. Schreiber, C. Cockcroft, Computer Network Providing Transparent Operation M. Ozur, R. Bismouth, and on a Compute Server and Associated Method D. Doherty 5,251,322 P Doyle, ]. Ellenberger, Method of Operating a Computer Graphics System Including E. jones, D. Carver, S. DiPirro, Asynchronously Traversing Its Nodes B. Gerovac, W Armstrong, E. Gibson, R. Shapiro, K. Rushforth, and WC. Roach 5,255,367 W Bruckert, T Bissett, Fault To lerant, Synchronized Twin Computer System with D. Mazur, and ]. Munzer Error Checking of 1/0 Communication 5,261,113 N. Pjouppi Apparatus and Method fo r a Single Operand Register Array for Ve ctor and Scalar Data Processing Operations 5,266,409 PH. Schmidt and j.C. Angus Hydrogenated Carbon Compositions 5, 267, 175 D. Hooper Database Access Mechanism for Rules Utilized by a Synthesis Procedure fo r Logic Circuit Design 5,276,892 AS Olesin and R.M. Supnik Destination Control Logic for Arithmetic and Logic Unit fo r Digital Data Processor 5,278,840 I) Bhandarkar, W Cardoza, Apparatus and Method fo r Data Induced Condition Signalling D. Cutler, D. Orbits, and R. Witek 5,280,617 R.F.Brender and B.R. Brett Automatic Program Code Generation in a Compiler System fo r an Instantiation of a Generic Program Structure and Based on Formal Parameters and Characteristics of Actual Parameters 5,287.463 RC Frame and FA. Zayas Method and Apparatus fo r Transferring Information over a Common Parallel Bus Using a Fixed Sequence of Bus Phase Transitions 5,291,581 D. Bhandarkar, W Cardoza, Apparatus and Method fo r Synchronization of Access D. Cutler, D. Orbits, and to Main Memory Signal Groups in a Multiprocessor Data R. Witek Processing System 5,297,283 D. Cutler, J Kelly, and Object Transferring System and Method in an Object Based F. Perazzoli Computer Operating System 5,301 ,329 RC Frame and FA. Zayas Double Unequal Bus Timeout 5,303,380 B. Foster, G. Brown, J Piazza, System fo r Processing Data to Fa cilitate the Creation ]. Te nny, B. Nelson, of Executable Images W Van Roggen, and P Anagnostopoulos 5,305,462 R. Grondalski Mechanism fo r Broadcasting Data in a Massively Parallel Array Processing System

5,313,467 G. Varghese, M. Fine, A. Smith, Integrated Communication Link Having Dynamically and R. Szmauz Allocatable Bandwidth and Protocol for Transmission of Allocation Information over the Link 5,317, 717 D. Bhandarkar, W Cardoza, Apparatus and Method fo r Main Memory Unit Protection D. Cutler, D. Orbits, and Using Access and Fault Logic Signals R. Witek

Digital Te chuical journal Vo l. 6 No 3 Summer 1994 77 Call fo r Authors fr om Digital Press

Digital Press has become an imprint of Butterworth-Heinemann, a major inter national publisher of professional books and a member of the Reed Elsevier group. Digital Press remains the authorized publisher fo r Digital Equipment Corporation: the two companies are working in partnership to identify and pub lish new books under the Digital Press imprint and create opportunities fo r authors to publish their work.

Digital Press remains committed to publishing high-quality books on a wide variety of subjects. We would like to hear from you if you are writing or thinking about writing a book.

Contact: Frank Satlow Publisher Digital Press 313 Was hington Street

Newton, MA 02158 Tel: (617) 928-2649 Fax: (617) 928-2640 fp [email protected] ISSN 0898-901X