<<

I INTEGRATION

ALPHA SERVERS &

Digital CPU Technical Journal Editorial The Digital TechnicalJournal is a refereed Cyrix is a trademark of Cyrix Corporation. Jane C. Blake, Managing Editor journal published quarterly by Digital dBASE is a trademark and Paradox is Helen L. Patterson, Editor Equipment Corporation, 30 Porter Road a registered trademark of Borland Kathleen M. Stetson, Editor LJ02/D10, Littleton, Massachusetts 01460. International, Inc. Subscriptionsto the Journal are $40.00 Circulation (non-U.S. $60) for four issues and $75.00 EDA/SQL is a trademark of Information Catherine M. Phillips, Administrator (non-U.S. $115) for eight issues and must Builders, Inc. Dorothea B. Cassady, Secretary be prepaid in U.S. funds. University and Encina is a registered trademark of Transarc college professors and Ph.D. students in Corporation. Production the electrical engineering and Excel and Microsoft are registered pde- Terri Autieri, Production Editor science fields receive complimentary sub- marks and Windows and Windows NT are Anne S. Katzeff, Typographer scriptions upon request. Orders, inquiries, trademarks of Microsoft Corporation. Joanne Murphy, Typographer and address changes should be sent to the Peter R Woodbury, Illustrator Digital TechnicalJournal at the published- Hewlett-Packard and HP-UX are registered by address. Inquiries can also be sent elec- trademarks of Hewlett-Packard Company. Advisory Board tronically to [email protected]. Single copies INGRES is a registered trademark of Ingres Samuel H. Fuller, Chairman and back issues are available for $16.00 each Corporation. Richard W. Beane by calling DECdirect at 1-800-DIGITAL Donald Z. Harbert (1-800-344-4825). Recent back issues of the INFORMIX is a registered trademark of William R Hawe Journal are also available on the Internet at Informix Software, Inc. Richard J. Hollingworth http://www.digital.com/info/D~/home. Inte1486, Intel, and are trademarks Richard F. Lary hunl. Complete Digital Internet listings can of Intel Corporation. be obtained by sending an electronic mail Alan G. Nemeth Kerberos is a trademark of Massachusetts message to [email protected]. Jean A. Proulx Institute of Technology. Robert M. Supnik Digital employees may order subscriptions MIPS R4000 is a trademark of MIPS Gayn B. Winters through Readers Choice by entering VTX Computer Systems, Inc. PROFILE at the system prompt. Motif, OSF, and OSF/1 are registered trade- Comments on the content of any paper marks of the Open Software Foundation, are welcomed and may be sent to the Inc. managing editor at the published-by or network address. Novell is a registered trademark of Novell, Inc. Copyright O 1995 Digital Equipment Corporation. Copying without fee is per- ORACLE is a registered trademark of Oracle mitted provided that such copies are made Corporation. for use in educational institutions by faculty POSIX is a registered trademark of the members and are not distributed for com- Institute of Electrical and Electronics mercial advantage. Abstracting with credit Engineers, Inc. of Digital Equipment Corporation's author- ship is permitted. All rights reserved. SCO is a registered trademark of The Santa Cruz Operation, Inc. The information in the Journal is subject to change without notice and should not SequeLink is a registered trademark of be construed as a commitment by Digital TechGnosis, Inc. Equipment Corporation or by the compa- Solaris and Sun are registered trade- nies herein represented. Digital Equipment marks and SunOS is a trademark of Sun Corporation assumes no responsibility for Microsystems, Inc. any errors that may appear in the Journal. SPARC is a registered trademark of SPARC ISSN 0898-90113 International, Inc. Documentation Number EY-T135E-TJ SPEC, SPEC@, SPECint, SPECfp92, Book production was done by Quantic SPECint92, SPECmark, SPECrate-@92, Communications, Inc. and SPECrate-int92 are trademarks of the Standard Performance Evaluation Council. The following are trademarks of Digital Equipment Corporation: ACMS, SPICE is a trademark of the University of ACMS Desktop, ACMSxp, Alphasewer, California at Berkeley. Alphastation, DEC, DEC OSF/l, DECchip, SPX/IPX is a trademark of Novell, Inc. DECnet, DECdtm, DECnet, Digital, the SYBASE is a registered trademark of Sybase, DIGITAL logo, ObjectBroker, OpenVMS, Inc. PATHWORKS, ULTRIX, VAX, and VMS. Tuxedo is a registered trademark of Unix ADABAS is a registered trademark of Systems Laboratories, Inc. Software AG of North America, Inc. UNIX is a registered trademark in the Cover Design AIX, IBM, OS/2, and RISC System/6000 United States and other counmes, licensed Key to the remarkable performance of the are registered trademarks and AS/400, CICS, exclusively through X/Open Company Ltd. Alphaserver 8400/8200 systems is Digital's and DB2 are trademarks of International new 300-MHz, &- Alpha 21 164 micro- Business Machines Corporation. X/Open is a trademark of X/Open Company Ltd. . Both subjects are featured in this AppleTalk and Macintosh are registered tenth anniversary issue of the Journal and trademarks of Apple Computer, Inc. are represented on the cover in the forms AT&T is a registered trademark of American of a photograph of the and Telephone and Telegraph Company. an illustration of the Alphaserver system topology. BT is a registered trademark of British The cover was designed by Mario Telecommunications plc. Furtado of Furtado Communication Challenge is a trademark of Silicon Design. Graphics, Inc. Contents

Foreword Richard L. Sires

DATABASE INTEGRATION DB Integrator: Open Middleware for Data Access Ilichard Plcdcrcder, Vishu lirishnamitrthy, 1Michacl G.~gnon, and Mayank Vadodaria

ACMSxp Open Distributed Transaction Processing Robert K. B;lati, J. Ian Carrtc, Wtlliam B. I)r~~ry, 'inti Oren L. Wicslcr An Open, Distributable, Three-tier, Client- Norman G,l)cplcdge, William A. Turner, c~nti Architecture with Transaction Semantics Alesandm Woog

ALPHA SERVERS & WORKSTATIONS

The AlphaServer 8000 Series: High-end Server lhvid M. Fcn\vick, Denis J. t'ole); Willian~I<. G~st, Platform Development Stcplien R. V;i~ilnDoren,and 1)anicl Wisscll

Digital's High-performance CMOS ASIC ]can H. Basm~ji,IGy R. Fisher, Frank W. Gatulis, Herbert R. Kolk, and Jamcs F. Rosenct-a~is

The Second-generation Processor Module for Nitin D. Godi\\,aln and Barry A. maska as AlphaServer 2100 Systems

The Design and Verification of the Alphastation John H. Zurawski, John E. Murray, and Pat11J. Lrmrnon 600 5-series

ALPHA 21 164 CPU

Circuit Implementation of a 300-MHz 64-bit William J. Bo\vllill, Shane L. ]',ell, Rradley J. Bcnschncidcr, 100 Second-generation CMOS Alpha CPU Andrew J. Black, Sharon M. Britrotl, Rubcn W. (:astclino, L>alc R. Donch~n,John H. Edmondson, Harry I<. F.l~r,Ill, Paul E. Grono\\,ski, hi1I<. Jain, Ptltricin L. IO-ocscn, Marc E:.. Lanicrc, 13rucc J. L.oughlin, Shskhar Mehra, l

Internal Organization of the Alpha 21 164, John H.Edmondson, Paul I,liubinfeld, Pctcr J. Ikmnon, 119 a 300-MHz 64-bit Quad-issue CMOS RlSC Rradley J. Benschncider, Dchra Bernsrcin, Ruhcn W. Microprocessor C~lstelino,Elizabeth M. Coopcr, Diiniel E. L)c\*cr, Dale H. I)onchin, Timotliy C. Fjsclicr, Anil K. Jnin, Slickh.ir Mchr.1, Jc~nneE. lMcycr, Ronald P. Preston, Vidya IC~jagopalan, Chandrasckhara Sornanathan, Scott A. Taylor, and Gilbcrt M. Wolrich

Functional Verification of a Multiple-issue, I\/licll.acII(a~ltro\\lit~ and Lisa M. Noack Pipelined, Superscalar Alpha Processor - the Alpha 21 164 CPU Chip

l)igi~.~lTechnical Jottrn.~l \id. 7 No. 1 1995 In Memoriam

This tenth 'in~livcrs,~ryissue of tlic I)i,:,i/nl 7i~ch11ic~11,Jri1~1-11~11is dedicated to the memory of l'ctcr E'. Conklin, (:orporatc <:onsulti~igEngineer, who passed ,1\\,ay in April 1995. Pctcr\iro~.ltcdat Di~it'ilh)r 26 ye.1rs. Hc \rr.is one of the pio- nccrs of the DE(:s!,stcm-10 sofi\\,.~rcgro~~p, contrilx~ting its first b~tchand 1.i~- t~r~ilmcmol-!, subs\,stcms. Hc \\,,15 a key contl-ibutor ill the clcsign of the VAS ;~rcliitccturc,ind ~tslanguage .ind run-time cn\.i~-onments.Hc \\rorlted in the Pl)l'-11 and 'Terminals and Pl.i~ltcrsgroups, helping to dc\/clop tlicir tec1inic:il and b~~sinessstl-atcgics in rapidly cli,inging cn\iro~imcnts.Anci lie \\,as a PI-inic mo\.cr of the Alpli,~program, o\,c~-seejngf rst tl~csofk\\~arc dc\.clopment and tlicli all ofthc c~~gi~iccrinp,as \\>ell ,IS cre,itilig its ~~nicl~lcc~~lt~~r.il and man,iscl.i,il app~-oachto cro>>-org,lni/,dtio~i,~ldc\rclopmcnt. lkyond his oi~tstnndingtcclinical and bi~sincsscontributions, Pctcr espouscd dnd represented ,111 tli,it \\.as podill Digital's cult~~rc-respect ~OI-and focils oli pu'plc; doing tlic rigl~tthing for customers and collcag~lcs;dri\~i~ig for concrete results .ind \\~ork,lblcproccsscs. Hc clianipioncci tlic role of \\,omen and minor- tic3 ,is Inanagcrs and contrib~~tors.Hc helped set LIPtlic i~itcrnalNotes systcm to ful-tlicr ope11 co~ii~ii~lnic~~tionsamong e~nployccs,and bcn\~ccncriiployccs and m,inagclnent. And lie ad\~oc,~tcd,lnd modclcd tlic uses of training and de\lclop- mcnt to impro\.c pc1.5onal and ors.~~iizatio~idpcrforma~icc at '111 Ic\,cls. l'ctcr's \\rork is \\.o\.cn tlirougli the f'ibric of lligital's Iiistol-\, liltc a bright, ~~nbrc,iltrlblethl-cad. Hc \\,ill bc sorely liiisscd. Editor's Introduction

The Di'qilal 7i.chr~icnI~/o~~r~~zwlmarks addition to discussions of three Alpha server architecture niade up of three its tenth anni\lcrsary \vith thc publica- hardware designs and the new micro- ticrs: desktop, middleware (founded tion of this issue. Since 1985, tlie processor, this issue presents papers on the ACMSsp monitor and Digital's ./ol,~n.ralhas chronicled Digital's engi- on database sohvare technologies. ObjectBrolter sohvare), and legacy neering achievements from silicon to These papers focus 011 the realities interfaces. sobare: record-breaking micropro- of integrating heterogeneous systems The next set of papers features cessors, standards-setting network and data sources. high-performance systems built on technologies, advanced storage arclii- The Database Integrator (DBI) the 300-MHz Alpha 21 164 micro- tectures, and industr)l-leading imple- directly addresses the heterogeneity processor. Presented first is the mentations of clusters and distributed issue by providing a niultidatabase AlphaServer 8000 platform-the systems. More than simply a record, management system for data access basis for the highest performance tl~e~~orrr~zaloffers readers insights into and integration ofdistributed data systelns yet developed by Digital. the how and why of Digital's product sources. lXichard Pledereder, Vishu Dave Fenwick, Denis Foley, Bill Gist, designs-in papers written by the Krishnamurtlip, Mike Gagnon, and Steve VanDoren, and Dan Wissell design engineers themselves. A look Mayank Vadodaria outline the data ciiscuss the principal design issues back over tlic last ten years, however, access issucs and compare the DBI relative to the aggressive goals set for provides only a partial view to engi- approach with others. Their discus- system data bandwidth and nicrnory neering's unique combination of sion addresses such topics as hetero- read latency. The)] define their design vision and pragmatism, a combi- gelieous query optimization, location approach with seven levels of abstrac- nation that has spurred industry transparency, global consistency, reso- tion and review tlie choices made in breakthroughs and established thc lution ofsemantic differences, and each level; the prevailing thcrne is foundation for the development security checks. achieving low memory read latency. of today's world-class hardwarc and Key to solving the problems posed As a result, the AlphaServer 8400 sohvare products. To celebrate by heterogeneous systems are open- and 8200 systems feature a minimum Digital's outstanding engineering ness and standards. Both are stressed memoly read latency of 260 nanosec- achievements, wc have therefore in the ACMSxp transaction processing onds (ns). Moreover, in benchmark included a special section of historic monitor design, described by Bob tests, the 12-processor AlphaServer milcstoncs as part of this anniversary Kaafi, Ian Carrie, Bill Drury, and Oren 8400 system achieves issue. The milestones begin in 1957 Wiesler. ACMSxp is layered on the performance levels of 5 billion float- wit11 tlie development of the com- OSF's Distributed (.;omputing Envi- ing-point operations per second. pany's tirst product, a system module ronment and uses Transarc's Encina Essential for meeting AlphaServer For scientific use that ran at 5 MHz. toolkit to support XA-compliant speed reqi~irementswas a custom The milestones continuc through . Thc monitor's applica- application-spccific the recent introduction of Digital's tion de\~elopnientenvironment is (ASIC) intcrface. Jcan Basmaji, new high-performance scrvcr system based 011 the Structurcd Transaction Kay Fisher, Frank Gatulis, Herb I

Digital Technical JOL Vol. 7 No. 1 1995 5 Developers of a second-generation laying out the 9.3-million technical papers, written by Digital's processor module for the Alphaserver clip and the global single-wire clock engineers, that describe the techno- 2100 multiprocessjng system also distribution scheme. They then prc- logical foundations of our products, took advantage of the Alpha 2 1164 sent a set of significant circuit design under the gi~idallceof an ad\~isory microprocessor performance and at challenges-the speed requirement, board responsible for content and the same time ensured physical com- the con~plicatedmicroarchitecturc, editorial philosophy. Because readcrs patibility with the first generation. and the large physical size ofthe chip responded so well to the working Nitin Godiwala and Barry Maskas -and explain circuit implementation engineer's perspective on product highlight the designs that most effi- decisions for the instruction, execu- design, the~/owrnc~lhas grown from ciently used the system bus bandwidth tion, and memory units; the system a biannual to a quarterly publication, and provided a 1.4 SPEC performance clock; and the three caches. and was one of the first industr) j our- increase over the first-generation The paper by John Edmondson nals to publish electronically on the module, including a third-level , et al. describes the hnctional units \VorldWide Web. Further, sincc duplicate tag store, and a synchro- of the Alpha 2 1164 n~icroprocessor: 1992, papers have been peer reviewed nous cloclung scheme. the quad-issue, superscalar instruction to ensure that readers receive substan- Also based on the Alpha 21 164 unit; the 64-bit integer execution pipe- tive, accurate information on a widen- microprocessor, the Alphastation lines; two 64-bit floating-point execu- ing number of topics covered in 600 5-series workstation incorporates tion pipelines; a high-performance the~/oz.~rnnl.Ofcourse this growth the 64-bit PC1 bus and supports- - three memory unit; and a cache control would not have happened without opcrating systems. In their paper, and bus interface unit. The authors thejotrrnnl's contributors, the engi- John Zurawski, John Murray, and note architecti~ralimprovements over neers who analyze their unique and Paul kmmon focus on the chips that the first-generation 21064 micropro- informative experiences and share provide high-bandwidth intercon- cessor and provide performance data. them with thcir peers. As Digital's nects between the CPU, the main The functional verification of this engineers add to the timeline of engi- memory, and the PC1 bus. They also complex microprocessor is described neering milestones in computer sys- recount their experiences in the in our concluding paper. Mike tems, sohare, networking, storage, development of a hardware-based Kantrowitz and Lisa Noack review semiconductors, and peripherals, verification technique that improved the many techniques employed thejournal will continue to scrve its test throughput by five orders of to veriQ the logic design and the readers by publishing this important magnitude over the sohvare-based PALcode interface, including work. techniques. implementation-directed pseudo- The editors thank Bob Supnik, The Alpha 2 1 164 n~icroprocessor random exercisers used in combina- Senior Corporate Consulting Engi- that is at the heart of the thrce systems tion with focused hand-generated neer, for his help in bringing together described above delivers an outstand- tests. The authors relate the lessons this special issue of thc,/otrrncll. ing microprocessor performance learned from the few bugs found in Upcoming in the,/o~rnzdsystems (peak) of 1.2 billion instructions per the tirst prototype of the engineering, Sequoia 2000 research, second. Three papers examine the microprocessor. software environments, scientific circuit design, the logic functions, An anniversary is a time to look computing, and networking. and the hnctional verification of this back and ahead. Looking back to the custom, 64-bit VLSI chip. First, Bill Journal's origins, I want to acknowl- Bowhill et al. examine the circuit edge the wisdom of Dick Beane, the design contributions needed to Journal's first editor, and Sam Fuller, achieve the performance goal of vice president of Corporate Research. 300-MHz operation. The authors They established theJournnl's edito- Jane C. Blake describe the floorplan choices for rial focus and its structure: to publish fManaging Editor

Digital Technical Journal - 7311.11 ,~.13.\![a~3-ssn.rcix3ill! 01 I!I,iZo~rt~v 1~,3!d,h 3~1,~- .SILIXLI 0009 XVA 166 1 LL6 1 -ss~~llol~i~csJr.\u;los pill' J.IUI\\P.I~LI 89 VOLE r'ildlv [I~!JV.IJLI~~-ISJ!J~3~11~.I~LIA\~S~ 002 lapow ~JOU~U PIP s32rd 3saLll LI! Su!~r.rq~ln:,sr,\\ 0009 XVA 886 1 IenlJIA bL6 1 1~1!21(1'oSc s.IL,~,(1.1011s 55.1111 lsn( L XVAoJ3!WS861 SbLIOLELL61 'aj11. . ,icp,i-l3,\3 olil! ,l2r.11 3111jjo 3,\o~1 ssaulsnq ]l~~!lll,<3111 .I() '~.1~3!<33JLll ,<.I3,\3J31S[j b911Z 5661 OSLIL 1-XVA 2861 S8109E 8961 ssul!~33.1111 Jnoqr 132 pur >I~C.Umr.1 qllualx alp uo .\rls 1rl81iir,\a~l~ ;sn~ncluro~ sly b90 1Z 266 1 08LIL 1-XVA 8L6 L Obl09E S96 L ,\pp111hsc pn,\lo,\a s.1~2s3c.1j! 1r11~\ 686 1 SL6 1 296 1 syDnJ1 han!laa -ssa~dx3OJ sJe2 a2eu uo~j XVA 09~1lualsh~ wa1 Likc race cars, tlic initial Alph;~ th'ln traditional 32-bit database sol+ tl~chstcst in tlic industry - the hard\\larc and sohvarc products were \\,arc.The three database papers in yardstick by which others mcnsurc criticized For being tc~npcr;imcnml- this issue emphnsize Digital's foc~~s computer perk)rmancc. Colnpetitors fast indeed for sonic applications, on commercial applications. h;ive shortcncd their dc\,elopment but not appropri;irc for others. Some support is sub- cycles ;ind ;iggrcssi\,ely increased their observers assumed that rhc first stantially more robust and has been clock rates. E\,cry single company generation \\,as a tluke, or tliat it esp.111dedto tlie fastest UNIS anti that dcscribcti Alpli,~fc~turcs ns represented all tliat Alp1~1computers Windo\\,s NT implernentations in tlie un~~cccss;iryin 1992 is no\\, rushing could ever be. The reaction to man!! inciustr): FLIIIOpenVMS clustering, to bring its o\vn 64-bit ilnd rclascti archirect~~ralfeatures, such as 64-bit including mixed Alpha and VAX I-cad-\\.riteorder SIMP iniplenicnta- addressing or relaxed read-\\trite clusters, is available. UNIX and NT tions to marker. Alpha h;is gro\\In ordering, \\'as "\\,ho needs it?" They cl~~steringis announced. All three from "niclic design" to "industry \\,rote Alpha offas a nichc dcsign. operating systems no\\: support SMI', !!ardsrickn in a single generation. s!~mmctric ~l~~~ltiprocessing.The Digital has in\lested o\/c~-S 1 billion Delivering on Its Promise 64-bit Digital UNIS implemcncation 111 the dcvclopnicnr ofAlpha. Literally Three years later, the fi~turchas has led the rest of the industry in tho~~snndsofpcoplc have brought arrived \\tit11 a (muffled) I-o.ir: delivering 64-bit sofh\~areby over a paper design to life. Rleccling-edge, 24 months. brute-hrcc chip tccli~iologyhas The sophisticated sccond- Cmmpiled-cocic impro\lements turned into practical engineering, generation Alpha 2 1 164 chip, lia\~ebeen remarkable. In 1992, \vitli a balance oFsophistic;ltio~iand described in tliis issuc, is 1.5 I could read tlie codc generated by cvcl-yday care: race cars to express- to 2 times as efficie~ltas the fi rst- some ofo~~rcori1pi1~1-s and redlinc delivery trucks. ga~urntio~i2 1064 chip in tcr~~is rhrcc out ofevery four instructions Alpha is evolving much as the of work done per clock cycle on as unneeded unnccded unneededd architects originally cnvisioncd. real progrinis. unneeded. A year ago, I could rc'id I believe l'cter Conklin, who Icd tlic The chip lias bccn conipilcd code and redline one Alpha Progrn~mOffice and to \\)lion7 boosted fon1 200 MHz to instruction out of e\,ery nvo as tliis iss~~eis riedicated in memoriam, stunning 300 MHz. ~~nnecdedunneeded. Today, I am \vould want to illso dedicate this issuc Efficiency ofcompiled cocic has hard-pressed to redline even 15 per- to all the brilliant and liard\\~orking improved by 10 to 60 percent cent of tlie instructions as unoeededd. people \\fhoha\.c made it a rc;ility. My on many programs. Mo\ing beyond the installed base, thanks and ad~nirationto cadi of you. Operating systcrn codc lias been migration efhrts are no\\? focused So lie\\! arc \\,c doing? Atic~~OLI expanded and tuned. on bringing in ne\v customers. In read this ~SSLIC,I tlii~ik ~OLI \\rill agree, addition to ITAX and [MIPS binary "Quite \\,ell, thank you." TIie perti)rniancc hctors ro~~glily translation, tlie SPARC-to-Alpha multipl!, togctllcr, producing second- binary translation product is a\railable. generation systems that arc about 2.5 Code fi-om PC platforms runs to 3.5 times hstcr than the cq~~ivalcnt cmulated on all Alpha operating sys- first-generation systems. One cxam- tcnis. A technology demonstratio~i ple of the higher speeds oftlicsc ne\\~ of s86-to-Alpha binary translatio~~ s)~stcnisis the AlphaScl-\lc1.8400, Ii,is been given at trade sho\vs. discusscti in tliis ~SSLIC;S!~S~CIII perfbr- The growing maturity and sophis- mancc approaches the le\,cl ofsupcr- tication of the Alpha products ha\re in with Linpack IIX n res~~lts turn led to accelerated sales growth. of 5 GFLOPS. Over 100,000 Alpha systems worth The second-generation system over $3.5 billion (harclu~are,sohvare, plathrnms cmphasizc industr)! Icadcr- and scrvicc) have been shipped, and ship for a bronti range ofcom~ncrcial tlic ship rate has increased 66 percent client-ser\icr applications, not just in the past year alone. In its tirst three scientific apl>lications. Likc cxpress- years, Alpha is off to a 111ucIi faster delivery trucks, much ofthc sccolid- start than other RISC architectures, generation soharc is Foc~lscdon SLICI~ 3s H1'-PA, iii their first three enterprise-wide d;it;ibasc access. -1'1.11I y years. Ruying patterns have sliitied taking ;id\~a~itagcof the 64-bit addrcss- from try-one-out to bu!~-~~-fleet-to- jng for tile first time, Oracle 7 d;~ub;lsc run-rhc-business. sohvarc can run huge ill-memory 111three short years, Alpha corn- database clucrics 200(!) ti~ncsFaster p~ltcrsha\~e become established as

\Iol. 7 No. 1 1995 Richard Pledereder Vishu IG-ishnarnurthy Michael Gagnon DB Integrator: Mayank Vadodaria Open Middleware for Data Access

During the last few years, access to hetero- A problem faced by organizations today is how to geneous data sources and integration of the uniformly access data that is stored in a variety of data- disparate data has emerged as one of the major bases managed by relational and nonrelatio~laldata systems and the11 transform it into an information areas for growth of database management resource that is manageable, functional, and readily software. Digital's DB lntegrator provides robust accessible. Digital's DB Integrator (DBI) is a multi- data access by supporting heterogeneous query database management system designed to provide optimization, location transparency, global production-quality data access and integration for consistency, resolution of semantic differences, heterogeneous and distributed data sources. and security checks. A global catalog provides This paper describes the data integration needs of the enterprise and how the DBI product hlfills those location transparency and operates as an needs. It then presents the DBI approach to multi- autonomous metadata repository. Global trans- database systems and a technical overview of DBI con- actions are coordinated through two-phase cepts and ternlinology. The nest section outlines the commit. Highly available horizontal partitioned system architecture of the DBI. The paper concludes views support continuous distributed - with highlights of some of the technologies incorpo- ing in the presence of loss of connectivity. The rated in DRI. DB lntegrator enables security checks without Data Integration Needs interfering with the access controls specified in the underlying data sources. Companies oken find themselves data rich, but infor- mation poor. Propelled by diverse application and end-user requirements, companies have made signifi- cant investments in incompatible, fragmented, and geographically distributed database systems that need to be integrated. Companies with centralized jnforma- tion systems are seelung methods to distribute this data to inexpcnsivc, departmental platforms, which would maximize performance, lower cost, and increase availability. The DB Integrator product family is specifically designed and implemented to address the following data integration needs: Data access. The data integration product must provide uniform access to both relational and nonrelational data regardless of location or stor- age form. Data access must be extensible to allow the user to write special-purpose methods. Locatio~iand fi~nctionaltransparency. The loca- tion of the data and the functional differences of the various database systems must be hidden to provide end users with a single, logical view of the data and a unifor~nlyf~~nctional data access system. Schema integration and translation. Users of data protocol (TCP/I13), DECnet, or Systems Network integration software must be presented with an Architecture (SNA).3 environment that lets them easily determine what Administration. The integrated database must data is available. Such an environment is fie- provide flexibility in configuration and be easy to qiiently referred to as a federated database. A data set up, maintain, and usc. integration product must be flexible enough to help resolve semantic inconsistencies such as Figure 1 illustrates the current set of client-server variances in field names, data types, and units of data access supported by the DB Integrator product measurement. family. Data consistency. Maintaining data consistency is one of thc most important aspects ofany database Multidatabase Management Systems system. This is also true for fedcrated database. A ~nultidatabase management system (IMDBMS) Performance. Integrating data from multiple data enables operations across m~~ltipleautonomous com- sources can be an expensive operation. Thc two ponent databases. Based on the taxonomy for multi- primary goals are to minimize the amount of data database systems presented in Reference 4, we can that is transferred across the netcvorl< and to maxi- dcscribe DBI as a loosely coupled, hcterogeneous, and mize the amount of rows that are proccssed federated multidatabase system. DBI is loosely cou- within a given unit of time. pled compared to the component databases: The data- Security. Access to distributed data must not base administrator (D13A) that is responsible for DBI compromise the security of data in the target and the DBAs that are responsible for the component databases. The security model must provide databases manage their environments independently authorized access to an integrated- schema without of one another. DBI is heterogeneous because it sup- violati~lgthe security of the autonomous data ports different types of component database syste11-l~. sources that have been integrated. DBI is federated because each colnponent database Openness. Any data integration product must exists as an independent entity. accommodate tools and applications with stan- dard SQL (structurcd quer)! language) interf'~ces, Reference Architecture both at the call level (e.g., Open Database The MDBMS provides users with a single system view Connectivity [ODBC] for of data distributed over a large number of heteroge- clients) and the language level (e.g., ANSI SQL)!s2 IICOLIS databases and file systems. The MDBMS inter- It must be able to provide and enable access to operates with the individual component databases data over the IIIOS~commonly deployed tralisports similar to the way that the SQL query processing such as transmission control protocol/internet engine in a relational DBMS interoperatcs with the

LINK DATABASES PATHWORKS NOVELL WINDOWS SOCKETS . DECNET - TCPIIP COMPLIANT TCPllPs ADABAS SYBASE MODEL-204" APPLETALK AS1400 SUPRA TOTAL*' DB2 SOUSERVER DB212 PC DATA APls DB216000 RMS - ACCESS ODBC SQL-92 INFORMIX DBMS - BTRIEVE DAL SQUSERVICES INGRES DSM - DBASE SEQUELINK INTERBASE VSAM - EXCEL ORACLE IMS" - PARADOX DB INTEGRATOR ORACLE RDB FOCUS" - SEQUENTIAL PROGRES CA-IDMS" SQUDS TERADATA CUSTOM DATA

DB INTEGRATOR "READ-ONLY VIA EDAISQL OPERATING SYSTEMS

MACINTOSH WINDOWS WINDOWS NT OPENVMS DEC OSFII ULTRIX SUN-OS HP-UX' SOLARIS' SCO" 'VIA SEQUELINK DECNET TCPIIP SNA

Figure 1 Client-Server Data Access with the DB Integrator

Vo1.7 No.1 1995 record storage system. Thus, a relational MDBMS, Simple, mapped-in gateway drivers to access data such as DBI, is typically composed of the follo\ving sources processing units: Support of distributed database features for the Language application programming interface Oracle Rdb relational database as well as support (API) and SQL parser of existing Oracle Rdb applications in the multi- database environment Relatio~ialdata system - Global catalog manager - Distributed query optimizer and Global Catalog DBI is addressable as a single inte- - Distributed execution system gration server. Integration clients such as tools and - Distributed transaction lnanagemcnt applications do not need to dcal with the complesities of the distributed data. Tlie DBI global catalog is a Gateways to access data sources repository in which DBI maintains the description of the distributed data. It enables DBI to provide Catalog Management tools and applications with a single access point to the One of the key difkrentiators between MDBMS archi- federated database environment. The global catalog tectures is tlie way that tlie ~netadatacatalog is orga- enables DBI to tell users what data is available without nized. Metadata is defined as the attributes of the data requiring immediate connectivity to the data or its that are accessible (e.g., naming, location, data types, data source. It can be managed and maintained as an or statistics). The nietadata is stored in a catalog. Two independent database. The maintenance of the DBI common approaches for catalog management are global catalog is not inherently tied to a specific data described below: manager; currently, the DBI catalog may reside in Autonomous catalog. The MDBNIS maintains its ORACLE, SYBASE, or Oracle Rdb databases. own catalog in a separate database. This catalog The use of a global catalog may result in a system describes the data available in the multidatabase. with a single point of failure. To eliminate its potential For data that resides in a relational database, the failure within a node, a disk, or a network, standard nietadata definitions of table objects, index high-availability mechanisms may be employed. These objects, and so forth, are i~nported(i.e., repli- include shadowed dislzs with shared access (e.g., clus- cated) into the milltidatabase catalog. For data tercd nodes) and data replication of the DBI catalog that resides in some other data source such as a tables \vith products such as the Digital Data record file system (e.g., record management sys- Distri butor.5 ten1 [RMS]) or a spread sheet, the MDBMS cata- log contains a relational description of that data Three-tier versus Two-tier Architecture With a two- source. tier data integration model, once the data has becn Integrated catalog. The MDBMS is integrated retrieved from the server ticr, the actual integration with a regular database system that is capable of occurs on the client tier. This may result in massive accessing objects (both data and mctadata) in integration operations at the client site. In contrast, remote and foreign databases. A gateway server is tlie DBI is based on a three-tier architecture that responsible for making a foreign database appear performs most integration functions on a middle as a homogeneous, remote database instance. For tier between tlie client and the various database data that resides in a relational database, the gate- servers. The three-tier approach avoids unncccssary way server stores views of its system relations into transfer of data to the client and is essential to provid- that database. For data that resides in a record file ing production-quality data integration. In another system or spread sheet, the gateway server stores comparison, all clients in the two-tier approach need the relational metadata description of the data in a to be configured to access tlie various data sourccs; separate data store. however, the three- tier approach significantly reduces such management complexities. OBI Approach Gateway Driver Model DRI deploys a set of gateway The DBI approach to rnultidatabase management very drivers to access specific data sources, including other closely follows the reference architecture presented DBI databases. These drivers share a single operating earlier. The DBI approach emphasizes the following system process space with DBI to avoid unncccssary design directions: interprocess comnii~nications.When DBI performs Global, autonornous catalog for metadata man- parallel query processing, ho\vever, gateway drivers agement may reside in a separate process space. The core of DBI Three-tier integration model (described later in interacts with the actual gateway drivers (e.g., a this section) SYBASE gateway driver) through the Strategic Data

Digital l'cchnical Journal Val. 7 No. 1 1995 Interface (SDI), an architected interface that is ~~scd Links and Proxies Tlic link object tells 13BI how to within the DBI product family as a design center? A connect to an underlying data source (referred to as gateway driver is implemented as a relatively thin sok- the link datubase). A link objcct has three components: ware layer that is SDI compliant and tliat is responsible a lill k ~ianic,tlic dccess string used to attach to the link for handling impedance mismatches in data models database, and, optionall)!, security information used by (e.g., RIMS versus relational), query language (e.g., tlie DRI gateway driver to provide authentication different dialects of SQL), and run-time capabilities informatic-111to the link database system. Tlic prosy (e.g.,SQL statement atomicity). object is associatecl \vith a link object. It can be used to spccib user-bpecific ~i~thcnticarioninformation for Distributed Rdb One of the des~gngoals for 1)RI was individual links. MJhen users do not \\rant to use pros- to enable distributed database processing for DEC: ics hrtheir links, they must speci? the authe~itication Rdb (no\\! Oracle Rdb) From the pcrspcctlvc ot ,in information for J specific database at the time they appl~cat~on,Dl31 therefore looks ltke a d~stnbutcdRdb conncct to 13R1. database svstem. Tables With link and prosy objects in place, the user DBI Concepts and Terminology can import mctadata definitions of ~~nderlyingtables into the 1lBI catalog. Tlir mctad,ita iniported for a In this section, \\/e present a briefo\~crvic\vof tllc con- tablc incluclcs statistics, and constraint and index infor- cepts and ter~ninolog,/relevant to 1)1<1. mation, a11 of which are ~lsedby tlie DBI optimizer. The import step is performed with a CREATE TARL.F, DBI Database stntcmcnt that 1x1s \>ec~icstcndcd to aIlo\\, for '1 link A DRI database consists of (1)a set of tablcs tliat 1)RI reference. For csa~nple: creates to maintain the DBI metadata (also rcfcrrcd to - - Import "rdb-emp" table into DBI as the catalog) and (2) the distributed data that is a\.ail- -- database as "emp" from the link able to the user \vlien connected to the 1)Bl catalog. -- database represented by the Link A DBA creates '1 DBI database using the 1)H' SQL -- named "Link-rdb". CREATE DATABASE statement. This statenlcnt has - - been extended for 1DRI to allo\v the uscl. to indicate CREATE TABLE emp LINK TO rdb-emp the physicdl database (e.g., a SYRASE datiibasc) tliat USING link-rdb; \vill be used to hold the DRI metadata tables. The creator of a 1)BI database automatically Views Vic\v objects arc LISC~LII for making multiple becomes tlie owner and systc~uad~iiinistrator of that tablcs from different link databascs 'Ippenr as a single database. A DBI system administr:itor may grant t,tblc. In 13131, \/ic\\s scrvc as po\\,crful mechanisms to access privileges on the DBI database to other users. resolve semantic differences in tablcs from disparate Depending on the level of privilege, a user may then databases. supports nvo types of vic\\rs: regular perform system administration f~mctions,esccutc data SQL \/ic\\/sand horizontall!/ partitioned views (HPVs). definition languagc (DDI,) operations, and/or clucry Rcgul,lr \!ie~\~sarc compliant wit11 ANSI SQL92 Level the tables in the virtual d.ltabasc. 1; they support fill1 query esprcssion c.ipabilitics and updat~bility.~HPVs consist of a view namc, a parti- DBI Objects tioning colu~nn,and partition specifications, Figure 2 In addition to regular SQL objects such as t;il>lcs or js an ~s~~~iipleof nn HPV dcf nition. columns, DBI uses objects, links, and prosics tliat arc HI'Vs provide 3 \!el-y po\vcrfi~lcolistruct for dcfi~i- outside the scope of the SQI, language standa~d. ing n logical tablc composed of horizontal partitions

CREATE VIEW emp (emp-id, first-name, Last-name, country) USING HORIZONTAL PARTITIONING ON (country)

PARTITION us WHERE country = 'US' COMPOSE AS SELECT employeeid, firstname, Lastname, 'US' FROM emp-us

PARTITION europe WHERE OTHERWISE COMPOSE AS SELECT emp-id, first-name, last-name, country-code FROM emp-eur;

Figure 2 Esalnplc ofdn HPV 1)cfinirion

10 Digital Technical Journal Vol 7 No. 1 199.5 that may span tables from disparate data sources. Both such as TCP/II>, No\~ell'sscque~~ced pacltet exchange/ retrieval and update operations on HPVs are opti- internetwork pacltet exchange (SPX/IPX), DECnet, mized such that unnecessary partition access is elimi- or Windows Sockets. DBI itself may be deployed on nated. In addition, HPVs may be used to implement a Digital UNIX (formerly DEC OSF/l) and OpenVMS shared-nothing computing model 011 top of both platforms today. Support for additional platforms is homogeneous and heterogeneous databases? being added.

Stored Procedures DBI supports stored procedure DBI System Architecture objects. Storcd procedures allow the user to embed application logic in the database. They make applica- In this section, we describe the system architecture tion code easily shareable and facilitate DBI to main- of the DBI product family and present some of its cain dependencies between the application code and specific designs. database objects. Furthermore, storcd procedures reduce message traffic between the clicnt and the Interfaces server. Figure 3 is an example of a stored proccdure. As shown in Figure 5, the DRI syste~iiarchitecture is anchored by nvo esternal interfaces, SQL, and meta- OBI Database Administration data driver interfaces/data drivel- interfaces (MDI/ DBI supports statements that keep the imported DDI), and nvo internal interfaces, Digital Standard ~netadataconsiste~it with the link database. The Relational Interface (DSRI) and SDI. extended ALTER TABLE statement may be used to The SQL interface is used by DBI clients to issue regularly refresh the table ~netadatainformation or requests to the integration server. The lMDI/DDI update tlie table's statistics. The ALTER LINK state- interface is used by DBI to call gateway drivers that are ment may be used to modify the link database specifi- provided by a user. The MDI/DDI interface specifies a cation or a proxy for a given link object. simple, record-oricnted data access interhce provided by Digital to assist ilsers in tlie access and integration OBI Configuration Capabilities of data sources for which 110 Digital-supplicd gateway Figure 4 shows tlie power of configuration options drivers are available. supported by DB1. Follo\ving the three-tier model for DSRI is the interface between DRI's SQL parser and data integration, the DBl server may access a very large the DBI processing enginc.YThe SDI jnterface speci- number of databases, including other DBI databases. fies a canonical data interface that shields the DBI core The DBI server is accessible through SQL APIs that from data-source-specific interfaces and facilitates are available on popular client platforms. DBI's clicnt- modular development.6 server protocol is supported on all common transports

procedure maintain-salaries(:state char(2) in, :n-decreased integer out); begin set :n-decreased = 0; for :empfor as each row of select * from employees emp where state = :state; d0 set :last-salary = 0; history-loop: for :satfor as for each row of select salary-amount from salary history s where s-employee-id = :empfor.employee-id

if :salfor.salary-amount Cltl :Last-salary then set :n-decreased = :n-decreased + 1; Leave history-loop; end if; set :last-salary = :salfor.salary-amount end for; end for; end;

- -- Figure 3 Example ofa Stored Proccdurc

lligital Technical Journal Vol. 7 No. 1 1995 1, NODE A 0,CLIENT NODE E

, NODE C . DB1

NODE D

SYBASE

Figure 4 DBI Config~~rntionCapabilirics

The SDI disp~tcherand g-atc\vay drivers provide ODBC ( SQUSERVICES ( SEQUELINK ( DEC SQL the access to data sources.

""L SQL SQL Environment and Server Infrastructure The SQL parser supports DEC SQL,, an ANSI/National Institute DB INTEGRATOR for Science and Technology (N1ST)-compliant SQL snl iniplementation by mapping DEC SQL syntax into an DBl RELATIONAL DBI NONRELATIONAL i~itcrnalquery graph representation.') 111,I c-I' lent-server en\~ironment,the DRJ server infi-astructurc is used to MDIIDDI NATIVE DBMS API managc, monitor, and maintain a DBI scrc.er configu- ration that supports \vorkstation and desktop clients.

API Driver and Context Manager The Al'1 driver is Figure 5 responsible for the top-level control tlo\v of client DB Integrator Architecture requests within the 13131 core. It currently accepts 1)SlU calls from applications such as DEC SQI. and dispatches then1 within 1)Bl. Thc contest manager Components performs deniand-driven propagation of execution The cornponelit architecture of DBI in Figure 6 contest to the gateway drivers and maintains the dis- closely resenlbles the multidatabase refcrcnce architcc- tributed contest of active sessions, transactions, and ture presented earlier: requests.

The SQL and ODBC client-server environment Metadata Manager The rnetadata manager is respon- provides language API and SQL parser functions. sjblc for the overall malyagcment and access to mcta- The AI'I driver and context manager support dis- data. The services provided fi~ll into tlie categories of tributcd transaction management and part of the catalog nianagemcnt, data definition, metadata cache distributed esecution system. managcmcnt, and qucr)~access to lIB1 system rcla- The ~netadatanianagcr provides global catalog tions. The mctadata catalog manager maintains the management. DRI catalog in the form of 1)BI-crcatcd tables in an underlyi~lgdatabase (e.g., SYBASE or OIWCLE). Thc The compiler supports the distributed query opti- DDL processor esecutcs tlie data dctinition statc- mizer and compil?c tion.' nicnts. The metadata cache manager is responsible hr The executor supports the remaining part of the maintaining metadata in a volatile caclic that provides distributed execution system. high-speed access to metadata objects.

12 Digital Technical Joutmal 01. I No. 1 1995 I SQUODBC CLIENTS I Technical Considerations

The DBI de\,clopnicnt team selected se\leral designs I-API DRIVER and teclinologies that it believes to be crucial for dis- I CONTEXT MANAGER I tributed and heterogeneous data processing. This sec- tion summarizes those designs within the following 1 METADATA 1 COMPILER EXECUTOR fi~nctionalunits: distributed execution; distributed MANAGER 1 I I I metadata handling; distributed, hetel.ogeneo~~squery process in_^; high a\/ailability; per For~iiance;and 1)R [ 1 SDl DISPATCHER server con6 guration. -- DB INTEGRATOR DBI GATEWAY DRIVER Distributed Execution To support transparent distributed query processing, DBI propagates execution contest such as connection Figure 6 1)R Inrcgrnror (;omponcnts contest or transaction contcst to the targct data sources. Tools and applications scc only the simple user session and transaction that they cstablisli wit11 the DBI integr.ltion scr\'cr. Compiler The compiler pro\/idcs services for tr,lnsl:~t- 1lRI uses a tree organization to track tlic distributed ing SQ1, statements and stored proccd~~rcsinto 1)RI execution co~itcst.Wlicn a user connects to a 1)RI csccution plans. A rule-based query optimizer pcr- database, a DRI user session contest is created. This fi)r~iisquery rewrite operations, enurncrates diffcrcnt session contcxt is subscq~~cntlyused to anchor active csccution strategies, and fi~ctorsin f~nctionalcapabili- transactions, conipilcd SQL statements, as \veil as tlic tics of the underlying data sources. Each csccution metadata cache that is created for every user attaching ~t~itegyis associated with a cost that is based on prcdi- to DBI. When Dl31 passes control to a gatec\q/ clrivcr, c;ltc scIccti\~itycstiniates, table carciinnlitics, :l\r,~ilability both scssion and transaction contest are establislicd at of inciiccs, nct\vork band\\/idth,and so forth. The lo\\,- the targct data soilrcc. cst cost strates is chosen as tl~final csccution plan. Distributed transactions must support consistency Abo\>ca certain th~-cslioldof clLIcry co~iiplcxity,the and concurrcnc!l across autonomous database man- optimizer s\\~itchcsfrom an esliaustivc search mctliod agers. Consistency rccluircs that a distributcd transac- to a grccdy scarch method to limit the computational tion manager with nvo-pliasc conlmit logic is available. complcsity of the optimization phase. Tlic compiler DBI uses the Digital 1)istributcd Transaction Manager guicrxesco'de that can be processed by the csccutor (DDTM) for that purpose and is adding support for the cornlx)lwnt ancl the Sateway drivers. distributed transaction processing (1)TP) SA standard intcgration.l".l1 Executor Thc csccutor component is responsible for Conc~~rrcncyrequires that distrib~~tcddea3loclts arc proccssi~igtlic excc~~tio~lplan that the compiler pro- detcctcd. In a multiclatnbnsc systeln, distrib~~tcddcad- duces. 'l?licsc ncti\~itiesincludc Jock prevention is not fc;isiblc because no clatabasc manager csposcs cstcrnal interfaces to its lock man- F,schanging data between the 1)RI and tlic clie~~t agement ser\.iccs--3 proccd 11re required to per for111 Strcn~ningdata bet\\,een the 11131 core and tlic link deadlock detcctio~~.1)RI tlicrcforc relics on the simple databases technique of transaction time-out to dctect deacilocks. Performing intermediate dntn mnnipi~l;itionsteps In addition, a 1)RI ;ipplication may choose to spccifii sucli as joins or aggregates isolation Ic\~clslo\vcr than scrializability or repeat:~hlc illallaging \\,orkspacc and buffer pool to cfti- read. This is done \\/it11the SQL SET TRANSACTION cicntly linndlc I2irge aniounts of tr,llisicl~t n~ld statement. The Dl31 contcst nlanngcr rccords tlic intcr~ncdintcdata transaction attributes specified and for~vardsthem to the underlyi~igdata sources as part of propagating Supporting parallel processing transaction contcst. Lower isolation levels will, in general, result in fc\\jcr lock requests and thus fc\\fc~- SDI Dispatcher and Gateway Drivers l'hc S1)I dis- deadlock situations. patcher scpar.ltcs the core of Dl31 from the gateway dri\.cr space. It locates and loads s1131~a1dci~iiagcs that Distributed Request Activation DRI supports SQI. rclxcscnt gatc\\,dy dri\~ersand routes SL)I calls to the statement atomicity. This rccli~ircseither that a single corrcspo~idi~igentries in the gatc\vny dri\rcl- image. SQL statement csccutcs ill its e~itiret)'or, ill the cilsc of a failure, that the clatabasc is rcsct to its state prior to

Digirnl Tcchnicll Journ.ll Vol. 7 No. I IS 1.. the executio~lof the statement. With DBI, the SQL the DRI metadata, in the catalog databasc are prcdetcr- statement may be excci~tedas a series of database mined. Although this approach docs not take advan- requests at multiple data sources. DBI internally uses tage of the existing gate\vay drivers, it results in the concept of harkpoints to track SQL statement high-performance access to the metadata store. boundaries. Gateway drivers are informed of [nark- To siniplify the de\~clopmentof a catalog applica- point boundaries, and the driver attempts to map tlie tion, the set of priniitivc operations on the catalog markpoint SDI operations into semantically equivalent database was isolated, and a catalog application constructs (e.g., savepoints) at thc target data source. interface (CI) was defined. Catalog applications are Some databases support SQL3-style savepoints, nlhicli developed according to the CI specification and are atomic units of work within a [ransaction. When implemented as shareable images. DBI dynamically DBI decides to roll back a markpoint, the gateway dri- loads the appropriate catalog application image bascd ver may then inform such a data source to roll back to on the catalog type specified by a user attaching to a the last savepoint. In the absence of markpoint pri~iii- DBI database. rives in tlie target data source, tlie gateway driver may elect to roll back the entire transaction to meet the Security The security support in tlie currently roll-back markpoint semantics. released version 3.1 of Dl31 is siniple but effective. It uses the security ~nccl~a~iismsof thc underlying link Gateway Drivers In contrast with other data integra- database systems in the following areas: tion architectures, the DBI gateway drivers are Authorization to connect to an underlying data- designed to be simple protocol and data translators. base through DBI and acccss data fro111 it. Their primary task is to report the capabilities of tlie data-source interface (API and SQL language) to the Access to the data that is manipulated tliro~~gh1)BI DBI core and si~bseq~~entlymap between the S1)I is controlled by the underlying 13BMS. Typically, interface protocol and the data-source interface. The underlying database systems control acccss to data gateway drivers typically sliarc process context with based on the identity of the user attached to its the DBI server process, thus avoiding the need for an database. DBI supports objects called proxies that intermediate gateway server process that \vould other- enable the client to specitj, its user identity (i.c., wise reside benveen the 11BI server and the data- ~~serna~ne/pass\\~ord),\\/liich is then used to attach source server (e.g., SYBASE SQL Scrver). This reduces to the underlying database. the amount of contest switching and interprocess Authorization to perform \larious Dl31 operations. message transfer. All privileges for a 13BI dnt~haseare for the databasc The gate\lray drivers are responsible for mapping tlie itself, ratlicr than for tables or colun~ns.Tllc priv- SDI semantics to the interface primitives provided at leges are based on hierarchically organized c.itc- the target data source. For relational databases such as gories of users: Oracle Rdb, ORACLE, INFOKMIX, SYRASE, or -The DBADM privilege is gi\u to uscrs rcspo~i- DB2, this requires primarily a mapping to thc product- sible for setting up and maintaining a 13R1 specific SQL dialect and the product-specific data database. types. For file systems such as RMS, the gateway driver maps the SDI senianrics to calls to the 1WS run-time -The CKEATE, DROP privilcgc is granted to library. interactive users and applicatio~idevclopcrs with database design responsibility who I~LIS~pcrform Distributed Metadata Handling data definition operations. In this section, we disc~~sstl~ree areas of importance to -The SELECT pri\~ilegcis rcser\lcd for interactive the handling of metadata in DBI: catalog manage- users and application dcvclopers who perform ment, security, alld metadata caching. data manipulation operations but do not perform any data definition operations. Catalog Management Thc DBI requirement of data- When a DBI administrator grants or revokes pri\li- base independence implies that 1)BI cannot require leges for a DBI database, DRI, in turn, grants or tlie presence ofa particular DBMS for its persistence revokes the appropriate set of privileges on the DBI metadata storage. Rather than devising a private stor- tables in the databasc system that Inanages the 11131 age and retrieval system, 1)RI was designed to layer on catalog. The enforcement of privilegcs is tliercforc car- top of common relational DBMSs. ried out by that database system. For example, ~\~Iicn Static, precompiled native applications are used the SELECT privilegc is granted on the logicul data- to access metadata from a given catalog DBMS for base, DBJ grants the SELECT privilege on the tables two reasons: (1)The pattern of nictadata access for the that represent the DBI catalog. This ensures that the catalog database is known, and (2)The tables housing user has access to the metadata for processing clucries.

14 Digital Techn~calJournal No. 1995 Similarly, when a uscr is granted the CREATE, DROP The metadata cache is also the data source for the privilege on the DBI database, DBI grants SELECT, DBI system relation queries. The metadata manager INSERT, UPDATE, and DELETE on the appropriate navigates tlie cache structures to obtain data for the tables in the catalog database to the user. This ensures system relations, making use of the hash table for effi- that any DDL actions executed by the user will enable cient access and using DBI's execution component for DBI to modify the tables in the catalog database. evaluating search conditions and expressions.

Metadata Manager Cache The in-memory Inetadata Distributed, Heterogeneous Query Processing cache serves a dual purpose. First, it facilitates rapid Distributed query processing in a heterogeneous data- access to the metadata by the DBI compiler. Second, it base environment poses certain uniq~~eproblems. serves as a data store for tlie DBI system relations that Data sources behave differently in terms of data can be queried by tools and applications. For exaniple, transfer cost, and they support different language DEC SQL, obtains rnetadata for semantic analysis of constructs. Many systems employ rudimentary tech- SQL statements by querying the DBI system relations. niques for decomposing a query, frequently pulling in The metadata cache is structured as a single hash all the data from underlying tables to the processing table representing a flat namespace across all DBI node, and then performing all the operations in the objects. An open liasliing scheme is employed in integration engine. Others simply use syntactic trans- which the hash-table entries hang off the buckets in formations, thereby providing the least common the hash table in a linked list. denominator in language hnctionality. DBI, on the To optimize the use of the cache as well as to accel- other hand, provides a robust query optimizer that erate the attach operation, the metadata manager includes decomposition algorithms to reduce the data initially obtains only minimal, high-level metadata flow and provide high-performance query execution. information from the catalog database; for example, only names of tables arc fctclied into the cache during Cost-based Plan Generation When a query has several the Dl31 database attach operation. Subsequently, the equivalent means ofproducing the result, tlie plan that metadata manager obtains fi~rtliermetadata informa- has the least estimated cost is chosen. Statistics for tion from the catalog database on a demand basis. table, CO~LIII~II,and index objects are used for estimat- DBI allows the creation of ne\v metadata objects. ing result size after various relational operations12J3 These operations are typically performed within mark- Data transmission costs from the underlying link data- point and transaction boundaries to enforce proper base to DBI are taken into account when estimating statement and transaction demarcation. The rnetadata how much of tlie query is to be sent to the gateway manager maintains a physical log in cache to denote database. The network transmission cost is measured transaction and markpoint boundaries. The lo,cr 1s' an dynamically for cach user session, once per gateway ordered list of structures, cach representing a DDL connection. The cost associated with performing a action, a pointer to the cache structure that was relational operation is also aggregated into the overall changed, and either the previous values of fields that cost. This crucial step ensures that the plan is not were updatcd or a pointer to a previous image of an skewed toward one database enginc, which \vould be entire structure. When a markpoint or transaction is the case if only the nenvork transmission costs were committed, the corresponding log part is reset; when talten into account. a markpoint or transaction is rolled back, the log is ~~sedto restore tlie cache to its state prior to the start of Rule-based Transformations A query result may be the markpoint or transaction. produccd with d~ffcrentsequences of relational opera- An object in cache can become stale when another tions. These sequences are generated using rule-bascd user attaches to the 11131 database and causes an transformations. The starting point is the original object's metadata to be changcd in the catalog data- operation set in which the query was syntactically rep- base. To ensure consistency ofthe cached version of an resented. From this, permutations are generated to object's metadata with tlie actual vcrsion in the catalog form equivalence sets, which then lead to the various database, the rnetadata manager uses a time stamp to combinations ofexecution plans that need to be exam- check the currency of the cached object when per- ined for cost. Finally, the least costly plan is chosen for forming incremental fetching of the object's metadata. the query. Heuristics are applied to limit the amount If the object in cache is stale, the object is not accessi- of search space. ble in the session, and an error message is issued to the user indicating that the object in cache is inconsistent Capability-based Decomposition The critical charac- with the catalog database. In a production environ- teristic of a heterogeneous environment is that the ment, this would be a rare event, given the low fre- data sources are nonuniform in their ability to perform quency ofdata def nition operations. certain operations and in their support of various

Digital Technical Journal language constructs. For exaniple, most databases 9. Perform the join in I>DI benvccn the results of cannot support derived table expressions (i.e., select steps 7 and 8. expressions in the FROM clause of another SELECT statement). Query Unnesting A nested SQL query, in its simplest The plan generation and deco~npositionphases of form, is LI SELECT qucr)~with tlie WHERE clause the optimizer must recognize the u~lderlyingdata- predicate containing a subquery (i.e., another bases' capabilities. Consider the query example shown SELECT query). The follo\ving are examples of nested in Figure 7 and the indicated locations of the tables. SQL queries: First, with T1 and T3 located in the same database, the optimizer can generate a subplan in which the join Example 1, Table Subquery between these two tables can be executed in the select * ORACLE database. An examination of the last (third) from A AND predicate indicates that all the tables involved in where A.cl IN (select (B. c2 + 5) from B that predicate are located in tlie same ORACLE data- where B.c3 = A.c3); base. Due to the limitations in OlWCLE's SQL lan- guage support, honlever, it cannot evaluate the combined expression between nvo subqueries in thc select * WHERE clause, where thc arjthmetic result is to be from A compared to the columii Tl .c5. where A.cl = (select max(B.c2) The DBI optimizer employs a more sopliisticatcd from 8 where B.c3 = A.c3); alternative. It evaluates the nvo subqueries separately and then substitutes them in the predicate in tlie Using strict SQJ- semantics, \\~ccan evaluate this subplan for ORACLE as run-time parameter values. nested query by comp~~tingthe results of the inner This technique leads to tlie most efficient plan: subquery for e\,cry tuple in the outer (containing) 1. Retrieve value for (sclect a\lg(T4.c5) from T4) from CILIC~)'block. The value for the column A.c3 is substi- ORACLE. t~~tedin the i1inc1.subcluer)l, and the resulting \lalue (or values) are computed for tlie sclect list and used to 2. Assign value to variable X. cvaluate the Boolean condition on column A.cl: this is 3. Retrieve \laluc for (select T5.c7 from T5 where repeatcd for cver)l tuple of A. This method of evaluat- T5.c8 = 'a') from C1RACL.E. ing the resi~ltsis \/cry expensive, especially in a distrib- 4. Assign value to variable Y. LI~C~en\~iron~nelit. 5. Assign param-1 : = variable X. Query unncsting algorithms provide other methods of evaluation that are semantically eqi~ivalentbut 6. Assign param-2 := variable Y much more efficient in both time and space. 7. Execute the SELECT statement below in OMCLE Unncsting deals \vith the transformation of nested and fetch the result rows. SQL q~~eriesillto an cclui\~alentseclLlcnce of relational select * operations. These relational operations are performed from TI, T3 as set operations, thcrcby avoiding the expensive tuple where (TI .c3 = T3.c3) iteration operators during execution and providing and (Tl.c5 = param-I + param-2); large performance gains in most cases. 'The back- 8. Fetch the rows ofT2 from DB2 into DBI. ground and motivation beliind the LIS~of unnesting has been presented in several research papcrs.l4.I5

select * from TI, T2, T3 where (TI .cl = T2.cZ) and (TI .c3 = T3.c3) and (Tl.c5 = (select avg(T4.c5) from T4) + (select T5.c7 from T5 where T5.c8 = 'a') );

TI, T3, T4 and T5 are Located in a Oracle database Table T2 is Located in a DB2 database.

Figure 7 Example of an SQL Query

16 l)~giral Technical Journ~l Depending on the type ofoperations and constructs the partitioning column with literals, as well as at found in the nested select block and its parent select query execution time (run time), when the partitio~i- block, several different algorithnis can be used. Some ing column is compared with run-time parameters. of these require no special operators over and above During affinity analysis, predicates are situated as the regular join operator. Other transformations close to the inner table operation as feasible. For exam- require a special se~nijoinoperator. Consider the ple, consider the following view definition, and the examples shown in Figure 8. si~bscquentselect statement on that view: In the example shown in Figure 9, a special operator create view V1 (a, b) as called semijoin is necessary. The se~nijoinof table R select TI .cl, avg(TZ.cZ) from TI, T2 with S 011 condition J is defined as the subset of where (TI .c4 = T2.c4) R-tuples for which there is at least one matching group by TI .cl; S-tuple satistying,J. Note that this makes the operator select * from V1 where (a = 5 and b > 10); asymmetric, in that (K se~nijoinS) is not the same as (Sseniijoin R), whereas the regular join is symmetric. The predicate a = 5 (upon further conjunctive normal By implementing the special semantics required for form [CNF] analysis) can be applied on the base table this semijoin operator, we can transform the nested scan itselfasTl.cl = 5. query into this join operator that can again make use Index join is one of the efficient join techniques of high-performance techniques like hash joins within used in DBI. This join technique minimizes the move- the DBI execution engine. ment of data from the link databascs by taking advan- tage of the indexing schemes in the link database to Predicate Analysis When a query against an HPV can facilitate the join process. Consider the following be satisfied by simply accessing a single logical parti- query: tion, then the rest of the partitions can be eliminated select * from the execution plan. Partition elimination algo- from TI, T2 rithms in DBI are used both at compile time, when the where TI .cl = T2.c2 + 5 and (. . .some restrict predicatecs) predicates on the HPV query invol\ie comparison of on T2.. .)

Q1 - query that will not require a special join after transformation - - select snum, city, status from S

where status = (select avg(weight) + 5 -- nesting predicate from P where P.city = S-city); correlation predicate

- - - - Q1-U - the unnested version

select snum, city, status from S, (select city, avgcweight) + 5 from P group by city) as Tl(cl,c2) where TI-cl = S-city and S.status = TI .c2;

- - Algorithm: - - 1) Take the inner block's FROM table that has a correlation predicate. - - 2) Add a Group-By to the inner block containing all attributes of this - - table that appear in one or more correlation predicates. The order of - - the attributes in the Group-By does not matter.

- - 3) Also, add these elements to the select list of the inner block; at the - - beginning or at the end, whatever is convenient.

-- 4) Next, add this block to the FROM List of the outer block - effectively doing a regular join with the tables in the outer FROM list. - - 5) Lastly, rewrite the correlation and nesting predicates as shown.

Figure 8 Query Unnesting AJgorithm

Digital Technical Journal Vo1.7 No. 1 1995 17 - 82 - query requiring a semi-join - select snum from S where city IN (select city from P where P-weight = S.status);

- - Q2-U - the unnested version

select snum from (S semi-join P on (P-weight = S.status AND S.city = P-city) ) ;

--I) Do a semi-join between S and P using the following (combined) condition: -- "(P-weight = S-status) AND (S.city = P.city)" -- In reality, this is actually specified as 2 separate semi-joins between - - S and P, one with the correlation predicate and one with the form of - - the nesting predicate. But these get combined using rules. -- 2) Project out S-snum from the result

Figure 9 Algorith~nwith Semijoin Operator

Given an index on column cl of table T1, and witli First, partition P2 is eliminated at compile time. cardinality and cost estimates permitting, the query Now suppose partition P3 js presently not available optimizer can generate an alternate plan. This plan due to net\-\iork connecti\iity yroblelns (Figure 11). allows the join to be pcrfornicd by using efficiently For each partition that is una\tailable, a message is indexed access retrieval for table T1. ret~~rnedindicating that sonic rows arc missing from the res~llttable: %l)BI-W-HAHPV-UNAiIMLABLE High Availability Partition 1'3 is currently una\lail'~ble. However, DRI High avajlability in DBI results from the use of hori still attempts to return as much data as is accessible. zolital partitioned views and catalog replication. Catalog Replication To prevent tlie 1)KI global cata- Horizontal Partitioned Views All HPV is a special log fi-om becoming a single point of failure, multiple kind of view in which DBI is provided with informa- copies of a catalog tablc can bc maintained by using tion about how data is distributed among tables in link replication techniques. Catalog tablc copies can be databases. HPVs offer many advantages over normal created casily and maintained using replication tools views, one of them being improved pcrformancc such as the DEC Data Distrib~~tor.~ through partition elimination and use of parallclisni. The other advantage is high availability. Performance If a partitioned view bas multiple partitions and if In addition to its distributed query optimizer, Dl31 some partitions are unavailable when the view is uses a series of techniques to increase tlie speed of queried, then DBI does not fail the query but returns query processing, most notably in the areas of data data from the available partitions. hi csaniplc is transfer, memory management, join processing, paral- shown in Figure 10. The example creates a partitioned lclism, and stored procedures. view named ALL-EMPLOYEES, with four col~~mns and three partitions, each ofwhich obtains rows fi-on1 Data Transfer Thc DBI execution engine performs three different tables. The partitioning is based on a bulk data transfer using the bulk fetch meclia~usn~s specific column, in this casc the CITY column, as spec- provided by tlie SDI interface. \/Vith b~~llcdata transfer, ified in the USING HORIZONTAL, PAliTITTONTNG a single request lnessagc to a local or remote data ON clause. sourcc returns many t~~pleswitli a single response mes- Suppose the following query is submitted sage. Bull< transfer techniques are mandatory in a dis- SQL> SELECT * FROM ALL-EMPLOYEES tributed environment; they reduce both niessage WHERE (CITY = 'MUNICH') traffic and stall ~vaitsdue to niessage delays. The data OR (CITY = 'NASHUA');

18 Digital Technical Journal Vol. 7 No. 1 1995 CREATE VIEW ALL-EMPLOYEES(ID, NAME, ADDRESS, CITY) USING HORIZONTAL PARTITIONING ON CITY PARTITION PI WHERE CITY = 'MUNICH' COMPOSE AS SELECT ID, LAST-NAME, ADDRESS, 'MUNICH' FROM MUNICH-EMPLOYEES WHERE STATUS = 'Y'

PARTITION PZ WHERE CITY = 'PARIS' COMPOSE AS SELECT ID, FULL-NAME, ADDRESS, 'PARIS' FROM PARIS-EMPLOYEES WHERE STATUS = 'Y';

PARTITION P3 WHERE CITY = 'NASHUA' COMPOSE AS SELECT ID, FULL-NAME, ADDRESS, LOCATION FROM NH-EMPLOYEES WHERE STATUS = 'Y';

Figure 10 Example of a Partitioned View

The buffer pool manager supports multiple buffer replacement policies, which is important for database worltloads that involve sequential access to data that is subsequently no longer needed. The two supported strategies are least recently used (LRU) and most P2: PARIS recently used (MRU).I6 Finally, the worltspace nian- ager supports write-behind for newly allocated pages. P3: NASHUA 'a@ This allows newly allocated pages that have been filled MULTIPLE to be written asynchronously. PHYSICAL DATABASES Join Processing Highly efficient processing of joins and unions is important in any co~nmercialdatabase; it Figure 11 High Availabil~tywith Partitioned Views is crucial for a multidatabase system. DBI supports nested loop join, index join, and hash join. In fact, transfer bandwidth benveen the DBI engine and the DBI supports both a regular hash-join mechanism and gateway drivers is hrther increased thro~~ghthe use of a hybrid, hash-partitioned variant that is augmented asynchronous SDI operations. with Bloom filt~ring.l7,18,~9 For both hash-join variants, the inner table rows are Memory Management An MDBMS needs to be able read aspnchronously into a DBI worltspace. This first to process large amounts of data efficiently without pass is used to estimate whether or not to use the hash- exceeding platform- or user-specific operational quo- partitioned variant. An exact estimate for the number tas such as the file size or the working set limit. of partitions to use is well worth the overhead of this In addition, standard operating system paging tech- initial pass.20 In addition, a Bloom filter with 64 kilo- niques may easily result in heavy 1/0 thrashing for is populated as part of this pass. The inner table database-centric work loads. cardinality, an estimate for the outer table cardinality, The DBI executor placcs data streams, intermediate and an estimate of the presently available memory are query results, or hash bucltets into individual work- used to determine whether the simple hash-join tech- spaces. A worltspace is organized as a linear sequence nique is sufficient, or whether the use of the hybrid off xed-size pages. A standard page-table mechanism hash-partitioned join technique is warranted. identifies the allocated pages and records status such as 111 general, hash-partitioned join processilig is indi- ~vhethera page is present in memory or whether it is cated \vl~enthe inner table and its hash-table buckets paged out to secondary storage. The worltspace nian- do not fit in memory. In this case, both the build phase ager operates as an intelligent buffer manager and pag- for the inner-table hash buckets as well as the probe ing system that controls fair access to memory across phase of outer-table tuples against the inner-table hash all active ~~orkspaccsof a given DBI user. A buffer pool buckets may incur massive amounts of random I/O. manager holds the workspace pages that reside in When the hash-partitioned variant is selected, the fol- memory. lowing steps are performed.

Dig~ralTechnical Journal Vol. 7 No. 1 1995 19 F,acli partition rccci\rcs .I scparntc \\,orl11I executor PI-occsscs. Tlic the join column of innel--t,lblc t~~plesand is monitor proccss supports on-line system management applied \\,hen the outcr table ro\\!s urc partitioned. of the senferconfg~~~.,ltio~~. One or morc dispntchcr Tllis results in a porcnti.lll\r massi\le rcductio~i processes 1ii311agc.dl clic~itco1n1i1~11iicatio1ls co~~tcst. of the nl~mbcrof ro\\,s th.lt 3rc placed into tlie Dispatchers ro~~teclic~lt 11lcss.1gcs to a11 .~ppropri,itc outcr partitions, tli~~sclimi~iating cspensi\~e 1/0 Dl31 executor proccss th~-o~~gllhigh-spccd sii~rcd operations. memory clucucs. Fig111-c12 sl~o\\,s'1 npic,ll 1)13I scr-\.el- The \\,orksp:lccs tliat hold the inner-table partition configi~ration. 1 and the 11;lsh-table buckets for that partition are Server Infrastructure In tl~cI)I13(; client logicall!, colinccts to '1 scl-\,ice object join operation oli tlic first partition piiir. tliat pro\.idcs access to a spccitic 1)13I d,~tab~lsc.'A scr- Tlic \vorkspaccs that hold the remaining inner- vjce is instautintecl by ,I pool of Ill31 c~cc~~tol.PI-occsscs tablc partitions 2 thl-oi~gli(11) arc agcd MRU; that contain the L>RI image. The .llnollnt ofproccsscs thcsc pages bccomc immcdiatcly available for of the pool is configu~-able,both off-line n11d on-line. buffer rcplaccmcnt sclcctio~lonce they have been This allo\vs the administrator to match the tliro~~glip~~t fillccl and tlicir frames u~ipinncd. requirements fix 3 give11 1)I31 d;ltah;lsc with the appro- Once [lie partitioning phase is coniplctc, cacli pair priate amount of csccutor processes. of innu and outer partitions is joined starting with partition pair 1. The inncr pilrtitions arc agcd Multithreading 1)RI csccutor processes J~IJ~prcs- I,RU, ancl the outcr partitions arc agcd MRU to ently be configured as session-rcusablc or transilction- kcel.' the inncr partition ti~plcsin memory. reusable. Sessio~i-reusableIncans that 3 clicnt is bouncl 'Thc use of tlcsiblc buffcr rcplaccnicnt strategies is to an csccutor proccss for the dur~tionof the entire cruci~lfor good buffcr cuchc bclinvior. database sessio~i. Transaction-~-c~~s~~l>Icmcnns tli.~t multiple clients may share tlic same csccutor proccss; Parallelism 1)BI employs t\\,o types of parallelism: a client is scliedulcd to a 1)111 csccutor h-o~ic tr.lns- pipdincd parallelism ,lnd indcpcndcnt parallelisn1.s action :it a time. With hash-join PI-occssing, for instance, the outer ta blc ro\\-sarc rc~db!. ~cp~r~tc1)RI cscc~~tion threads Summary from the undcrl\ring dnt~ln>c.This mcilns that tlic outer t,thlc t~~plcstrean1 is cffccti\~clygcncratcd in par- The 13B Intcgr,ltor proci~~ctcont.~ins m;iny fcnturcs allel \\,it11 the prohc phase processi~lgof the hash-join that cnablc it to p~.o\,idcopc11, robust, ,lnJ high-pcr- operator on the inncr table ~-o\\,s.Tlic o~~tcr-tablet~lplc formance data acccss. Dl31 gu.~ra~itccsopen dnn ncccss stream is directed into tlic hash-join probe phase. by supporti~igde theto ,~ndclc ~LII-cintc~.thcc st,lndnrJs For UNION processing on partitioned \,ic\\!s, the such as SQL92 and 01:13(:. (llicnt-scr\.c~-co~l~lccti\~inf individual input strcams to the UNION operator are is available over tllc l)F.;.(:nct, 'T'<;l'/[l', ,lnii Sl'X/II'S generated by separate 1)121 cxccution threads. 'The transports. The IMI)I/I)I)I inrcrhcc ,lllo\\,s uscrs to streams 3rc provided in parallel and independently to este~idtlie use of I)[',[ to gain acccss to an!' nl~mbcrof the UNION opcmtor. data sources.

Stored Procedures Stored procedures provide a criti- cal pcrhrmancc cnlianccmcnt fi)r clie~it-sel-verpro- cessing. They allo\v the 1)RA to encaps~~latea set of SQL statements plus control logic. Tlic client sends one message containing a stored procedure rather than scvcral messages, each containing one SQI, state- ment. This rcduccs processing delays that otherwise \\lould be incurrccl due to nct\vork rmftic.

DBI Server Configuration In a smndarcl 1)RI configi~rntion, one cxccution proccss is crc~tcdfix cncli l>I131 Scr\.cr (:onfip~~~-;lrio~~ DBI provides ~.obustdata acccss by supporti~ighct- 2. In/i)~.~r~u/ior.~Tech11olo~)~-ll~~/uh~1sc Larig~,iuge erogencous cluery optimization, location trans- SQL, ANSI X.JH2-92-1 54//)UL C8K-002 ( New York: parency, glot~~lconsistency, resolution of semantic Amcrican Nation.11 St,lndurds Institute, 1992). d'I~~LILI~CCS, -.-. and sccuriq~ clicclts. The Dl31 cl~~cryopti- 3. "Middle\\~;lrc: P.lnacc,~ or Boondoggle?," S/r~rte~yic mizer taltcs cost f'ictors and capabilities into account to Artcrl]!.sis/c'cpor'/ (C;,irtncr Group, July 5, 1994). determine tlie optimal plan. A globill catalog provitics 4. A. Slictli 2nd J. I.nrson, "Fcdcratccl 1)atabasc Systems location transparency and operates as an auto~iomous for Managing l)istriburcd, Hctcrogcneous, and metadata repository. Global transactions are coorcii- Autonomoi~sI)arabascs," AC'III Ci)1n))rtli17,qSSIII-U~~S. nated through nvo-phasc commit. Highly available \~ol.22, 110. 3 ( 1990). horizontal partitiorred \liec\ls support continuous dis- tributed processing in the presence of loss of conncc- 5. lligilul Il~~/lluIli.s/r-ilxt/or H~rlt~lhook(Maynard, tivity. Definitions of vicivs and stored procedures allo\v Mass.: I)jgital Equipment <:orporation, Ordcr No. AA-HZ65FI-TE, 1994). tlie user to hide semantic difkrcnces among the ~rnderlyin~databases. Finally, L>BI enables security 6. Slmlegic llalu l~r/c~~~/ircc.Vcrsio~i 3.1 (Maynard, checks \\~itho~~tinterfering \\lit11 tlie acccss controls Mass.: 1)igital Equi~mcntCorporation, 1994). This specified in the underlying dar~:5o11rccs. i~itcrnnlI'll3 intcgr.itor specification is not avail;~blcto 1)HI offcl-s high-pcrforma~iccdatrsior? advanced query csccution algorithms, ,lnd efficient 6.0(bl;iyni~rd, Mass.: l11g1talI,:quipn~cnt Corporation, use of nenvorlc resources. The query optimizer 1994). decomposes a distributed query by s sing as many fen- 8. 1). I. 1)cWitt and J. Gra!; "Parallel Database Systems: tilrcs of the underlying database as possible and by Tlic FLIILI~CoFHigli Pcrk)rmancc 1)atabasr Systc~~is,'' employing state-of-the-art techniq~~cssuch as query Cb1i11lrr~rric~r1io1t.sr!/'lh!,cACIZl. \~l. 35, 110. 6 (1992): unnesting and partition elimination. The DRI query 85-98. processor is capable of driving indes joins and hybrid h'ish-partitioned joins. All intermcdiatc data is cached 9. Ili~i/al/IS/?/ Ff~t~~dboob.li,~sio~r 5.1 (Maynard, Mass.: Iligital Equipment Corporation, 1994). This 1/0 opti~iiized.Connections to remote data sources internal documcnt is not available to cstcrnal readers. arc established solely on demand. Finnlly, parnllcl query execution is supported. In the future, performance \vill continue to be an important tictor for any data acccss product as \\,ill

support for object-oriented data models. By combin- I 1 , l)i.~~/r~iht~t/c~~I'1 ~~~/r~.i~/c/ior~ ProccJ.s.si~i~q: TZw XA .Y]I~cI- ing data-i1itegratio11technologics such as 1)Rl with ,/iccrlio~r,)(/Open CAE Spccific;itio~i: C193, ISBN application-integration standards such as Object 1-872630-24-3 ( 1992). Request Rrokers, a merger of data integration and 12. P. Scli~igcrct al., "Acccss Path Selection in a Relational application integration will bc feasible. Database Management Sysrem," Proceeciirrgs ?/.the ACiCl .Sl(~MOI)Cbrfi~r~tce ( 1979). Acknowledgments 13. P. Sclingcr and M. Adiba, "Access Path Sclcction in 1)isrriburcd l>atabasc Managenlent Systcms," /B11/1 The authors \voi~ldlike to recognize cvcryonc who lc'o.~curchlRI engineering team dcsigncd, 15. U.l'la)~al, "OFNcsts and Trccs: A Unified Approach to implemented, and delivered the product on sclicdulc. Processing Qucrics That Cont'lin Ncstcd Subqueries, The DBI management tcaln of Stcvc Scrra, Rich Aggregates and Quantificrs," Procee~1i1tg.s(!/'the 13th Bourdeau, Arlene Lacharite, and Trish Pcndlcton con- Chr~/i.~-c.nccort Wty Large) I)cr/uhu,se.s (VLD R), tributed their comniitnicnt to delivering the vision. Rrighton ( 1987). We would also like to thank the anonymous referees 16. M. Stoncbrakcr, "Operating System Support for their in\laluable comments on the content and k)r 1)atabasc Management S!lstc~ns," Cornrn~~nicu- presentation of this paper. 1iorl.s (4th~AGM. vol. 24, no. 7 (1981): 412.

11. C;. Gracfc, "Query Evaluation Tcch~iiqucsk)r Large References l)anbascs," ACilll Cbnrptrlirrg .Srrr!r<)l.s,vol. 25, no. 2 (1993). 18. D. H. ]%loom,"Spacc/ti~nc Tmdcoffs in Hash Coding \\zit11 Allo\\,~blcl,;rrorh," Ck~ttrurrr~ri~~~~/ic~rtsqJ' the ACJf. \.ol. 13, no. 7 ( 1970):422-426. 19. M. hni.~k~-ishna,"Practical Pcrform~nccof Bloo~n Filters anti Par.lllcl Frcc-test Sc.lrching," C'o~~~t~tirt~i- catiorrs o/'//?cACM, \wl. 32, no. 10 ( 1989): 1237. 20. S. <:hristodouI;lkis, "lisrimating 13lock Transfers and Joi 11 Sizes," l~~.o~~~~~~li~t,q~sI?/' I~CA CiLl SIG:.IOL) cbt!/i,t-cvlc.e( 1983). Michael Gagnon lLIike Gng~ionjoincd 1)ipitil in 198 1 ;uid worked on rhc Biographies dcsign and develop~ncntofDigid's tta~~actionprtxcst;ing and databasc s\,stcms. lMikc contriburcd to the dc\~clon- mclit of A<:&IS, 1)igital's transaction processing monitor for \'MS systems, and more recently he contriburcd to the developl~lentof a distrib~ltcdhctcrogencous darab;lsc sys- tem. When that s!atem \\.as rctbc~~sedas the Ill< Integrator product, Mikc led the tea111 that produccd the csccution engine for all rclatiolial proccssing. Mike assumed projccr leadership responsibility for l)Rl version 1.0 and Icd the project through \lcrsion 3.1. Mikc is currently employed by Jris Associntcs, 3 subsidinrp of L.otus Sohvarc.

Richard Pledereder Fornlcrl!, a consulting soft\vnrc cnginccr in l)igital1s Sofnvnrc Products C;roup, Iticliard I'lcdcrcdcr \\us thc systeni architect on rhc I)R I ntcglntor product filniil!/ and contributed to the architccturc and implcn~cntation of common 1)BI and Rdb fcat~~rcssucli ;IS SQL stored proccd~~rcs.kchal-d also initiated the architecture, design, and dcvclop~ncntcSli)rt ofa multirhrcadcd database server e~i\~ironrncnt,\vhich is no\\. part of the l)BI/OSF and Rdb/OSF prodilcts. Hc is no\\.a soft\\,arc nrcliitcct in tlic I>istributcJ ProJ~~crsGro~~p at Syh1sc, 111~.HC rccci\.cd ~MayalkVadodaria a R.S. and an h1.S. ill colnp~~tcrscience koln the l'cclinical Formerly a prilicipal soti\\.arc rnpinrcr in I)igiralls 1)atab.isc Unircrsin ~~lunich,Ravaria. IGchard also collects npcs of Integ-ration and Intcropcrabilit! Ciroup, tMa!,ank Vndocinrin opens h\ the Ra\-ari.ill composer Richard Wagncr. was the technical group Icadcr for rlic query processing components of Digital's I)B Intcgmtor prod~~cttimil\. He \\.as also responsible For 13igiml's SQI.. tic\~clopnicnt cnriron~nc~itproducts. Hc has been instr~~~ncntalin rlic design ofnlany key fcaturcs in the co~npilarionand query optimization \\*itliinD131. Mapnk holds a B. Tech. from the Indian I~istit~rtcofTcch~iolo~~?, ~Midras, and an M.S. in conlputcr scicncc kom tl~cU~)i\.crsity of Illinojs at Urbnna- Cha~npaig~~.Hc is currcntly \\,it11Cupta l'cchnologics.

Vishu IOrishnamurthy Vishu Krishna~nurtliyis n principal r~igi~iccrin 1)igiral's 1)ntab;isc Intc~ri~tionnnd Inrcropcmbility C;roup, \vhcrc he is currcntl!l tlic projcct Icadcr k)r the 1)B Integrator product. Vishu \\..is tlic tcch11icaI Icndcr fi)r thc mrtadnri and catalog ninnagcnicnt cotilponcnts of 1)RI. Sincc join- ing l>igital in 1988, he has held scniol- dcvclopmcnt posi- tions in tlic 1)istributcd <:ompilcr C;roup, in the RdbStar projcct, a~i~iin tlic I)F.<: Data 1)istributor projccr. Vishu holds n I3.E. (honol-s)in ~ncclia~iicalenginrering from the 171~i\~crsityof M;idrns and M.S. dcgrccs in computcr 31id i~lfor~iiatio~iSC~CIICCS and ill ~nccha~licalc~lginccrj~lg (robot~ca)ti.om thc Uni\,crsity of Florida.

\'ol. 7 No.1 1995 I Robert K. Baafi J. Ian Carrie William B. Drury ACMSxp Open Oren L. Wiesler Distributed Transaction Processing

Digital's ACMSxp portable transaction process- Transaction processing (TP) is a style of compi~ting ing (TP) monitor supports open TP standards and that guarantees robustness and high availability for provides an environment for the development, critical business applications. TP typically involves a large number of users using display devices to issue execution, and administration of robust, distrib- similar and repetitive requests. The requests result in uted, client-server applications. The ACMSxp the accessing and updating of one or niore databases TP monitor supports the Structured Transaction to reflect the current state of the business. Definition Language, a modular language that The basic building block in a TP system is a transac- simplifies the development of transactional tion. A transaction is an indivisible unit of work that applications. ACMSxp software is layered on represents the f~indamentalconstruct of recovery, con- sistency, and concurrency. Each transaction has the the Open Software Foundation's Distributed properties of atomicity, consistency, isolation, and dura- Computing Environment (DCE) and supports bility (ACID). 'These propcrties are defined as follows: XA-compliant databases and other resource Atomicity. Either all the actions of a transaction managers by using the Encina toolkit from succeed or all fail. In case of failure, the actions are Transarc Corporation or Digital's distributed rolled back. transaction manager (DECdtm) software. As Consistency. Mer a transaction executes, it must a framework for DCE-based applications, the either leave the system in a correct state or abort ACMSxp TP monitor simplifies application and return the system to its initial state. development, integrates system administra- Isolation. The actions carried out by a transaction tion, and provides the additional capabilities against a shared database cannot become visible to of high availability, high performance, fault other transactions until the transaction commits. tolerance, and data integrity. Durability. The effects of a committed transaction are permanent. A TP monitor manages and coordinates thc flow of transactions through the s)rsteni. Transaction requests typically originate from clients, arc processed by one or Inore servers, and end at the originating c.lient. When a transaction ends, the TP monitor ensures that all systems involved in the transaction are left in a con- sistent state. The development of powerful desktop systems and advances in communications technology have fileled the growth of distributed client-server computing. The systems in a distributed environment map run different operating systems, possibly from different vendors. Business-critical applications may nun under the con- trol of different TP monitors. To coordinate their activ- ities, TP monitors on heterogeneous systerns need to conform to standards for open transaction processing. Opcn standards for transaction processing have been adopted by the International Organization for

Digital Technical Journal \101. 7 No. 1 1995 23 Standardization/Open Syste~iis Interconnection Application Development (ISO/OSI), the X/Open initiative, and the Service Pro\.idcrs' Integrated Requirements hr Information AClMSxp applications arc writtell ~lsing3 combination Tech~iology(SPIRIT) consortium.l~*The )(/Open of the STDL and traditional Iang~lagessuch as C and initiative is a consortii~niof vendors \\those purpose is COBOL. STDL is a modular, block-structured lan- to define standards for application portability. SPIRIT guage de\ieloped specifically for rransactio~iproccssing. is a consortium of telecommunications service pro- It is based 011 the ACMS Task Ilcf nition l,~i~nguage viders from the U.S., Europe, and Japan working (TDL) and \\?asdeveloped as part of Nippon l'clcgraph undcr the general spo~isorship of tlie Network and Telephone's (NTT's) Multi\.cndor Integration ~Vanagement Forum (NIMF):;-~ The goal of the Architecture (MIA) initiati\,c.+-" Tlie NMF's SPIRIT NMF's SPIRIT co~isorti~~mis to define standards for consortium subsequently adopted STL>L. portability and interopcl-ability across heterogeneous systems to be used as the basis for procurement. STDL Language Overview The main standards for open transaction processing STDL provides transactional fcati~rcsincluding trans- are action demarcation, transactional remote procedure call, transactional task and data record clueuing, trans- X/Open distributed transaction processing (DTP), actional display management, transactional exception which is an architect~~rcthat allo\iis rn~~ltiplepro- handling, and transactional \itorking stol-age called grams to slinre resources (c.g., databases and files) worltspaccs. provided by milltiple resource managers and allo\vs STDL divides nn application into thrcc parts: pre- their work to be coordinated. The architecture sentation, transaction flow control, and processing, as defines application programming interfaces and illustrated in Figure 1. The presentation part interfaces interactions among transactional applications, with display devices using a prcscntation manager, transaction managcrs, resource managers, and such as Motif, Windo\vs, or forms managcr soft\\jarc. comrn~~nicatio~isresource managers. Tlie trans'ic- The transaction flo\v control part is \vrittcn in STI>L tion manager and the resource managcr communi- and controls the tlow of execution, inclucling ~TIIIISLIC- cate by means of the XA int~rface.~ tion demarcation, exception handling, and acccss to X/Open transactional remote procedure call queues. The processing part is written in traditional (TsRPC), \vliicli allo\vs an application to invoke languages, such as C, COBOL, and SQL,, and provides local and remote resource managers as if they \\,ere computation and access to I-CSOLI~CCmanagcrs such as all local. TsRI'C also allo\vs an application to be databases and files. decomposed into client and server coniponents on The application hlnctions in the three parts of the different computers interconnected by means of STDL application model are referred to rcspccti\.cly as reniotc procedure calls ( RPCs). presentatio~iprocedures, tasks, and processing procc- SPIRlT Structured Transaction Dcfi~~itio~lLanguage dures. The application hnctions arc packagcd into (STDL,), \vhicli is a block-str~~cturedlanguagc for groups for the pilrposcs ofcornpilation and e~ccution. transaction proccssi11g.~,3,~STDL provides transac- The groups are referred to us presentation groups, taslc tional feat~~resincluding demarcation of transaction groups, and processing groups. boundaries, transaction recovery, exception han- A group specification describes tlic flnctions in the dling, trarlsactional communications, access to data group and their intcrfaces. The intcrhcc specification qileues, sub~nissionof queued nrork requests, and includes the arguments that arc pnsscd to tlic fi~nction invocation of presentation services. and an indication of\vhcthcr an arg~~~iic~itis ilipllt 011[): output only, or both input and oi~tp~~t.For n task, tlic Digital's Application Control and Management interface specification also indicates \vlicthcr tlic t~sk Systcni/Cross-plath)rni (ACMSsp) software product begins a new transaction (NON(:OMPOSABLE) or is a portable TP nionitor that slipports the open TP joins the caller's transaction (COMI'OSAI%LE). standards. It provides an environment for tlic develop- STDL variables are defined in constructs called ment, execution, and administration of STDL appli- worltiionitor

24 Digital 'Tcclmicnl Journ.~l I TRANSACTION I I FLOW I PRESENTATION PROCESSING I I I I I I I

I I I I I C, COBOL, RESOURCES I SQL

PRESENTATION

Figure 1 STDL Application Modcl

STl3L supports nvo types of clueues: record and defined in the task group to track the number of suc- taslt. Record queues provide a transactional, durable cesshl executions (sz~ccases)and the number of hiled scratch pad facility for applications to store and executio~is(Jailures). These operations all take place retrieve intermediate results. Task queues provide within the context of a transaction defined by task a way ofexecuting tasks independently ofthe currently addl. If the transaction succeeds, the program incre- executing task in both time and location. Storage of ments number and the shared workspace st1ccesse.s. the task queue element on the task queue may or may If the transaction fails, the program restores ~zz~rnber. not be conditional on the outcome of the currently to its initial state and involtes the exception handler. executing task. The exception handler then updates the shared worlt- space fnilzrl-es, Sample STDL Application Figure 2 sho\vs a sample STDL application program. STDL Compiler The sa~iipleprogram accepts an integer, increments The STDL compiler simplifies the process of develop- it, and displays it. In addition, shared workspaces are ing distributed client-server applications. It generates

RECORD argl number INTEGER; END RECORD;

TASK GROUP example1 TASK add1 TASK ARGUMENT IS argl PASSED AS INOUT; END TASK GROUP;

TASK add1 ARGUMENT IS argl PASSED AS INOUT; WORKSPACES ARE successes SHARED UPDATE RECOVERABLE, failures SHARED UPDATE, argl PRIVATE RECOVERABLE; BLOCK WITH TRANSACTION PROCESSING COMPUTE successes = successes + 1 PROCESSING COMPUTE number = number + 1 EXCHANGE SEND RECORD number TO inscreen END BLOCK EXCEPTION HANDLER IS PROCESSING COMPUTE failures = failures + 1 END EXCEPTION HANDLER; END TASK;

Figure 2 Sample STDL Application

Digital Technical Journal I / 0 1 1995 L> all tlie code necessary for si~pportingthe application in parsed and the syntas has been checked for errors, the the distributed environment, including server initial- driver generates intermediate files by translating ization, naniespace registration, namespace lookup, STDL groups into ACMSsp client and server stubs and applicatio~~contest propagation. This allo\\rs tlic and a DCE RPC Interface Dcfinition 1,anguagc application programliier to foc~~son the application (IDL) file problem at hand. The ACMSsp ST1)L compiler translates STDL spec- STDL tasks into C code and ACMSsp run-time ser- ifications into escci~tablecodt. The compiler itself is vice calls written i~ithe ANSI C: p~.ogramniingla~lguage using STDL record definitions illto (: structures con- POSIS 1003.1 library interfaces for platform portabil- tained in C header files or COBOL copy filcs ity; the generated code consists of only ACMSsp run- Merthe STDL driver has generated all the intcrme- time service calls and IXE service calls.12 To the diate files, it invokes the appropriate language proccs- application progr~nimer,thc ACMSsp STDL compiler sor to convert the filcs into object files. Tlic lX:E I1)L looks nii~chlike a classical compiler. The STDL com- compiler processes the IDL tiles, and tlie C: compiler piler reads source code, converts it to object code, and processes tlie tasks and the ACMSsp stubs. To keep then links it to create an esecutablc program. Figure 3 the ~iunibetof pieces visible to tlic application pro- shoivs tlic elements \vrittell by tlic application pro- grammer within reason, the AC:MSxp clicnt and scl-ver grammer and the transforniations required to create stubs are combined \\lit11 the 1X:E clicnt and server an executable program. stubs. The result is a collection ofobject files similar to Internally, the STDL, compiler consists of a series of those found in a conventional DC;E application. The steps that ~LIIIunder the control of a driver program. platfor111 linltcr then co~nbincsrhe resulting fi lcs jnto TIis processing takes place in tlie steps shown inside an executable program. the dashed-line box of Figure 3. The STDL, driver first The ACMSsp client and server st~~bsarc similar in reads STDL specifications in onc pass and constructs concept to the DCE KPC clic~itand server stubs. Tlic internal structurcs that rcprcsent each S7'I)L entity in client stub is linked \\/it11 other ~pplicationsthat in\~okc the sourcc file. O~icean entity has bccn conipletel!/

C OR COBOL uSTDL SOURCE FILES

COMPILER HEADERS

C OR COBOL COMPILER

I LIBR4RY I I MODULES I t , L-JORJECT MODULES LINKER

uI EXECUTABLE PROGRAM I KEY: PROCESSING ?.-; STDL COMPILER 0'ILE ACTIVITY 1-----: PHASES

Figure 3 ST131, Col.npilcr Flo\\,s

26 Digital 'I'ccllnic~lJournal Vol. / No. 1 199.5 this gro~~p'stasks or procedures. The server stub is concurrent execution (multithrcaded). A proccssing combined with application codc to crcate the applica- server executes STDL processing gro~~pcode and uses tion server image. Tlic A(:MSsp st~~bscall ACMSsp a pool of single-threaded processes to achic\,c concur- run-time services to add to the base DCF, KPC services rent execution (multiprocess). features such as transactions, failover and failback, and Systenl servers pro\.ide specific ruli-tinlc scr\'ICCS ' to time-out. the TPcontroller, task servers, and processing servers. The sJrstern servers include tlic c\,cnt log serjfer, the Execution Environment request queue server, and the record qucuc server. System servers arc multithrcadcd. The AC~MSxp.run-timesystem provides an environ- Client processes invoke ser\.iccs pro\ridcd by a ment for csecuting and invoking STDL applications. TPsystem and its scr\lcrs. An administration clicnt It also pro\lidcs scr\~iccsthat allow components in the (also referred to as the director) invokes i~dministra- execution environnicnt to be managed. The esecution tion services provided by the TPcontrollcr and system environment provides many services typically needed servers. hi application client involics application scr- in TP en\riro~lmcnts,such as resource sc hcdi~ling,fault vices provided by taslc scrvcrs. An application clicnt tolerance, and clucuing. can bc a customer-written clicnt or an A<:IMS 1)csktop client. A customer-\vritten clicnt can consist of codc Process Model necessary to support a forms miinagcr or cicvicc con- The ACMSsp environment consists of clicnt and trol such as an automatic teller machine 01.a g~sp~~rnl>. server components. A TPsystem comprises multiple An ALMS Desktop clicnt allo\\/s ~x)puIa~-~icsktop SYS- server co~iipo~~c~lts011 3 nodc tlii~tare nianaged as telns such as the Macintosh, S<;O1s UNIX, 1Microsofi a unit. A given TPsyste~iihas a globally unique name Windows, and Windo\\ls NT operating systems to be and is associated with only one nodc, but a node can used to access ser\lices pro\iidcd by A<:MSxp applica- have multiplc TPs)/stcms associated with it. ATl'systcm t~onservers. contains a central process called the TPcoutroller, \vhich controls the components within the TPsyste~ii. Run-time Services The processes in the execution environment are illus- The ACMSsp run-time system pro\lidcs scrviccs trated in Figurc 4. req~~iredfor the esecution ofclient-scrvcr 7'P applicn- As the central point of co11troI for tllc components tions. The run-time scrviccs arc liigl~lymodular and \vithin a TPsystem, the TPcontrollcr performs many are layered on the scr\rices provided by the underlying h~nctions,including license checking, starting and transaction manager, lXE, opcrati~igsystc~n, network, stopping scrvcrs, and monitoring scr\,cr processes and and other ser\~iccs,3s sho\\,~iill Fig~11.c5. rest~rtingthem \vhcn they terminate abnorrnall!,. It The run-time ser\~iccsintegrate tlic scrviccs of the also receives ad~iiinistr~tionrequests and perfornls the underlying platform and pro\ridc additional f~nction- reqi~cstedoperations, maintains information in shared ality. Thev export an application programming intcr- memory for corn~ii~i~iicatio~lwith server processes, face (AI'I) called the transaction proccssing scr\rice and maintains kc!! ti lcs for scr\

Figure 4 Proccsscs in Exccurion Environmcnr STDL APPLICATION PROGRAM

ACMSXP RUN-TIME LIBRARY 7------7------I------r------I TRANSACTION I COMMUNICAT!ON I QUEUING I SECURITY I ------I DEMARCATION SERVICES I SERVICES I SERVICES "'

DATABASESANDOTHERRESOURCEMANAGERS

TRANSACTION MANAGER. DCE, OPERATING SYSTEM. NETWORK. AND OTHER SERVICES

Figure 5 modular Ibn-time Arcliitccrurc benvcen clients and scrtfcrs using 1lC:k; RPC.The Event posting, \\~liicliprovides scr\iiccs (or writing supported transports arc transmission control pro- events into a log. Logged cvcnts include crror, tocol/internct protocol (TCP/IP), Dl;(:nct OSI, security, status, audit, and trace cvcnts. and Fast Local Trunsport. Performance monitoring, which provides services Process managcmcnt, which provides services for for capturing performance measurement data. starting and stopping server proccsscs, ~iionitoring server processes for abnornial termination, and Client-Server Communication restarting new ones to maintain the specified num- The ACMSsp com~ii~~nicationsscr\,iccs LISC OSF's ber of proccsscs. DCE services for locating servers, invoking ~cr\~crs, contcxt managcmcnt, which provides ser- aid ensuring secure cornn~~~~iic~tio~~s.Tlic comniuni- vices for creating, setting, and propagating thread cations services maximize the cfficicncy of IXE scr- contcxt. Thrcad contcst includes request contcxt, vice usage, provide robustness in tlic event of failure, exception contest, transaction contcst, and proce- and add distribution of transaction semantics to lXE dure contcst. 1U'C communications. Ti~ncralert, \\,hicIi provides scrviccs ti)r accuniulat- Figure 6 slio\vs tlie elements and steps in\rolvcd in ing CPU time and transaction (elapsed) time. tlie communication between a clicnt 31idn scr\.cr.Tlic numeric annotations in the following discussion refer Transaction demarcation, which integrates \kith the to the numbers that sppcar in tlic figure. Encina toolkit on the OSF/l platform or tlie The STDL clicnt application calls the scrlrcr ( 1 ). DECdtm sott\\,nrc on tlic OpcnVMS platform to + I.- he ilC~MSspclient stub issues I-LIII-timescr\,icc culls provide distributed transaction support. (2) to initialize contest blocks ;ind to obtnin a binding Queuing, \vhich provides serviccs for request qucu- handle (i.e.,server addressing information), and calls ing and record queuing. llcq~~cstclucuing allo\vs the DCE R1'C clicnt stub, passing contcst blocks and task rcclucsts to bc q~~cucdfor dcfcrrcd invoc'lt~on. application data (3). The 1)CE RPC clicnt stub mar- Record clueuing allows data records to be enqueued shals data and calls the server (4). and dequcucd. The DCE RPC serves stiib rccci\~csthe c;~ll,unmar- File ~nanagcmcnt,wliicli provides fi lc nianagenicnt shals data, and calls tlic AC&ISsp server stub (5).The scrviccs k)r COBOL and C programs. It provides ACMSxp server stub issi~csr~~n-time scl-vice calls (6)to thread-based transaction scmantics for STDL filc establish local contcxt nncl to chcclt scc~1rit-ynurlioriza- access n~~dlia~idlcs opening 3nc1 closing of files, tion, and calls the scrvcl- application (7). The server tile positioning, and filc locking. application esecutcs and rcturns the results to tlie Workspacc managcmcnt, \\lliicli ~>rovidesservices ACMSxp server stlib, wliicli propag~ltcs any crror for managing private and shared \\~orltspaces. information. A workspace is an STDL construct and represents an area of memory used for data sromge and for Transaction Processing Characteristics arguments passed ill n proccdurc c;lll. A \vorltsp;~ce The run-time syste~~~provides the '1.1' monitor \vitIi can be recoverable or nonrccovcrable. characteristics such as high availnhilit): lo~dbalancing, and high performance. Some ofthc mccli~lnisms~~scd Security, \vliicIi autlicnticntcs users and servers and to achieve these characteristics arc discussed bclo\v. provides access control, b.ised on the 1)CE scc~~rity service, for application invocation as \vcll as man- Availability The run-tinic systc~iiprovides f.~ilo\~cr agcnient operations. and failhacl< capabilities to cnliancc thc a\~ail,tbilitl\,of

Val. 7 No. I I995 ------I 1 I CLIENT I I SERVER I

APPLICATION ;m

SERVICES CLIENT 1 SERVER SERVICES Fp~IiI 1 STUB STUB I I

I mi4CLIENT SERVER I I I ------I I ------J -- Figure 6 Client-Scrvcr C:oin~~lti~iic;ltio~~t.'lo\\. applications. E'ailovcr is the redirection of an RPC to an Load Balancing The ACMSxp run-time system can alternate servcr if the intended server is not reachable. achieve load balancing for task serwrs tliro~~glithe The target server can be unrcachablc for many reasons, DCE CDS. The DCE <:L)S group entry contains n~ul- including loss of connectivity, application failurcs, and tiple server entries that provide the same intcrhce. ~nacliinefailures. Failbnck is tlic redirection ofcalls to Locating a server by means of a group entry results in the original scrvcr when it bccon~esavailable. the random selection of one server in the entry. Failo\ler and failback capabilities are supported for A combination of static load balancing and failovcr can task servers but not for processing scrvcrs. The 1)CE also be implemented using DCX C1)S f~~nctionality. cell directory scrvicc (C1)S) namcspacc profile mccha- 11isni s~~pportsfailovcr aucl fililbaclc. The system admin- Performance Many parts of the ACMSsp system con- istrntor configl~rcsthe primary and alternate servers by tain mechanisms that ace designed to impro\lc perfor- placing them in the sanic nanicspacc profile with dif- mance. A discussion of some of thcsc niccl~anisms ferent priorities. The scrvcr \vith the lo\-ver priority follows. number is the primary server. The server stub cachcs scrvcr bindings to improve Run-time support k)r failo\rer and Failback is imple- performance. Servcr bindings arc the addressing infor- mented in the clicnt stub. Failovcr is attempted if an mation that allows a clicnt process to call a server 1VC: f'iils and rllc rctur~~ccierror indicates that no work process. Binding caching is a means of retaining the hacl becn done by the called server in the current server addressing information for reuse. Reading the transaction. Fililovcr is alwavs attcrnpted for a non- binding from the namespace can be time-consuming. transactional lW<; but is attc~iiptcdfor a transactional For example, a DCE C1)S namespace lookup requires RP<:only if this is the ti rst call to the intended server in a nenvork connection to fetch the data fro111 iuiotlicr the transaction. The hilover mechanism is optimized process, \vhicli may be on a scparnrc node. The cache in three w:lys: by reconnecting, by pinging, and by of server bindings is sh~rcdamong all the tlircads in chcclting the tiilcd servers table. When a fClilurcis the client process. This sharing provides a second order cietcctcd, the fiilovcr mechanism attcnipts to recon- of performance improvement in that work pre\.iously nect to the scr\!cr in case the failure was caused by performed on behalf of other threads can improve the intermittent commi~nicationsproblcnls. If the recon- perforrnance of all threads by prcloading tlie cnche. nect hils, the fiiilovcr mcclianism attempts to find an The scheduler subcomponent of tlic co~n~nunica- altcrnatc server. Wlicn an altcrnntc server is selected, tions services allocates and dcallocatcs servcr processes. it is pingcd to cnsurc that it is rcnchablc before being It maintains n local namespace (also rcfcrrcd to as called \\it11 application work. If a scrvcr cannot be scheduler database) in shared memory to keep track of reached, it is recorded in n "failed servers" table and server process allocation. The ilsc of tlie local name- skipped on subscqucnt hilo\lc~-attcnipts. space instead of DCE CDS improves the performance Failback is attcmptcd if the binding found is for of RPC calls benveen task scrvcrs and processing an alternate scrvcr. Failback to the primary server is servers, which are recluired to be in tlic same TPsystcrn. attcmptcd cvc~~if the binding for the alternate server The security scrvicc cachcs access cor~trol lists is good, ,IS long as the hilback tirncr has expired. T11e ( ACLs) to improve performance. The Tl'co~~troller failback timer dcbults to 300 seconds and can be set maintains in shared memory the ACLs for managed by an environment variable. objects that the ACMSxp TP monitor accesses at run ti~ne(e.g., procedures). The security scrvicc caches

Digital Tcchnicnl Joul.n.1l each object's ACL into tlie server process niernory management agents, for initiating management when the object is first accessed. The server process requests, for returning results to the director, for can- refreshes its cache if the entry in shared memory is celing an outstanding request \vithout waiting for coni- updated. pletion, and for terminating an association normally. The management protocol specifics both tlie mech- System Administration anism for communication benvecn a director and man- agement agents and the ~nodclof interaction benvecn The distributed TP environment is inherently complex them. The model specifics ho\\, requests and responses and requires effective system administration. The are passed benveen the director and thc managcnicnt ACMSsp TP monitor pro\,idcs the following system agents, the processing of requests tl~atin\,olvc \vild administration facilities for configuring, monitoring, card object instances, and the buffering of multiplc and controlling components and resources \vitIiin the responses to optimize perhrnmance. The ACMSxp TP ACMSsp run-time en\'l~ronnient: monitor uses DCE RPC for cornni~~nicntio~ibcnvccn a director and ~rianagcme~~tagents. I Integrated user interface. The director (see tlie dis- A managemcnt agent performs operations fix a cussion of Figure 7, which follows) provides a con- managed object. Each object class has n Inanagcment sistent user interface for invoking management agent that performs managemcnt opcrations for operations on all managed objects. The command instances of that object class. The management agent .line interface provides features such as command receives a management request from the director, per- scripts, symbol substitution, session logging, default forms the requested operation, and returns the res~~lts. session parameters, and on-line help.

Centralized distributed managcnient. A single direc- Management Functions tor can manage multiple TPsystems on the local or Management operations that can be performed on remote nodes using DCE RPC for communication. managed objects arc grouped into the following f111c- Extensibility. The object-orientcd approach allows tional categories, as def ncd by the OSI managcment the ACMSxp TP monitor to represent managed framework: resources in a consistent manner and to add new Configuration management. Mnnagcd objccts arc objects gracef ~lly. instantiated, observed, and controlled. I'crsistcnt information about managed objccts is stored in Management Model a configuration database. The ACMSxp management model is object oriented and is based on the ISO/C>SI standard for nenvork Fault managemcnt. Events gcncratcd by systcni and system manage~nent.?.lVFigure7 illustrates the operation are recorded in a log ti lc. Thc contents of elements ofthe model. the log can be examined using ;I variety of search A director initiates managcment requests on behalf critcria. of the systenl adruinistrator and serves as tlie interface Performance Itlanagenlent. Performance metrics arc benveen a system administrator and the objects being collected as the run-time system executes. The per- monitored and controlled.I4 A director consists of nvo formance data is captured as attributes of ninnaged parts: the user interface and thc managcment service objects and can be esamined using the director. interface. The user interface interacts with the user and Security management. Principals arc authcnticatcd is either command line or graphical. The nianagemcnt using the DCE. Access to system udministration service interface interacts with management agents. operatiolls and application proccdurcs is controlled This interface provides ser\~iccsfor creating an asso- using an ACL mechanism bascd on the 1)CE ciation for communication between a director and model.

DIRECTOR 0 MANAGED OBJECTS MANAGEMENT MANAGEMENT PROTOCOL MANAGEMENT INTERFACE AGENT

L: A .-

Figure 7 Management Model

30 Digital Techrlical lournnl vol. 7 No. I 1995 Managed Objects another and reflects tlie containment relationship of all The resources in the ACMSsp environment that need their corresponding managed resources. The top-level to be managed are represented as objects. A managed object in the structure, rererred to as a global object, object encapsulates the functionality of a real resource has a globally unique name. Objects contained \\lithi11 and specifies as visible only those aspects that need to the global object are referred to as local objects and be accessed by the manager. A managed object has the have names that are unique only within tlie context of following properties: their level in the structure. Table 1 describes the man- aged objects in the ACMSsp system. Attributes. Attributes are pieces of information that describe an object and represent internal state \mi- Conclusion ables. Each attribute has a name and a value, which can be examined or modified as a rcsiilt of a nial1- The ACMSsp transaction processing monitor employs agcment operation. Esamplcs of attributes are ese- modular design techniques and a proven transaction curable file name and processing state (for 3 servcr). processing architecture to provide a truly open, Operations. Operations are activities that the Inan- distributed transaction processing system. The ager can perform on the managed object. Opera- STDL application development language, which the tions allow the manager to exil~nincattributes, ACMSxp TP monitor supports, has been endorsed mod@ attributes, and perform actions specific by an international standards consortium and has to the object. Examples of operations arc create, been implemented on other vendors' platforms. The delete, enable, disable, set, and sliowv. layering on both the Open Software Foundation's Events. Events indicate thc occurrence of nor- Environment software and mal and abnormal conditions. Esamples of events the Encina toolkit providcs a foundation of open dis- are the detection of an error and the arrival at tributed processing that has becn accepted by the a threshold. world's largest computcr systems providers. The Bch,~vior.Behavior defines ho\v attr~butcs,opera- ACMSsp TP monitor provides a comprehensive set of tions, and events \vork togcthcr ~ndhow they affect facilities for managing thc run-tirnc en\fironrnent.The thc managed rcsource. object-oriented nianagcmcnt approach results in a consistent representation of managed objects, a con- For nalnlng purposes, mandgcd objects arc orga- sistent user interface, a nodular implementation, and nizcd into a containnient liicrarch!l. This hierarchy este11sibilit)l. specifies \vhich ~iianagedobject is contained ulithin

Table 1 ACMSxp Managed Objects Object Class Description TPsystem A collection of system and application components and resources on a given node that is managed as a unit. A TPsystem is referred to as a global entity because it contains other managed objects and is not contained in any other managed object. Server A managed object that executes procedures. It encapsulates a collection of one or more operating system processes that execute the same program image. Process The basic unit scheduled by the operating system that provides the context within which a program image executes. It represents an operating system process. Interface A set of procedures that is provided by a server. It represents a DCE RPC interface and has a universally unique identifier (UUID) that distinguishes it from other instances. Procedure A structured sequence of instructions executed to achieve a particular result. It represents a DCE RPC operation. Queue A repository for storing an ordered collection of elements. The supported queues include a request queue, which contains submit requests, and a record queue, which contains data records. Element A single entry in a queue. Log A named repository where event records are stored. Request session The occurrence of a request at a particular TPsystem. A request is a series of operations invoked by a client program on behalf of a user and executed by one or more servers.

Digital Tech~lic<~lJour111l Vo1.7No.1 1995 31 Acknowledgments (New York: Thc Institute of Electrical cind Electronics Engineers, 1990).

Throughout thc course: of this project, many people 13. M. Sylor, F. l>olan, and L). Shi~rrlcff,"Ncr\\,ork Man- have participated in the dcsign, implementation, and agcnient," 1)igiIul TC~~/?~I~C~(I/,/~III~IILI/.vol. 5, 110. 1 documentation of the product. Thc authors would (Winter 1993): 1 17-129. like to thank 1111 these pcoplc for their dedication and 14. C. Strutt and j. Sivist, "Design of the 1)ECnlcc Man- their contributions. agenicnt Director," I)igi/~rl~~JC~II~L~LI//OIII~IIC/I. vol. 5, no. 1 (Wintcr 1993): 130-142. References Biographies 1 . W0pet1Ili.s/r-il~~r/~~/ Tr~rt~.suc/~ot~ Pt.oc~>s.~irlg l&jbr- rrzce ~Llodel.ISBN 1-872630-16-2 (Reading, U.K.: >;/Open Company L,rd., 199 1). 2. h?forrnalion Proce.xsing Sysfcwrs-Opeti Sy.stetns 1wlercoti~ecli.r~~-Basic Rq/;.retlce hfodel- Pa 114: iLl~/tlc~,:,orncr~l1"1-c/t~1e1oot;C'. ISO/I EC 7498-4:1989 (Gcncva: Intcrnarional Organization for Standardiza- tion, 1989).

3. .~1)11C transaction processing ~i~onitor.]Sob rccci\ui 6. WOpetr (,:4 E .Ypc~c~/icu/iot~./lc~cer11/7er- 7 991, Dislrib- a 13,s. in electricell engineering from the Uni\.crsit! of Connectict~tin 1971 and ;11i M.S. in info~.~il.~tio~lsvstcms I rtcd TI-utr.scrcliotrl'roc.c>.~.sir I:,: 7 hc ,\':I S/xc(ficatior.c. from Lchigl~Uni\crsin in 1973. Mc is .I mclnhcr of ACM, ISEN 1-872630-24-3, 1)ocumcnt No. C19.3 01. T~LI1jct.1 Pi, and kt.1 K.ippa Nu. SO/CAE/91/300 (lkading, U.K.: X/Opcn Coni- pany l.td., 1994). 7. P, 13crnstci11, P. G!illstroni, 2nd T.Wimbcrg, "STl)I,- Portable Lnngi~agcfor Tr~nsacrionProcessing," Pro- cce~1irrg.sc!/'/hc iVit~~lf~~~//h It~/c~rrra/ior rul Chr!/1.1-erzce 011 I/cr:,'O/,.gc/)ct~ahmc.s, l)ublin, Ireland, 1993. 8. W. ling, J. Johnson, and R. 1.anda~1,"Transaction ~ManagcnicntSupport in tllc VMS Operating Systeni ~ Ovo-i?ie~~~,7i.chl~ical l\'ecloiw~ver~t.s,TR550001 product. He is a nlcmbcr of the Tmnsaction I'roccssing (Tokyo, Japan: Nippon Telegraph and Telcphonc Engineering Group. Since joining I)iginl in 1989, 1:111 Corporation, 1991). has also contributed to the A<:lMSsp transaction procrssing monitor. He \\lorkcd on rbc S'1'I)I. conlpilc~.codc gcncrx- 11. E. Nc\\,comcr, "Pioneering 1)isrributcci 'Trvisnction tor, language run-tinir SLIpport, tile SLII)PO~~,and Encina Management," U~~llclirr(?[/he 'l2chrzical Comniiltee transaction nianagcr intcgrntion. 111c,irlicr \vo~-k,he \\.as ojr 1)ulu I:ngitrc~o-it~g,vol. 17, no. 1 (March 1994). eniplo)fed by Ci~llinctSofn\ .ire ,~sa project Ic~dcrill tl~c ID1MS/lId.lt~b3scproduct's (:on~~iiunic.ltionsGroup. Ian holds a B.A. ( 1980) in coniputcr scicncc/~nanagcrial studics from Ricc Uni\.crsin. Hc is a mcmbcr of A

Oren L. Wiesler Presently, Orcn Wiesler is a Factory Integration Manager at PRI Auto~nation,a manufacturer of automation equip- ment used in scmiconductor nian~f~~cturing.At Digital, lie was a principal software cnginccr in the Transaction Processing Engineering Group. Hc led the ACMSsp for OSF/l AXP version 2.0 cffort and the support team for ACMSxp for OpcnVMS VAX vcrsion 1.0, and contributed to the ACMSxp run-time system. Earlier, Oren worked in a processor liard\varc group. He received a B.S.E.E. from Worccstcr Polytechnic Institute in 1984 and holds nvo patents: one rclatcd to dynamic control of si~nultaneously switching outp~~ts,the other on intcrlcavcd addrcssi~ig.

Digival Technical Journal Vol. 7 No. 1 1995 30 I Norman G.Depledge William A. Turner An Open, Distributable, Alexandra Woog Three-tier Client-Server Architecture with Transaction Semantics

This paper describes a distributable, three-tier Almost all large global enterprises have developed sep- client-server architecture for heterogeneous, arate systelns to address specific business needs. multivendor environments based on the Frequentl)l, these systems are on disparate platforms from different \~endors.Users may have to log in to integration of Digital's ObjectBroker and several systelns in order to process n single service ACMSxp transaction processing monitor reqllest from a customer. To improve customer service products. ObjectBroker integration software and develop new products, new applications must provides the flexibility to decouple the tight integrate esisting environments and must be capable association between desktop devices and spe- of accessing and integrating data from existing cific legacy systems. The ACMSxp transaction platforms. End users nlay be faced with an array ofinconsistent processing monitor provides the transaction and incompatible user interfaces that nrc diffic~~ltto semantics, system management, scalabilty, and Jearn to use. This source of inefficiency directly high availability that mission-critical production impacts the level and cost of service provided to cus- systems require. Combining these technologies tomers and the time-to-market for new products and and products in a three-tier architecture pro- services. vides a strategic direction for the development An analysis of the above proble~nsleads to some fundamental conclusions about esisting 1,usiness sys- of new applications and allows for optimal tems in large enterprises. Generally, thc in-place appli- integration of legacy systems. The architecture cations are mission-critical legacy systems that record complies with industry standards, which facili- transactions performed by the businesses. These sys- tates vendor independence and ensures the tems demand superior transactional integrity and longevity of the solution. operational reliability. They servc hundreds to t1io~1- sands of users and yet provide good response at high le\,cls of performance. Systems designers \\/ill not introduce changes to them that \vould compromise tlicse esacting requirements. Consecluently, enter- prises do not readily replace their legacy systems but, instcad, look for other solutions that integrate them with nc\v systelns. To impro\~cthe effecti\.eness of existing Icgacy sys- tems, major enterprises are seeking to reengineer the usel- interface. The goal is to rcfacc the applications with n modern, consistent, easy-to-use intcrfacc that directly retlects thc users' and custorncrs' needs. The new interface must be fully articulated; that is, it should allow any desktop to acccss any permitted application, regardless of its location or tl~cplatform on \vliicl~it is running. The solution should allo\v the composition of new compou~~dbusiness fc~nctionsby combining esisting application transnctions horn mul- tiple legacy systems and possibly new or do\vnsizcd applications. The new user interface should accom- plish this without disrupting tlie level of servicc pro- vided to tlie users.

Digit,~lI'cchnlcal Journal Vol 7 Nc ' 1995 All these requirements indicate the need for an Object Management Group's (OMG's) Common intermediate architectural layer that provides for isola- Object Request Broker Architecture (CORBA) tion, switching, transaction semantics, compositio~lof mature and become more widely available, and as tier hnction, and location transparency. The resultant 3 applications are modified or new ones ndded.'J architecture has three tiers: the clients, the intermedi- Figure 1 sho\vs the disposition of functions with ate layer, and the existing legacy systems and neul intertier comn~unicationsparadigms. servers. Such an architect~~rcis expected to last a consider- A Standards-based Architecture able number of pears. It is, therefore, essential that the architecture be based on modern but stable tech- Digital implemented the threr-tier architecture using nologies and be flesiblc enough to accommodate standards-based sohilare to offer the highest level of technology evolution. interoperation with systems offered by other standards- compliant vendors. Standards compliance also facili- The Three-tier Architecture tates the porting of applications across platforms. The standards organizations most relevant to this The three-tier architecture consists of the follo~ling architecture are separate layers of systcnis and sohvarc: International Organization for Standardization 1. Clients (ISO) 2. Transactional middle\vare American National Standards Institute (ANSI) 3. Systems of record (legacy systems and new systems) Open Sohare Foundation The attributes of the proposed intermediate layer Object Management Group make this thrcc-tier :~rchitecti~remorc flexible than X/Open Company Limited traditional two-tier client-server architect~lres. Nippon Telegraph and Telephone's (NTT's) Tier 1 systems (clients) provide a desktop graphical Multivendor Integration Architecture (MIA) and user interface (GUI) to the end users. These systems the Network Management Forum (NMF's) have seamless access to a set of abstract transaction Service Providers' I~itegratcd l

TIER 2 INTERFACE COMPLIANT TIER 3 CORBA- WITH MESSAGE-BASED-- --- COMPLIANT -PROTOCOLS. DCE. OR DESKTOP GUI INTERFACE_TRANSACTIONAL ,'ORBA NEW AND LEGACY nCLIENTS MIDDLEWARE SYSTEMS

Figure 1 Disposir~onoFFunctions u4tIi Inrcrticr Commu~iicationsParadigms

Digital Technical Journal tion were de\~elopcdby tlie OMG, a consortium of APPLICATIONS ., ., I ,- inforn~ation systcnis vendors, i~icl~di~igDigital, DISKLESS SUPPORT Hewlctt-Packard, Hyperl>esk, Symbios Logic (for- merly NCR), Object l)csign, and Su~iSoft.~Digital's DISTRIBUTED FILE SERVICE COMA-compliant product, namely ObjectBrokcr integration soh+~-arc,has bccn ported to the industry's leading range of platforms.0 REMOTE PROCEDURE CALLS The 0bjcctR1-oltcrproduct reduces the time and THREADS costs associated ulitli p~.o\tidingaccess to critical busi- ness applications across mu1 tivendor plattkorms. I r OPERATING SYSTEM AND TRANSPORT SERVICES allows legacy applications to be integrated illto hctcro- geneous client-scr\/cr cn\lironments u~ithoutsource Figure 2 code changes. OSF's 1)istt.ibuted Computing En\~ironnicnt Microsoti <;orporation has developed a parallel approach ;IS cvidcnccd in its Object [.inking and Remote procedure calls (1U'Cs) Embedding (01,k) sofi\vare, \\~hicliis focclsed on intc- DCE threads, which is a standardized 11111ltl- grating objects in a desktop cnviron~iient.~Microsoti threading service and Digital arc working to integrate the COlUIA and OLE sofh+/arcinto n combined architecti~recallccl the Distributed time service, \vhicli synchronizes clock Common Objcct Model (COM), which allows tlie full time across globally distributed systems interopcrntion of applications dc\leloped under eithcr Cell directory service (CDS), \\lhicli pro\lidcs constituent architecture. autlicntication, access control, ~ndencryption, and uses a ICerberos-based private key security XIOpen Distributed Transaction Processing model The X/Opcn ciistrib~~tedtransaction processing Global directory service, \vhich provides directory (DTP) committee is dcf ning standards for DTP sys- services between cells using tlic X.500 standard tems that use Hat transactions. In Figure 3, the TS 1)istributed file ser\lice, which provides location- interface allo\vs applications to coordinate global tralisparcnt access to files across a nct\\lork transactions via the transaction manager (TIM); the >;A interface connects the TIV to resourcc malingers DC:E has been rapidly adopted as a tcchnolobql for dis- (lb~s),typically relational databases or file s!lstems; tributed systems and is no\\! a\jailablc on a large num- and the XA+ interface con~~ectsthe TiM to commi~ni- ber of vendor platforms, including Digital, IBM, cations resource man;lgcrs (CRMs). The interface Hcwlctt-Packard, Sun, and Microsof?. betwcen an application and a ClWI is specific to the CRM type, of\\~liichthree are defined. CORBA The Comnion Object Req~~estBrolter Architecture is 'I'ransactional remote proccdure call (TxlWC:), a standard spccitication for tlic central co~iimi~nic~~tion \\(hich js derived from the work led by Digital for and integration of sohvare objccts at the cntel.prise the MIA/Sl'IRIT rc~tiotctask in\rocatio~lprotocol Ic\,cl and across enterprises. C01UA and its spccifcn- (disc~~sscdin more dctnil later in this section).

THIRD-GENERATION LANGUAGE APPLICATION AAA A A A A TxRPC I I I XATMI. OR

MANAGERS MANAGERS

Figure 3 X/Opcn llisrribured Transaction Proccssitig ~Modcl

36 Iligital 'ILcIinicnl Journal Vol. 7 No. I 1995 XATMI, which is a non-RPC-based client-server and Unisla; and sofnvare vendors, sucli as Microsoft that originated with Unis System Laboratories' and Oracle. The goal of the SPIRIT consortium is to transaction processing monitor for the UNIX oper- produce a common, agreed-upon set of specifications ating system, namely the Tuxedo product. for a general-purpose computing platfor~nfor the tele- Peer-to-peer, by which messages are exchanged communicatio~~sindustry by July 1995. The com- between applications. The messages are sent and bined annual computing expenditures are estimated to received in an order based on prior agreement exceed $20 billion. between the implementers of the applications. MLA/SPIRIT standards are working their way into Peer-to-peer uses Common Programming Inter- international standards bodies. X/Open and thc NIMF face for communications (CPI-C),which is derived have extended their collaborative agreement to from 113M's System Nenvork Architecture (SNA) include the work ofSPIlUT in aclmowledgnient oftlie message-bascd protocol of the same name. difficulties that diverging standards would create. X/Open publishes the SPIRIT documentation along- side its own CAE specifications and guides. Further- MIA/SPIRIT more, after conducting a survey of major transaction MIA is a software architecture developed by a consor- processing users, X/Open recently voted to use its tium of five vendors under the sponsorship of NTT: fast-tracking process to accelerate progrcss in the Digital, IBM, Fujitsu, Hitachi, and NEC. MLA adoption of STDL as an X/Open standard. adopted esisting industry standards and defincd stan- Digital delivered a platform that supports STDL in dards in areas where JIOJI~were available. One of the January 1993, IBM offered STDL on the CICS plat- areas most lacking ill standards was DTP. NTT form in the second quarter of 1993, and Hewlett- reqi~ested tcclinologp proposals and received Packard has made STDL available on Transarc responses from all the vendors in the consortium. Corporation's Encina transaction processing monitor. Digital submitted its Application Control and NEC, Hitachi, and Fujitsu have already shipped STDL Management System (ACMS) transaction processing platforms. Unisys plans to demonstrate a SPIIUT plat- monitor model and was selected to lead the develop- form \vith STDL in October 1995. ment of the specifications because of ACMS' modern, In July 1994, an interoperability demonstration highly structured model and transaction processing using STDL was conducted s~~ccessfullyin Tokyo, application programming interface (API). Japan. The demonstration, which also included RTI, MIA achicvcs application portability and interoper- involved systems provided by Hewlett-Packard and ability across a variety of \lendor operating systems and Fujitsu on Transarc Corporation's Encina transaction platforms by using standardized APIs as integrative processing monitor, Digital on its Application Control constructs and by using standardized systems inter- and Management System/Cross-platform (ACMSsp) connection interfaces (SIIs) for communication. transaction processing monitor, and IBM on both the Two significant MIA standards that Digital con- MVS/CICS and OS/2 platforms. tri buted are Structured Transaction Definition Language Architecture Components (STDL,), which is a high-le\lcl programming lan- guage suited to transactional client-server pro- Figure 4 illustrates the overall three-tier client- gran~rning".~ server architecture. This section discusses the various Remote task invocation (RTI), a service definition components. and protocol for RIG that are in a multivendor environment and that use the two-phase commit Tier 1 Desktop Environment protocol The architecture must provide for the connection of a \vide variety of desktop platforms to the server layer, As a follow-on to NTT's MIA, the work in the field i.e., the tier 2 middleware services. This connection of transaction processing standards has passed to the must be accomplished in a secure, extensible, reliable, SPIRIT consortium, which is managed by the and location-transparent manner. Standards-based Network Management Forum. NMF's list of members solutions are always desirable and more effective over includes telecommunications service providers, such the multiyear life of an enterprise-wide solution. as AT&T, BT, Deutsche Telekom, ETIS (itself a Digital therefore selected its CORBA-compliant consortium that represents 27 European Postal, ObjectBrol, Teleco~nItalia, and Telcfonica; com- CORBA provides a flexible approach to developing puter vcndors, such as Digital, Hewlett-Packard, a distributed application by decoupling the client and Fujits~l/ICL, Hitnchi, IBM, NEC, Siemens Nixdorf,

Digital Technical Journal Vo1.7 No. 1 1995 37 TlER 1 TlER 2 TlER 3

PROCESSING SERVER DESKTOP SERVER SYSTEM CLIENT EXECUTES SPECIFIC LEGACY

PROCESSING SERVER EXECUTES SPECIFIC LEGACY I 1111 APPLICATION PROTOCOL DESKTOP CLIENT USER CONTEXT

KEY:

1111 MULTITHREADED 1 SINGLE THREADED

Figure 4 Overall Three-tier <:lic~~t-ServerArchitcct~~rc scnler portions of the application. COIUA specifics a Communications arc conducted tliroi~gli 1U'C:s. comliion set of interfaces that allows clicnt programs The 1WCs are carried over a network transport, c.g., to rnakc rcclllcsts to 2nd rcccivc rcsponscs fro111server a transmission control protocol/intcrnct protocol programs \\/itlioi~tdirect kno\\~ledgeof the informa- (TCP/IP) or a DECnet transport. The RPC connects tion source or its location. COIU3A defines the ORR as with ObjectRroker's ORB, which thcn reroutes the an inter~nediary between clients and servers that RPC directly to the selected scrvicc instance. delivers clicnt requests to the nppropriatu server and Digital expects futurc \lersions of its COIUA- returns the ser\ler respoIises to the requesting client. compliant ObjectBroker product to support OSF's Figure 5 shoivs how the OR13 allo\vs a client appjica- DCE and thus providc standards-based directory, tion to recluest a service \vithout knowing where thc security, and RPC services. lX:E provides rigoro~~s server is located or lie\\, it \\rill fill fill the request. security ser\,ices for autlie~lticntinguscrs, granting In the COlU3A model, clicnt applications need to pri\~ileges,and controlli~igacccss to iniportant net- Itno\\! only \\hat rcc1ucsts they call make .lnd lio\i~to worked resources. Thesc scr\~iccsarc based on the makc the rcclucsts; the!! clo not nccd to bc coded \\,jth Iiighly secure ICcrberos model, \vhich is thc standard any implcmcntatio~~details about the scrver. Server security model for many financial institutions and a progmnis nccd to Ikno\v ho\v to filltill the requests but major reason \vhy they have standardized on LICE. All not l10\\~ to return information to the client program. interaction from tier 1 clients must go through the Clients i~sjngobjects to rcqLIest a service do not need Kcrberos-based DCE security peri~netcr.Desktop and to kno\\~which scrvcr will fillfill that request. The mobile computer uscrs log in to the 1)CE cell to gain scrvcl- fillfilling the rcqucst does not need to know their credentials for performing their business. lI<:Ii which clicnt initiated the rcclucst. Thc GUI clients can authenticates users and grants them the approp~.iatc be developed using ally tool that provides a call-level privileges and controlled access to the authorized busi- interface or an object-oricntcd interface to CC)lUA- ness functions. No clear-tcst pass\vorcls arcrcq~~ircd, compliant client services on the specific platfol-m. even for mobilc uscrs who acccss tlic middlcwarc layer by means ofdial-up lines. Kcmote or mobile users are able to perform DCE login over a serial line internet DIRECTS protocol (SLIP) connection. Confidentiality is ensured REQUESTS REQUEST by data encryption.

Tier 2 Middleware Services RETURNS RESPONSE RESPONSE The tier 2 middleware of this architecture is founded TO CLIENT on the ACMSsp transaction processjug monitor. The ACMSxp sofhvare product for transactional applica- Figure 5 tions conforms to the >(/Open 1)TP and MIA/SPIlUT COMA <:licnt-Ser\.el. l

CLIENT TRANSACTION DATA ACCESS PRESENTATION FLOW CONTROL AND FUNCTIONS PROCESSING 7)FUNCTIONS

DATABASES, FILES, ETC. 1 1 1 ;Tr+pq C, COBOL, SQL

Figure 6 ACMSxp Three-part Model

Digital Technical Journal Vo1. 7 No. 1 1995 39

- TASK CLIENT PROCESSING

TASK CLIENT

TASK CLIENT SQL, STDL LEGACY APPLICATION INTERFACE - I I KEY: 1111 MULTITHREADED ( SINGLE THREADED

- Figure 7 ACMSsp Application Compo~lcnts

application code that resides in pl-occssing servers. different platforms to intcropcratc by means of sinlplc The business fi~nctionsarc written as S'TDL tasks call statements with fi~llytyped arguments. Data type and can be composed of multiple Iegac!~ applica- differences between liardwarc architccti~rcs Arc tion transactions. When ticr 3 applications support bridged by tlie n~arshalingprocess that converts dac~ standards-conipliant TslWC, transactions can be to a canonical form and then to tlic targct form as ;I called directly as taslts in ST1)L from the tier 2 business rlormal process. bIcssagc-bnscci protocols, such as L,U f~~nctions. 6.2, cannot adequately deal with mixed data types and Sec~~rityinto tier 2 is Ilandlcd by the ObjectRroker place a burden 011 the application programmer in a sofnvarc. Witliin ticr 2, sccurity is cnforccd according multivendor en\rironrnent. to the rules ofOSF's l>CF,. Security between tier 2 and The advent of reduced instruction set computer tier 3 is n~andatcdby the rules of cncli specific Icgacy (1USC) architectures has csaccrb,itcd tlicsc problems. s!'Stell'). Gaps are freclucntly left in memory bcn\~ccnvariables To provide operational support for production in structures and records tliat contain mixed dam applications, sopliisticatcd systcm management fea- types. These gaps in buffers, \\.hen processed b!* com- tures \\!ere built into tlie A<:MSsp product. A system pilers on RISC mncliincs, render the buffers unmnp- rnanagen~entintcrfnce is a\/iilablc to any authorized k>&le unless redundant ti llcr \~firi~l~Ics;ire adcicd to thc operator on any node in tlic 1)CE cell. Tti~.o~~gIia sin- structure def nitions. glc director, all A(;1MSsp objects can be managed in Each legacy application n~ctliodis encapsulated in multiple transaction processing systen~son all nodes in an ACMSsp server class that is in\,okcd tl-~nsaction:~lly the nenvork. The managed objects include transaction by a simple STDL call. l-l~~~s,thc dc\iclopcr of tlie pt-ocessing systems, event logs, rcclllcst sessions, STDL transactional busil~css fi~nctions is shielded scr\lcrs, ~~OCCSSCS,i~ltcrfaccs, and procedures. For from the complexities of the native i~ltcrfillccto the example, system managers can exanline and change legacy data. This approach permits future update of the properties and execution state of servers. Thc the method without affecting the csisting busi~lcss number of instances of a given server class can be set functions. and changed dynamically c\~itlioutstopping the sys- The designer must select tl~cmost appropriate com- tem. ACMSxp system managcmcnt can be induced to munications protocol ti)r each ticr 3 legacy systcm. adopt servers that arc normally external to its domain, Whenever possible, an application intcrfi~lccshould be such as thc ObjectRroltcr nicthod servers that provide selected that avoids tlie so-callcd "scrccn scraping" the connection bcnvcen the desktop clients and the techniques, in \vIiich the applic'ition c~n~~latcs,I user transactional task scrvers in the ACMSxp product. interacting with csisting tcrrnin'll scree11fi)rli~s. For IBlM mainframe systems, tlic SNA I,U 6.2 pro- Tier 3 Legacy Application Interfaces tocol \\:itl.r Syncpoint 1,evel 1 or 2 is oticn appropriate Intercomniunications iss~~csrelated to the differences for interoperating with 113M transaction processing benvcen hard\.\lare and sohvarc architectures on dis- environments. This protocol may also bc the appropri- parate platforms arc addressed b!/ tcclinologies such as ate choice for Irgac!, systems from other \~cndors.If DCE. DC;E supports I

40 Digitdl ~12clinic.11Journal the application message protocol is designed in a man- 2. C'onzmorz Objecl Reqziest Broker Architect~~reSpeciji- ner that simulates a simple procedure call, future cation, draft 29, revision 1.2 (Fra~ningham,Mass.: migration to an RPC model will be sinlplified. Object Management Group, Document No. 93-12-43, Recently, IBM has made DCE available on IMVS- December 1993). OpenEdition and has provided application support for 3. fI/Ifultiuendor- Intcgratiorl Architectzire Technicul both the CICS and Information Management Spstcm Requiren?erzts. Division 1, Oueruieiu (Tokyo: Niypon (IMS) transaction processing environments. This Telegraph and Telephone Corporation Technology feature allows DCE client programs to invoke trans- Research Departmcnt, Order No. TR55000 1, 199 1 ). actions on the IG&I niainfra~iirby \\lay of a DCE appli- 4. ,Y/OPei7 Con.sorlil.lr17 S'cclJj:catioll, SfJIR1T PlulJbrtn cation server provided by IBhI. An appropriate DCE BI~ieprint(SPIlUT Issue 2.0, Volume 1.0), ISBN 1- client could be included in a data access processing 85912-059-8 (Reading, U.K.: X/Opcn Company Ltd., procedure of an ACMSxp processing server as an alter- Network Management Forum, Document No. J401, native to SNA LU 6.2. 1994). It should be noted that thc desired throughput level 5. Synibios Log~cis the former NCK Microelectronic for a given legacy systeln conncction can be adjusted Products Division of AT&T Global Information Solu- dynamically. An operator can usc the systeln manage- tions Company. ment of the ACMSsp transaction processing monitor to reset the number of active servers in the pool that 6. ObjectDroker Or~ewieu:aud G/osscrrj' (Maynard, Mass.: lligital Equipmelit Corporation, Order No. imple~ucntsthat conncction. Also, any number of AA-Q9KJA-TK,1994). tier 2 nodes can be configured to provide that service within the middleware layer. New nodes can be placed 7. Micr-osojt IVi'indotu.~G rl/lS-DOS Veniorz 6.0,Chapter 11 in scrvice without interrupting currently running (Kedmond, Wash.: Microsoft Press, 1993). nodcs. 8. Sl'IIUT Plc~tfor171.Ul~iepr~iltl, SPIRIT STDL Luizocu- A three-tier, object-oriented dient-server architecture merit No. 1403, 1994). that jncludcs an open systelils transaction processjng 9. SPIIUT Pluljbr-rn Ljl~iepr-irzl:SPIIUTSTDL Lrtl~iror~~?~ent,, monitor can provide a basis for connecting users a~ld Bxectltiorz and Protocol ~Vappirzg Specijication customers to existing enterprise transaction processing (SPIRIT lssuc 2.0, Volume 3.0), ISBN 1-85912-064-4 systems by means of reengineered desktop systems (Reading, U.K.: X/Open Company Ltd., Ncnvork that support GUIs. This approach provides Management Forum, Documcnr No. 1404, 1994). A clear separation of fi~nctioii,i.e., client activities are separate from middleware control and man- Biographies agcment fi~nctions Data location transparency Location transpdrcncy for application interfaces and topological independence A means of defining new business fi~nctionsby compounding existing transactions on different platforms, regardless of location Flesibility to support the continuous evolution of systems without disruption to cnd users Norman G.Depledge Resilience to enhance overall availability As manager of the Trans~ctionProccssing Design Consult- Unrcstrictcd scaling ofthe systeln (through rcpli- ing Group n~ithinDigital's I.iyered soft\vasc organization, Norman 1)cplcdgc is responsible fix managing the rccli- cation of components) for performance adapted nolog transfer interface bcn4~ccnthe TPengineering to thc business growth fi~nctions,thc field organization, and strategic customers A set ofreusable objects to the tier 1 client on a \~orld\.i~idcbnsis. His background in coniputcrs spans 33 ycars. Hc lias held mallagement positions in clcctrical cnginccring, sotiware engineering, and marketing. For References and Note the past IS years, he has specialized in on-line transaction processing at Honeyell, Gull, and Digital. Norman has an 1. Open SoJiuar-e Foi'ol~udalionDCE Application Deud- Honors Degree in electrical cnginccring from Manchcsrcr opmerzt Gl~icle.l)oc~~mcnt No. ED-DCEAPDEVl- University, England, and holds three pdtcnts in electronic 1092-2 (Canlbridgc, /Mass.: Opcn Sohvarc Foundation, controls. 1993).

Digital Technical Journal Vol. 7 No. 1 1995 41 William A. Turner William Turner is J cc~nsulnntin the Tra~lsacrionProccss- ing S!.stc~ns Gro~~p.I Ic succcssfi~ll!. consrr~lctcrln \\,orking ~llodcl.~nd dcnionstl-;lrion ofan open, distributable, thrcc- ricr client-scr\,cr-architccturc \\lit11 tr.lnsaction semantics. In prc\rio~~swork, Willi.11n \\,,is .I co~lsulr;~nrin tlic Norrheast licgion .uld in the h'c\\. York l'roducrion Systems Kcsourcc (;enter. Ilcforc joining Dipical in 1987, 11c hcld positions as n systems manJgcr li)r Elcctric Mutunl and as a technical support manager- hrHollc)~\vcll 1nk)rmation S!.srrrns. William rccci\wi .I R S, in m;lthcm'~ticsfi.o~il Villanov;i Uni\,crsity in 1966.

Alexandra Woog As n consi~lrnnt\\,it-I~ 1)igit;ll's ?'r.lns.lcrion Proccss~ng Systcms Group, Alcx.l~~dr,l\tVoo~ consults \\it11 customers on the design ;uid implcnicnt~tio~iof production systems. In prior tvork, slic \\..IS the product manager k)r 1)igital's Rcmotc Transnetion l

\k11. 7 No. 1 1995 I David M. Fenwick Denis J. Foley William B. Gist The AlphaServer 8000 Stephen R. VanDoren Series: High-end Server Daniel Wissell Platform Development

The AlphaServer 8400 and the AlphaServer The neb\/ AlphaServer 8000 platform is the fou~idatio~l 8200 are Digital's newest high-end server for a series of open, Alpha microprocessor-based, products. Both servers are based on the high-end server products, beginning \\fit11 the AlphaSer\zer 8400 and AlphaSer\!er 8200 systems and 300-MHz Alpha 21164 microprocessor and on continuing through at least thrcc generations of the AlphaServer 8000-series platform archi- products. When combined \\,it11 the pou8cr of the tecture. The AlphaServer 8000 platform industry-leading 300-megahertz (MHz) Alpha 2 1164 development team set aggressive system data microprocessor, this innovative server plntforni offers bandwidth and memory read latency targets the outstanding performance and pricc/pcrforrnance in order to achieve high-performance goals. required in tcclinical and conlmercial markets! In uniprocessor pcrformance benchmark tcsts, the The low-latency criterion was factored into NphaServer 8400/8200 SPECFp92 I-nting of 512 design decisions made at each of the seven is 1.4 tinics the rating of its nearest competitor, the layers of platform development. The combi- SGI Po\\fer <:hallenge SL product. In multiprocessor nation of industry-leading microprocessor benchmark tcsts of systems with 1 to 12 processors, technology and a system platform focused the AlphnScl-vcr 8400 system posts SPF,(:rate levels on low latency has resulted in a 12-processor greater tliall 3.5 ti~nesthose of the H1'9000-SO0 T500 system. In the category of cost for pcrfor~nance, server implementation-the AlphaServer NAS l'arallcl Class R SP benchmarks she\\) that the 8400-capable of supercomputer levels of AlphaServer 8400 system provides 1.7 tilucs the performance. performance per million dollars of the SGI Powcr Challenge SL system? Pcrllaps most imprcssi\zc is the AIphaScrvcr 8400 performance on thc L.inpack nxn benchmark:' With a Linpack nxn rcsult of 5 billion floating-point opcratiolls (GFLOPS), a 12-proccssor AlphnServcr 8400 achieves the perfor~nanccle\eels of such as the NEC SX-3/22 systcnl and the massively pnrallel Thinking ~MachincsCkl-200 system. There are n\,o kcys to the remarkable pcrformance of the AlphaScrvcr 8400 and AlphaScr\,cr 8200 systems: the Alplia 2 1164 nlicroproccssor chip and the AlphaServer 8000 platform architecture. This paper is conccrncd with the second of tl~cse,the AlphaServer 8000 platform ,~~.i-hitccturc.Specifically, it discusses thc principal design issues encountered and rcsolvcd in the pursuit of thc aggrcssi\:e per- formance and product goals for tlic AlphaSer\!er 8000 series. Thc paper concludes \\.it11 an c\~aluation of the success of this platform development based on the perform3nce ~xsultsof the firht AlphaSer\~cr 8000-series products, the AlphaScr\,er 8400 and AlphaServer 8200 systems. AlphaServer 8400 and AlphaServer 8200 detailed analysis of s~lrnmetricmultiprocessing (SMI') Product Goals system operation. As a result of the analyses, '~ggressi\rc systern data band\vidtli and memory read latency goals The AlphaServer 8000 platform technical require- were established, as well as a design philosophy that ments were derived from a set of product goals. This emphasized low mcniory rcad latency in all aspects of set comprised minimum performance goals and a the platfor~ndc\~clopmcnt. This section addresses the number of specific configuration and expandability read latency issues and goals considered by the design requirements dcveloped from Digital's server marltct- team. The 8000 plattbrm development is the focus of ing profiles. The motivations that sli,lpcd tlic list of tlie section follo\ving. gods below were many. Support for Icgncp I/O sub- Read latency is the time it takes a niicroprocessor to systems and DEC 7000/10000 AXP compatibility, read a piece ofdata into a register in response to a l(~d for example, \vas motivated by the ~ieedto provide instruction. If tlic datn to be read is found in a proccs- lligital's customcr installed base with a cost-effective sor's cache, the read latcncy will typically be \.cry small. upg-ade path from 7000-series linrd\\~ire to If, howc\,er, the data to be rcad resides in a computer AlpliaScr\ler 8000-series hardwrare. -l'he goals for low- system's main mcmor)~,the read latency is typically cost I/O subs)ateni, peripheral coriiponciit intcrcon- much larger. In either case, a processor m;~yhave to nect (PCI), and EISA support and for nckniount wait the duration of the rcad latency to make f~rthcr cabinet support were included to take advantage of progress. The smaller the read latcnql, the less time a c~nergingindustry standards and open systems and processor waits for data and thus the better the proccs- their markets. The processor, I/O, and memory sor performs. capacity goals were driven sirnplp by the conipetitivc Cache nicmorics arc typically i~sedto minimize rcad state of the server ~iiarket. latency. When caches clo not work well, either because data sets arc larger than the cache size or as the result Provide industr!l-leading enterprise and open- of non-locality of reference, a computer systeni's office servcr performance. processor-lncmory interconnect contributes signih- Provide twice the overall performance ofthc 1)EC cantly to thc average rcad latency seen by a processor. 7000/10000 AXP server products. The system characteristics that help determine a Support up to 12 Alpha 21 164 proccssors. processol-'s average read latency are tlie system's mini- Support at least 14 (GB) of main mum memory read latcncy and data band\.\lidth. memory. A system's niininium memory read latency is the time rcqiiirccl for a processor to fetch data from a sys- Support multiple 1/0 port controllers-up to tem's main memory, ~~ncnci~mberedby sjlstem traffic 144 1/0 slots. from otlxr processors and 1/0 ports. As processors Provide a low-cost, integrated I/O subsystem. and I/O ports arc iiddcd to a system, their con~pcti- Support new, industry-standard PC1 and tion for memory and intcrconncct resources tends to I/O subsystems. degrade the system's mcrnory rcad latency from tl~c Contini~cto support Iegacv I/O subsystems, such minimum memory rcad Intcncy baseline. A system's JS XMI and Futnrebus+. datn bandwidth, i.c., the amount of data that a system can transfer bcn\.ecn nlain mcmory and its processors Makc centcrplane hard\.r,are conip.ltiblc with an and 1/0 ports in a gi\lcn pcrioci of time, \\rill determine inti~~stry-standardrack~iiount cabinet. the estenr to \\~hichprocessors and T/O ports \\,ill Make centerplane hardware meclinnically com- degrade each other's rcad latency. As data bandtvidth pntiblc \\/it11 tlie DEC 7000/10000 AXP-series increases, 50 too docs a s!~stcm's ability to support co~i- cabinet. currelit data rcfcl-cnccs from various processors and I/O ports. This incrcascd band\vidth and concurrent Performance Goals and Memory Read data referencing scr\)c to rcducc competition fo~. Latency Issues resoilrccs and, .is n rcsult, to reduce memory rc;ld latency. Thus we can conclude that the more a\lailablc Although "pro\riding industr\l-leatli~igpcrformancc" data bandwidtli in n system, tlie closer the mcmory and LLdo~~l>li~igtlie perforin~iricc" of ;ln existing rend 1;ltency \\!ill be to tlic minimn~ii.This conclusion industry-lending server present excellent goals fix- thc is supported by the results of a queuing modcl used to dc\rclopmcnt of a nc\v server, it is difficult to dcsign to support tlic AlplinScr\~cr8000 platform de\relop~ncnt. SLICI~ I~C~LIIOLISgoills. To qi~alititj'the actual tcclinical This queuing moticl, origin.~llyimplemented to cvalu- requirements for the new Alphaserver (3000 pli~tform, ate bus arbitration sclicmcs, outputs the a\,cragc read the dcsign team utilized a performance sti~dyof tlic latencies experienced by each processor in a spstcm as DF,C 7000/10000 ASP systcnis and conducted a the nu~~ibcrof PI-occssors and tlic number of menlory

Vol. 7 No. 1 1095 resoilrces arc varied. It is important to note that in stalled. Such stall periods directly affect the perfor- this model memory rcsources, or banlts, determine mance of computer systems on commercial work- the amount of system bandwidth: the more meniory loads. These assertions supported by a study on the banks, tlie rriore bandwidth. It is also important to performance of commercial ~~orkJoadson Digital's note that the minirnum read latency in this model Alpha 20164-based 1)EC 7000/1000 AXP server:' is 168 nanoseconds (ns). The results of the model It is important to note lierc that the latency ills flagged arc shown in Table 1. These resi~ltscle'~rly show tliat in this study cannot be cured with raw system data latency degrades as the number of system processors bandwidth or sohvarc-enhanced latency reduction. is increased and that latency improves as the sys- Low memory latcncy alone can address the needs of tem's bandwidth-number of mernorv banks-is these workloads. increased. Comparable industry systeliis from IBM and Man!! tccli~iical market benchmarks, such as the He\vlett-Packard (HP) do not stress lou~memory Linpnck bcnclimarlts and the McCalpin Streams latency system development in their respective RIS<: benchmark, stress a computer system's data band- Systen1/6000 SMP or Hawks (PA-8000-based) SMP width capability. The regularity of data reference pat- ~ysterns.~.~In fact, neither directly ackno\vledges mem- terns in these benchmarks allocvs a high degree of data ory latency as a significant system attribute. :This mind prcfctching. Consequcntlp, data cun be streamed into set is retlccted in the rcsults: Based on IBlM's docu- a processor from main menlory so tliat a piccc of data mentation, we estimate tlie RISC System/6000 SMP's has an ~~nnaturallyhigh probability ofbcing resident in minimum main meniory read latency to be in tlie the processor's cache when it is needed for sonic calcu- neighborhood of 600 to 800 ns. lation. Ironically, tliis amounts to using smart sohvare IBlM and HP do emphasize system bandwidth to minimize rcad latency. By reading 3 piece of data in their designs. HI' provides a 960-megabyte-per- into a ~~rocessor'scache before it is actually needed, SCC~I-I~(MB/s) L'r~n~ay"processor-memory bi~sin the softularc presents the processor with a s~iiallcache its Hawks system. The actual data bandwidtli of this read latcncy instead of a long memory latcncy when bus is slightly less than the quoted 960 MB/s, since the data is needed. Thus the streaming techniques the bus is shared benvcen address and data traffic. applied in tliese benchmarks allo\v processors in Iligli- IBM, on tlie other hand, goes to tlie extent of applying bnnd\\,idth systems to stall for a fill1 memory rcad a data crossbar s~vitchill conjunctio~~with a serial Intcncy pcriod only when starting LIPa stream of data. address bus to reach an 800-MB/s rate in its RISC Therefore menlory latency can be amortized over System/6000 SMP system. Tlic actual attainable data lengthy high-bandwidth data streams, minimizing its bandwidth in IBM's system is determined by the significance. It is important to note, however, that bandwidth of its address bus. although bandwidth is thc system attribute tliat donii- In the past, lligital's systems have shown much nates pcrfor~iiancein these \\~orkloads,it dominates the same balance of bandwidth and latency as have pcrk)rmancc through its effect on read latcncy. their competitors. The DEC 7000/10000 AXP sys- <:onimcrcial cvorkloads like tlie Transaction tem has a rninin1111i-rmain nicmory read latency of Processing Performance Council's benchmark suite, 560 ns and a syste~iidata band\vidth of 640 MR/s. on tlic other hand, typically have more complex data The AlphaScrvcr 8000 platform, ho\ve\~er,\Alas devel- patterns tl~atfi.ec~ucntly defj, attempts to prcfctcli data. oped nlith a marked emphasis on lo\v memory read When some of tlicse codes parse data structures, in latency. This emphasis can be seen through nearly all fact, tlic address ofeach data access may depc~idon the phases of system development, including the system results of the last data access. In any case where a topology, clocking strategy, and protocol. This processor ~nt~st\\fait for nlernory rcxi data to makc latencjf-oriented mindsct is reflected in the results: progress, a systcni's memory read latency \\,ill detcr- The AlphaServcr SO00 platform boasts minimum mine tlic period of time that the processor \vill be memory read latencies of 200 ns. The AlpliaServer

Table 1 Average Read Latency as a Function of the Number of Processors and Bandwidth (Number of Memory Banks) Average Read Latency (Nanoseconds) Number of Processors 2 Memory Banks 4 Memory Banks 6 Memory Banks 8 Memory Banks

Digital Technical Journ.~l 8400 and 8200 systems feature a rnini~n~~nimemory clock technology and methodology, and the signaling read latency of 260 ns. To back up these latencies, the protocols. For example, the TLEP, TMEM, and TIOP AlphaServer 8000 platform supports a tremendous modules all implement bus interfaces in the same inte- 2,100 MB/s of data bandwidth. The AlpliaSer\rer grated circuit (I(:) packages with tlie same silicon tech- 8400 and 8200 systems, although not capable of pro- nology and drive their cornnion interconnect bus with viding the fill1 2,100 MR/s, still provide 1,600 MB/s the same standard bus driver cell. Furthermore, all of bandwidth. This gives the systems less than half the these modules apply nearly identical clocking circuits memory latency of comparable industry systcnis while and communicate by means of a common bus proto- providing nearly twice the bandwidth. F~~rther~nore, col. The ephemeral architectural standards that consti- these attributes improve upon the L>EC 7000/10000 tllte the "pl;~tforrn"specitji exact physical requirements AXI' attributes by factors of 2 to 3. Although difficult for designing the Alphaserver processor-rnernory- to determine exactly how these attributes would trans- 1/0 port interconnect and the various niodules that late into overall system performance, they were will populate it. It is important to note that the kcy to accepted as sufficient to meet tlie AlphaServer 8000 AlpliaSer\~er8000 pel-forniancc is fou~idin these stan- plattbrm performance goals. A co~nparisonof the dards. As \vc esplore tlic design decisions and trade- maximum DEC 7000/10000 AYP SPECrates of offs that shaped the AlphaServer 8000 platform, it approximately 25,000 integer and 40,000 floating is this collection of ,lrcIiitectural standards that \\JC point with the rnaxi~n~~rnAlphaScrvcr 8400 actually probc. SPECrates of 91,580 integer and 14,0571 floating Througho~~tthis analysis of the AlphaServer 8000 point indicates tliat these attributes were sound architecture, two themes frequently recur: low mem- choices. ory latency and practical engineering. As disc~lsscdin the contest of the AlphaScrvcr 8000 goals, low nicm- AlphaServer 8000 Platform Development ory read latency was identifed as the key to systcni performance. As such, low latency was factored into Referring to the Alphaserver 8000 platform as a nearly every system design decision. Design decisions "foundation" for a series of server products does jn general can be tlio~~glitof ils being rcsol\~edin one not give a clear picture of \\{hat constitutes a server ofnvo cvays: by crnpliasizing Digital's superior silicon platform. The Alphaserver SO00 platti)rrn has both technology or by efkcting architectural finesse. Use of physical and architectural components. Thc physical superior technology is self-explanatory; it involves component consists of the basic physical structure pushing leading-edgc technology to simply over- from which 8000-series server products arc b~~ilt.This whelm and eliminatc a dcsign issue. Architectural includes power systems, thermal management sys- finesse, on the other hand, typically involves a shiti in tems, system enclosures, and a centerplane card cage operating mode or configuration tliat allows a prob- that i~nplemcntsthe interconnect between processor, lem to be avoided altogether. Practical engineeri~~gis melnory and 1/0 port n~odulcs.The processor, mem- the art of finding a balance benveen leading-edge ory, and 1/0 modules are printed circ~~itboard (PCB) teclinology and nrchitect~~raltincsse that produces the assemblies that can be implemented with varying com- best product. binations of processor, dynamic random-access mem- ory (DRAM), and application-specific integrated Layered Platform Development circ~lit (ASIC) components. The assemblies are inserted into the platform ce~lterplanccard cage in Platform development typic all!^ involves a simple varying configurations and in varying enclosures to three-layer process: ( 1 ) determine a basic system create the various 8000-series products. The topology, (2) establish the electrical means by \\*Iiicl~ Alphasel-ver 8200 system, for exa~ilylc,co~nprises up various computer components ill transmit signals to six Alpha 21 164-based TLE1' processor modules, across the system topology, and (3) apply a signaling 'TMEM lnelnorp modules, or ITIOI' and TIOP 1/0 protocol to the electrical transniissio~isto give them port nodules in an industry-standard rack-niount sps- meaning and to nllo\v the computer components to tern. The AlphaServer 8400 system co~npriscsup to communicate. System topology determines how nine TLEP processor modules, TMEM rncrnory 11iod- processor, memory, and I/O components of a coni- ules, or ITlOP and TIOP 1/0 port modulcs in a DEC puter systern arc interconnected. Computer intercon- 7000 AXP-style data center cabinet. nects may involve simple buses, multiplexed buses, The architect~~ralcomponent of the AlphaSer\~er , and mi~ltiticredbuscs. The electrical means 8000 platform consists primarily of a collection of for transmitting signals across a computer intercon- technological, topological, and protocol standards. nect ma!! in\rolvc bus driver technology, tech- This collection includes the processor-memory inter- nology, and cloclc technology. Signaling proco- connect strategy, the bus interface technology, the cols apply namcs to system interconnect signals and

46 1)igital Technical Journal Vol. 7 No. 1 1995 define cycles in \vIiich the signals have valid values. Topological Layer This naming and dcfinition allows each computer Server-class computers typically comprise processor, component to undcrsta~~dtlic trans~nissionsof otlier memory, and I/O port components. These compo- components. nents are usually fO111idin the form of P<:R modules. As the AlphaScrver 8000 platform developmerit A computer system's topology defines how these com- progressed, this si~iiplethree-layer platforni dcvelop- puter componcnts are jnterconnected. Computer ment model was found to be insufficient. Efforts to topologies are many and varied. The IRM RISC achieve the Ion-latency performance goal and the sini- System/6000 SIMP, for example, links its modules by ple product goals i~nco\!crcdunexpected design issues. means ofan address bus and a data switch. Its rnen~ory The resolution of tlicsc design issues led to the crc- modules are grouped into a single memory subsystem ation of a Inore robust scvcn-layer platform develop- witli one connection to the address bus and one co11- ment model. When certain multi-driver bus signals nection to the data switch. The HP Hawlts SMP sys- thrc;~tenedthe cycle tinic of the Alphaserver 8000 sys- tem, by comparison, links its modules by means of ten1 bus, For example, the system's latency goals wcrc a single bus onto which address and data are multi- threatened as well. The practical solution to this multi- plexed. The Ha\vlts system also groups its memory driver signal problem \\Ins the creation of specific sig- into a single nienior!l subsptern witli one connection naling conventions for problematic classes of signals. to the multiplexed bus.7 Digital's DE<: 7000/ 10000 This innovation lccl to the birth of the Signaling Layer AYP also uses a single multiplexeii addrcss and data of the development model. Similarly, when tlie intc- bus. Unlike the I13M and HP systems, tlie 1)EC gration of PC1 I/O into the systeni was found to con- 7000/10000 AXP system allows its memory to be flict wit11 primary protocol elements that were key to distributed, witli multiple connections to its multi- lo\\/ latency processor-mcniory commu~~ication,the plexed bus. concept ofa "supersct protocol" was created. This led None of the IRM, HP, or prior Digital systems meet to the creation of the Supcrset Protocol Layer of tlie the latency goals of the AlphaServer 8000 platform. dcvclopnient model. The seven-layer platforrn dcvel- Exactly ho\v much system topology contributes to opment model is coritr.~stcdwjth the simple tlircc- these systenls' latc~icics is ~~ncleiir.A multiplexed layer de\~elopnientmodel in Figure 1. address and data bus certainly crcatcs a s!/stcm bottle- The analysis of tlic AlphaScrver 8000 platform neck and can contribute to latency. Like\visc, ~~nified dcsign presented here traces tlie key systeni dcsign memory subsystc~nscan often have associated over- decisions tlirougli each oftlic seven layers oFthc dcvel- liead thdt can translate into latency. 111 addition to per- opnient process. Each layer \\!ill be described in greater formance issues, topologies such 'is tlic II3M switch- detail as this analysis proceeds. based system have significant cost issues. If, for exam- ple, a custo~iicr\vcre to purchase a sparsely configured -nvo processors perhaps-IRM systcm, such a cus- SUPERSET tonierulould be required to pay for tlic s\vitcli support PROTOCOL for up to eight processors. This crcatcs a high system LAYER entry cost and a potentially lo\\,er incremental cost as PROTOCOL PRIMARY LAYER PROTOCOL functionality js added to the system. In J simple bused LAYER system, a customer pays only for what is needed to CONSISTENCY support thc specific functionality rcquircd. This cre- CHECK LAYER ates a more manageable entry cost and ,i smooth, if slightly steeper, incremental cost. Froni Digital's mar- SIGNALING ELECTRICAL LAYER keting perspective, this ~iiakesa bused systcm prefer- TRANSPORT able, provided it can satis@ bnndwidtli and latency LAYER ELECTRICAL TRANSPORT requirements. LAYER Uniprocessor computer topologies, an example 1. OPERATIONAL of which is shown in Figure 2, typically exhibit the PHYSICAL LAYER lowest memory read latencies of any computer class. ARCHITECTURAL LAYER As such, this simple uniprocessor topology was chosen TOPOLOGICAL LAYER as tlie basis from which to develop tlic Alphaserver 8000 platform topology. In the uniprocessor model, THREE-LAYER SEVEN-LAYER DEVELOPMENT DEVELOPMENT processor chips communicate with DIM arrays MODEL MODEL thro~~gliseparate address and data paths. Thcsc paths include addrcss and data interfaces and buses. The Figure 1 AlphaServcr 8000 topology was created by addi~iga Comparison of Con\/cntionol Th~~cc-l~~~crModel with second sct of interfaces benveen tlie address and data Sc\,cn-layer Platform 1)evcloplncrlt Model buses and tlie L)lW array, and connecting additional

Vol. 7 No. 1 1995 47 is that memory read latcncy from any processor to any PROCESSOR nieliiory array is comparable to tlie latency of a uniprocessor system. The delay associated with nvo interhccs-one address interface and one data inter- MICROPROCESSOR face-is all tliat is addcd into the path. 111 addition, tlic platform's simplc bus topology feati~resa low entry cost, a simplc gro\vtIi path (just insert another mod- ule) and flexible configuration (just about any module I I t4ERESS can be placed in any slot). DATABUS I 1 Operational Layer Tlie Operational L,aycr is so nnmcd for lack of :I bcttcr DRAM ARRAY descriptor. Tlie laycr is actually a place to clcfint: a "I liigli-level system clocking strategy. This strategy has nvo key colnponcnts: definition of target operating Figure 2 fiequaicies and definition of n design methodolog!. to Si~nplcUniprocessor Sysrcm Topology support operation across all the defined operating fre- quencies. Thc dcsign methodology component of tliis nlicroprocessors, nlemory arrays, and 1/0 ports to the strategy may seem better suited for a higher order buses by nieans of similar interbces. The resultant development laycr, si1c11 as the 1'1-otocol Layer. topology is sIio\vn in Figure 3. 'l'liis topology features However, because the methodology is logically associ- separate address and data buses. These buses together ated wit11 thc system's operating frequency range and al-e referred to as the Alphaserver SO00 systcni bus. the operating frequency range provides a foundation The topology presented in Figure 3 is an abstract. for the Electrical Transport Laycr, it seemed appropri- To Flesh out tliis abstract and nieasurc it against spe- ate to include both components of tlie strategy in tlic cifi c system goals, signal counts, cycle times, and bus Operational Layer. connection (slot) counts must be added. It is in this In personal computer (PC:)-class rnicroproccssor cffort that practical engineering must be applied. To systems, clocl< ratcs arc typicall!! slo\\l (33 MHz to 66 achieve the system's bandc\.idtli goal, for example, tlie MHz). Cornplcnicntarp components capable of oper- data bus could be iniplemcnted as a \vide bus with ating at these speeds arc readily available, e.g., trans- a high clock frequency, or it coi~ldbe replaced with a ccivel.s, static ral~dom-accessmemory (SRAM), ASIC, switch-bascd data interconnect, liltc tliat of the IRM DIIAM, and (PAL.). RISC Systcm/6000 SMP. The liigh-frequency bus Therefore entire PC systems are typically run synchl-o- presents a significant technological challenge in terms nously, i.e., the systcm logic (typically a motherbo:lrd) of drivers and clocking. This challenge grows as tlie and thc microprocessor 1.~111 at identical clock speeds. number of bus slots grows. The growth of the tech- Alpha processors, on the othcr hand, run at clock ratcs nological challenge is a significant issuc given the exceeding 250 MHz. The current state of comple- system's configuration goals. The s\vitcIi interconnect, mentary compollcnts niakcs running systcm logic :~t on thc othcr hand, avoids thc tccl~nologicalcli~illengcs Alpha processor r~tcsimpractical if not impossible. by providing more data paths at lo\ver clock frccluen- Many of these coniponcnts cannot pcrforni j~ltcrnal cies. The lo\ver clock frequencies, lho\\lc\~cr,can trans- filnctions at a 250-MHz rate, let alo~ietransfers late directly into additional latency. Given tlie benveen componcntb. emphasis placed on Incrnory latcncy and tlic advan- Digital's 1)E(: 7000/10000 &YP systems sol\,ed thc tages associated with simple bused systems, the practi- probleni of Alplia microprocessor and systcm clock cal dcsign choice was to adopt a wide, higli-frequency disparity by running both the Alplia microproccsso~. data interconnect. The resultant AlpliaScrvcr 8000 and tlie DEC 7000/10000 ASP system liard\varc at systcni bus feati~res 9 slots, an address bus that their respective maxinii~~iiclock rates and synchroniz- supports a 40-bit address space, and ii 256-bit (plus ing address and data transfers between the micro- error-correcting code [ECC]) data bus. To meet processor and thc system. Each time a transfcr \\us configuration goals, processor ~nodulcsncccssarily sync.hronjzed, ho\vc\/cr, a synclironizatio~i latcncy support at lcast nvo per module, penalty \\,as addcd to the latcncy of tlie transfcr. In the memory modules support up to 2 GB of DRAM stor- DEC 7000/10000 AXP systcni, nvo synchronization age, and I/O port modules support up to 48 PC1 penaltics-one for Jn addrcss trnnsfer to the s!,stcln slots. To meet performance goals, both buses must and one for a Jat.1 transfcr to the processor-arc addcd operate ;it a frequency of 100 MHz (10-ns cycle). to each memory rend latency. With nlultiple dat.1 Tlie Alphaserver SO00 platform topology has a transfers, tlic data transfer fi-0111 the system to tlic number of advantages. The most significant advantage processor cnn be particularly large. When combined,

48 Iligiral 'I'cchnicnl Journal Vol. 7 No. 1 1995 PROCESSOR 3 110 PORT 3 I10 PORT 2 110 PORT 1

- DATA ADDRESS DATA ADDRESS - INTERFACE INTERFACE INTERFACE INTERFACE 411 41 I J 4 4 111 DATA BUS t ll, t l I C J

/I tit ADDRESS BUS 111 ,,

DRAM ARRAY

MEMORY 1

Figure 3 AIpIi.~ScrvcrSO00 Mult~proccssorSystcrn Topology the nvo penalties nddcd nearly 125 ns to the 1)EC operating fiequencics of those processors is difficult to 7000/10000 AXP read latency, or approsimately 25 predict, the AlpliaScr\~cr8000 platform ~iiustbe capa- percent of the total 560-11s latency. The same 125 ns, ble of operating across a range of cloclc frequencies. liowc\/c~-,could add another 60 percent to the Specifically the AlpliaSer\ler 8000 platform is capable Alphaserver SO00 p1atfi)rm's lower target latency of of operating at clock frequencies between 62.5 MHz 200 ns. (16-ns cycle) and 100 MHz (10-ns cycle). C;i\~cnits latency goals, the AlphaSer\ler 8000 plat- Operating across a range of frequencies may seem a form implements a cloclting methodology that mini- trivial reqi~irenicntto meet; if logic \arere designed to mizcs synchronization penalties 2nd thi~snii~iiniizcs operate at a 10-ns cycle time, jt should certainly con- read latency. This methodology involves clocking the tinue to hnction electrically at a 16-11scyclc time. The cntirc Alphaserver systcln-up to the 1/0 channels- real issues that this frccluency range creates, howcver, synchronous to the microprocessor in such a way are much more subtle. DRAMs, for example, require a that the Alpha microprocessor operates at a clock fre- periodic refresh. The refresh period for typical DRAM qi~u~cythat is a direct multiplc of the systern clock may be 50 niilliseconds (ms). If a system were freql~c~lcy.With a 100-MHz (10-ns cycle) clock rate, designed to a 10-ns clock rate, the system \vould be for esamplc, thc AlpliaScrver 8000 could support designed to initiate a DRAM refresh every 5,000,000 a 200-MHz (5-11s cyclc) Alpha processor i~singa cycles. If the s)lstcni \\)ere to be slowed to a 16-11s clock 2X clock multiplier. Sincc the processor must still rate, the systern \vould initiate a DRAM refresh cvcry synchronize \\/it11 a system clock edge when transfcr- 80 nis based on tlic same 5,000,000 cycles. This could ring address and data to tlie system, synchronization cause DRAMs to lose state and corrupt systc~i~opcra- penalties are not eliminated altogether. They can, tion. Similarly, DRAMS have a fixed read access time. however, be limited to less than 10 ns, or 5 percent of The AlphaServer 8400/8200 TMEM module, for tlie AlphaServer 8000 platfi)rni's total read latency. example, uses 60-11s DRAMs. If the DRAM'S con- S~~nclironouscloclting by means of clock m~~ltiples troller is designed as a '/-cycle controller and clocked at is not LIII~~LI~and in~~ovntivcin and of itself. The a 10-11sclock rate, it \voiild access tlie 60-11s DWM ill uniqueness of the AlphaScrvcr SO00 cloclcing strategy 70 11s.If the system were slowed to a 16-11s clock rate, lies in its flexibility. Sincc thc LAlpliaSer\~er8000 plat- the system \\~OLII~,using the same controller, consume form must support at least three generations ofAlpha 112 ns in accessing the same 60-ns DRAM. This appli- processors to satisfy its product goals and the spccific cation of a single simple controller over a frequency

Digital T'cclinical Jo~~rnal Vol. 7 No. 1 1995 49 range directly increases the DRAM'S rcad latency and To put the Alp1iaSe1-ver8000 100-MHz system bus decreases the 1)RAlM's bandwidth. This non-opti~nal goal in perspective, consider the oper~tingfi-ccluencies DRAM performance in turn directly increases the sys- of a number of today's highly competitive micro- tem read late~icyand decreases the system bandwicltli. processors."The NcsCen Ns586 operates at 93 MHz. The AlphaServer 8000 platform design addresses The Intel Pentiuni, Cyrix iMl, and AMD KS all oper- these issues by imple~lientingcontrollers that can be ate at 100 MHz. The Intel 1'6 opcratcs at 133 l\/IHz. reconfig~~redb'ised on the system's specific operating In all these microprocessors, the loo+/- MHz oper- frequency. TJic TIMEM module, for example, iniplc- ation takes place on n silicon die less than 1 inch ments a reconfgurable controller for sequencing the scluxc. To meet its goals, the AlphaSer\rcr 8000 s!a- reads ancl writes of its DRAMS. This controller has tern bus must transfer data from an interface on a three settings: one for cycle times between 10 ns and module in any slot on the system bus to an interface on 11.2 ns, one for cycle times between 11.3 ns and 12.9 another module in any other slot on the system bus ns, and one for cycle times beb\!ccn 13 ns and 16 ns. across a 13-inch-long \\,ire etch, \\/ith ninc etch stubs Each setting accesses the DRAMS in differing ~lumbcrs and ninc connectors, in the same 10 ns in \\~liiclithese of system clock cycles, but all three modes access microprocessors transfer data across 1 -inch dies. By the DRAlMs in approximately the same number of any measure this is n daunting task. nanoseconds. By allowing flexible reconfiguration, A brealzdo\vn of the ele~nentsthat determine nii11i- this co~~trollerallows the TMEM to keep the 1)RAM's IIILIIIIcycle tirne aptly demonstrates thc significance of read latency 'lnd bandwidth as close to ideal as pos- clock system design, bus driver dcsign, and bt~s sible. Other cxa~nplesof reconfgurablc controllers receiver design in the AlpliaServer 8000 system bus are the TMF,l\/l's refresh timer and the TLEP's cache devclopn~ent.blini~num b~~s c!,clc tirnc is tlic mini- controller. mum time requircd hct\\,ecn clock edgcs during \\hich It should be noted here that tlic AlpliaServer 8000 data is dri\!en from a bus dri\rer cell on one clock cdgc operating frequency range and processor-based fre- and is received i~itoa bus rccei\~ercell on the nest clock quency selection account for the disparities bcbvccn edge. An equation for determining the minimum cycle the AlphaScrver 8000 platform's handwidth capability time is sho\vn below. 7;,,,,,, is the minimum cycle ti~ne. and the AlphaScrver 8400 and 8200 products' band- 7;,,.0i, is the time, rnc3s~1redfrom a rising clock edge, width capabilities. The Alpha 21 164 processor is the that is required for a bus driver to dri\zc J nc\v bus sig- basis for the 8400 and 8200 products. l'liis 300-MHz n~lle\,el to all system bus rccci\,crs. /;,,,\; is the time a (3.33-ns c!rcle) microprocessor, combined ulith a 4x bus receiver needs to process a nc\\. bus signal le\~el clock frequency multiplier, sets the system clock frc- before the signal can be clocked into the receiver cell. quency at 75 MHz (13.3-ns cycle). This 13.3-11s cycle 7;kc,11,is the \!ariation bct\\leen the clock used to clock time, when applied to the 256-bit data bus, produccs tlic bus driver and the clock used to clock the bus the 1,600 MR/s of data band\vidtIi. The cycle time receiver. ,,,,[,, 7; ,.,,!,,, and 7;,,,,, must all be n~inimized increases the rcad latency of the 8400 and the 8200 to to achieve the lo\\icst possible cycle time. The \ALICof some extent as \\,ell, but the reconfigurablc DRAM 7;k,,ll is deterrni~icdby the system clock dcsign. The controllers help to mitigate this effect. \,alucs of 5,.,,/, ancl 7; ,,,! ,ire determined by the bus dri\~er/recei\,cr cell dcsign. Electrical Transport Layer When the bused system topology was sclcctcd in the Topological Layer of the AlphaScr\lcr 8000 platform AlphaServer 8000 System Bus Interface To pro\ridc development, a practical cnginccri~lgdecision was some contest fo~the clock and bus dl-ivcr/recei\~er made to emphasize leading-edge technology as the discussions, it is ncccss,~ryto briefly describe the stan- means to accomplish our perforrn.1~~~goals, 2s dard AlphaSer\~erSO00 system bus intcrhcc. Each opposed to elegant architectural chicanery. It \\,as Alpl~aSer~rerSO00 module implements a standard observed in the topological discussion that, with the s)lstcn~bus intcrf~~cc.This interhcc consists of five selected system topology, bus cycle time \\!as critical to ASICs: one interfaces to the AlpliaSer\~cr 8000 meeting the platform's performance goals. Tbc address bus, and four interface to tlic AlphaServer Electrical Transport Layer of the platform develop- 8000 data bus." E:~cli ASIC is i~nple~ncntedin ment in\~olvedsclccting or dc\,cloping the centcr- Digital's 0.75-~nicromctcr,3.3-\volt (V) co~iiplcmcn- plane, connector, clocking, and silicon interface tnry mctal-oxide semiconductor (CMOS) tccli~iologr technology that \\,auld allo\\~the AlphaScr\~erSO00 and features up to 100,000 gates. Each ASIC is pack- system bus to operate at a 100-i\/IHz clock frequency. dged in a 447-pin i~~tcrstitialpi11 grid array (IPGA) The most inno\/~ti\leof the technological develop- and features up to 273 user I/Os. ments that resulted from this effort were the plat- Essential to thc AlphaScrver 8000 dc\lelolxment form's clocking system and its ctrstom bus driver/ uicrc the speed oftlic CIMOS interface ASIC tcchnol- receiver cel I. on/ and the dc\fclopiiuit team's ability to influence

\loI. 7 No. 1 1995 the ASIC design process. "Influencing tlie design ended RF oscillatol., a constant impedance bandpass process" translated to the ability to develop a standard filter, and a ~ii~le-waypower splitter. Thc power splitter, cell design library and proccss that is for and in concert by way of the bandpass filter, produces nine spectrally with the de\leJopment of the Alphaserver 8000 plat- clean, amplitude-reduced copies of thc oscillator sine form. The standard ccll library, together with the wave. These nine o~~tp~~tsare tightly matched in phase CMOS silicon technology, provided the A.lphaServer and amplitude. Thcy are distributed to the nine system 8000 platform's required speed; co~nplexlogic filnc- bus module connectors by means of matched-length, tions (5 to 8 levels of complex logic gates) can be per- shrouded, controlled-impedance etch. This design fornied within a 10-ns cycle. "Influencing tlie design provides tlie modules \\lit11 low skew (30 to 40 ps), process" also translated to the ability to design a Fully high-quality (greater than 20-decibel signal-to-noise custom bus driver/rcceiver cell. Thus the develop- ratio) clocks. ment team could create a custom driver/receiver cell The RF sine wave clock was an ideal selection for tailored to the specific needs of the AlphaServer 8000 system clock distribution. By eliminating all high- system bus. order harnionics, the edge rates and propagation times ofthe clock wave are fixed and predictable across the Clock Technology The primary goal of the distribution network. This predictability eliminates AlphaServer 8000 platform clock distribution systeni variation in the clock as perceived by the clock receiver was to maintain a ske\\! (Ti,,,,.) as small as possible on each module, thus niininiizing skew. It also greatly benveen any nvo clocks in the systeni, while delivering reduces constraints on the design of co~inectors,etch, clocks to all clockcd systcm components. :The goal of termination, ctc. minimum skew is consistent with attaining the lowcst The AlphaServer 8000 module clock distribu- possible bus cycle time, the highest possible system tion is a boilerplate design that is replicated on each data bandwidth, and thc lowcst possible memory read AlphaServer 8000 module. On each module, the latency. It is important to note that in the AlpliaScrver system sinc wave clock is terminated by a single- SO00 platform, skew benveen clocks is not simply ended-to-dual-differential output transformer. This measured at the clock pins of the various clocked transfor~ner produces two phase- and amplitude- components. Ske\v is measurcd and, more important, matched differential clocks that are fed into one or two managed at the actilal "point of use" of the clock, ti)r Alphaserver 8000 clock repeater c.hips (DC285 example, at the clock pins ofASIC flip-. This is an chips). These chips convert the sine wave clocks into important point when dealing with ASICs. Since dif- CMOS-compatible digital clocks; distribute multiple ferent copies of even the same ASIC design can have copies of the digital clocks to various module com- different clock insertion delays, additional skew can be ponents, including tlie system bus interface ASICs; injected between clocks aher the clocks pass their and perform remote delay clock regulation on each ASIC pins. clock copy. The Alphaserver 8000 clock distribution system is The remote delay clock regulation is performed by a implemented according to a two-tier scheme. 'The frst custom, digital delay-locked loop (DLL) circuit. This tier, tlie skl.yterrl clock ~listrih~.ition,distributes a clean DLL circuit \vas devised specifically to deskew clocks radio frequency (W) sinc wave clock to each system all the way to their point of use in the system bus inter- bus module. The second tier, tlie mod~lleclock distri- face ASICs. The principles of DLL-based remote delay hulion, converts tlic systeni RF sine wave clock to a clock rcg~~lationare simple. The sum of the delays digital clock and distributes tlic digital clock to each associated with (1) the clock repeater chips, (2) the niodi~le'scomponents. TIic module clock distribution module clock distribution ctch, and (3) the ASIC tier also manages the skcw benveen the system RE' sine clock distribution l~envorkconstitutes the insertion wave clock and all copies of each module's digital delay of the ASIC; point-of-use clock \vith respect to clock by means of an innovative "remote delay coni- the systeni sinc wave clock. With no clock regulation, pensation" meclianisni. The system clock distribution this delay appears as skew benveen the system clock delivers clocks to the nine system bus niodule slots and the point-of-use ASIC clock. Between ASlCs on with a maximum of 40 picoseconds (ps) of skcw. different modules, a fixed portion of the clock inser- The module clock distribution delivers clocks to tlic tion delay will correlate and need not be fi~ctoredinto various module components, most notably system the overall system skew. Since the insertion delay can bus interface ASICs, with a maximum of 980 ps of easily approach 7 ns, however, the variation in the skew. The skew between any ASIC flip-flop on any insertion delays to different ASICs, which must be fac- Alphaserver 8000 modulc and any ASIC flip-flop on tored into the overall system skew, can also be signifi- any other AlphaServer 8000 module is guaranteed to cant. To reduce the skew between the system sine be less than 1,100 ps. wave clock and the point-of-use ASIC clock, tlie clock The AlphaServer 8000 system clock distribution repeater uses a digital delay line to add delay to the begins on the system clock module with a single- clock repeater output clock. Enougli delay is added so

Digital Technical Jol~rnal Vol. 7 No. 1 1995 51 that the insertion delay pli~sthe delay-line delay is width. In the effort to minimize tlie b~~scycle time, cqual to an integer multiple of the system clock. This the design of the AlphaServer 8000 bus driver/ delay moves the point-of-use clock ahcad to a point receiver cell was focused on minimizing the propaga- where it again lines up with the system clock. As the tion delay of the system bus driver circuit and systern operates, the system and point-of-use clocks minimizing the setup time (5;,,,,,,) of tlic system bus may drifi apart. 111response, the clock repeater adjusts receiver. its delay line to pi111 the cloclts back together. This The Alphaserver 8000 system bus driver/receiver process of delaying clocks and d!mamically adjusting cell is a fi~llycustom CMOS I/O ccll, \vhich incorpo- the delay is called remote delay clock regulation. rates a bus driver, a bus recei\.er, and an output flip- When the clock separation, or drift, is measured by a flop and an input tlip-flop in a single ccll. Consisting of clock "replica loop" and the clock delay is inserted by nearly 200 metal oxide semiconductor field-effect means ofa digital delay line, the process is called DLL- (), the bus driver cell is powered based remote delay clock regulation.10 Using the clock by standard 3.3-V CMOS power, but drives the bus at repeater chips in this way, Alphaserver 8000 n~odules a much lower 1 .S-V level (i.e., voltage s\vings benvecn are able to acllieve point-of-use to point-of-use skew 0 and 1.5 V). This lo\\rvoltagc outp~~tserves to reduce of approxin~ately930 to 980 ps. Combined \\lit11 the the bus dri\u-'s po\\lcr cons~~niptionand permits con- system module-to-module skew of 30 to 40 ps, this patibility \vitIi fc~tureCIMOS tcchnologics that are provides tlie quoted system-\vide clock skc\v of no po~vercdby voltages less than 3.3 V. Man\, of the bus more than 1,100 ps. driver cell's critic,il characteristics arc "programnia- It is \vorth noting that although the AlpIiaServer blc," such as the 1.5-Voutput, tl~creceiver s\vitcJiing clock repeater \\/as primarily developed for Llse wit11 point, the driver's drive current limit, and the driver's system bus interface ASICs, it is a generally vcrsatilc rise and fall times. These values are programmed and, part. It may, for instance, be used with non-ASIC parts most important, are held constant by means of ref- such as transceivers and synchronous Sl&Vs. In these erence voltages and resistances external to the bus cases, the clock pin of the 11011-ASICpart is treated as driver/reccivcr cell's ASIC package. They allo\v the the point of use of tlie clock. The clock repeater may cell to produce unifornm, predictable, high-performance also be used for precise positioning of clock edges. On waveforms and to transmit and receive data in a clock the TLEP module, for example, the Alpha 21 164 cycle of 10 ns. microprocessor's system clock is synchronized to a The bus drivcr/receiver's high performnnce begins clock repeater output by means of a digital pliase- with its outpl~tflip-flop and driver logic. The output locked loop (PLL) on the microproccssor. The Alpha tlip-flop is designed for ~i~inirnumdelay and is inte- 2 1 164's PLL operates in such a way that the 2 1 164's grdlly linked to tlie output driver. This configuration clock is always in phase with or always trailing the sys- produces clocl<-to-outpl~ttimes of 0.5 ns to 1 ns. The ten1 (reference) clock. It can trail by as IIILICII as 2 11s. output dri\'cr itself, with its programmable output Such a large clock disparity in this fixed orjentation call voltage and cdgc rates, allo\vs the shape of the output create setup timc problenis for transfers from the \va\/cform to be carefully controlled. The cell's pro- Alpha 21164 to the system and hold-tinw problems gramniablc values are set such that the Alphaserver for transfers from the systern to the Alpha 2 1 164. The system bus waveform balances tlie edge rate effects of TLEP design addressed this problem by lengthening increased crosstalk cvith increascd propagation delay. the replica loop associated with the Alpha 21 164 clock Furthermore, the bus waveforni is shaped in such a and thereby shifting the microprocessor clock 1 ns way that it allows incident wave transmission of sig- earlier than the balance of the clock repeater output nals. As such, a signal can be received on its initial clocks. Since the Alpha 21 164 clock was either in propagation across the bus centerplane, as opposcd to pliase or 2 ns later than its associated clock repeater waiting For signal reflections to scttlc. All the driver clock, which is 1 ns carlicr than the rest of the clock characteristics serve to reduce bus scttli~igtime. When repeater clocks, the 2 1164 clock now appears to be combined \vith the lo\\) clock-to-ot~tputtimc of the either 1 ns earlier or 1 ns later than the rest of the clock output flip-flop, this reduced settling time produces a repeater system clocks. This centering of the module very low driver circuit propagation delay (T,,.,,). clocks with respect to the 21164 clock halves the Tlic bus iiri\lcr/recei\ler cell's receiver and input required setup or hold rnargin.ll. l2. I4 flip-flop f~rthcr contribute to its high performance. Dcsigncd with a programmable rcfcrcncc voltagc, the Bus Driver Technology Like the Alphaserver 8000 receiver has a very precise switching point. Whereas clock system, the AlphaScrvcr 8000 system bus driver/ typical receivers may have a 200-~nilli\rolt(mV) to receiver cell was specifically designed to minimize bus 300-nlV s\vitching \vindow, the bus drivcr/rccciver cycle time. As with the clock logic, the goal of mini- cell's rcccivcr has a switching \\,i~ldo~~as small as mizing cycle time was a result ofthe effort to minimize 40 mV. This diminished switching uncertainty directly system read latency and maximize systcm data band- reduces thc receiver's maximum scti~ptimc. The input

52 1)igiral Technical Journal Vol. 7 No. 1 1995 tlip-flop's master latch is a sense-amplifier-based latch every AlphaServer system bus interface ASIC, bus dri- as opposed to a simple inverter-based latch. The sense ver cells in different ASICs, for example, can drive amplifier, with its ability to resolve small voltage differ- nearly identical system bus waveforms even if those entials niucli faster than standard inverters, allows tlie ASICs come from manufacturing lots with varying master latch to determine its next state much more speed characteristics. Alphaserver 8000 PVT compen- rapidly than a standard latch. This characteristic serves sation is based on reference voltages and resistances to reduce both the receiver's setup and hold time provided by very precise, low-cost, module-level com- requirements. ponents. The PVT compensation circuit measures In general, the setup and hold tinie requirements of these references and configures internal voltages and a state elcment are interrelated. The setup time, for resistances so that all bus driver cells can operate uni- esample, can be reduced at tlie expense of hold time. formly and predictably. By creating predictability and Since setup time contributes to cycle time and hold thus reducing uncertainty and skew, bus cycle time is tinie may not, reducing setup time is desirable. The minimized. AlphaServer 8000 bus driver/receiver cell requires at most 300 ps of combined setup and hold time. Signaling Layer However, since the edge rates of the cell driver are so Powerful though it may be, the AlphaServer 8000 bus well controlled, the minimum propagation time for a driver/receiver cell is not witliout limitations. During bus signal is always guaranteed to exceed 300 ps. As its development, it was found that the bus driver cell a result, the bus receiver circuit is designed with all could be developed to drive the AlphaSer\~er8000 sys- 300 ps charged as hold time. This renders a minimized tem bus in 10 11s under a limited number of condi- receiver setup time (T<,,,,Jof 0 ps. tions. When the driver cell asserted a deasserted (near The AlphaServer 8000 bus driver/receiver cells 0 V) bus line or deasserted a bus line that had been have a number of additional features that further asserted (near 1.5 V) for only one cycle, for example, reduce the propagation delay ( q>,,,/))ofthe driver cir- 10-ns timing could readily be met. When the driver cuit. The cell, for example, features in-cell bus termi- attempted to deassert a bus line tliat had been asserted nation, which provides the system bus with full, for more than one cycle by multiple drivers, however, distributed termination. Simulations have shown that 10-ns timing could not be met. These limitations have such distributed termination can provide an advantage significant in~plications for protocol development. of 500 ps over common end termination. The bus Protocols typically have a number of signals that can driver/receiver cell's termination resistance, like other be driven by multiple drivers. These may include cache cell parameters, is programmable and made identical status signals and bus flow control signals. Protocols throughout all system ASICs by means of a reference also typically include a number of signals that can be resistor external to each ASIC. asserted for many cycles. These map include bank busy The bus driver/receiver cell also features a special signals or arbitration request signals. Clearly the impli- preconditioning function that improves the driver's cations are that the liniitations of the bus driver/ propagation delay by as 111uch as 1,500 ps. This feature receiver cell would cause the system either to fall short causes all bus drivers to begin driving toward the of its cycle time and performance goals or to be inca- opposite state each time they receive a new value horn pable of supporting a workable bus protocol. the bus. If the bus is changing state from one cycle to With the bus driver/receiver cell pushing tech- the next, thc feature causes all drivers to begin driving nology to its limits, the solutions to this problem the bus to a new state in the next cycle. In doing so, all were extremely limited. The system cycle time could bus driver cell drivers contribute current and acceler- be slowed down to accommodate all signal transitions ate the bus transition. If the bus is not changing from within a single cycle, regardless of the charge state of one cycle to the next, the drivers simply push the state the signal line; or a signaling protocol could be devel- of the bus toward the opposite state, but only to a oped that would avoid charging a signal to thc point benign voltage urcll short of tlie switching threshold. where it could not transition in 10 ns; or the physical All of the bus driver cell's programmable features, topology of the system could be reconsidered with such as switching point, output voltage, edge rates, the goal of finding a new topology that met the system and termination resistance, make the bus driver cell a goals at a slower clock rate. The first option of slow- very stable and high-performance interface cell. The ing the clock was clearly unacceptable; it could not existence of these features, however, is an element of satis@ the system's late~icyand bandwidth goals given the bus driver cell's complernentar)r process-voltage- the system's topology. The third option could poten- temperature (PVT) compensation function. PVT com- tially satisfjl tlie system's latency and bandwidth goals, pensation is meant to makc a device's operating but came at the expense of tlie favorable qualities characteristics independent of variations in the semi- of the simple bus outlined in the Topological Layer conductor process, power supply voltage, and operat- and at the risk that the neb\/ topology would suffer ing temperature. By applying PVT compensation in similar, unforeseen pitfalls. The option of developing a

Dinital Technical Jo~~rnal Vol. 7 No. 1 1995 53 signaling protocol, on tlie other hand, could satistj tlic sections of this paper, the term "asserted" is used to s)steni's performance goals \vitli little or no risk. A sig- imply constant assertion, \\/it11 the ~~nderstandingthat naling protocol was clearly the practical solution to the the signals may in fact be pulsed. bus driver/reccivcr cell limitations. The Signaling Layer of the platform dcvclopnicnt Consistency Check Layer model introduces tlie AlphaScr\rer 8000 signaling pro- The Consistcnc)r Check Layer defines a method for tocol. This protocol was dc\leloped by creating a list of maintaining system integrity. Specifically, it defines signal classes, based on driver counts and assertion and methods for detecting errors and inconsistencies in the deassertion characteristics, and by associating a specific system and, more important, mcthods for logging signaling protocol \\jitli cacli class. The signal classes errors in the presence of historically disal~lingerrors. and their protocols are listed in Table 2. As the Although it docs 1101 contrib~~tcdirectly to the Alphaserver 8000 primary protocol \ifas dc\lclopcd, AlpliaScrvcr 8000 platform's performance goals or each bus signal was assigned a signal class. As stated product goals, the Consistency Check Layer AIphaSer\/cr 8400/8200 hardware was developed, contributes an extremely usefi~l feature to tlie each bus signal was designed to operate according to MphaScrvcr 8000 products. It is included in the paper tlie signaling protocol associated with its signali~~g for the sake of completeness in the analysis of the class. The system bus address and data signals, for seven-layer platfi)rm dcvelopnient nlodel. example, fall into the second class of signals. As a The AlpliaScr\lcr 8000-based systems employ a result, the AlphaScrver 8400/8200 modules are number of c~.ror-checkingmechanisms. Thcsc include designed to leave tristate cycles benveen cacli address transn~itcliecks, sequence chcclrcss) demonstrate a note\\~ortliypara- This Fault signal has the effect ofpartially resetting all digm that results fro1i1the Alphaserver 8000 sig~l;lling system bus intcrfaces and processors, and trapping the protocol. All thcsc signals are defined such that at processors to "machine check" error-handling rou- times they must be asserted for multiple cycles. All tines. The partial reset clears all s)~stc~iistate, with the these signals also hll into the fourth signal class, which exception of error registers. This rcs)lncIironizes all expressly prohibits driving the signals for niultiple system bus intcrfi1ces and elimin,~tcsall potentially c)lcles. When these t\vo contl-adictory requircmcnts unser\~iccablctransactions left pending in the systcni. exist, the result is n class of signals p~~lsedto indicate Thus tlie systcm can begin execution of tlie machine- multiple cycles of constant assertion. Logic inside each check routines in a rcsct system. Although the routines AlphaSer\w- 8000-based ~iiodulemust be designed are not gi~aranrecdto be able to co~npletean error log to convert these pulsed signals to constantly asserted in the prcscncc of an error, it is bclicvcd that this signals within its system bus interface. Note that \vhcn mechanism will increase the probability of a successf~l signals such as these are discussed in the protocol error log.

Table 2 Al~haServer8000 Sianal Classes Signal Driver Count and Signal Class AssertionIDeassertion Characteristics Signaling Protocol 1 Single driver with multiple receivers Never driven more than two consecutive cycles 2 Multiple drivers with multiple receivers Tristate cycle on the bus when driver changes One driver at a time Never driven more than two consecutive cycles Multiple drivers with multiple receivers Value received on signal deassertion is unpredictable and must be ignored Many drivers at once possible Tristate cycle on the bus when driver changes Assertion time may differ from driver Never driven in two consecutive cycles to driver Deassertion time is fixed Multiple drivers with multiple receivers Value received on signal deassertion is unpredictable and must be ignored Many drivers at once possible Tristate cycle on the bus when driver changes Timinq is fixed Never driven in two consecutive cycles The Alphaserver 8000 platform's Fault crror- bus transactions and their associated data that the sys- handling feature is particularly usefill in recovering tem bus can process in a given period of ti~uedefine error state from a computer in a "hung" statc. A com- the system bus band\vidtIi. puter enters a hung statc \ilhen an error occurs that The components of a typical memory read transac- stops all progress in the computer system. If a proccs- ti011 are shown in a timeline in Figi~rc4. 'l'his timeline sor is waiting for a response to a read, for esamplc, and of compo~~cntsis based 011a system that is an abstract the read response is n~)tforthcoming due to an crror, of tlie DEC 7000/10000 MI' systems. To minimize the system hangs while \\!siting for the responsc. Thc a system's memory read latency, each component desktop model for error handling \vouId require J sys- of the read transaction timeline must be minimized. tem reset to recover from SLICII an error. The process of Components 1, 3,7, and 8 of the timeline are simply the system reset, lio~lever,\\lould purge error statc. data and address transfers across buses and through The purge, in turn, makes error diagnosis extremely interfaces. The delays associated with tlicse compo- difficult. This desktop model is not unique to desktop nents are largely determined by system cyde time; they systems. It is also employed in server-class machines cannot be affected by the protocol to any great extent. such as Digital's DEC 7000/10000 AXP systems. Component 5 is the DRAM access time. It is mini- Although this model may be acceptable on tlie desk- mized by the rcconfigurable controllers described in top, it is most undesirable in an enterprise server the Operational Layer. The remaining components, system. The AlphaServer 8000-based systems use a (2) address bus arbitration, (4)nlemory bank decode, time-out to detect a hung system and tlie and (6)data bus arbitration, fall into the domain ofthe Fault error-handling technique to recover an error log primary protocol. These elements must be designed to in the event of a Ilung system. The result is a robust contribute minimal delay to the overall latency. error-handling system that is appropriate in an cntcr- The effects of protocol on a system's d.lta band- prise server. cvidth are a little Inore difficult to quantib than the effects of protocol on memory read latency. In gen- Primary Protocol Layer eral, the theoretical maximum systcm band\vidth is The Prjmary Protocol L.ayer of the platform dc\jclop- equd to either the sum of the bnnd\\,idths of the sy- ment assigns names and characteristics to thc various tern's memory banks or the maximum system bus system bus signals and uses these names and char~cter- bandlvidth, whichever is smaller. If the system band- istics to define higher-order system bus transactions width is limited by memory module bandwidth, it is and ti~nctions.System bus transactions may include essential to kccp as many memory modules .Ictlve - ' as reads of data from memory or writes of data to meni- possible. If, for example, eight banks of memory are ory. These transactions are the primary business of required to sustain 100 percent of the maximum sys- a computer system and its protocol. If a system effi- tem bandwidth, but the system can support only four ciently executes read and write transactions, it will per- outstanding commands, only four banks can be kept form better than a system that does not. System bus busy and only 50 percent of the maximum bandwidth functions may includc mapping memory addresses can be rendered. 111 another example, if 10 percent of to specific memory banks or arbitrating for access to the time this system freezes all but one bank of mem- system buses. These functions cnable system bus trans- ory to perform special atomic functions on special data actions to operatc in environments with rn~~ltiple blocks, the system's bandwidth will suffer 11e;lrly a 10 processors arbitrating for access to the system bus and percent penalty (73/80 possible Iiiemorp accesses \a- multiple banks of memory. sus 80/80 possible memory accesses). If the system AlphaServer 8000 system bus transactions relate band\\!jdth is limited by the band\\,idtli of the system directly into the platform's perfoniiance nletrics. Thc bus, the masimum system banduridth can be achieved syste~n'smemory read latency, for example, is equal to only when the protocol allo\\,s system nlodules to the time it takes for a processor to issue and cornplctc a drive data onto the systen~data bus in c\fcryavailable system bus read transaction. The number of s!!stcni cycle on thc data bus. When a processor reads a block

3 7

PROCESSOR INTERFACE READ MEMORY MEMORY MEMORY DATA INTERFACE ISSUES ARBITRATES DRIVEN BANK ACCESSES ARBITRATES DRIVEN FORWARDS READ FOR ON DECODES DATA FOR ON DATA ADDRESS SYSTEM READ FROM DATA DATA BACK TO BUS BUS ADDRESS DRAMS BUS BUS PROCESSOR

Figure 4 Components of Memory Rcad Latcncy

Digital Techllical Jourllal v No I 1995 55 of data from a second processor's cache, for csamplc, with a value of iO. As this data transfer is initiated, the the second proccssor may have to stall the data bus to status, so, is also driven on tlie s\.stcm bus. In cycle allow it to drive the rcad data onto the system's data X+2, all system bus modules have an opportunity to bus as prescribed by the system protocol. A stall of tlie stall or to co~itrolthe flow to the system data bus. In data bus translates into ~~nuseddata bus cycles and this csan~plc,tlic bus is not stallcci, ns indicated by a degradation of real systc~iibandwidth. TIILISto maxi- value of n. Finally, given that the bus is not stalled, the mize real system band\vidtIi, system bus and mcmory 64 bytcs ofrcad data associated \\,it11 rcad t.0 arc trans- bank utilization lii~~stbe maximized, and stalls in sjls- ferred across the system bus during cycles X+5 and tem bus activity and stalls in niernory bank acti\.ity X+6. I11 addition to read 1.0,Figure 5 also illustrates must be minimized. the csccution of a \\'rite, (1.1, .lnd ;i~iotIicrread, 1.2. Tlic follo\vi~igscctio~ls begin \\!it11 an o\jcr\~ic\\:of Notc tliat cl,ita transfer initiation, dnt:~1x1s tlo\\~ con- the basic AlphaServcr 8000 platform protocol and trol, and data transfer are pipclincci on the system data how this basic protocol influences system pcrfor- bus in tlic snmc order as their associated co~nmands mance. This section is followed by a discussion of ho\v werc issucd to the addrcss bus. Notc f~rtlicrthat this the various protocol coniponents identified as cle- diagram represents 100 percent utilization of thc slJs- nients of memory read latency (i.e., memory bank tem data bi~s(one data transfcr c\'cr!, tlircc c!jcles). mapping, address bus arbitration, and data bus arbitra- Wit11 a 10-ns cyclc timc, this utilization \\!auld trans- tion) affect the latency. These sections concluclc 114th 3 late to 2.1 C;R per scco~ldof band\\icitli. discussion of subbloclt \\trite transactions and their Thc AlpliaScr\~cr system adcircss bus uses nvo effects on system band\vidth. mechanisms to control the tlo\\*of systcm bus tl-ansac- tions. First, processor iuid I/O port ~nodulcsare not Alphaserver 8000 Protocol Overview The platform allo\vcd to issi~ccommands to mcliiory ~notii~lcsth~t development Topological Layer clef ncd the arc busy performing some DlWlM acccss for a previ- AlphaServer 8000 systcni bus as having separate ously issucd system bus transaction. The stiitc of each address and data buses. The Alphaserver 8000 systcm memory bank is con~niunicatedto each procchsor by bus protocol dcf ncs lie\\- system bus transactions arc mews of systcm bus Bank-A\.ailablc signals. If a performed using tlicsc n\lo buses. According to the processor or 1/0 port seeks access to 3 gi\,cn memory protocol, processor ,ind 1/0 port mod~~lcsinitiate bank and that mcmor!l bank's Rank-A\tnilable signal rcad and \\?rite transactions by issuing read and write indicatcs that the bank is free, the processor or 1/0 commands to the spstcri~address bus. These addrcss port may request acccss to the addrcss bus and, if bus commands arc ti)llowcd sometime later by an granted acccss by the system arbitration logic, issue its associated data transfer on tlie data bus. All data trans- transaction to the ndclress bus. If a proccssor or 1/0 fers are initiated in tlic ordcr in which their associated port scclcs '~cccssto a given Inemor\* bnnlt Llnci tliat addrcss bus comniands arc iss~~ed.Cache coherency Iiiemory bank's Bank-A\*~ilablcsignal indicates that information for cach systcni bus transaction is 1)road- the bank is not fi-cc, tlie processor or 1/0 port \\.ill not cast on the system bus as cach transaction's data bus request acccss to the system addrcss bus. Tli~~s,~uiless transfer is initiated. Each data transfer mo\/es 64 all memory banks are busy 01-~~nlcss the total of the of data (only 32 bytcs of \\rhicli arc valid for pro- busy memory banks includes all banks that are ~iecded grammed 1/0 transfers). Figure 5 sho\vs ,in csamplc to service tlic system's processors and 1/0 ports, the of AlphdServer 8000 systcn~bus traffic. In cyclc 1 a address bus \\,ill continue to transmit co~iima~ids.The read transaction, 1.0, is initiated on the system address second mechanism for controlling the tlo\v through bus. In cycle X, the data tmnsfcr for read rO is initiatcd the addrcss bus is the system 13~1sArb-Suppress signal. on the system data bus by means of the systcm bus Ifany systcm bus mod~~leruns out of an!* co~nm,Ind/ Scnci-Data signal, the .~s\crtionof \\diich is indicated address-rclurcti rcso~~rce,such 2s co~nmandqueue

CYCLE 1 2 3 4 5 --- X X+l X+2 X+3 X+4 X+5 X+6 X+7 Xt8 X+9 X+lOX+11 X+12 ADDRESS BUS ,O wl r2 COMMAND ...

DATA BUS i0 il 12 SEND-DATA .. .

CACHE so s1 s2 STATUS .. . DATA BUS n n n FLOW CONTROL DATA BUS .. . rO rO wl wl r2 r2

Figure 5 AlphnSer\rer SO00 Spstc~n811s Traffic entries, jt can asscrt this signal anci prc\!cnt tlie system It is important to note that tlic addrcss bus and tlie arbitration logic from grnnting any more transactions data bus have indcpendcnt mc.lns and criteria for initi- access to the bus. 'flic Arb-Suppress sigrial is useful, ating transactio~isand controlling tlic flo\\l of transac- for esaniple, in a systcm configuration with 16 mem- tions. The address bus initiates addrcss bus commands ory banks hut only eight entries \\forth of commnnd based on processor and 1/0 port module rcqucsts and clucuing in n proccssor. controls tlie flo\\! b~scdon the stnrc of addrcsa-related The AlpliaScrvcr 8000 system dat3 1x1sIias its O\\~II resources. Thc data bus initiates data transfers in the tlo\v-control mcchanism, the system bus Hold signal, same order as the address bus transmitted comnlands \vhich is indcpcndcnt of the addrcss bus tlow-control by means of tlie Send-Data signal. Send-Data is usu- mechanisms. The Holtl signnl, sho\\fn as Data Bus ally asserted by a mcmory mod~~lcbased on the statc l-'lo\\ Control in Figure 5, is asscrtcti in rcsponsc to the of the module's 1)lWMs. The data bi~stlow is con- initi.1tion of a data bus transfer. Normally, data bus trolled based on the statc of various data-related transfers arc initiated on the data bus when all resources. The differing means and criteria For initia- AlpliaScr\lcr 8000 mcniorv ~nodulc asserts tlie ti011 and flow control alloiv tlic nvo lx~scsto operate Send-llnta signal. Send-l)~t,l is asserted by a memory almost indcpcndently of one .~notlicr.Tliis indcpcn- module based on tllc stnrc of the module's DlL41'vls: dence translates into performance becausc it allo\\a tlie When servicing a read transaction, the memory will address bus to continue to initiate commands even as assert Send-Data when its DKAM rcad is complete; the data bus may be stalled because of n conflict.

\\~IICII ser\!ici~lgn \\,rite transaction, tlie memory ~\lill Continuous conimand initiation translates into more assert Send-l),lta as soo~i'1s its turn on the data bus continuous system parallelism and thus morc system comes LI~.Five cycles after the assertion of Send-Data, bandwidth. Figures 6 and 7 illustrate this point. Both sonic modulc drives data onto the data bus. If a mod- fig~~resillustratcsystems that arc issuing a scrics of ule is rcclt~ircdto drive data in rcsponsc to ari assertion processor reads to lbloclts that must be sourced from of Send-Data and is u~~ublcto do so, it will assert another processor's caclic. In both cases, processors tlic Hold sig~lalt\\lo cycles aftcr the assertion of require two morc c)~clesthan main memory banks to Send-Data. This may occur if a processor module source read data. As such, two cyclcs of Hold assertion must SoLIrcc rcad data horn its cache and cannot fetch must perjodically occur on tlic data bus. Figure 6 illus- the data from tlic caclic ns quicl

Figure 6 Rcnd \\,it11 Onc <:yclc of Holct - Five Rcads Sourccd by a Processor CYCLE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ADDRESS BUS rO r 1 r2 - - r3 r4 COMMAND DATA BUS I 0 - - 11 12 i 3 - - 14 SEND-DATA CACHE SO SO s I s2 s3 s3 s4 STATUS DATA BUS H n n n H n n FLOW CONTROL

DATA BUS - -rOrO rlrl r2r2 - -r3r3

Figure 7 Fi\.c Rends Sourccd by 3 Processor in a Rigidly Slotted System address bus. Tliis prevents the fourth and fifth reads bank compares each bank number transmitted across from getting the headstart necessary to prevent subse- the system bus to 4 bits in a programmable bank num- ~LI~II~stalls oftlie data bus. The result is a 20 percent ber register to determine if it sl~o~~ldrespond to tlic degradation in pcrforniance for tlic five reads illus- system bus command. trated. If the serics ofrcads is lengthened toward infin- This bank mapping logic configuration helps to ity, tlie percent of degradation settles to 18 percent. reduce AlphaServer SO00 memory rcad latency. Clcarly the AlphaScrvcr 8000 approach produces Because the bank mapping is donc on the nodes that superior data bandwidth characteristics. issue commands to the addrcss bus, the lengthy It js also important to note that the AlphaServer address comparison can be donc in parallel with 8000 addrcss bus and data bus have dit'fcrent masi- address bus arbitration, eli~iiinatingits two-cycle delay nium bandwidths. Commands can be issued to the from the memory read latency. The address compari- address bus every other cycle. With a 10-ns cyclc time, son traditionally done in the mcniorp bank logic is this translates into 50 million co~nmandspcr second. no~vreplaced \vitli a simple 4-bit co~nparison,which The data bus, on the othcr hand, can transfer one can easily be done in a single cycle. Thc overall ctkct is bloclc of data every three cycles. With a 10-ns cvcle that the AlphaServer 8000 bank mapping protocol time, this translates into 33.3 million data blocks consumes at least one cycle less thm l'>igital's tradi- per second. Tliis excess of ;~ddrcssbus bandnlidth js tional bank mapping protocol. This eclLlatcs to one Jess ~1sefi11in the development of lo\\f-latency arbitration c!lcle-10 ns minimum--of memory read latency. schenies. Address Bus Arbitration AlpliaScr\~cr8000 systems Memory Bank Mapping Digital's previous server employ a distributed, rotating-priority arbitration systems, like thc VAX 6000 serics and the LIEC scheme to grant acccss to thcir addrcss buses. 7000/10000 AXI' series, h;tve employed a common Processor and I/O port modules recluest acccss to the a~-~proachto address-to-me~iiorlf-bankmapping. 111 address bus based on requests from microprocessors this approach, all ~ncniorymodules implement address and 1/0 devices, and on the state of the system's range registers. As commands and addresses are trans- memory banks, as described in the section ~nittedacross tlie system bus, thc memory banks com- AlphaServer SO00 Protocol Overview. Each module pare tlie addresses against thcir addrcss range registers evaluates the requests from all othcr modules and, to determine if they must respond to the command. based on a rotating list of niodulc priorities, dcter- An addrcss range conipal-ison can involve a significant mines whether or not it is granted acccss to the bus. number of address bits and, as 3 rcsi~lt,can become Each time a module is granted access to the bus, its logically complex enough to consume two 10-ns priority is rotated to the lowest priority spot on the pri- cycles of time. These two cpclcs can be added directly ority list. to memory rcad latency. The AlphaSer\~er8000 arbitration scheme operates The low-latency focus of the 141phaSer\~cr8000 in a pipelined fashion. Tliis Incans that modulcs platform proniptcd a change in bank mapping request access to the bus in one cyclc, arbitrate h)r schemes. 111 AJphaServer 8000 systenls, the address access to the bus in the nest cycle, and tinally drive range registers havc bcen n~ovcdonto the processor a command and address onto the bus one cycle later. and I/O port modules. Tlic range registers output a In terms of processor-generated read requests, this 4-bit bank number that is shipped across tlie system means that, at best, a systcm bus rcad command can be bus \vith each command arid addrcss. Each memory driven onto the systeril ;~ddrcssbi~s t\vo cycles aker its

VOl. / No. 1 1995 corresponding cache read niiss is generated on the This fact implies that any module that has been processor module. This adds nvo cyclcs of delay to tlie requesting acccss to the address ~LISfor more than nvo memory read latency. consecutive cyclcs is requesting jn a nonspeculative To reduce memory read latency in components manner. As such, the AlpliaServer 8000 arbiter kecps associated with address bus arbitration, the a history of address bus requests and creates nvo pri- AlpliaServcr 8000 platform employs a techniclue oritized groups of requests bascd on this history. It called "early arbitration." Early arbitration allo\\ls a creates a high-priority group of reqilests from those module to request access to the address bus before it requests tliat have been asserted for more than two has determined if it really needs acccss to the data bus. cycles and a lo~v-prioritygroup of rcq~~estsfrom those If the module is granted access to the address bus but requests that have been asserted for two cycles or less. determines that it does not need or cannot use tlie It applies tlie single set of rotating priorities, described access, it will drive a No-Operation or NoOp com- above, to both sets of requests. If there are any mand in the command slot th,it jt is granted. This fea- requests in tlic liigli-priority group, the arbiter selects tilre is particularly useful on proccssor modules. It one of these b.lscd on the rotating priority set. Ifthere allows a proccssor to request acccss to the bus for a are no high-priority requests, tlie arbiter selects a read command in parallel with determining if the read request from the lower priority group based on the command will hit or miss in the processor's cache. If rotating priority set. This fi~nctionalitylimits carly the read results in a cache hit and tlie processor is arbitration to only tl~osetimes when there are non- granted access to the address bus, then tlie processor speculative requests in tlic systcm. It allo\\,s rlic issues a NoOp command. If the rcad results in a cache AlphaServcr 8000 platform to take advantage of hit and the processor is not granted access to the latency gains associated witli carly arbitration .~nd address bus, the processor disconti~iuesrequesting processor and 1/0 port based bank decode, without access to the bus. When applied in this manner, this degrading bandwidth in tlie process. feature can remove two c!icles of delay from the 1nc1i7- or!( read latcncv. This feati~rc is also key to tlic Data Bus Arbitration l:he Alpl~aScr\fcr8000 data bus Alphaserver 8000 memory bank decode feature tliat transfers blocks ofdata in the same order that the com- allows address-to-memory bank decode to proceed in mands corresponding to those blocks are issued on thc parallel with system bus arbitration. This is to say, it address bus. This eliminates data bus arbitration per allows a processor or 1/0 port modulc to requcst se. In-order data return is accomplished by a simple access to thc nddress bus before it can determine system of counters and sequence nt~nibers.Each time which nienior!l bank it is trying to access and before it a command is issued to tlie address bus, it is assigned a can determine if tliat menior!l bank is available. If a sequence number. Sequence numbers are assigned in module is granted access to the bus and the bank it is ascending order. Each time a block ofdata is dri\fen on trying to access is not available, then the module issues the data bus, a data bus countcr is incremented. Each a NoOp co~nniand.If a modulc is not granted acccss module waiting to initiate a data transfer in response to tlie bus and the bank it is trying to access is not to some address bus co~.nm;indcompares the sequence available, then the nodule discontinues requesting nunlber associated with its command with the data access to the bus until the bank becomes available. bus counter. When a modulc's sequence number When applied this way, this feature eliminates at least matches its data bus counter, it is that niodule's turn to one cycle from the memory read latency, as dcscribcd initiatc a data bus transfer. in the section Me~iioryRank Mapping. It is arguable tliat in-order data return is not the The excess address b~isbandwidth noted in the optimi1111d~ta scheduling algorithm. If the scenario protocol overview allows some aliiount of early arbi- shown in Figurc 6 were resliapcd such that only rcad tration to take place \\litliout affecting system pcr- rO sourced dam from another processor and the for~nance.When system traffic increases, however, penalty for sourcing data fro111;I processor \\.ere more escessivc carly arbitration cJn steal usefill address bus severe-a longer data bus Hold requirement-the slots from nonspeculati\ic transactio~~sand as a result result \i~o~~ldI~cmore signifcant bandwidth dcgrada- degrade bus bandwidth. In fact, in certain pathologi- tion. This nc\v scellario is illustrated in Figure 8. With cal cases, excessive carly arbitration by rnodulcs witli more efficient data scheduling, it is conceivable that high arbitration priority can pcrmanentl!l lock out data bus utilization could bc improved by using data requests from lo\\ler priority modules. To eliminate slots abandoned under tlie sizable Hold \vindow in the negative effect of early arbitration, tlic iUpl~aSer\ler Figure 8. The latter scenario is illustrated in Figurc 9. 8000 employs a technique called "lool<-back-n\lo" Clearly tlie systcm in Figurc 9 has improved upon the arbitration. This technique relies on the fact that niod- bandwidth ofthc system in Figure 8. ules must resolve all cache niiss or bank availability What Figurc Y cannot sho\\l are all the implica- uncertainties for early arbitrations within the nvo tions of out-of-order data transfers. With as many as cycles reqi~iredfor an early request and its arbitration. 16 outstanding transactions (8 in the Alphaserver

Digital Tccl~111cx1Jol~rnal Vol. 7 No. I 1995 59 CYCLE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

rO r1 r2 r3 r4 COMMAND DATA BUS i 0 i 1 i 2 i 3 I 4 SEND-DATA CACHE so so so so so s 1 S2 s3 STATUS DATA BUS HHHH~n n FLOW CONTROL ------DATA BUS ------rOrO r3r3 r4r4

Figure 8 Randwidth Degradation ns n Rcsult of In-Order Data Transfirs

Figure 9 Impro\~edBandwidth witli Out-of-01-dcrL>;lt~ T1.3nsfers

8400/8200) active in the system at any one time, the complete the subblock \\trite, tlic I/O port module task of producing a logic structure capable ofretiring must Inergc the subblock \vritc cl~~tainto the block as it the transactions in order is enornious. Furthermore, \\/as aker the third-party module modifed it. This the retiring of transactions out of order complicates tlie problem can be resol\wi in one of mro ways: (1) by business of maintaining colicrcnt, ordered memory mcdns of a small caclle on rlie I/O port rnod~~le uptiates. Finally, it was felt that the parallelism nude that updates the 1/0 port's copy of the block based on possible by the indepcnduit address and data bus the third-party writc, or (2) by Incans of an atomic \vould help to mitigate many of the negative effects rend-modit)-write that disallo\vs the third-pa~.t)twritc associated with the in-order dam transfers. For these nltogctlier. reasons, a practical decision wls taken to transfer data In an ideal world, 1/0 port nodules would imple- on the system data bus in thc order that the associated ment a small one-block cache fix tlie purpose of sub- commands were issued to the slatem nddrcss bus. block \\!rites. This cache \\'o~~ldallo~v the I/O module pcrfol-ming the subblock \\'rite to update its copy of Subblock Writes To support a range of 1/0 subs)~s- the block targeted by the subblock writc with modi- tems, AlpliaServer 8000 I/O port modules must sup- ficd data from third-party motiulcs. Unhrrunatel!!, port \\)rites of data as small as long\\jords (32 bits), not a11 processors broadcast modified data to the words (16 bits), and bytes. Given the AlphaSer\lcr system. Many processors, for csamplc, use a read- system bus block size of 64 bytes, these writes are in\talidate protocol. In n reacl-invalidate protocol, referred to as subblock writes. The csec~~tionofa sub- ~vlicna processor \vishes to ~nocii~a block, it issues a block write consists of reading a block of data from a command that invalidates a11 other copics of that block system memory bank, o\lcrwriting just the portion of in the systcni and then modifies thc block of data in its the block addressed by the subblock write, :lnd writing cache. If sucll an invalidate co~nrn'lndinvalidated tlie tlic entire block back to memory. The difficulty with block in an I/O port module's subblock write cache, pcrformiog this operation arises \vhcn a "third-party" the I/O port module \\.auld be fol-ccd to re-read tlie module-defined here as a module other than the one block. There is no guarantee, Ilo\\,s\,cr,that another pcrfbrmi~lgthe subl.doclt \\trite-modifies the block invnlid~tc\\till not occur bctwccn the re-read of the bct\\lecn the read portion of the subblock writc and block and the write of mcrgcd data back to memory. the writc portion of the subblock write. To correctly

60 Digital 'Technical Journal As such, the I/O port module may never be able to fi~nctions,such as memory reads and writes, local reg- complete the subblock writc. I/O port caching is ister reads and writes, and mailbox-based 1/0 register therefore not a workable solution. reads and writes. The protocol performs these basic Atomic read-moditjr-write sequences disallow third- functions with a high level of efficiency and perfor- party writes to a given block between the read portion mance. Some additional functionality, such as PC1 ofa sc~bblockcvrite and the write portion ofa subblock direct-programmed 1/0 register accesses, can be func- write. As SLIC~,the atomic read-nioditji-write sequence tionally satisfied by the primary protocol but cannot does guarantee the timely completion of a subblock be satisfied in a way that does not severely degrade the write. Implementations of atomic read-modilj-write performance of the entire Alphaserver 8000 system. seclucnces arc designed to disallow accesses to so~nc As such, the Alphaserver 8000 platform allows for size portion of the memory region tliat contains tlic Superset l'rotocols, i.c., protocols tliat are built upon subblock address, benveen thc read and write portions the basic constructs (reads and writes) of thc of the subblock writc. The size of tlie memory region AlphaServer 8000 primary protocol. can \Jar),from a singlc block of data to a singlc bank of PC1 direct-programmed 1/0 register reads can take memory to the entircty of memory. If the size of the more than a microsecond to complcte. If tl~csereads memory region is small, such as a single data block, were completed by means of the AlphaServcr 8000 design complesity is signifcant; but the impact of nonpended, strictly ordered primary protocol, the locking out access to a single block of memory is AlphaSer\rer systcm data bus would be stalled for a full insignificant to band\vidth. Conversely, if the size of microsecond each time a PC1 programmed [/O rcad the memory region is largc, such as the entirety of was executed. Such stalls would have a disastrous effect memory, design complcsity is insigniticant; but the on system bus bandwidth and system perfi)rmance. impact of loclcing out accesses to the entirety of lncm- The PC1 progra~nmedI/O problem is solved on the ory for any period of time can be significant to system Alphaserver 8000 platform by ilnplerncnting a PCI- bandwidth. specific pended read protocol using the simple read The Alphaserver 8000 platforru srlpports aton~ic and \\(rite con~mandsalready included in the basic rcad-tnoditjr-\\,ritc scque~~ccsby locking out accesses AlphaSer\rer 8000 prilnary protocol. This special within memory-bank-sized memory regions. This superset protocol works as follo\\~s: middle ground melnorjl-region size provides the When a microprocessor issues a PC1 programmed AlphaServer SO00 \vitli u practical balancc bcnveen I/O read, the rcad is issued to the AlphaServer design complexity and system band~vidtli. The SO00 system bus as a register read. Tliis read is AlphaScrver 8000 platform implements memory pended with a unique identification number that bank granularity atomic read-moditjl-write accesses is associated with the issuing processor by driving by means of special Rend-Bank-Lock and the identification number 011 tlic systern bank Write-Bank-Unlock address bus commands, and by number lines \\/hen the register read command js Ic\,craging the existing memory bank flow control issued to the systeln address bus. The bank num- mcchanisn~s. Speciticnlly, Read-Bank-Lock com- ber lines arc otherwise unused during rcgister mands function like normal read commands, cscept accesses. The issuing processor also sets a flag, that their targeted memory banks are left busy aker indicating tl~atit has issued a PC1 programmed the read transaction is complete. Memory banks I/O rcad command. locked by Read-Bank-Lock commands relnain busy u~itila Write-Bank-Unlock cornma~ldis issued from The I/O port module interfacing to the addressed the same module that issued the Read-Bank-Lock PC1 local bus responds to the rcgistcr read by for- command. While a memory ballk is busy, no module warding thc read to the PCI, storing the processor other than tlie module that locked the bank by means identification nilmber specified by the address bus of a Read-Bank-Lock command ill even recluest bank number lines and driving "dummy data" access to the bank, as required by standard arbitration onto the data bus in the register read's associated protocol. This approach provides for atomic rcad- data slot. The value of the dummy data is irrele- modi@-writeseqilenccs and coherent subblock c\fritcs. vant; it is ignored by all systcm bus modules and This protocol \vorks regardless of thc number of I/O is typically \vhatever was left in the I/O ports modules in the system and regardless of arbitration register read buffer as a result of the last rcad it priorities. serviced. When the 1'CI local bus retilrns rcad data to the Supenet Protocol Layer 1/0 port module, the I/O n~oduleissues a regis- The Alphaserver 8000 primary protocol provides all ter write to a special PC1 read-data-return register the basic constructs required to perform basic system address on the system bus. This write is pended

Digital Tccllnicnl Jourl1nl I. I.1 1595 01 with the issuing processor's identification num- this speed, the systerns feature 265-11s niinimum read ber, which was stored by the 1/0 port module. latencies and 1,600 MB/s ofdata bandwidth. This identification number is again pended by Both systems are based on thc same set of driving it onto the system bank ni~niberlines as Alphaserver 8000 architecturally compliant system the register write comrnand is issued to the system bus modules. In addition, both systems support a new address bus. The PC1 read data is returned in the PC1 1/0 subsystem designed specifically for these data cycle associated \vith this register write. classes of systems. The constituent modules and 1/0 When a processor module identifies a register subsystems that compose the AlphaServer 8400 and write that addresses thc PC1 read-data-return reg- the AlphaServer 8200 systems arc as follo\\/s. ister address, it checks the state of its PC1 read flag TZEP Processor &Iodzlle-Each TLEP processol- and compares the value driven in the system bank module supports nvo 300-MHz Alpha 21 164 micro- number lines with its unique identification num- processors. Each Alpha 2 1 164 processor is paired \\lit11 ber. If the PC1 read flag is set and the value on the a 4-1MB external cache. This cachc is constructed with bank number lines matches the processor's identi- 10-ns asynchronous SRAMs. The cachc latency to first fication number, then the processor con~pletesthe data is 20 ns, and with one 3.33-11s processor cycle of PC1 prograrnmed 1/0 read with the data supplied wave pipelining, its maximum bandwidth is 915 MB/s. by the register write. The TLEP module operates with a 75-MHz ( 13.33-11s cycle time) clock frequency. The AlphaServer 8000 PC1 programmed 1/0 read 7lWElM rWerno,y Mod~de-Each TMEM rnemory superset protocol allows AlphaServer 8000 systems to module is implemented with two equal-sized 1)lM complete PC1 programmed 1/0 reads without stalling banks. TMEM modules arc available in 128-M13, system buses. Furtliern~orc,it allows AlpliaSer\ler sps- 256-MB, 512-MB, 1024-MB, and 2048-MB sizes. tems to support PC1 1/0 in such a way that system bus The TMEM module is designed to operatc at n 100- modules not participating in the superset transaction MHz ( 10-ns cycle time) clock frcquency. need not be alerted to the presence of spccial bus Z'OP I/O Port Modr~le-The TIOI' niodule inter- transactions and therefore need not contain logic faces the AlphaServer 8000 systcni bus to four I/O that recognizes and responds to these special cases. channels, called "hoscs." Each hose can interf;lcc to This approach demonstrates a practical \vay to si~ii- one XMI, Futurebus+, or PCI/EISA I/O subsystem. plifj overall system design without affecting system Each TIOP can support up to 400 MR/s of 1/0 performance. data bandwidth and is dcsignccl to operatc at a 100-MHz (10-ns cycle time) clock frequency. AlphaServer 8400 and AlphaServer 8200 Systems m0P Integrated YO Pori Modrrk~--The ITI0 P niodule interfaces the AlphaServcr 8000 s)~stembus to The AlphaServer 8400 and 8200 systems are tlie first one hose 1/0 channel and one scmipl-ccontigured PC1 products based on the AlphaServer 8000 platform. local bus, which is integrated onto the ITIOP nlodule. The AlphaServer 8200 system is an "open officem-class The integrated PC1 bus fcaturcs one single-ended server (i.e., the AlphaServer 8200 can be located in any small computer systems interface (SCSI ) controller, office area, for example, where photocopier machines three Fast Wide Differential SCSI controllers, one NI are typically placed). It features up to six system bus port, and optional FDDI and NVIUM controllers. modules in an industr\r-standard 47.5-centimeter Each ITIOP can support up to 200 MB/s of I/O data (19-inch) rackmount cabinet. The 8200 system can bandwidth and is designcd to operate at a 100-MHz support up to six 300-MHz Alpha 21164 niicro- (10-ns cycle time) clock frccl~~cncy. processors, 6 GI3 of main memory, and 108 PC1 I/O PCIA PCI VO Suhsyslam-The PCIA PC1 I/O slots. The AlphaServer 8400 system is an "enterprisen- subsystem consists of hose-to-PC1 adapter logic and a class server (i.e., a machine on which a business can be 12-slot PC1 local bus. This 12-slot bus is crcatcd from run). It features up to ninc system bus nlodules in a 4-slot PC1 buses interfaced such that they appear as a DEC 7000-style cabinet. It can support up to twelve single bus. The high slot count provides the conncc- 300-MHz Alpha 21164 nlicroprocessors, 14 GB of tivity essential in an enterprise-class server. The PCIA main memory, and 144 PC1 1/0 slots. optimizes direct memory access (DMA) reads by The clock freqi~encicsof both the AlphaServer 8400 means of the 1'CI Read- Memory-Multiple com- system and the AlphaServer 8200 system are deter- mand. The Read-Miss-Multiple command allows the mined by tlie clock Frequency oftlie 300-MHz (3.33-11s PCIA to stream DMA read data fro111 mclnory to the cycle time) Alpha 21 164 microprocessor chip. Both PC1 bus. Consequently, the PCIA can increase DMA systems use a 4X clock multiplier to arrive at a system read band\vidtli, offsetting any latency penaltics that clock frequency of 75 MHz ( 13.3-11s cyclc time). At resi~ltfrom the AlphaServer 8000 platform's multi- level I/O architecture. The PCIA's adapter logic

62 ~)igitalTechnical Journ.11 vol. 7 No. I 1995 includes a 321< entry map RAM for converting PC1 system achie\les the same 1,900 million floating- addresses (32 bits) to AlphaServer 8000 system bus point operations per second (MFLOPS) as an eight- addresses (40 bits). This niap R4fVI features a five- processor SGI Po\ver Challenge XL system. Finally, entry, ti~llyassociative tra~lslationcache. the AlphaServer 8400 system's 5-GFLOPS Linpack 1iXn result is beyond the performance of all other AlphaServer 8400 and AlphaServer 8200 open systems servers, placing the AlphaServer at Performance supercomputer performance levels with systems such as the NEC SX-3/22 system and the massively parallel A number of performance benchmarks have been run Thinking Machines CM-200 system. on the AlphaServer 8400 and Alphaserver 8200 sys- tems. The results of some of these bench~narltsare Acknowledgments su~nn~arizedin Table 3. The AlphaServer SPECint92 and SPECfp92 ratings Several members of the AlphaServer 8000 Develop- demonstrate outstanding performance. In both rat- ment Team jn addition to the authors were key con- ings, the Alphaserver 8400 system performance is tributors to the generation of this technical article. over 3.5 times the ratings of the HP9000-800 T500 These individuals are John Bloem, Elbcrt Bloom, system. The SPECfp92 rating of 512 is 1.4 times Dick Doucette, Dave Hartwell, Rick Hetherington, its nearest competitor, the SGI Power Challenge XL Dale Keck, and hch Watson. systern. Similarly, a six-processor AlphaServer 8400

Table 3 AlphaServer 8400 and 8200 System Performance Benchmark Results Benchmark Processor Name Count Units AlphaServer 8200 AlphaServer 8400 341.4 512.9 8551 50788 not applicable 11981 71286 - not applicable Linpack 100~100 MFLOPS 140.3 Linpack 1000x 1000 MFLOPS 410.5 MFLOPS 1821 MFLOPS not applicable MFLOPS not applicable Linpack nxn MFLOPS 428.3 MFLOPS 2445 GFLOPS not applicable AIM Ill AIMS not applicable Performance Rating AIM Ill Maximum quantity not applicable User Loads AIM Ill not applicable Throughput McCalpin Copy not available not applicable McCalpin Scale 1 not available 8 not applicable McCalpin Sum 1 not available 8 not applicable McCalpin Triad 1 not available 8 not applicable

Digital Technical Jour-1~1l Vol. 7 No. 1 1995 63 References Biographies

1. W. Bowhill et al., "Circuit Irnplemcnution of a 300- MHz, 64-bit Scco~id-generatio11CMOS Alplin (:I'U," Digitnl Techt7iwl,/ot11.1~~11,\rol. 7, no. 1 ( 1995, this issuc): 100-1 18. 2. S. Saini and 13. Bniley, "NAS Parallel Bcnclilnarks Results 3-95," Rcporr NAS-95-011 (~Mofl2t Field, Calif.: Nun~cl~icnlAerodynamic Sirnulation t:.lcilit!; NASA Aulics licsc;~rcli Center, sainiti~n;ls.n;~s,i.go\~, April 1995). David M. Fenwick ,. ,. J. Dongarra, "Pcrk)rmancc of Various <:omputcrs Dave Fcnwiclc ib the AlphaScr\~cr8000-scrics systcni Using Standard Linear Equations Sofnvarc," llocu- arcliitcct. As Icnclcr ol'rhc ad\lanccd ilc\.clol)mcrit gro~~p rncnt Nu~nbcr(;S-89-85, ~lvailablc on the Ilitcrnet and of tllc dc\igll rcnln, he has hccn ~-cqx)nsiblcfor dcf- from Oak Ridyc Nation21 Laboratory, nctlibteornl.go\; inition of the prod~lct2nd its c1iarnctc1-istics,,111~1 for the April 13, 1995. systc~iiiml)lcnicntatio~l. llavc moved fro111 lligial's Europc.ui Engineering organizatiou in 1985 to join 4. Z. Cventanovic and L). Bhtindarkar, "(:Iiar~ctcrization thc U.S.-b~scd\/AXHI program and s~~b~cquc~~tl!~was of AlpI1.1 ASL' l'crforniancc Using '1.1' .ind Sl'l:<: prOccs.wr ;~cllitcct&K tllc Va\X 6000 vCCtOr pl'OiCSbOT. Workloads," 1~t~occ~cdrr1g.soj'thp 1994 lir/er.~t~t/iotr~~IA collsultinp c~igj~ictr,he holds 3 lnajor U.S.patuits Synzposizltir 011 (,bni[~ci/et'A~-chikc/t~r~..60-70. and has 13 pltcnl applications pcndin9. Hc rccci\.cci an Honoi~rs1)cgrec in clcitricd and elccr~-otucc~ipincering 5. J. Nicholson, "The RISC Spstcrn/6000 SIMP Systcni," koni I.o~rgllboro~~gliUniversity r)f'l-cclirlologv, United COMPCOt\' 95, March 1995: 102-109. I

Stephen R. VanDoren In 1988, Steve VanDorcn came to Digital to work with the VAX 6000 veccol. team. He latcr joincd an ad\mnccd dc\lcloprnenr ream responsible for c\.al- uating systcln technology rcq~lircmentsfor \vhat wo~lld becomc the AlphaScrvcr 8000 series of products. During the Alphaserver projcct, he lead the design of the address intcrhce on tlic T1.F.P processor module. He is listcd as a coinventor on 10 patents filcd on the AlphaServcr 5000- scrics ~~~-cliitcct~lr~~lfei~t~~rcs. Stc\~c is currently \\~~rkil~g011 new server processor designs. He is a membcr oCEta bppa Nu and T~LIBeta Pi and holds a R.S. dcgrcc ill cornp~lter systems engineering from the ~niversit)rof ~assachusetrs.

Daniel Wissell Consulting cl~gincerDan Wisscll has more than 20 years ofco~.rl~utcri~ld~~str)~ exlxriencc in analog and digital cir- cuit design and test. Wliilc at lligital, he has worked on the VAXclustcr and 1)EC 7000/1000 s)rstcms dc\~el.oprnent teams, and more recently hc col~tributedto the AlphaServcr 8000-scrics design efk~rt.Hc is recognized tvithin Digital ns nn cspcrr in thc arcas ofdistribntcd pojvcr systems, on-module energy nialiagclncnt, and high-speed clock systems. L)an holds three patents and has tiled several patent applications k)r his work on current and ht~~re Digital products. Hc has degrees in engineering from Kcan Collcgc and rhc Mil\vaukcc School ofEngineering.

Digiral Technical Jou1.11al Vo1.7 No. 1 1995 65 Jean H. Basn~aji Iby R. Fisher Frank W. Gatulis Digital's High- Herbert R. Kolk performance James F. Rosencrans CMOS ASlC

A high-performanceASlC has been developed Today, high-perfornace ~nicroproccssorsdesigned to serve as the interface for the 10-ns bus in with conipleriientary metal-oxide se~niconductor the new Alphaserver 8000 series server systems (ClMOS) processes are IJILIC~ni(.)re demanding on the support logic used to interface them to the rest of the from Digital. The CMOS standard-cell alternative system. Microprocessors, like Digital Semiconductor's (CSALT) technology provides a timing-driven Alpha 21 164 chip, are extending the external logic layout methodology together with a correct- cycle times to the point ulherc custonl-integrated by-construction approach for managing the circuits arc necessary to realize the full performance complex device physics issues associated with potential. The CiMOS standard-cell alternative state-of-the-art CMOS processes. The timing- (CSALT) technology developed at Digital satisfies these high-performance needs without resorting to driven layout is coupled with an automated a complex, custonl design process. standard-cell design approach to bring the CSALT technology provides a timing-driven layout complete design process directly to the logic niethodology together with a correct-by-co~istr~~ctio~i designer. approach for managing the co~tiplesdevice physics issues associated with state-of-the-art CMOS pro- cesses. The timing-driven layout is coupled with an automated standard-cell design approach to bring the complete design process directly to the logic designer. Using CSALT, logic designers can take their application-specific integrated circuit (ASIC) designs from J conccpt on their desktops to a completed layout that is ready for fabrication. Other design approaches address portions of the process, but the CSALT tool suite is complete and automated. Many ASIC vendors tr~nsferthe logic designs to a different set of engincers, sing different tools and skills, to complete the physical implementa- tion before post-layout timing analysis can take place. Any problenis encountered after the layout tend to result in the design being returned to the logic design- ers. The artificial boundary erected between logic designers and layout implernenters can rcsult in delays. In cornplcs dcsigns, ~n~~ltipleiterations may be neces- sary before the dcsign converges into an acceptable solution. This convergence process becomes more complicated with the introduction of synthesized logic, because thc process is exte~ldedto include the synthesis tools. <:SALT'S ti~iii~ig-drivenniethodology eliminates the need for the many chip layout specialists and ASIC vendor experts \vho normally complcte a multichip project. In addition, the timing-driven methodology eliminates the nced for the traditional chip floorplan- ning step in which the designer maps thc logical

66 Digiral Tcchn~calJournal Vol. 7 No. 1 1995 design onto the physical chip arcliitecturc. The floor- Eliminate chip tloorplanning and let timing con planning step often becolnes a critical and time- straints drive the placement. consuming effbrr when the design is being optimized Eliminate manual interaction in the tools to reduce for performance. design time and defects. The automated and batch-driven CSALT nieth- Develop very conservative layout rules to eliminate odology can turn a logically complete design into the need for cross talk and electromigration analysis. a \\lorking, timing-correct chip layout within three computc-intensi\,c days. Previous platform deve.lop- Automate the de\iclopmcnt and characterization of ment projects used industry-standard ASICs, manual cell elements including thorough checking. layouts, and hundreds of manual cell placements to Deliver more robust and accurate prediction ofcliip meet the tight design timing requirements \vithin their performance through integrated SPICE simulation liigh-performance ASICs. These methods typically and expanded cell library perfor~nancetab1es.l added months to the layout phase of these projects. Use proven algorithms and software whenever CSALT's timing-driven layout was specifically devel- possible. oped to address thcsc I.iigli-pcrforrnance reqi~irements and to make tlie complete design process available to Overview and Description logic designers. This paper discusses some implementation pieces of The front-end logic design and verification process is CSALT techliology and elnphasizes the unique timing- based on the ASIC standard tools fbr design driven approach and results. It explains the goals that that include schematics capture, timing and logic veri- were establislicd fol- CSALT development as well as fication, pre-layout delay estimation, and post-layout several features of the physical technoloa. The paper delay feedback and analysis. The performance data for concludes with a discussion of the layout process oper- thc library elements is housed in lookup tables that ations and the process controller. have multiple slope/intercept data entries based on output drive loading as well as input edge rate delay The Need for CSALT correction factors. Unique delays are calculated for each cell instance. CSALT supports a low-skew bal- During the technology evaluation phase of the anced clock distribution net. Alphaserver 8000 series platform, various ASIC tech- The back-end layout tools for CSALT include sev- nology vendors wcrc evaluated and compared against eral internally generated tools as well as research tools the aggressive performance needs demanded by the from academia. The heart of the place-and-route platform's designs and thc customization that was process is TimberWolf from the University of necessary within these technologics to meet system IVashington.2 One of thc important features of the bits timing. Bascd on the experience of developing TimberWolf tools is their ability to be constraint- designs for the previous platform generation and due driven. These constraints are automatically generated to tlie anticipated months of iterative and interactive from the timing verification step and then passed to manual placc and route necessary to meet timing, it the TimberWolf tools. TimberWolf prioritizes these becatlie clear that teclinology was a high-risk item to critical path nets during the placement process in an the program. Requirenients for the AlpliaServer 8000 attempt to meet the timing requirenients. Constraints series systems exceeded the performance capabilities of can also be manually generated through a separate esisting ASIC technologies and the available CAD user-generated file that feeds into the process. Once tools. In addition, access to the internal silicon struc- parameter files and constraint files are established, the ture of the ASICs was required to customize bus inter- place-and-route process proceeds in a completely face drivers. The risk and cost of developing these automated and batch-driven mechanism all the way capabilities through worlting with ASIC vendors to a completely verified design layout file (DLF). The would have added months of valuable schedule time speed of the process execution is limited only by to the program. the batch queues available and the performance of the As a result, the decision was made to focus the effort underlying processor type. on CSALT technoloby and to move it from its advanced The silicol~fabrication process relies on Digital development stagc to a production-quality one. Given Semiconductor's CMOS line. A I the physical design the selection criteria that were emphasized, a set of and process fabrication rules are built into the layout goals was established for tlic (:SALT development: tools and driven through the parameter files specified at Incorporate ,In integrated timing-constrained startup. CSALT has built-in correct-by-constructio~~ driven placement. custom design rules that guarantee all aspects of the automated layout to be free from any design rule vio- Implement technology in a 3.3 volt (V) stable lation. The tools account for all aspects of the physical CMOS process.

Digital Technical Jo~~rnal Vol. 7 No. 1 1995 67 design, such as electromigration rules, coupling capac- 5.0-V Compatible UO Cells itance effects on timing, as well as analysis of any elec- CSALT arrays developed in Digital's hurth-generation trical hot spots resulting from excessive logic switching CMOS process are po\vercd by a 3.3-Vsupply for both in a dense localized arca. 1/0 and internal core. CSAL.T ASI<:s can receive but not send 5.0-V I/O. The input receivers for both the Physical Technology bidirectional and the inpnt-only cells lia\~transistor- transistor logic (TTL) input Ic\,els and car1 be used in The ASIC designs targeted for this technology needed either a 3.3-V or a 5.0-V signaling cn\!ironmcnt. The to meet the physical, electrical, and thermal require- CSALT PC1 interface cell meets the I'CI 5.0-V spccifi- ments of tlie PJphaSer\ler SO00 series platform. The cation, \vitliout requiring the csternal niod~~lctcrmi- system fi~nctionsthat the ASIC designs satisf) belong nation recornmended by niost ASIC vendors. to three classes: Performance-tuned Library Elements Class 1-Interface benvccn the system bus and The performance targets for tlie cell clcrncnts in the CI'U CSaT \\)ere determined from 3 number of sources. Class 2-Intcrf'lce benveen the CPU and the local First, previous ASIC designs, library pcrfor~nancc,and I/O heuristics were used to establish a baseline. The hcuris- Class 3-Interface bcnveen the local 1/0 and the tics of the number of cell logic Ic\~cIsbct\\~ec~i two state Peripheral Component I~itercon~iect(PCI) elements in the 1)EC 7000 platform designs \\Icrc ana- lyzed. Second, tlie fourth-generation ClMOS silicon An enhanced ASIC dcsign style was used to reduce process, electrical interconnect data, and transistor the time to market and to minimize dcsign and verifi- - properties were ~rsedto arrivc at nc\\l scaled cstilnatcs cation resources. The cnhancements to the convcn- based on unit load, cell timing, and intcrcol~nectdelay. tional ASIC design (such as timing- drive^^ layo~~tand Third, cycle tirncs and systeni sltc\vs of the target plnt- autoniated incorporation of SPICE delays) signifi- form were used to determine a new estimate of the cantly improved ease of design for high-performance, le\lels of logic that can be placed ben\lccn two state ele- 100-megahertz (MHz) very large-scale integration ments. The analyses rcsulted in thc generation of basc- (VLSI) chips.l line perfor~nancctargets that \vcrc 11sccl in the dcsign of Sc\leral fcaturcs of tlie CSAI-T physical teclinology an ASIC library tuned to cyclc 100K gates at 100 [MHz. and their advantages are discussed jn the follo\ving sections. Delay Calculation

Low-skew Clock Distribution CSALT post-la)fo~~ttiming analysis ~nclnet dclay gen- There is one lo\\l-sl

68 Digiral Tcclinic,ll Journal simulation. The results of SPICE simulation arc then macrocell to model the range of input edge ratcs that back-annotated into timiug ,~nalyses,and tlie design is the macrocell is cspected to see. Thcsc n\.osets of data reanalyzed sing SPICE accuracy for deluys on critical are used to crcatc ( 1 ) a table of delay additives to gate path nets that had Liilcd prc\~iously.Tliis strntcgy propagation delay as a f~nctionofi~ipi~t ecigc rate and allows us to timc designs quickly with the accurac)l of (2) a table of output edge rates as a function of gate SPICF, \\/here needed.1 propagation dclay. These tables arc then used during tlie timing analysis stcp. SPICE Library Characterization The last component of delay, \\*ire transit delay, is The entire cell timing c1at.i set and cell pcrf(.)rniancc tlie only one 11otcictcrmined by SLiC. I)LII-inglayout, tables are generated a~~to~naticallythrough a suite of the bounds on tllc twnsit dcl~ythrougli c\,cry net arc automated tools called SPICE Library Characteri- calculated. Thcsc bounds are gencrntcd \,cry qllicltly zation (SLiC). SL,iC's automated procedure \\Jill crcatc and are cli~itcaccuratc for short ancl liglltly loaded SPICE input files to fi~lly characterize a library nets. 'For longcr, Inorc hea\~ilyloadcd ncts, Sl.,i(: calcu- of CSALT cells, csccutc SPICE on these files, and lates morc conscr\/ati\ic bou~~ds.~l'his co~lscr\~utism post-process the results into a format readable by tlic co~~tributcsto inaccurac!~in path timing and is thc pri- timing too1s.l niary reason \vhy allother methodology \\,as dc\~clopcd For cell dclay slopes and intercepts, the SLiC for determining morc accurate dclays \\,it11Sl'I(:E.1 process prod~~ccsdclay tables for each input-to-outp~~t This alternati\lc methodology for cnlculati~igdclay path combination throi~ghcach library macrocell. has been veriticcl through comparisons of tho~~sands Tliis is done by si~nulatingin SPICE with 11 discrete of path dclays \vitli SPICE. In all cases, the timing was OLI~PLIC capacitance values attached to the cell output. found to be conser\lati\re. Sisq-five pcrccnt of all cal- The total range of loads is broken into four windows, culated delays arc within 10 percent of SPI<;E predic- and a best-fit line through cach \vindo\\l is detcrmincd. tion, and \lirtilally all delays are within 20 percent. This Each is the11 trallslatcd so that all discrete points methodology co~nplenientsthe po\\.cr of tlic Eist tum- withiri the \\.indo\\, fall on or below this linc (for \vorst- around timc ofstatic timing analysis tools by ~iiodeling casc parameters) or abo\,c this line (for best-case the delays morc accurately and closely to SI'I<:E pre- pnramcters). This tra~isl<~tionis one mechanism ti)r diction. Llrgc chips can be analyzed in less than one ensuring timing conscr\~atisrn.Figure 1 slio\\~sthe hour and be f~lllytimed in a fc\\ lio~~rsif .iny SI'ICE CSAI,T library performance npprosiri~ation. sirnulation of lar~cnets becomes necessary.) For edge-rate cffcct on delay, SLiC measures oi~tput cdgc ratcs for each of the 11 capacitance \ralucs Constraint Generation Overview attachcd to cacti outpl~tcell described above and stores them. In ad~iitio~i,SLiC creates ten sinlulations Aher each timing vcritication run, a report is generated for cacli input-to-outp~~tpatli through cach library listing all paths that fail 2nd detailing ,111 ncts kind prim- ti\res,witl~incLich of these failing paths. 'This inti)rma- tion is tllen itc~ati\.clyprocessed through Jn nlgori tlim to shortcn cacll net in tlie path proportionately to its original length in the p~th,s~~cli that it satisfies tlie allowable timc rcq~~ircmcnt.First, the ~lllo\\~ablctotal wire delay in 3 path is calc~~latedin picosccorlds:

\vhere K/ is the total i~llo\\rable\\,ire ciclay fi)r all i~idi\ici- ual failing path, ~bl~~-vTit~~eLi~nitis the cycle ti~ncthat the failing path needs to meet, Ccll/'ri~i,rri/i~~c~l)c~I~<~!vis the intrinsic dcla)! through all the primitives that csist in the failing path, and Setz.1p7B~ne is tlic set~~ptime LOAD req~~iredby the state clc~ncntthat cnds thc hiling path. KEY: Then c\lcry net in tlie path is apportio~iedaccording - BEST-FIT MAXIMUM DELAY PARAMETER to its contribution in the current (hiling) total \\fire ---- BEST-FIT MINIMUM DELAY PARAMETER delay: * SPICE SIMULATION NetFiili?z~lk&ajl Nc~I~Vc~rrdXdg)~= IF1 ( Figure 1 Ac/-lrulPuthWire (:SAL,T Library I'crforni;~~~ccApprosirnation

Val. 7 No. 1 199.5 69 where I\~J/I\~./I~/)F/Lo~is the allo\\f;lble delay on a par- TOP tici~lar ~ictin a K~ili~igp;~th, IV~J//~L~~/III~~II~J/~~,].~ is I10 RING tlic actual dcluy on .I ncr ithi^ hi^^ n hiling p,ltli, and .4c11/all'~r/bWi~.cj is tlic total acc~11l1ulatcd\\,ire delay of MINI-MOAT ,111 nets in the hiling path. HIGH-POWER AND DECOUPLE CELLS I 1 Since \\tires can be sli;lrcd by more tI1;11l one failing MOAT path, a cliangc in the Icngrli of a wire in one path ulill cause other paths that lia\,c the same \\.ire as an clement to be sclicduleci for rccalculatio~i.A \\,ire length may clia~igcsc\~clal times bcforc it is stable, l3~1ringrccalcu- ROW- 1 .. . Intion, the s~nnllcr\\,ire protiucccl is the onc thdt nrjll be .- used. :rIlis itcr.ltio11.~lgorithni co~lti~i~~cs .~111tilno nets arc schcclulcd k)r rcc\,.llu~tion,nncl con\,ergcncc is achic\~cd.The ~lurnbc~-of iterations can be limited if convergence is not acliic\fcd in a ti~nclymanner. At completion, hc/.Vcl/arc then con\.crtcd into wire Icngths:

Figure 2 (3AL.T lhc Arc11ircctu1.c

where NdLct~gthis the calculatcci net constraint in i111it Generally, poiver to the corc is distributed hy ccll length, .Sl(~)c~(?/l)r-ircris the slope of tlic ccll driving aburnlent on metal 3 o\lcr the cell rows. Horizontal the failing net in unit time per cnpncita~~cc,G~lteLo~ld signal routing in the corc clinnncls taltcs plncc or1 is the sum capacitance ofall cclls tied to tli;lt 11ct,anti nxtal 2. bletal 1 is ~~scdfor \rcrt~calcorc routing. Cill~lll,is the c~pncitanccpel unit Icngth for intcrcon- To route in tlic \lcrtical clircction, the ro\vs contain ncct metal. feeds. By design, many standard cclls havc vertical ~Vc.ll,o/,y/bis then compnrcci to a quench value, and feeds to provide pass tl~roi~gli.In addition, a stan- tlic larger ofthc t\\,o is used 11s the ~ic\\.net co~lstraint dard feed cell can be autoniatically inscrtcd by the feeding back to a layout. Q~~ench~alucs define layout tools \\*lien the clcmanci for fccds is high. the niininium \\sire Ic~lgththat 3 net can Ilavc, based 011 I/O bristles for each of the core cclls arc made airail- the nt~nibcrof pins (hn-oi~t)in that net. able on the top and bottom ofthc cclls to c~lll~~~cc routa bili~.. Physical Die Architecture -lr~~nk--fhc - region splitting tlic core into lcfi and right halves js referred to as the trunk. The trunk is The CSA1,T ciic arcliitccturc, as sho\\,n in Figure 2, a routing region i~sedprimaril!. to route clocks and consists of tlic follo\\ing sections: power signals down the ccntcr of the corc. Thcsc 1/0 cclls-The outcr~nostrcgion \\rlicrc the 1/0 signals are then distributed to the left and right cclls arc located is :dso called the pild ring. Bonding sides of the corc on a ro\\,-b~-~-o\vbasis. pads arc built into the 1/0 cclls. Ring-Although not indicated in Figure 2, tlic Higli-po\vcr anci ciccouplc cclls-This rcgion ofthc term i.ir7g refers to the 1/0 ring, the mini-mont, arra); also callcci the liigli-po\\jcr ring, is fillcd prc- and the high-po\\lcr ring regions ns n group. Even clorni~lnntly\virl~ riccouplc cclls. This rcgion also though the physical size of the ring is fixed, the allows for placcnicnt of a limited ni~mbcrof liigli- total dimensions arc dctcrrni~icd1))' tlic package power driver cclls designed to drivc licavily loadecl size of the array. The size of the ring cstablisl~csthe ncts such ns clock lines anci reset lines. available arca remaining in the ccntcr of the array Corc-This region holds tlic majority of the logic for the moat and the corc. in the arrlip implemented as stand~rdcells. All these Mini-moat-The mini-mo;lt is the rcgion scp;lrit- cclls arc the same licight but vary in \\,idrli accord- ing thc I/O ring from the Iligli-po\\lcr ring. 'The ing to ti ~nctionalcomplcsity. (:ore cclls arc arranged layout process uses this rcgion to ro~~tca small in rows ~ll~rnbcrcclfi-0111 the bottom ofthc arr.1~.The 11~11nberof liigli F'ln-o~~tncts that drivc cclls in the numbcr of ro\vs in the corc is a dcsign-dcpe~ldent 1/0 ring. Layout paranicters control the assign- variable. The space bcn\.ccn the rows \.arics tiorn ment of ncts to tl~cmini-moat. ro\\*to ro\\. and is used hrrouting clianncls.

Digital 'l'mh~~icnlIuurnal Moat-The nioat is a routing region used by the in the final layout. Because a working design may not layout tools for attaching the ring to the core. The be achieved on the first Ia!iout iteration, the overall size of the moat is determined by the alnoiunt of CSALT methodoloby pro\,idcs mechanisms for ana- space that is left over when the core and trunk are lyzing post-layout timing delays and for generating placed and routed. Small arrays (lo\\, gate counts) a refined set ofconstraints that can be fed back into the result in srn,ill cores and large moat areas, and large layout for another pass. The layo~~tprocess is iterated arrays (high gate counts) result in large cores and in this manner ~~ntilit converges on a layout solution small moat areas. During the layout of large chips, that meets the timing constraints. it is possible for the core to become so large that not cno~~ghmoat space remains to make all the Routability necessary routing connections. To ensure 100 percent routing, thc routing process had to be kept simple, \vhich required substantial plan- Figure 3 is a photomicrograph of a CSALT die for ning during development of the chip architecture one of the CPU gate arrays used in the AlphaScr\lcr described above. As a result, the follo\\/ing elements of SO00 series server systems. the architecture were dcf ncd: (1) Pins are a\iailablc on both the top and the bottom of the cells; (2) Power Placement and Routing and clock connections arc defined by cell abutment; and (3) Total routing; of the chip is divided into four The fi~nctionof the layout tool suitc is to provide areas (thc core, the nioat, the ring, and the niegacell a fully placed and routed array that meets or cxcccds all interface). This plan kept the routing problems similar the design timing criteria and that satisfies all electrical from chip to chip, which allocvcd the routing tools to and pl.lysical layout rules required to rcleasc the array focus on particuli~rsolutions. for mask generation. Several place-and-route features are discussed in the follo\ving sections. Quick Turnaround Time One significant fcati~rcof the CSALT layout process is Constraint-driven Layout System tliat it can conipl~tca layout without n~anualinterven- When an array is submitted to layout, it is accompa- tion, saving tinle over manual processes. CSALT con- nied by a set of timing constraints. Timing constraints sistently demonstrated that the CAD suite can provide can be thought of as esti~natedrestrictions, on a per- co~upletcdlayouts in thrcc to ten days from the time net basis, for the aliiount of metal lengths allo\\~edto the wirelist enters the 1,1!~out process. An array tliat bas interconnect the net in the layout. These constraints been in layout for ten days is likely to be one that is dif- drive the Timberwolf placement tool and arc ulti- ficult to time and that lhas required four to six layout mately respo~lsible for the placement of core cells iterations to converge.

Cross-talk Effect inclusion I11 recent generations of ASIC technologies, intercon- nect 111etal widths and pitches have been shrinking \vhilc the clock freclucncics have been on the rise. This raised sonic concerns about on-chip cross-talk cffcct due to the ability of signals traveling on one \\lire to affect the spccd of signals traveling on adjacent or \.ictim ivires. In extreme cases, this cross talk can ca~~sc signals to spike on the victim wires. CSALT mcthod- ology compensates for such effects on wire delay cal- culation, and the compensation is integrated into the layout process. The integration is i~nplemented by factoring in a coupling capacitancc extracted from layout and by using a worst-case signal-sulitching scenario. Conscr- vative factors \\{erechosen after analysis of cross t;ill< on a representative cross section of CSALT layouts, i~si~lg different routing pitches on signal interconnect n~ctal 2. The goal was to find the right balance bcnvccn Figure 3 Photomicrograph ofa CSALT Die metal pitch, area ~~scd,and chip timing. The study resulted in an optimunl pitch definition of 3.75

Digital Tcchn~cdlJol~rndl Vol. / No. 1 1995 71 micrometers (pn)for ~iictal2, and a coupling capaci- the tool-it assumes all logic is s\vitching at niaxi- tance ~ni~ltiplicrof 2. 'l'lic core arca increase, from that mum fireqt~ency. ofusing the minimum pitch for metal 2 .~t2.625 IJ-~I, was less than 10 percent k)r the largest design. Correct-by-construction Concept Ho\\lc\,cr, overall die area increase \Ifas negligible due As it applies to all the critical dc\'icc issues (for csani- to the dcsigns bcirlg I/O-ring-lilnitcd in nature. ple, clectroniigration, cross talk, hot carrier injection, and latchup), acceptance of the concept of a layout Free of Electrornigration being correct by construction 11.1s dram,~tic;ill!, Whcn the current density ill thc ~l~~rnin~~rnintcrcon- reduced turnaround time in the Iayo~~tproccss b!' nect used in today's liigli-density CMOS processes is eliminating the iced to perform tlicsc analysis opcrn- too high, a detrimental pllysical plicnomcnon occurs. tions on each array. Why docs it \\lol-lc for <:SAI,TI This phenomenon causes metal reliability problems It \vorlr,tloospla~ining for tlic trunk sizc~imet31 ro~~tcsSLICI~ 3s clock nets and other high and any random-access nieniory (1UiM) dc\,iccs fan-ol~tncts. takes place; timing constraints from sc\.cral sources The bulk of the po\iicr distrib~~tionis acliie\,ed by (prc-layout, user defined, current layol~t,and psc- cell abutment. Cell po\\'cr rails arc conscrvati\~ely \lious layouts) arc merged into a \\lorst-casc set dcsig~icd to handle the largest row's current of composite constraints tliat arc LISC~by tl~c demand. Timberwolf tool to place and globnlly ro~~te As a final check on correctness, onc of the layout the core. Also during this step, the balanced clock process steps incorporates a hot rocv tool. This tool system and scan chain arc synthesized and globally flags any ro\\a in the corc wliosc cells collectively routed. The SCAR clian~~clrouter is tlle~i LISC~ escccd a preclctcrmincd currcnt threshold def ncd to route the standard-cell portio~lof the corc. If bp the I~andlingcapability of the power cells in the design contains RAMS, they arc then placcd in tlic rocvs. This ink)rrn

72 I>igiral Technical Journnl Capacitance inforniation fioni the layout file is extracted, and the proccss procectis to cnlculatc

FULL WlRELlST a conser\tative metal delay (including compcnsa- PREPARATION WlRELlST PREPARATION tion for cross talk) for each net in the Iayoilt. Thcse post-layo~~t1nct3l dela!,~ ;I~C fed back into PARTITIONING the timing \rcrifi cation proccss. In addition, ------SPICE is run on clock nets and other critical nets PAD-RING I ADS-POP I predefined by the user, and the delays arc made ASSEMBLY n\isilable for tiniing annl!.scs.l

The entire proccss runs ai~to~iii~ticall!~'ind, \\,lien CORE PLACE * AND ROUTE possible, steps are run concurrently. Manual intcr\fcn- PREPARATION tion is ilnnecessar!l and is discouraged; it is uscd only CORE during tool debugging or spccial customization. ASSEMBLY TIMBERWOLF PLACE AND GLOBAL ROUTE Process Controller

CSALT's fi~lly ,lutonintcci process controller (PC) fiCOMPLETE ensures optimum use ofsystcm rcsourccs, orchcstratcs the entire layout proccss, provides all necessary data CHIP RING AND management fi~nctions,and provides tlie user with MOAT PREPARATION CHIP I I a very simple set of comn1ancls for operating an othcr- ASSEMBLY wise complex process. The power of the PC is in its continuous dynamic ROUTE ROUTE decompositio~iof every layout into parallel batch streams. The PC runs thc entirc I;lyout proccss in ----I-----VERIFICATION batch mode, taliing f~lladvantngc of opportunities to use multiple processors and run independent parts CAPACITANCE INTERCONNECT EXTRACT of the lavout process in parallel streams. Bccausc it VERIFICATION WIREST hides all tlie CAD and process complexity from 3 user, I PREPARATION I no pre\,ious (:AD or 1ayo~1tskiI1s arc rcqui~-ctito itcr'lte layouts once initial la!wut paramctcrs have been cstab- - - 1 PHYSICAL VERIFICATION 1 lished for a given array. INTERCONNECT VERIFICATION1 DESIGN RULE CHECK Figure 5 illustr3tes the flo\v for the <:SALT layout process. (A and B i~idicatcconnectivity points.) Each llallle in the process flo\\f represents the n'lmc of3 ~i~i- L--- gle process step. Execution of the layout proccss is controlled by a single command proccdurc called Figure 4 <:SALT place and route (<:l'li). Althoi~~hother CPR CSA1.T CAD I.;1yout Ovcrvic\\- command line options exist, <:Pi< is most often i~scd in its simplest and most powerfll form, CPK PC. This command causes CPR to in\roke thc dynamic PC that din~ensioni~iformation, rclati\lc coordinates of bris- \\,ill a~lalj~zethe current relationships of3 Iayoi~tdata- tles, and bristle names. The ring and core cell out- base and begin autoiliatic esccutiou of the 1~)lout lines arc no\\, replaced with the act~~alcell layout process from the nest eligible proccss step. inhrn~ationh-om the librar!!, and a complete design 1'1)iout file for the chip is produced. As the file is Results and Conclusion gencmtcd, alignment block and substrate ring data is added to makc the physical representation file CSALT gate array technology was ~~scdcstcnsi\rcly rcady fix mask set procluction. during the de\,clopment of tlic AlpliaScr\lcr 8000 5. The hllo\ving proccdurcs occur during the \ferifica- server s)lstems. This dcsign metliodolog?l rcmo\lccl the tion phase of the layout: product's critical dcper~dcncyon the p1,lc.e-and-routc Physical dcsign I-LIICS ai-c chcckcd. portion of tlie design proccss. As a result, tiniing- A \\/irelist conlpnrison is pcrfonned to ensure correct ASIC layouts were produced in fc\\/cr than 72 that the layout file is an csact rnatch to the logi- haul-s. In addition, the (:SAL,T ASIC: logic designers cal representation of the design. had access to thc pro\Ien 3.3-V silicon structures

Digiral Tcchnic.~lJournal \I1 No I I >o PREP GWLHACK CLOCKCHK UNBUNDLE

I ADS-POP I

SMALLPLOTBLOCK

I TOPAZMERGE NLS T I

MERGE-PATHS I I LEASH AUGPATHS I

I SPWMAX SPRMAX TR2 TOPAZ2 HLR (PG) I

Figure 5 CSALT CAL) 1'1.occss Flo\\~Diagnm

74 Digital 'li.zlinic.~l Jou1.1131 Val. 7 No. 1 1995 dcvclopcd by Digital Scrniconductor. CSAL.T's timing- Biographies driven layout approach for designing and irnplcmcnt- ing a high-performance ASIC made tlie AlphaScrvcr 8000 server s)atems' aggressive 1 0-nanosecond bus speed a reality, with mi~iim~~m Although CSALT's i~nplcn~entationpieces ma)f not be uniclue, tlie approach that \vas taken to link the set of front-end design tools \\~iththe back-end l~lyo~~t113s proven to be i~niquc,wit11 unmatched results. No ASIC vendor today (January 1995) can provide logic designers \itit11 thc ability to do their own a~~tomntcd, Jean H. Bas~naji timing-correct layouts from their desktops. Jcan Basmnji is a hard\\arc consulting cllgi~iccrin the In less aggressive designs, a large number ofworking Server Platk)rm I)c\~clopn~cntGroup. As the <:SAL;T layol~tsolutions exist. Tlic 11~1n1berof tliesc solutions dc\~elopnicnt17rojcct Icnllcr and ASIC tcchnolob,'\' 111311- starts to shrink \vhcn the technology is niorc aggres- agcr, hc \\,as responsible for the transition ol'thc (:SAL'I' sively used. Iterative timing-driven layout cfficicntl!l tcchnolog project kom advanced dcvclopmcnt to pro- duction. Jean has hccn rlic technical director of com- searches through the m'ltris of possible solutions to puter-nidcd c~iginccringarid design \,critication tcsting find a \vorl

The authors \vould likc to acknowledgc the efforts of tlic follo\ving people, \\,itliout \\,horn thc project \\.ould not have becn succcssfi~l:Meaghan Engcl~lil, 1)avc Caffo, and Paul Jnnson. We \\~oiild also likc to ackno\vledgc otlicr mcmbers of the CSALT team: Linda Greska, Kevin Gamachc, Llennis Linvinctz, 13ill Gist, Dave Vanderbcck, John l>raslier, John Iepartment of Electrical Engineering and Computer scicnce (magna cum Iaudc, 1978) from Rosto~iU11i\,crsity- Sciences, Uni\,crsit)fof (:.~lifo~,niaat Rerkclcy. metropolitan <:ollcgc. 2. Timberwolf ~MixcdMacro/Standard Cell Floorplanning and Routing Pnckagc is a public domain automatic Iay- out package. l3c\,clopcd and dirccred by Professor <:arl Scchen from July 1986 through June 1992 at Yale Uni- vcrsiry and later at the University of Washington, it is currently maintc~incdand supported by Ti~nbcr\VolF Systclils, Inc., l)nllns, 'fixas. 3. J. Rubcnstcin, r. Pcnficld, nnd IM. Horo\vitz, "Signal Delay in liC Trcc Nct\\rorks," lEEE Trz~r~.\~ic.liortsor1 Cor~rptrter-ai~lvcIIlesigtr c,J'Irrtegr~rtcclC'ire~~il.~ atrcl Frank W. Gatulis S)ster~rs,CAI)-2 (3)(July 1983): 202-21 1. As supervisor of tlic (;AD segnicnt of the <:SAI:T tcchnol- ogy ream, consultant entjneec Frank Chrulis is responsible 4. 13. Fellwick, I). Folcy, W. Gist, S. \'i~nl)orcn, .ind for defining nnJ implcmcnti~lgautomated I;lyo~lttools and D. Wisscll, "The AlphaScrvcr 8000 Series: High-end processes. .I graduate of Don Bosco Techuical I~lstiti~tc, Server Platform I)c\~cloprncnt," Digilul Technical he also attended the R.S.E.E. program at Northcastcr~l ,jot.~r*~~al,vol. 7, 110. 1 (1995, this issue): 43-65. Univcrsity. Frank has also supervised the I/O diagnostic rcalii for DE(:systc~iiI0 and I)ECs!-stern 20 products ns \re11 as the dc\.clopnlcnr of tlic console-b,~sctihult dctcction s!istcm anti ;luronlatcd isol~tiontool suite used on tlie VAX 8600 ;uid VAX 8650. His architccturc scgniellts stratcgics \\lcrc t~scdto hbricate and tcst m~~lricl~ipunits In VAX 9000 systems. I3cfi)rc joi~li~lg Digital in 1973, Frank worked at EG&G wlicrc he developed real-timc pictul-c acquisition, processing. JII~con~prcssio~i sofnvarc.

Herbert R. Kolk A co~isultingsofn\,arc cnginscr \\ilia spccinlizcs in chip routing, Herb I(olk currently supports and cnha~iccsthe (:SAI:r layout proccss. Prior to that, he was the architect of the <:hamelcon routcr, which nlns ascd for routing g~tc ~rrays,lli~iltichip I~IOC~LIICS, and CI'U boards for the VAX 9000. This routcr is still being ~~scdin the (:SAL.lS proccss. I3cforc joining 1)i~iulin 1983, Hcrb designed soft\varc s!,stcnls for the (;o1n111u1iic.ltio1isSysterns Div~sio~~of GTE. Hcrb rcceircd a R.S.E.F.. (\\.it11ho~lors) from the 1l)clicstcr lnstit~~teofTcclinology in 1973.

James F. Rosencrans Ji~nRoscncrans is 3 principal hard\\~aree~lgi~lccr in the AlpliaScr\~crEngineering Group. As the ASIC tcclinol- ogist for server system dcvclopmcnt, Jim supports the AlphaScrver 8000 scrics dcsign reams and acts as liaison hrboth the teams nlid tcst c~iginecring.In additio~ito lending tlie CSAlll niicrop.lcknging clcfinition 2nd dcsign, lie contributed to tlic dc\~clopmcntof CSAI-T tcchnolop. His previous \\.ark includes co~itriburionsto ASIC and ASI(:/silicon technology sclcction, definition, and ticsign support. Bcfi~rcjoining I)igit.il in 1988, Jim was \\lit11 N<:R Microclccrro~iics,\\ hcrc lic \\,orkcd on custoni 2nd <:MOS ASIC: dccign and ,\ilico~lprocess dc\lclopmcnt. A mcnlbcr of IEF.l:, Jim rccci\,cd ;I l%.S.F..E.tiom thc Univcrsit)r of Wisconsin in 1980.

Vol. 7 No. 1 1'9'95 I Nitin D. Godiwala Barry A. Maskas The Second-generation Processor Module for AlphaServer 2100 Systems

The second-generation KN470 processor module The second-generation KN470 processor moclule for for AlphaServer 2100 systems performs signifi- AlphaServer 2100 systems achieves a higher perfor- cantly better than the first-generation KN460 mance than the first-ge~ierationI(N460 module \\ihile ~iiaintainingcompatibility \vith tlie AlphaServer 2 100 module and was designed to be swap-compatible system environment. This paper describes the proces- as an upgrade. The KN470 processor module sor module project and the resulting design. Topics derives its performance improvements from the discussed are the elements that contribute to tlic com- enhanced architecture of Digital's new Alpha patibility and to the higher performance: coherence 21 164 microprocessor, the synchronous design protocol, system bus protocol, system bus arbitration, of the third-level cache and system interface, system interface and shared data, and clocking. Some key design trade-offs are described. The paper con- the implementation of a duplicate tag of the cludes with a performance summary that presents third-level cache, and the implementation of measured attributes of the higher performance a write-invalidate protocol KN470 processor in the context of the AlphaServer for the multiprocessor system bus. Additional 2 100 fi~mily. design features such as read-miss pipelining, When the AlphaServer 2100 product fiirnily was system bus grant parking, hidden coherence being defined in late 1992, the processor module pcrfor~iiance-osier-time roadmap projected three per- transactions to the duplicate tag, and Alpha formance variations based on increasing tlie clock rate 21 164 microprocessor write transactions to the of the microprocessor.l These modules system bus back-off and replay were combined were to be compatible with Digital's mid-range niulti- to produce a higher performance processor processor system bus and ~~uldsupport enlianced module. The scope of the project required imple- fi~~ictionalitysuch as direct-mapped I/O, up to four menting functionality in system components microprocessors, an 1/0 bridge to 32-bit Peripheral Chnponent Interconnect (PCI) and Extended Indus- such as the memory, the backplane, the system try Standard Architecture (EISA) buses, and an I/O bus arbiter, and the I10 bridge, which shipped expansion option module with an 1/0 bridge to a one year ahead of the KN470 module. 64-bit PC1 bus.2 Two members of the DEC 4000 processor design team were assigned to deliver this first-generation processor module. At this time, there was no goal to develop a second-generation processor modulr. Therefore, the remainder of the team designed the arbiter chip and the enhancements required in the processor-module system interface chips and at the same time co~ltributedto the Alpha 2 1 164 ~iiicroprocessorde\lelopment effort. Goals for contributions to the Alpha 21 164 niicro- processor development effort \\'ere partitioned into short- and longer-term goals. A short-term goal \\?asto detine a system for the nccv Alpha ~~~icroprocessor.~The related longer-tern1 goal was to ensure thdt the Alpha 2 1 164 microprocessor could operate in that defined system. An architectural sti~dyresi~lted in a proposal and a project plan to develop a second-generation processor module that extended the perforniance and

Val. 7 No. 1 1995 7 longevity of the AlphaServer 2100 family. In addition, system memory module enabled the team to achieve the remainder of tlie team ni;ide requests of the Alpha the project's goal of designing a higher performance 2 1164 microprocessor team to incorporate specific processor module and multiprocessor system. legacy-related AlphaServer 2 100 hnctions such as support for 32- cachc blocks, control of 1/0 Overview of the Processor Module address space read merging, and completion of niem- ory barriers on the Alpha 2 1164 microprocessor. The The KN470 processor mod~~leprovides an operational busincss management team accepted the proposal and envlronmcnt for the Alph.1 21164 rnicroproccssor. the project plan. The Alpha 21164 microprocessor This environment, which is similar to that of tlie tirst- team agreed to support the f lnctionality requests. The generation KN460 processor module environment, design team staffing was completed by march 1993, includes tlie follo\\l~ng: and detailed design work began in lMay 1993. The Alpha 2 1 164 microprocessor-a superscalar, design team's goal was to have a processor module superpipelined iniplelnentation of the Alpha archi- ready to accept the Alpha 2 1 164 n~icroprocessorfor tecture with low average installation when the n~icroprocessor first became because of its four-instruction issue available. The team met this goal. Since tlie first- and second-generation processor B-cache-a module or third-le\iel \\/rite-back cache niodules would operate in the same enclosure and with Systmi interface-two application-specific inte- the same power supply, the size and shape (i.e., form grated circiiit (ASIC) chips that interface tlie Alpha factor), cooling demands, and power consumption 2 1 164 microprocessor, %cache, and di~plicatetag ofthe new module had to be compatible with those of store to the system bus the first-generation module. Because of the presence Duplicatc tag store-a tag store of thc third-le\lel of an on-chip, write-back second-level cache and an write-back cache estimated longer access time to that cache from the System bus clock repeater that provides system bus system bus, the Alpha 2 1 164 microprocessor architec- synchronous clocks to thc module ture adopted an invalidate-on-write cache coherence protocol. The Alpha 21064 niicroprocessor supported Systenl bus arbiter that determines which system an off-chip, write-back second-level cache that has bus node can access the spstcm bus a faster access time fro111 tlic system bus. This faster Serial control bus subsystem that includes clock and access time enabled the implementation of a good- reset control circuitry, a \\pith a performing update-on-\\!rite cache coherence proto- scrial interface, serial rcad-only Inenlory \\tit11 col. Support of these snooping, ~~iultiprocessossystem power-on firmurare bits, and non\lolatile niemory bus coherence protocols required enhancements to for processor configuration parameters the system bus transaction types. Tliis resulted in minor Figure 1 sho\vs a block diagram of the KN470 logic changes to the memory interface chips and to tlie processor module. I/O bridge chip.4.5 These changes were defined and The Alpha 2 1164 microprocessor is organized witli implemented in time for the first-generation system an on-chip 8- (KB) priiilary instruction cache power-on. Hence, the system components, the 1/0 and an 8-KB nlrite-through data cache, \vhich are bridge chip, tlie memory modules, and the system referred to as first-level caches. In addition, a 96-IU3, bus and backplane are conlpatible with the first- and second-le\/el, three-way, sct-associative \vrite-back second -generation processors. This basic difference in cache is iniple~nentedon the chip. the system bus coherence protocols prevented the s~ls- The module design includes a B-cache or third-level tem from supporting the coexistence of tlie first- and cache to niininiize the miss penalty and to be config- second-generation processor modules because such urable through the use of various densities of similarly a configuration has asymmetric attributes. Alpha oper- packaged static random-access niemor). (RAh4) chips. ating systeni sohvare does not support asymmetric Such a design enabled final product definition late in ; symmetry is assumed. tlie verification process based on static RAM costs and Another project goal was to maintain the delivered performance from the B-cache. The size of Alphaserver 2100 family's position among tl~eindus- tlie B-caclic is either 1,2,4,8, or 16 megabytes (1MB). try's leading high-performance server systems. This Each 13-cache entry stores 32 bytes of data and the goal was achieved by exploiting the Alpha 21 164 associated tag bits and is called 3 cache block. To facili- microprocessor's performance through the design of tate read-fill data and victim-write data exchange nith the processor module's third_levcl cache, by imple- the system interface, the Alpha 2 1 164 microprocessor menting a full-duplicate tag OF this cache, and by and tlic system interface share the B-cache data port. implementing a synchrono~~scloclung scheme. The B-cache is controlled by the Alpha 2 1164 micro- Combining the processor design attributes with Drocessor, which has its second-le~elcache configured a pipelined read transaction of a faster read-access

78 Digital Technical Journal Vo1.7 No. 1 1993 A SYSTEM BUS SYSTEM BUS ARBITRATION ARBITER f CLOCK AND SYSTEM SYSTEM BUS SUPPORT CIRCUITRY CLOCK 4 t REPEATER SYSTEM BUS

TAG CONTROLS INDEX<25:4> ALPHA 21 164 BUS<127:0> MICROPROCESSOR

SYSTEM BUS CONTROL MASK<3:O> SYSTEM INTERFACE 2 BIT-SLICE ASlCS INTERRUPTS CMD<3:O>+PARITY 4 + AND ERRORS ADDRc39:4> - * CONTROL-IN - SYSTEM BUS CONTROL-OUT * INITIALIZATION DATA<127:0> * * CHECK<15:0> - b TAG-C (VSDP) - w TAGc30:20> - w 11YY INDEX<25:4>, ,B-CACHE CONTROLS ' 1,2,4,8,OR16MB

Figure 1 Block Diagram of thc KN470 Processor ~Modulr to operate in 32-byte instead of 64-byte cache block events and to know when to sample new requests for mode. This 32-byte mode of operation for the the system bus. second-level caclie was the most complex request The module maintains a duplicate copy of tag con- made of the Alpha 21164 microprocessor design trol bits of each B-cache block in the duplicate tag team. Ho\ve\,er, this design element was required to store. The duplicate tag store is controlled by the achieve the Alphaserver 2100 compatibility goal. system interface and is time multiplexed between sys- The system interface is a common boundary tem bus requests and Alpha 2 1164 microprocessor between the system bus, the Alpha 21164 micro- requests. This ability to pipeline transactions to the processor, and the third-le\rrl cache. The system inter- duplicate tag store from the Alpha 21164 micro- face provides the protocol and circuitry for the Alpha processor and the systenl bus allowed the Alpha 21164 microprocessor to read or write devices con- 21164 microprocessor's requests to fill predictable nected to tlie system bus. Conversely, the system inter- time slots in parallel to the system bus transactions, face provides the protocol and circuitry for the hidden from the system bus. This is called cycle- processor ~nodc~leto respond to read or write transac- stealing because the coherence transactions requested tions from the system bus. The system interface com- by the Alpha 21 164 microprocessor do not requirc prises two identical bit slices of an ASIC. The ASICs arbitration for or use of the system bus cycles. Cycle- operate as even and odd slices, based on a mode-select stealing provided more usefill system bus bandwidth pin on the module. The system interface selects the while at the same time reduced the Alpha 21164 operating mode of the arbiter cliip. It also encodes tlie microprocessor latency for colierence transactions to system bus transaction type as read or write and then the duplicate tag store. supplies a control, sjgnal to tlie arbiter cliip. :The arbiter Thc clock repeater chips generate complementary chip must Itno\\! the present system bus transaction metal-oxide semico~iductor (CMOS)-level cloclts type to remain synchronized wit11 tlie system bus from positive emitter-coupled logic (PECL)-driven

Digital Technical Journal Vo1.7 No. 1 1995 79 backplnnc cloclts. These CPIOS-lc\zcl cloclts ure sltc\\s broadcast \\'rite modificat~onsof this block to tlic regulated and distributed to the ~iiodule's compo- system bus. nents. Tlie Alpha 21 164 microproccsso~-has digital- 5. VSl) = 11 1 A \,cllid cache block may also be in lock-loop circ~litry,\vhich alig~istlic Alplia 2 1 164 anotlicr iacl~c,,ind the cop!. of this block has been r~iicroproccssor's interface clock to tlic rcfcrencc ~nodificdnlorc rcccntl)~than the copy in memory. clocks tliat run to all other rnod~~lccomponents. This scheme is basically the sp~iclironousclocking scheme. Cache statc transitions are synchronized to s)rstcni 'I'lie modulc includes tlic system bus arbiter cliip. bus transaction cycles beca~~sethe s)~stenibus is the 'The decision to locate this chip on tlic processor common point of coherence and coherence conflict instead of tlie backplane stemmed from concerns over resolution. compatibility between the fi rst- and second-generation Wlien the Alplin 2 1 164 microproccssor rcq~~csts proccssor arbitration algorithms supported by tliis a transaction for n caclic bloclt to be read or f llcd from chip. 'l'hc system ~LISarbiter cliip \\!as designed ;~ndhb- system memory into tlic B-cache and on-chip caches, ricatcd for tlie fi rst-generation proccssor and included the cache block statc is set to VSL) = 100 in the dupli- the fi~nctionalityof the second-gcncr:~tio~iproccssor. cate trig store, tlic R-cache, and the on-chip caches. Tlic cliip design \\,as completed prior to the design The first-Je\.cl instl-uction and write-through data of the Li470 processor module's systcm interface. caches must maintain only a valid bit. The second-level 7.0 minimize the chance of dcsig~icrror, the team \\>rite-backcaclic must riiai~itainthe VSD bits consis- performed extensive simulations to help the project tent \\,it11thc R-cache and the duplicate tag store. realize a fi~ll-function,second-pass chip for LIS~in the If an Alpha 21 164 microprocessor's read trans- first- and second-generation processor modules. action request is of tlic type intent-to-modi$, then 'I'lie 1 = 000 A cache block is eitliel- empty or allo\v an update of \vritc data from the system bus but removed fi-om the B-cache and hence invalid. instead in\,alidatcs the block. Invalidation requires tlic 2. VSO = 100 A cache block is tlic onlv cacllcd copy duplicate t~gstorc, tlic B-caclic, and the on-chip in the system. caches to clear their V statc. Implementation of systcm

3 VSI) = 101 A cache block is valid, .lnd tliis copy bus \vrite transactions that cause block invalidation is has bccn modified lnorc rcccntl!! than tlic copy In recluired to support tlic caclic coherence protocol of mcniory. the Alpha 2 1 164 microprocessor. By filtering out s!.stcm bus transactions that do not 4. VSI) = 110 A \did cache block may also be alter tlic coliercncc states of tlic Alpha 2 1 164 micro- in another cache. This processor must \\

Enhancement of Transactions Minimization of Arbitration Latency For the first-generation AlphaScr\,cr 2 100 processor, The systcrn bus arbiter imple~ncntcda bus grant park- the system bus protocol is tlie same as the one imple- ing or pregrant signaling schc~iiethat ~~iinimizedtlic mented in tlic 1)I-X: 4000 system. Tliis system bus pro- arbitration timing o\icrliead. Tliis scheme combined tocol is 3 s~ioopi~lgbus p~-otocolin \\~liich all bus \\(it11the pipclining of tlie read-miss commands from participants arc required to monitor s!,stcni bus trans- the Alplia 2 1 164 ~nicroproccssorenabled tlic systc111 actions and to lcccp their caclicd copy of memo^-\, bus interface to use the a\,,iilable memory band\\.idth. coherent. For the second-generation processor, tlic The arbiter for tlie first-gcneration processor fol- designcrs enlirinccd the l>E,C 4000 systc~iibus proto- lo\\ied tlie protocol ~isedin tlic 1)EC 4000 systcm. The col to support tlic \\~ritc-in\jalid.ltccache coherence arbiter sa~nplcsthe requests anti then issues the grunts protocol of the Alpha 2 1 164 microprocessor. according to round-robin arbitration rules. Tlie arbi- -The DEC 4000 system bus protocol supports four tration r~~lcsallo\v processor modules to lia\re hir acccss t\~pcsoftr'insactions: read, \\!rite, exchange, and to tlie system bus. The elapscd timc fi-on1 \\,lien a no operation. l~licexclia~~ge trri~is.iction performs a proccsmr 1iia11es a system bus request to tlie arbiter \~icti~ii-\\~~-itctr:~nsaction to one rncmory location and ~~ntilit rcccivcs a grant is referred to as the arbitmtion a read transaction of another rncmory location. Tlic cycle or ~rbitrntiono\lerhead. rrlic ,l~.bitmtiono\lcrhcad nvo transactions arc separated by the common lo\vcr increases the memory and direct-~iiapped1/0 acccss 18 bits ofnddrcss. Tlie escliangc transaction combines latency, as \\,ell as the cache-miss penalty. Typically, tlie read trans,ictions u~id\'ictim-\\,rjtc transactions into arbitration o\~A~cadfor each processor appears lo\\,in one transaction, sliari~lgthe .iddress cycle of the sys- a ~iiultiproccsso~--system in \\,liicli bus utilization is ten1 bus. Tlic cschange transaction is used to e\rict extremely high. The appear.ince of lo\\: arbitration modified caclic I~locksfrom the caclics back to s!stcln o\~erheadresults fi-om the timc the systcni bus \\laits to memory to allow ;i replacement block with a different fnish a transaction bcforc the arbitcr can iss~~etlic nest t.13 to be allocated. grant. Ho\\lcvcr, the arbitration overliead m,ly be '1s To s~~pporttlic second-gener'ition processor's high as 20 percent of the trans'lction time in a systcm

\\,rite-in\,~lidatccoherence protocol, tr,ins.i' CtlOllS- ' \\'ere co~ifigumtionin \\,hich one processor modulc is con- added to the first-generation systcm bus protocol. suming the ,l\~ailablcgrants from tlic ~rbitcr. Thcsc addcd transactions \\,crc nccdcd to signal other ?1. - lie arbiter i~scdby the second-generatio11 proccs- processors and tlic 1/0 bridgc chip to invalidate sor pregrants or parks a grant to tlic processor modlrle a bloclc \\ilicn a hloclc \\,as being rcaci for tlic purpose of \\~liene\lertlic system bus goes icllc. Tliis t'eat~~rcclimi- being modified. 'rhe escl~~si\~c-re~idand esclusi\~c- nates arbitr,\tion o\lerhead. The result is a lo\\zer mias cschange t~.,ins'~ctio~is\\,ere addcd to the four f rst- penalty and 'in .~bilit!l to sust'iin a continuo~~sstream of generation transactio~~types. The exclusi\~e-rc'id read tr~nsactions\\.lien the bus is not ~~tilizedby other tr'i~isactioii is tlic rcad trans'iction that also causes system bus nodes. This arbiter cnhancenicnt does not caclie in\~,ilid,~tio~ihy a bpst,indcr processor modi~lc cost additional xbitration o\,erlic~dfor other requests and the 1/0 bridgc cliip of the block bcing read. The becausc tllc cost of i~nparlcinga grunt \I~~Seliminated esclusi\~e-escll,~~igctransaction is the cscliangc tmns- through the signaling protocol. Tliis signaling proto- action tliat also causes c~clic in\r.ilidation by a col enabled the pregranted signal to be negated ,it the bystander processor moclule ,incl the I/O bridgc chip sarne timc a nc\\. grant signal is .~sscrtcd. of'tl~eI.>locl< bcing read. 7I 7 he Alp11,i 21 164 microprocessor is cap,Iblc of Tlie 1QI470 modulc iniplcmcnted the esclusi\rc pending rcaci-~nissrequests to the systcm intcrf,icc. transaction types to establish priv~tco\vnersliip of a Thcsc rcad transaction rcqucsts sometimes h,i\~an block. Establishing private owncrsliip to J pre\4ousl!/ rissoci'iteci \~ictinl tliat must be displaced by the sliarcd bloclc enables \\,rite transactions to completc requested rcad data. By pipelining tlicsc requests in \\,ithout lia\,ing to broadcast \\,rite tr,~nsactionsbnck relation to tlic system bus grants, a contin~~ousstrc.lm of back-to-back system bus rcad or exchange transac- an excl~~si\le-readtransaction on tlie system bus. Other tion requests can tlo\v because of tlie parked grant. processor modules responding to this exclusive-read Since the Alpha 21164 microprocessor is capable of transaction provide the data if their block is dirty, but continued execution while miss requests are pended, regardless of the dirty state, they also invalidate their tlie processor designers had to carefully schedule the cache block. The invalidation eliminates the shared use of the B-cache. The fill data coming from system state. If no other processor module has a dirty block, memory and the Alpha 21 164 niicroprocessor are in the data is returned from the system nienlory. The contention for use of the R-cache. The system inter- processor niodulc that is issuing an exclusi\ie-read fice minimizes tlie time tliat the B-cache is allocated to transaction sets its cache block state to VSl3 = 10 1 as accept the fill data while maintaining tlie tlo\\! ofcom- it f lls. Thc write transaction tliat is pending in thc mands into the read miss transaction pipeline from the processor can co~iiplctewithout broadcasting a write Alpha 21 164 niicroprocessor. By allowing the micro- transaction to the system bus. processor to have access to the B-cache before and A systern bus that does not support the exclusive after each fill, a continuous flow of transactions \\,as transaction types requi rcs a shared write transaction to realized. The continuous flow of transactions uses tlie a block to be deco~nposedinto two system bus transac- available system bus band\vid th. tions. This can rcsu l t in system bus bandwidth satura- tion. A systun bus that supports tlie esclusive Handling of Shared Data transaction types requires only one systeni bus transac- tion. In a shared-data environment in which write A shared-database environment in uihich write trans- transactions to shared data are the prominent cause of actions are prominent uses the system bus esclusivc cache misses, support for the exclusive transaction transaction types to establish ownership of the cachc types helps preserve bus bandwidth. Also, the deadlock blocks. These transaction types minimize the system scenario presented above does not exist. The KN470 bus bandwidth usage by avoiding write broadcast processor write transactions to a cached block consume transactions of modified blocks. only one system bus transaction and can alivays com- In a multiprocessor environment, a block tliat is plete. The invalidate windo\v does not exist during the valid in more than one cache is called a shared block. time it takes for thc write transaction to complete. The coherence state of a shared block is VSD = 110. Tlie systenl bus protocol implemented by the Tlic follo\ving example summarizes the problem asso- ICN470 ~iiodulcallows for\vard progress during shared ciated with a write transaction to 3 shared caclie block write transactions in the system. However, system soft- in a system bus protocol without the exclusive transac- ware is expected to avoid repetitive write transactions tion types. to blocks tliat are shared \vitliout some higher level Processor A has a modified but unshared caclie ownership protocol. Write transactions, if issued to block with state VSD = 101. Processor B wants to a shared block by several processors, consume bus write tlie cache block that Processor A has modified. bandwidth and trigger false invalidations for Processor B issues a read transaction on the systeni bus bystanders. This nay hinder for\ilard progress and and then must immediately folloc\~tlic read transaction affect systeni performance. with a write broadcast transaction of the modified data. The write broadcast transaction lnust be issued Support of an Interlock Mechanism by Processor B because the read transaction \\!as shared. At the end of the two bus transactions that it The system interface i~nplementsan address lock reg- issued, Processor B's cache block state \\rill be VSD = ister as specified in the Alpha Architecture Reference 101. Processor A has invalidated its cache block. Thus, Ma~zzialto support sofnvare synchronization opera- two bus trarlsactions were req~riredfrorn Processor B tions.6The address lock register in the system interface to write the modified cache block. With fair arbitra- has a signal that reflects tlie state of a valid bit to the tion, however, Processor R may not have access to the Alpha 21164 niicroprocessor. The microprocessor system bus after the read transaction. The write trans- manages tlie lock address register in tlie system inter- action may be bloclted, thus creating other coherence face based on sampling this signal during fill transac- situations. If two or more processors in a system arc tions from tlie system bus. trying to write the same block, Processor B may not Tlie Alpha 2 1 164 11iicroprocessor has an internal get access to the system bus to coniplete the \\,rite lock register tliat is maintained consistent with the transaction. The system is potentially in deadlock. lock register in the system interface, which is referred The system bus protocol implemented by the to as the ester~iallock register. The external lock regis- KN470 enables tlie \\Trite transaction to complete but ter is a backup copy of the Alpha 21 164 microproces- requires only one system bus exclwive-read transac- sor's lock register and is used only when instruction tion. In response to the Alpha 21 164 microprocessor's stream prefetching causcs the locked address to be request to modify a cache block, the processor initiates evicted from the R-cache. Tlie rsecution ofa load with

82 Digital -l'cchnical Journal Vol. 7 No. 1 1995 lock instruction by an Alpha 21 164 microprocessor sates for the half-cycle correction phase of the Alpha results in a transaction that sets both internal and 21164 microprocessor's digital lock loop (DLL). esternal lock flags and lock address registers. The Alphaserver 2100 system interconnect has an The external lock flag is cleared by the system edge-to-edge clock architecture, and it implements an interhce if the lock address matches the system bus edge-to-edge data transfer scheme. The n~icroproces- address of either a write transaction or an esclusi\~e sor has an internal DLL that sy~~chronizesto a refer- transaction. The internal lock flag is cleared by the ence clock supplied by the clock repeater chip. Instead Alpha 21 164 niicroprocessor due to system bus probe of trying to precisely control the clock skew across four transactions from the write or exclusive transaction to different chips, data valid windows are set around the a valid cache block. edge-to-edge data transfer clock edges to avoid setup The lock address resolutio~~is a single-aligned or hold-time issues. This simpler clocking scherne 32-byte block and is consistent with the size of cacl~e takes ad\rantage of the four delivered clock edges per blocks in this system. The Alpha 2 1164 microproces- cycle from the clock repeater chips. It also enables a sor has 64-byte internal lock register resolution. Since simpler synchronous boundary between the Alpha the address ofa load to memory and the correspond- 21 164 microprocessor and the system interface. The ing store to mcmory must both be within the same synchronous clocking improves data transfer rates, 16-byte aligned region, the difkrcnce in the resolu- lowers the miss penalty, and inlproves the pipeline effi- tion of the internal and the esternal lock registers was ciency among the components of the system. determined to be insignificant to performance.6 Figure 2 shows the clocking scheme that is iniple- mented on the KN470 module. The Alpha 21164 The KN470 Module and System Bus Clocking microprocessor accepts a differential clock at twice the desired internal clock frequency. The oscillator for the The IW470 module ~mplemcntsa locv-cost synchro- processor runs at 6, 7, 8, or 9 times the 41.66 mega- nous clochng schcnic. The sclie~iieexploits the system hertz (MHz) system bus clock frequency. The DLL bus clocking to run the Alpha 21164 microprocessor subtracts one halfofan internal clock cycle to maintain synchronous to the system bus. This scheme conipen- phase alignment with the system bus reference clock.

Figure 2 KN470 Clocking Scheme

Digital Technical Jo~lrllal Vol. 7 No. 1 1995 83 This DLL scheme assumes that tlie intcrnal cloclc fi-e- Components on the niodule arc cloclced by outputs qucncy runs slightly faster than the system bus clock from the clock repeater chips. The clock repeater chips frequency. Given these scaling rates, thc interface generate six copies of the primary clock TPHIl H. benveen the Alpha 21 164 microprocessor and the sys- TPHIl H cloclts are distributed as follo\\s: one copy tem interface are locked at the system bus clock rate. to the Alpha 21 164 microprocessor, two copics to The Alphaserver 2 100 backplane distri butcs 1'ECL- each of the t\vo systcm interface ASICs, and one copy level system bus cloclts PHIl L and PHIl H, and to the systc~nbus arbiter chip. The Alpha 21 164 PHI3 Land PHI3 H differentially to each module on microprocessor uses its copy of the primary clock as a the system bus. Each module receives, terminates, and reference clock for its 1)LL. The data transfcrs bet\\lccn capacitively couples the clock signals into 1'ECL-to- the microprocessor and tlic systcln interface are cdgc- CMOS-level converters to provide h~lrrdges per sJa- to-edge transfcrs n~id'Ire referenced to the prin~nry tem bus clock cycle. This level conversion is con~pleted clock. The clock repeater chip generates three scc- in the clock repeater chips. System bus handshake and ondary clocks: TPHI 1 L, TPHI3 H, and TPHI3 I,. data transfers occur from clock edge to clock edge and The clock-edge relationships among these four clocks thus form a primary clock in the system. The remai~i- are specified sucli that eacli clock edge is 90 dcg~.ccs ing three edges in a cloclung cycle are secondary out of phase \vith tlie orlicr nvo clock edges. Thc reln- clocks. The clock repeater chip, a custom CMOS clock tionships among the diffirrent clock phases are shown chip, provides module-to-module clock skc\v of less in Fig~~re3 for the case of tlie Alpha 21 164 oscillator than I nanosecond (ns) and is implemented to projlide with o frequency six times that of thc system bus clock. sl

1- 1- 24NS SYSTEM BUS CYCLE TIME -1

TPH13 L

TPHll L

TPH13 H

TPHIl H

TPHll H (TO ARBITRATIO

SYSCLKlR H r

SIGNIFICANT RISING CLOCK EDGES - Figure 3 Relationships among Different Clock Pl~ascs

Digiral Technical Journal ~nicroprocessordriver turn-on and tul-n-off times are the design could ha\fe been implcmcntcd such th~tthe fast, but tlie ASI<:s ha\~eslo\v turn-on and turn-off system bus \\~oi~ldnot stall in tlic absence of ackno\vl- times. To compensate for the hst ,~ndslo\v driver cliar- cdgments of pre\!iously pendcd requests. This Ic\~cl acteristics, the edge-to-edge clocking scheme required of complexity a\,oided the more co~nplesissues of a modification. 'l'lie Alpha 2 1164 microprocessor uses managing a queue of block in\.alidate, set bloclt to its copy of Tl'HI 1 H as the reference clock edgc shared, and read bloclt transaction requests. to align its SYSCL,I<1/2 H-generated interface out- The la470 module dcsign implelnents '1 scheme of put clocks. Though SYSCL1<1/2 H does not pliysi- \\)rite transaction back-off or replay that esploits tlie cally connect to the system interface, the Alpha 21164 transaction replay queue of tlie Alplia 21164 micro- micl-oprocessor uses the internal copy of tlie processor. This rcplay f~~~ictionalityhelps the system SYSCLI<1/2 H edgc to eitlier drive data or recei\~e interface handle c'1cIie st~tccliangcs \\,lien simultane- ddta. The system interface i~scsits cop!( of the refer- ous recluests to \\>riteto the svstcm bus and to invali- ence clock as the data recei\,c edge ti~rsignaling from date 6-on1 the systc111~LIS re m'ldc to the snlnc cnclie tlie Alpha 2 1164 ~nicroprocessor.To drive the data block. The designers simplified the caclie coherence to both the microprocessor and the B-cache, tlie sys- management and logic dcsign by u\~oidingthe use of tem interface uses the TPH13 I, secondary clock, a pended \\/rite transaction in tlie system intcrfiicc, \\~liicliis phase-delayed 90 degrees from the prirn~r!, which \\,auld lia\,e required a one-block \\,rite c:lclic. clock TPHI 1 H. A \\;rite trnnsaction from tlic Alpha 21 164 micro- The above clocking sclielnc achie\,es single-clock, processor to tlic system L~lsis not considered coin- edge-to-eclgc d'ltn transfer rates \\,ithout imposing plete until the system bus is granted. This nonpendccl overly strict constraints on cloclt routing and layout. scheme for \+!rite transactions enables \\trite tmnsacrion Thc schelnc can withstand larger than 1 ns of clock rcplay from the Alplia 21 164 microprocessor and skc\v and compensates for tlie Alplia 21164 micro- avoids the requirement for tlic systcnl interfice to pre- processor's 1lL.L half-cycle correction bcni~eentlic rcf- serve logic states if a system ~LIStransaction taltcs crcnce clock and SYSCLK1/2 H. precedence. When the system bus tr'1nsactio11 t~l

\'ol 7 So. I 19'95 8.5 to replay tlic \\,rite transaction in rel'1tio11 to tlic nest systcm remained fixed as the processor- models \\zc~-e system bus gr'11it. The designers chose the simpler s\\7appcd for these pertbrmnncc mcnsuremcnts. implementation to reduce logic design complexity and The Stnncinrd l'erfol-mancc Evaluation (:orporation \rerification time. (SPEC) \\,,IS fi)rli~edto iclentit'\' and crcatc objccti\,e sets ofapplic,ltions-oric11tcci rests, \\.hich can scr\.c ,IS Performance of AlphaServer 2100 Systems with common reference points. SPEC CINT92 is a good KN470 Modules base indic'ltor of CI'U pel-for111ance in a commercial cn\~ironmcnt. bcnchmarli is the geometric mean To \~alidatcthe improved pel-for~ii,~nccgoal of tlic of ratios Oy 'rvliich ttic six I>c11ch1i~arksin tliis suite IW470 PI-occssor modi~lcin Alpli,~Ser\,cr2 100 sys- exceed the pcrformancc of the reference ~n,~cl~inc. tems running Digital UNIX (formcrl!, l)EC OSF/l ) SPEC (X1'92 mJy be llsed to compare floating-point \usion 3.213, project engineers measured sc\.cml intensi\~c cn\.i~-onments,t!~pic,~ll!. cnginccl-jng ,~nd i~ld~~str!r-stn~ld.lrdbcnchmal-Its. A brief description of scientific ~pplications.SI'E(: (:F1'92 is the geometric each benchm,i~-kfollo\\rs. ?'able 1 lists tlie bcnclimarks mean of ratios by \vhich the 14 bencli~nal-ksill tliis that wc~-crun on an AlpliaScrvcr 2 100 Model 5/250 suite chcccd the pcrformancc of the reference ~nacliine. systeln, the 11~1mberofprocessor modules in 3 con fig- SI'LC Homogeneous <:ap,iciry method bcncl~rn,~l-lis i~r~~tionfor c,~ch benchm~rk, the ~iicdsurcdcstim.1tcs test ~nultiproccssorcfficicnc!,. The!, pro\ridc a hir mcJ- or una~~ciitcdresults of tllc benchmark, and the pel-- sure for tlic processing capacity of a system, namely, form,~nccgin. Performance g'li~iis reported as J 1,,1tio lie\\, ml~cli\\.orli the s!,stcln cJn perfor111 in 3 given of tlic 1CK470 result to the top-performing, first- amount of time. Tlic Sl'F,(:1-3tc is a capaci?. nlc~sul-c. gener.ltion ICi460 result. Ihe ratios demonstrate that It is not .I measure of ho\v f.lst ,I s\,stem can pcrforln the IOU470 processor modl~lcachic\~cs the primary any task but of Ilo\\, many of tliosc taslis tlic systcm project go.11 by pto\'icIi~~gmore perform,lncc to complctcs \\,itliin il~~arbitl-nr!, time intend. iVpliaScr\~cr2 100 systems th'ln the first-generation 13c\,clopcd by 41A4 Tccllnolog\,, the AIM Suite I I1 ILW460 processor. Renchmarli Suite \\.,IS designed to mcasul-c, c\.,llurlte, The Alpli~Scr\~er2100 blodcl 5,450 sjrstcln uses and prcciict UNIX ~il~~lti~~sc~-s\~stcmperformal~cc. The the IW470 PI-ocessor 1iio11~Iethat incorporates the bcnch~narlisuite ~~scs33 f~~nction.lltests, dnd tliesc Alpha 2 1 164 microproccssor opcl-rlting at 250 MHz tests c.ln be groupcd to rctlcct tlie computing ,~cti\rities \\,it11 a 4-h/lR B-cache. The ,ilpl~,iSer\cr2100 A4ocicl of \,arious types of applicatio~is.The AIh4 l'crforlnnnce 4/275 systcm uses tlie IGY460 processor moti~~lc\\,it11 Ratings identit'\' the rna\irn~~lnpcrformancc of tlic tlie Alph,~2 1064 microproccsor operating at 275 ~!~stc1iiLII~C~CI. optimum usage of Cl'U, tloati~ig-point, MHz \\zitli ,I 4-MB B-c,iclic. 'l'llc AlpliaSer\,cl- 2 100 and iiisli c.lching. At a s!,srcm's peak pel-torm,lncc, an

Table 1 Performance Data for an AlphaServer 2100 System That Incorporates the KN470 Processor Module Performance Gain Number of Processor Expressed As a Ratio of Modules per Alphaserver 2100 Model 51250 Performance Benchmark Configuration Model 51250 to Model 41275 Performance - - SPEC CINT92 SPECint92 1 277 1.4 SPECrate-int92 4 24,996 1.4 SPEC CFP92 SPECfp92 1 41 0 1.4 SPECrate-fp92 4 37,926 1.4 AIM Suite Ill Benchmark Suite Performance (Estimated) Performance Rating 2 396 IVlaximum User Loads 2,400 1.4 Performance Rating 4 719 Maximum User Loads 3,100 1.3 LINPACK (IVIFLOPS) 1000 X 1000 4 1,022 1.6 McCalpin COPY 2 171 1.28 scale 2 169 1.27 sum 2 162 1.25 triad 2 162 1.27 ------

86 Digital 'rtchnic~lJoul-ndl increase in tlie wroi-lon Caley, l>iclt Beaven, Suite performance scaling for iUpliaScr\~crconfigi~ra- and Ginny Lnmere proved tlic design's integrity. Andy tions of one to four processor modules. These results Ebert provided the ASIC tcst stratcgy and applications validate improvements in the ability of I(N470 proces- support. Peter Woods .111d Traci Post werc responsiblc sor modi~lesto scale in mi~ltiprocessorconfiguratio~is. for the serial co~itrolsubs!!stem and diagnostics. Stcphcn Shirron liiadc the O~CIIVIVSsystem boot on Summary the nelv m~chine,and ICevin Peterson and Harold Buchngliarn provided tlic firrn\\iarc and console bits 'The implementation of tlie write-invalidate coherence and consultation regarding sofnvare issues. Janet protocol combined u~itll sy~ichronoi~sclocking, 3 \Valsh and Jeff Kcrrigan contributed operations duplicate tag store, and pipelining cacl~e-missrequests support. Ste\,c Brooks, Rich Freiss, ancl 13c;in Gagne

copy c(j) = a(j) ;copy a to c scale b(j) = 3.0 * c(j) ;multiply c times 3, store ; result in b sum c(j) = a(j) + b(j) ;add a to b and store in c triad a(j) = b(j) + 3.0 * c(j) ;multiply c times 3, add to ; b, store result in a

-- Figure 4 1-lie Four Pnrts of the Mc<:nlp~nBencli~nnrk

Table 2 AIM Suite Ill Benchmark Suite Performance Scaling (Estimated)

AlphaServer 2100 System

Number of Processor Modules 1 2 3 4 Maximum Throughput JobslMinute 2,178 3,882 5,249 7,047 Model 51250 Scaling 1.O 1.8 2.4 3.2 Maximum Throughput JobslMinute 1,451 2,229 2,998 3,587 Model 41275 Scaling 1.O 1.5 2.1 2.5 developed the operating system sohvare that supports this system. Sinion Steely, Zarlta Cvetanovic, and John Shakshobcr carried OLI~pcrforniance analysis and validations. John Edmondson, Pete En~inon,A~iil Jain, and l'aul Rubinfeld dcsigncd the I(IU47O-specific functionality in the Alplia 2 1 164 microprocessor.

References

Barry A. Maslcas A consultant engineer in 1)ipical's Server Product I)c\.cl- op1iic111C;roup, R.irr! Maskas \\.as the project Icndcr responhihlc for the dcvclop~ncntof the Alpha 2 1 164 2. 13, hilaali.~~,S. Shirron, .lnd N. \Varchol, "1)esign 2nd microprocessor-17ased AlplinScrvcr 2 100 and 2000 PI-o- Pel-li)rmnncc of thc Dl<:4000 ASP Dcparrmcnt:il ccssors and systcms. He is cun-cl~tl!. in\,ol\,cd in h~rtlic~- Scrvcr (:o~nputing Systems," Ili,qit~~lTi,clnirical Alpha-b~scdserver svsteln dc\,clopmcnt \vork. In ea1-1icr ~/or~i.iir~l.vol. 4, no. 4 (Special Issue 1992):82- 99. \\,ark, 13~rry\t13s the project 1c.idcr nncl architect k)r the D1.X 4000 system bus, backpl;lnc, and processor ~nodulcs 3. J. Eduiondson ct al., "1ntcrn.ll Orpnniy~tionof the and For the arcliitect~ueand Jc\.clopmc~ltof custo~ii\:I.SI L41plla 21 164, a 300-hIHz 64-bit Quad-issuc <:111OS pcriplicral chip scts for VAS 4000 and I\IicroVAS systems. ll', 1992).

Biographies

Nitin D. Godiwala Nitin Godi\vda is a principal cnginccr in Digital's Sci-\.cr Product 1)cvclapment Group. His area of expertise is di$! s).3tc1narchirccnrrc and pipcfinui nlachlnrr;. As a conrribiltor t4) the P&haSer\.er 2100 wmcr ~twl~~cr, hc \\.arl\c project hder;uld pri~ipaltrclGtect of the ntbirrarion ASIC and d~esgsreln ultcrf;~ce ASICs tix the Alpl1~2 1064 and 21164 micrr~~twcssor-bfficdprtx~s- sor moddcw. 111 prc~iot~owork, hc Hxr a principal archi- tcct vld dcsignrr dthc s!'ytcnl intdact MfC for tlw DEC: 4000 processor modulc. Rcforc coming to Digital in 1986, Sirin \\,orked for Analogis COT., Gocld h/lodico~i, and Honc!.\\.cll Inc. Hc rccci\~cdn 11.E. koln Bombay Uni\*crsit\*2nd ~111h1I.S. in colnp~~tcr,lnci clcctric,ll cngl neering froni thc Uni\,crsity of PVisconsln, M~dison. He holcis four patents and has eight p.lrcnts pencling.

\'ol. 7 No. I 199s I John H. Zurawski John E. Murray The Design and Paul J. Lernmoi~ Verification of the AlphaStation 600 5-series Workstation

The AlphaStation 600 5-series workstation The Ihigh-pcrforrnancc AlpliaStation 600 5-scrics is a high-performance, uniprocessor design \vorkstation is based on the fastest Alpha micl-oproccs- based on the Alpha 21 164 microprocessor and soy to date-the Alpha 2 1 164.l The 1/0 subsystem ~~scsthe 64-bit \fersion of the Peripheral Component on the bus. Six ASlCs provide high- PC1 CMOS Interconnect (PCI) and the Estendccl lnclustr!l bandwidth, low-latency interconnects between Standard Architcct~~rc(EISA) bus. The AlphaStution the CPU, the main memory, and the I10 sub- 600 supports three operating systems: Digital UNIX system. The verification effort used directed, (formerly 1)EC OSF/l ), OpcnViMS, and ~Micl-osoft's pseudorandom testing on a VERILOG software Windo\vs NT. Tliis \\~orltstation scrics i~scs the model. A hardware-based verification technique 1)ECchip 21 171 chip set designed and built by 1)igital. These chips pro\,ide high-band\\,idtli, lo\\,- provided a test throughput that resulted in a Intcncy interconnects bcn\rccn the CPU, tlic main significant improvement over software tests. mcmor!l, and the PC1 bus. This technique currently involves the use of This paper describes tlic architecture and featurcs graphics cards to emulate generic DMA devices. of the AlphaStation 600 5-series \\forltst~tion2nd the A PC1 hardware demon is under development to I)F,t;<:chip 2 1171 chip set. T11c system o\~er\ric\\,is first further enhance the capability of the hardware- presented, follo\\,ed b!. .I cletniled disc~~ssionof the chip set. Tlic paper then ticscl-ibcs the cnchc anti mcm- based verification. ory designs, detailing lio\\r the memory design e\,ol\fcd from the \\~orkstatio~i'srcquire~ncnts. The latter part of the paper describes the fi~nctionalverification of tlic workstation. Thc paper concludes with a description of the liard\\lare-based \,crific;ltion effort.

System Overview

The AlpliaStation 600 5-series ivorkstation consists of the Alpha 21 164 microprocessor, a third-lc\~clcache that is external to the Cl'U chip, and a system chip set that interfaces benvecn tlic <:l'U, the metnor!: and thc P<:I bus. The DECchip 2 1 171 chip set consists of tlircc designs: a data slice, onc l'<:I interface ancl mcmor!,- sonti-ol chip (called the control chip), and n miscclla- ncous chip that includes tlic PC1 interrupt logic and flash I-end-only meiilory (KOIM) control. The Intel 82374 and 82375 chip sets provide the bridge to thc EISA bt~s.~Fig~~rc 1 slio\\~s a block dingr.lm of tlic \vorkstation. The SysData bus transfers data ben\,ccn the PI-occs- sor, the CPU's tertiary cache, and the data slices. The 128-bit-\\,ideSys1)atn bus is protected by cnor- correcting code (E<;<:) and is clockcd c\,cry 30 nanoseconds (ns).T17c d,it,l slices provide n 256-bit- ~\'idcdata path to nicmoi-!I. Data transfers bct\vccn the 128-BIT SERIAL kc- \- MEMORY BANK 0

Figure 1 AlphaStation 600 5-serics Workstation Block Diagram

PC1 and the processor, the external cache (typically From 32 MB to 1 (GB) of main memory 4 megabytes [MB]), and memory take place through can be used in industry-standard, 36-bit, single the control chip and four data slices. The control chip in-line memory nodules (SIMMs). All memory and the data slices communicate over the 64-bit, ECC- banks support single-sided and double-sided protected 1/0 data bus. SIMMs. The major components and features of the system Eight option slots are available for expansion: four board are thc following: PCI, three EISA, and one PCI/EISA shared slot. The Alpha 21164 n~icroprocessorsupports all The system design minimized logic on the mother speed selections from 266 to 333 megahertz board in favor of more expansion slots, which (MHz). allow custonlers to configure to thejr require- ments. The system uses option cards for small The plug-in, external write-back cache (2 MB to computer systems interface (SCSI), 16 MB) has a block size of 64 bytes. Access time is graphics, and audio. a multiple of the processor cycle time and is dependent on thc static random-access memory The system supports 64-bit PC1 address and data (SWM) part used. With 12-ns SlUMs, typical capability. access times are 24 ns for the first 128 bits of data, Due to its synchronous design, the system's 21 ns for remaining data. memory, cache, and PC1 timing are multiples of The system board contains a 256-bit data path to processor cycle time. memory (284 megabytes per second [MB/s] for The system provides an X bus for the real-time sustained CPU reads of memory). clock, keyboard controller, control panel logic, and the configuration RAM.

90 Digiral Technical Journal "01.7 No. 1 1995 Data Slice Chips Tlic data slice allo\vs for sim~~ltancousoperations. Four data slice chips iiiiplc~iieiittlie priniary data path For instance, the I/O \\,rite buffers can empty to tlie in tlie system. <:ollectivcly, tlie data slices constitute a control chip (and then to the 1'CI) \vhile a concurrent 32-byte bus to main memory, a 16-byte bus to the read from CPU to main memory is in progress. CPU and its secondary cache, and an 8-byte bus to the control chip (and then to tlie PC1 bi~s). Control Chip Figure 2 sho\\ls a block diagram of the data slice The control chip controls the data slices and main chip. The dara slice contains internal buffers that pro- mcniory and pro\lidcs a f~llycompliant host intcr- vide temporary storage for direct memory access hcc to rlic 64-bit PC1 local bus. The 1'CI local bus (DMA), I/O, and (:l'U traffic. A 64-byte victim buffer is a Iiig11-performancc, processor-indcpeiidc~it bus, holds the displaced cache entry for a <:PU fill opera- intcndcd to i~itcrconncctperipheral controller com- tion. Tlie Mcniory-Data-In register accepts 288 bits ponents to a processor and nicmory subsysteni. Tlic (incltiding EC(:) of memory data every 60 11s. Tliis PC1 local bus offers the promise of an industr!l- register clocks the memory data on tlie optimal 15-11s standard interconnect, suitable for a large class ofcom- clock to reduce nicmory latency. Thc mcniory data putcrs ranging from personal computers to large then proceeds to the CPU on tlie 30-ns, 144-bit scr\u-s. bidirectional data bus. A set of four, 32-byte I/(> write Figure 3 sho\.i's a bloclc diagram of the control chip. buffers help masirnizc the perforniancc of copy opera- The control chip contains five segments of logic: tions fioiii Inemor!/ to 1/0 space. A 32-byte buffer The address and comniand interface to tllc Alpha holds the 1/0 rcild data. Finall!; tlicrc .ire a pair of 2 1 164 microprocessor 1)MA buffers, each consisting ofthrce 64-byte storage areas. DA4A rend operations ilsc tnro of thcsc three The data path from the P<:I ~LISto tlic dat~slices loc'uions: the first liolcls the requested menlory data, by Incans of the 1/0 data 1x1s and the other holds the external cache data in tlie case I)IMA address logic, including a 32-entry scatter/ ofa cac.hc hit. 1)iMA \\?rites use all tlirec locations: one gather (S/G) map (Tliis is discussed in tlic section location holtis the 1)bIA \iirite data, and the other nvo Scattcr/Gather Address )Map.) hold the memory and c;iclie data i~sedduring a 1)MA I'rogrammed I/O rcad/\\.r-itc address logic \\,rite merge. Tlic memory address and control logic

VICTIM CPU VICTIM PATH MEMORY TO THE -BUFFER DATA OUT -MEMORY REGISTER BANKS / CPU READ MISS PATH MEMORY DATA IN DATA REGISTER FROMITO DMA 21 164 CPU WRITE PATH DMA READ DATA PATH ...... I I ------I I I / \ I I I10 WRITE I I I------I BUFFERS , ------_-11 - I I DMA 11 -1 I I I I BUFFER 1 I I I - I - I I I I 110 WRITE I I I( -I - 1 I I I ' I - I I I I I PATH - I I 11 -I - I I I I' I I I I 11 I 1 I I I --I - I I I r ------1 MEMORY B-CACHE I I MEMORY B-CACHE I I ------I I- ______--_I I I I I I I I I DMA READNVRITE BUFFERS I-----__----_----__------110 DATA BUS (TO CONTROL CHIP)

Figure 2 lhta Slice Block lliagram COMMAND

Figure 3 Conrrol (:hi\> ]%lockl)iagr<11n

CPU Interface A three-deep qLleile can hold two out- DMA Read Buffering IJI i~dditionto the n\lo 64-bvte standing rcild requests, together \\,ith the addrcss ot buffers inside tlic data slicc, tlic control chip li~st\iro a \lictil.n block associated \\lit11 one of these rcaci 32-byte DMA read buffers. These buffcrs prcfctcli requests. 1311ringa DMA write, tlic Flush Address rcg- DMA read clata \\?hen the initiating PC1 read com- ister holds the address of the caclic block that tlic Cl'U mand so indicates. This arrangciilerlt provides data to IIILIS~ move to menlor!! (and invalidate in the cache). a 64-bit PC1 device at a rutc of more than 260 MR/s. In this manner, tlic ssstem maintains cache colicrcncy duri~igl>hllA \\,rite operations. ScatterIGather Address Map The S/G mapping addrcss tablc trunslates contiguo~~sPC1 addresses to PC1 Address Space Windows P(:I dcvices use addrcss any arbitrary memory addrcss on an S-kilobyte (Im) space windows to access main memory. During discus- granularity. For sohvare compatibility with other sions wit11 the dc\~elopersof tlic operating system, \~c Alpha system designs, the S/G map uses a transl~tion clcterminecl that fo~~rP<:I nddrcss space \\,intio\\rs lookasidc buffer (TLR)."I7ic cicsigncrs enhanced the \\,auld bc desirable. EISA dc\riccs use one \\,indo\\.. TLB: First, cncli of the eight TLB entries llolds hi~r S/G mapping ~1st~a second. The third \\indo\\; consccuti\.c entries (PTEs). This is ~lscfi~l directly maps a contiguo~~sPC1 ;~ddressregion to .I \\hen addressing lnrgc 32-KI3 contig~~ousrcgio~~s on contiguous region of main nicnlory. The fourth \\in- the PC1 bus. For instance, thc NClIS10 P(:I-to-SCSI doc\{ supports 64-hit P<:I addresses. FU~LI~Csystem de\~icerccl~~ircs nearly 24 IU3 of script space.4 Second, designs may provide more than 4 GP, ofmain ~~iemor-!I, soh\~arecan loclc as many as one half of tlie TLR thus requiring thc 64-bit addrcss \\,indo\\: entries to prevent the hard\\,;lrc-controlled rcplacc- ment illgorithm from displacing them. This feature DMA Write Buffering The control chip pro\,idcs ,I reduces T1.R thrashing. single-entr!, 64-byte l>i\/lA \\!rite buffer. Once the buffer is f 111, the data is transferred to the DlMA buffers Programmed I10 (PIO) Writes The designers fi)cuscd in the data slices. The design can s~~pport97-MB/s on improving the performaucc of the fi~nctionalit!r DlMA \\,rite band\\fidthfro111 a 32-bit PC1 device. that allows 3 processor to copy 6-01111neliior)f to 1/0

Val. I No. I 1995 space. High-end graphics device drivers use this func- read; otherwise main memory data is used. Each DMA tionality to load the graphics command into the write to memory results in a FLUSH command to the device's first-in, first-out (FIFO) buffer. The data slice CPU. If the block is present in any of the caches, then has four buffers, and the control chip contains the cor- the data is sent to the DMA buffers in the data slice responding four-entry address qilcue. Four buffers and the cache bloclts are invalidated. This cache data is hold enough I/O write tran~~ct~onsto mask the discarded if the DMA \\.rite is sent tc) a complete block. latency of the processor's read of memory. The control In the case of a DMA write to a partial block, the DJMA chip provides t\vo additional 32-byte data buffers. write data is merged with cachc data or the memory While one drives data on the PC1 bus, the other data as appropriate. In this manner, the system main- accepts the nest 32 bytes of data from the data slices. tains cache coherency, removing this burden from the sohvare. The memory controller loglc in the control chip supports as many as eight banks of Memory Bandwidth dynamic random-access memory (DRAM). The cur- rent memory backplane, however, provides for only 4 Tlie Jnernory bandwidth realized by the CPU depends banks, allo\ving from 32 MI3 to 1 G13 of memory. The on a number of factors. These include tlie cache block nlernory controller supports a \\NI~range of DlUM size, the latency of the meniory system, and tlie data sizes and speeds across mult~plebanks in a system. bandwidth into the CPU. Registers program the DMtiming parameters, the DRAM configuration, and the base address and size Cache Block Size for each memorv bank. The meniory timing uses a 15- The Alpha 21164 microprocessor supports either a 11s grani~lar~tyand supports SIMM speeds ranging 32- or 64-byte cache block size. The Alphastation 600 from 80 ns do\vn to 50 ns. \vorkstation uses the 64-byte size, ulhich is ideal for many applications, but suffers on certain vector-type Cache Design programs \vith contiguous Iiiemory references.3 An example of a larger block size design is the RISC The Alpha 2 1 164 microprocessor contains significant Systern/6000 Model 590 workstation from Inter- on-chip caching: an 8-KB virtual instruction cache; an national Business Machines C~rporation.~This design 8-KK data caclic; and a 96-KR, 3-\vay, set-associative, supports a 256-byte cache bloclc size, allowing it to \\!rite-back, second-level mixed instruction and data amortize a long memory latency by a large meliiory cache. The system allows for an external caclie as a fetch. For certain vector programs, tlie Model 590 plug-in option. This cache is typically 2 MB to 4 MB in performs well; but in other applications, the large size, and the block size is 64 bytes. The access time for block size wastes bandwidth by fetching more data the esternal caclic depends on tlie C:PU frequency and than the CPU requires. the speed variant of the cachc. Typically, the first data The Alpl~aStation 600 provides a hard\vare fea- requires 7 to 8 CPU cycles; si~bsecl~~cntdata items ture to gain the benefit of a largcr block size when require 1 or 2 fc\%crcycles. The act~~alvalue depends appropriate. The Alpha 2 1 164 nlicroprocessor can on both the minimum propagation time through the issue a pair of read requests to mernor!.. If these two cache loop and on the CPU c!lcle time. The external reads reside in the saliie memory page, the control cache data bus is 16 bytes wide, providing almost chip treats them as a single 128-byte memory rcad. In 1 GB/s of ba~idcvidthwith a 333-MHz CPU and a this way, the system approsi~natesthe benefit of a 5-cycle cuchc access. larger block and acliie\~es284 MR/s of memory read Tlie processor al\vays controls tlie external cachc, bandwidth. but during a cachc miss, the systenl and the processor \vork together to update tlie cache or displace thc Memory Latency cache victim. For an csternal cachr miss, the system Tlie 180-ns memory latency consists of hve parts. performs ~OLII*16-byte loads at 30 ns. Any dirty caclie First, the address is transferred from the microproccs- Ibloclt is sent to the victim buffer in the data slices, in sor to the control chip in 15 11s.The control chip sends parallel cvith the rcad of memory. Fast page-mode the rnernory ro\\r-address p~~lse15 ns later, and tlie memory \\,rites arc used to \\.rite the victim into mem- data is received by tlie data slices 105 ns later. The data ory quickly. (This is discussed in thc section ~Meniory slices require 15 ns to merge the \\rider memory data Addressing Sclicmc. ) onto the narrower SysData bus, and tlie last 30 ns are During 1)MA transactions, the system interrogates spent updating the external criche and loading the the Cl'U for rclc\lant cache data. l'licre is no duplicate Alpha 21 164 ~nicroprocessor. tag in the system. DlMA reads cause main meniory to Although the 105 ns to acccss the memory may be read in parallel with probes of the CPU's caches. If appear to be generous, the designers had to meet tlie a cache probe hits, the cachc data is uscd for the DlMA significant challenge of implementing the rcquired

Vol. 7 No. 1 199'5 ~5 1 GB of memory with inexpensi\fc 36-bit SIiMMs. The 128-bit memory width instead of the 256-bit width. JEDEC standard for these SIlMMs only specifies thc The designers would have ~~scdED0 parts had they pinning and dimensions. It docs not specifv the etch been .l\railable earlier. le~igtlis,which can \lary by many inches from \,endor to \lendor. Neither does it speci% the electrical loading Memory Addressing Scheme distribution, nor the DMbl type or location (1-bit parts lia\le 2 data loads \\Thereas 4-bit parts ha\r a sin- The adopted addressing schenle helps iniprovc men- gle, bidirectional load). With a 1-GB memory system, ory bandwidth. Whenever tlic CPU requests a new the loading \lariation benvccn a lightly loaded memory block of data, the \vrite-back cache tilay lia\~cto dis- and a firlly loaded rnernor!i is significant. All these fac- place cLll.rent data (the \,ictim block) to allo\\ spncc fix tors contributed to significant signal-integrity prob- the incoming data. The \\,riting of the victim block to lems \\,it11 seirere signal reflections. Tlic mcmory mernorjr slio~~ldoccur qi~ickly, other\\.ise it \\.ill mother-board etch was carefi~llyplaced and balanced, impede thc CPU's request for nc\\. data. and numerous ter~ninationsclicmcs were in\rcstigatcd Fig~~rc4 sho\\fs tlic mctlioti used to adcircss the to dampen the signal reflections. external caclic aiid memo^-p. The ClPU address 13 1 :6> directly accesses the cachc: rlic lo\\r-order bits < 19:6> Data Bandwidth form thc indcs for a 1-MI1CLICIIC, and the rcmnini~ig Tlie SysData bus transfers dam benvccn the processor, bits <31:20> for111 the cnchc tag. Thc Cl'U addrcss the tertiary cachc, and the data slices. This 128-bit bus docs not directly address memory. Instead, tlic Inem- is clocked every 30 ns to satis+ the write tinii~igof the ory addrcss interchanges the indes portion of the external cache and to bc synchronous with the I'(:I addrcss field with the tag portion. Tlie number of bus. Typical memory DRAM parts cycle at 60 ns, thus addrcss [,its interchanged ricpcnds on the yo\\. ,~ndcol- requiring '1 32-bytc-\vide memory bus to mntcli tlic umn dimensions of the 1>IW?vI ~~sed. bandwidth of the SysData bus. Tlie data slice chips For the s;il inter- quently, tlie n~emorysystem is logically cquivalcnt to change \\.it11 bits <16:6>, and the remaining bits select a 2-\\~'1yinterlcavcd memory design. tlie memory bank. This addressing scheme has the fol- Nc\v memory technologies \\lit11 supel-ior data lon/ing effect: a CPU addrcss tliat is incrementi~~gby band\\~idtlis are becoming a\,ailablr. Synchro~io~~s units of 1 A413 no\v accesses consecuti\,e rncmo1.y loca- DWMs arc an exciting teclinologv, but they lack 3 tions. 1>1UM memory pro\*icics a fast addressing firm standard and are subject to s signiticant price prc- modc, called page modc, \\.licnc\,craccessing consccu- mium o\,cr plain 5-volt DRAlM parts. Estenclecl-data- tive locations. For a 1-IMR cachc, objects separated by out (E1)O) DRAMS allo\v greater burst mcmory a multiple of I iMB correspond to cache victim bloclts. band\vidtli, but tlie latency to the first d3ta is 11ot Conscqucntly, a CPU read request of Iiicmory t11;it reduced. Consecl~~ently,the mcmory band\\.idtli to involves a \,ictim \\!rite to mcmor!r gains the hcnctit of the CPU is not significantly improved. The major page modc and proceeds f.~stcrthan it \vould \\zit11 a advantage of using ED0 parts is tlicir easier meniory tradition;illy nddressed mclnory. timing: The output data of ED0 parts is valid for a Although this address scheme is ideal fix CPU longcr period than standard 1>1L41Ms.In addition, an memory accesses, it creates tlic converse effect for ED0 memory can be cycled at 30 ns, \\lhicll allo\\ls a DlMA n-n~isnctions.It scatters consccuti\ie 1>MA blocks

BLOCK OFFSET PHYSICAL ADDRESS

3 1 4 20 19 4 6 CACHE INDEX TAG ADDRESS

BANK SELECT

Figure 4 memory Addrcss Scheme

94 Digital 1~chniz.ilJournal by 1 MB in nielnory. These locations hll outside tlie Tlie designers used the VElULOG hardware DRAM page-mode region, resulting ill lower perfor- description language to define all tlie logic within the mance. The solution is to enlarge the memory blocks; ASICs. Schematics were not used. The SYNOI'SIS for example, start the memory interchange at bit <8> gate-synthesizer tool generated the gates. The design- instead of bit <6>.This compromise allo\\ls 256-byte ers had to partition the logic into small 3,000 to S,000 DMA bursts to run at fill1 speed. Slightly fewer victim gate segments to allow SYNOPSIS to complete within blocks, however, gain the benefit of page mode. 12 to 15 hours on a DECstation 5000 workstation. The bit assignment for this address scheme depends Currently, the same synthesis requires 1 hour on the on the row and column structure of tlie DRAM part AlphaStatiol~600 5/260. The designers developed a and on the external cache size. Power-011 sohare custom program that helped balance the timing con- automatically configures the address-interchange straints across these small gate segments. This allowed hardsvare in the s!.stem. the SYNOPSIS tool to focus its attention on the scg- ments \vitIi the greatest potential for improvement. Design Considerations Performance In this section, we discuss the dcsign choices made for system clocking, timing veritication, and tlie Table 1 gives the bandnlidths of the workstation for application-spccifc integrated circuit (ASIC) design. the 32-bit nnd 64-bit PC1 options. A structural siniula- tion niodel verified this data, using a 180-ns memory System Clocking latency and a 30-11s system clock. The 285-MB/s read The cliip set is a synchronous design: The system clock band~vidthof tlie CPU memory is impressive consid- is an integer ~iiultiple of thc CPU cycle ti~ne. ering that the memory systcni is 1 GB. Eventually, the Consequently, tlie PC1 clock, the memory clock, and memory size will reach 4 GR when 64-Mb memory the cache loop are all synchronous to each other. The chips become available. designers avoided an as!rnchronous design for two rca- The 1/0 \\:rite band\\,idtli is important for certain sons. It suffers tiom longer latencies d~~eto the syn- 3D graphics options that rely on 1'10 to fill tlie chronizers, and it is more difficult to veritjr its timing. command queue. Current high-end graphics devices Unlike tlie memory controller, which uses a doublc- require approximately 80 MB/s to 100 MB/s. The freque~icycloclc to provide a finer 15-11s resolution for 213 MB/s of 1/0 write bandwidth on the 64-bit PC1 the memory timing pulses, the s!~ncIironous design of can support a double-headed 3D graphics configura- the chip set uses a single-phase clock. This simplified tion \\itliout saturating the PC1 bus. Other 3D graph- clocking scheme eased the timing \ferification work. ics options use large DMA reads to fill their command Phase-locked-loop (PLL) deviccs control tlie cloclc queue. This approach offers additional bandjvidth at skew on the system board and in the ASICs. Tlie PLL 263 MB/s. The system did not optimize DMA writes in the ASICs also generates the double-frequency to the same extent as DMA reads. Most options are clock. amply satisfied with 100 MB/s of bandwidth.

Timing Verification The con~pletesystem was verificd for static tini~ng. Table 1 A signal-integrity tool similar to SPICE \\{as used to Bandwidth Data ~nalyzeall tlic module etch and to feed tlie delays Into Transaction 32-bit 64-bit the modulc t~rningverification cffi)rt. Thc final ASIC Type PC1 PC1 timing verification used the act~~alASIC etch dela!ls. This process WAS so successhl that the actual hard\vare CPU memory read: svas free of nnv tinling-related bug or signal-integrity 64 bytes 284 284 problem. I10 write: Contiguous 32 bytes 119 213 Random 4 bytes 44 44 ASIC Design I10 read: The cliip dcsigncrs chose to implcmcnt the gate array 4 bytes 12 12 using the 300K technology from LSI Logic Corpo- 32 bytes 5 6 5 6 ration. The control chip uses ovcr lOOK gates, and DMA read: each data slice consumes 24I< gates. Originally, the 64 bytes 79 112 designers considered the slower 100IC technology, but 8 KB 132 263 it proved unable to satis@ the timing recluirenients for DMA write: a 64-bit-wide PC1 bus. 64 bytes 97 102

Digital Technical Journal Vol. 7 No. 1 1995 93 Table 2 gives the performance for several bencli- major objecti\le was to ensure that the hardware is reli- marks. Tlie data is for a system with a 300-MHz able 111a1ip)'cars hence, when new, as yet undeveloped, processor and a 4-MB cache built out of 12-12sSUM PC1 options populate the s)rstem. Today, the l"CI bus parts. The SPECmark dntn is preliminary and clrarly uses only a sninll number of expansion option cards. It \vorld-class. The LINl'ACI< clata is for double- is q~~itcprobable that a perf~~nctoryverification oftlie precision operands. Even greater performance is pos- PC1 logic \\.auld result in a \\,orking system at the time sible with faster cache options (for instance, a cache of hardware po\\rer-on 2nd for Inany months tliere- using 8-11s parts) and faster speed variants of tlie after. It is only as more option cards become a\~nilablc Alpha 21 164 microprocessor. that the likelihood of systcni failure grocvs. Conse- quently, the verification team developed a detailed 1'CI Functional Verification transactor and subjected tlic PC1 interface in the co~i- trol chip to lica\~!,stressors. The comple.\it\f of the PC1 The fi~nctionnl verification is an oilgoing effort. transactor FJ~exceeds that of the PC1 intcrfilcc logic Three bctors contribute to the nccd for greater, Inorc \vitlii~ithe ASIC. The reason is that the ASIC: design efficient \lcrification. First, the design complesity of implements only the subset of the PC1 arc1iitcct~11-e e.1~1111e\\l PI-ojcct i~~crcascswith tlie q~~estfor 111o1-e appropriate to its design. The I'CI transactor, lie\\,- performance. Next, the cluality cspcctations arc I~S- ever, has to emulate any possible PC1 de\ricc and thus ing-the prototype hardlvare must boot an operating must implement all possible cases. Furthermore, it system \\it11 no liard\\fare problems. Finally, time to nii~stmodcl poorly designcd PC1 option cards (tlie market is dccrc;~sing,PI-o\riding less time for fi~nc- \vord "slioi~ld"is comliion in tlic PC1 specification). tional veriti cation. Thc vcrifcation experience included the following: A number of projects at Digital have succcssf~~lly Directed tests. Specific, directed tests arc ~iccdcd used the SEGUE higli-level Iang~~agefor fi~nctional to supplement pscudorancio~ntesting. For cs,im- \~erificatio~~.~.'SEGUE allo\\,s simple handling of ran- plc, a certain intricate sccluc~iccof e\.cnts is best domness and percentage \\~eigIitings.As an esa~nplc,n \verified \\,it11 a specific test, rather than relying on code secluc1ice niay express that 30 percult of the the random process to generate the sequence by DMA tests sliou Id target the scattcr/gather TLB, and cliancc. that the DMA length should be selected at random from a specjfied range. Each evocation ofSEGUE gcn- Staff hours. In prior projects, tlie hard\\/arc tcam erates a tcst sequence with difkrent random \,aria- exceeded tlie \.crification tcam in size. O\lcr the tions. Thcsc test sequences are run across many years, the iniportancc of \,critication has gro\\-n. \vorkstations to achieve a high throughput. The proj- On tliis project, t\vicc as much time \\.as spent on ect uscd 20 \\,orlcstatio~isfor 12 montlis. the \,crifcntion effort ns on thc hardu,arc coding. The test suite focused on the ASIC verification in Dcgrcc of I-andomncss. Pure randomness is not the contest of the complete system. It \\,as not a goal al\vays desirable. For instance, an i~itercsti~igtcst to \reri@ the Alpha 2 1 164 microprocessor; neitlicr \\.as can he conducted \vlicn 3 1)MA \\rrite nncl a (;PU the EISA logic \,crifed (tliis logic \\,as copied fro111 rend tnrgct tlie same bloclc in memor!r (alrliougli, other projects). The test en\rironment uscd the for coherency reasons, not tlie samc data). VEIULOG simulator and included tlie Alpha 2 1 164 Random addresses arc unlikel!, to create this behavioral modcl, a PC1 tra~isactor(a bus fiuictional interaction; instead carefill address scltction is model), and '1 memory and cache modcl. Tlie SEGUE l1ccCSS;lI')~. code gcncratcd (:-language ccst programs for Cl'U- Error tcsts. The pseuclornndo~ntest process added to-nlemory and CPU-to-1/0 transactions, as as a different error condition, si~chas a PC1 address- Dh4A scripts for the PC1 transactor. ing error, uitliin each tcst. The hard\\.arc logic, The goal of \.crificatio~i\\'elit beyond ensuring tli.it LI~OIIdctccti~ig the error, \\ro~~Id\'ccto~- b!' sc~icii~ig the prototylx Ii~l'dware f~~nctioncdcorrectly. Thc an interrupt to the error-handling code. The Inn- dler \\lould check ifthe liard\\,are had captu~.cdthe Table 2 corrcct error status and, if it had, \voi~ld~CSLII~~C Benchmark Performance tile csccution of tl~ctcst program. 7'his strategy Benchmark Performance uncovered ~LI~S\\'hen the linrd\\rare conti~i~~cd functioning after- an error co~idition,only to fail man!, cycles later. H~ll-d\\~arcsimulation ~~ccclcrator.Thc project LINPACK 100 X 100 team ciid not use a liardwarc simulation accclcra- LINPACK 1000 X 1000 tor for 3 n~~~iibcrof reasons. 111the early pliasc of \rerification, bugs arc so frcclucnt that there is no \\lorkstation. Tliis 5-orders-of-~iiag~~it~idcimpro\fe- \,aluc in finding morc bugs. The limiting resource ment suggests that all the tests performed in the past is tlie isolation ,ind fixing of the bugs. Second, 12 months of sohvare-based \.erif cation can be coni- porting tlic codc o~itothc hardware simulator pleted during tlie liard\\lare-based debugging in 5 LISCS I-esoi~rcestli~t arc bcttcr spc~itimpro\-ing the minutes. tcst suite: running poor tcsts faster is of no value. A secondary, but very iiscfi~l,adva~itagc ofliard\\rare- Finally, the hard\varc-based verification technique based testing is the ability to stress the chips clectri- offcrs far greater perfor~iiancc. cally. For instance, by selccti~iga data pattern of 1's Rug curvc. The projcct team maintained a bug and 0's for tlie DMA, memory, and PI0 tcsts, verifica- cur\.c. The first-pass ASIC \\.as released \\(hen the tion engineers can severely test tlie capability of the bug; cur\

Digiral Tcchnic~lJournal Val. 7 No. 1 1995 97 DMA transactions run, self-checking, random CPU operation by loading the appropriate pattern into this traffic to memory and I/C) will also run. These events nienior)~.In reality, the architccturc ofthe PC1 demon provide the random mis of interacting instructions. At has to diverge from this ideal model for at least nvo the completion of the test, a miscon~parsionof the nvo reasons. First, the PC1 denion has to be able to emu- DlMA write regions indicates an error. late the fastest possible PC1 device, and this forces the use ofan ASIC. Ho\\fever,ASICs have limited memory Graphics Demon capacity. It is desirable to store the scripts for many A number of I'CI option c'~rdswere investigated as thousa~ldsof DMAs this memory. The scripts arc potential PC1 demon cards. The requirements for a approximately 100-bits \\,idc (64-bit PC1 data and PC1 demon card are twofold: it must be able to per- control) and require se\wal megabytes of niemor!: form DMA of various lengths, and it must have meni- This menlory requirenicnt forces the design to 11sc or)! for tlie storage of DMA and PI0 data. The DEC; external nienlor!r. Sccond, the PC1 architecture has a ZLXp-El graphics card was selected because it offers few handsIial

98 Il~g~talTcclinicnl Journal ~Mcl'licrson, Jim Nict~~pski,Sub Pal, Nick l'nluzzi, hcl< Rud~nan,Jim I

References

1. J. F,dmondson ct al., "Internal Organization of the Alphn 2 1164, a 300-MI Iz 64-bit Quad-issue CtvlOS John E. Murray I\IS<: ~Microproccssor,'' l)igi/~rl Techtzical ,/otrt.tr~rl. A consulting engineer in tlic Alpha Personal Systems vol. 7, no. 1 (1995, this isbuc): 119-133. Group, John ~Murraywas the logic design architect for thc Alphastation 600 5-series. In previous work, John led thc dcsigi tc.1111 h)r the instruction fetch and decode unit on the VAS 9000 systcm. Prior to joining Digital in 1982, he \\.as \\.it11 l<:L in the Unitcd Kingdom. He holds clcven 3. S. Nadkarni er nl., "l)c\~clopmcntof Digital's P<:l <:hip patents. Sets and E\.nluntion IGt fi~rhc l)E<;chip 21064 micro- ~>roccsso~,''l)i,qi/(t/ ~~c-~7~~ic-~r/~/otr~~tta/.vol. 6, IIO. 2 (Spring 1994):49-61 4. AClc 5.3CcSlCJ Ilctl~iill~rtzrtul (llayton, Ohio: N<:R Corporation, 1992). 5. A. Ag.\rlv.~l, Attrrl)~.xis (!/' Cilc.he Per/hrtne~t~c~,/i)t. Ol,oz/li~~g.$)slotis atrd iVlrr//iprc~r-c~tn~~iin~~(Koston: IUu\vcr Academic Publishers, 1989). 6. S. \White 2nd S. l)lin\\.in, "I'OWEIU: Nest Gcncl-'ition of tlic RIS<: S\ \icm/6000 Family," IBAll,/otrt-trc~/ (4' Paul J. Lemmon N(ac,clt.ch LIIIC~l)c~i~elo/)~~~c~~c/. \.ol. 35, no. 5 (Scptcmhcr Paul Lemn~onjoi~~cd Digital in 1987; he is a principal cngi- l994), neer. Paul \\.as the ASIC tcani lcader and the arcllitect of the co~itrolASIC: ti)r thc Alphastation 600 5-scrics. Hc \\.as 7. W. Anderson, "l.ogic.ll \krific,~tionof tlic NVAX CPLJ previously cml)loycd at lhtapoint, \\~hcrehe was a design (:hip Dcsign," Olgil~11'/i~cI~r~ic~il,/o~o~~z~tI~ \,ol. 4, no. 3 engineer/projcct cllginccr. Paul received 3 R.S. in electrical (Surn~ncr1992): 38-46. e~lgineeringfrom Ohio State University in 1980. He holds n1.o parcnts. Biographies

Tohn H. Zurawski John Zumtvski \\.as the system i~rchitcctfor the AlpllnSt~tion600 5-scrics \vorksratio~i.Prior to this project, John was the system arcliitcct for the L)E<;station 5000 series of MIPS 1<4000 \\,orkstations. Hc has also led the vcri6cation cffol-r ti)]-the 1)1'<: 3000 \\rorltsnrion and lcri the team that designed rhc tlonting-point unit for tlic VAS 8800 family. John holds 3 1I.Sc. degree in physics (1976), and M.Sc. (1977) vid l'h.1). (1980) degrees jn computcr science, all fi-om Mn~~clicstcrUniversiy. John is a mcmbcr of IEEE. Hc holds sc\.cn parents and has published six papers on computer technology. Hc joined 1)igibl in 1982 after completing post-doctoral research at Manchesrer University.

Digital Tcchnicnl Journ.il I./ 4.1 1995 YY William J. Bowhill, Shane L. Bell, Bradley J. Benschneider, Andrew J. Black, Sharon M. Britton, Ruben W. Castelino, Circuit Implementation Dale R Donchin, John H. Edmondson, Harry R Fair, 111, Paul E. Gronowski, of a 300-MHz 64-bit Anil K Jain, Patricia L. IGoesen, Marc E. Lamere, Bruce J. Loughlhi, Second-generation Shekhar Mehta, Robert 0.Mueller, Ronald P. Preston, Sribalan Santhanam, CMOS Alpha CPU Timothy A. Shedd, Michael J. Smith, Stephen C.Thierauf

A 300-MHz, custom 64-bit VLSI, second- Tlie Alpha 21 164 chip is a 300-mcgnlicrtz (MHz), generation Alpha CPU chip has been developed. quad-issue, custom \,cry large-scale integration (VLSI) The chip was designed in a 0.5-krn CMOS implementation of the Alplia arcliircct~~rctliat dcli\~ers peak performance of 1,200 n~illioninstructions per technology using four levels of metal. The die second (niips)/600 nlillion tloating-point operations size is 16.5 mm by 18.1 mm, contains 9.3 million per second (IMFLOPS). The chip is designed in a transistors, operates at 3.3 V. and supports 0.5-micrometer (~m)complcmcntary metal-oxide 3.3-Vl5.0-V interfaces. Power dissipation is 50W. semiconductor (CMOS) technology using four levels It contains an 8-KB instruction cache; an 8-KB ofmetal. The die measures 16.5 ~nilli~nctcrs(mm) by data cache; and a 96-KB unified second-level 18.1 nlm and contains 9.3 million transistol-a. It opcr- ates at 3.3 \lolts (V) n~lds~~pports 3.3-V nnd 5.0-V cache. The chip can issue four instructions per interfaces. The chip dissipates 50 wrutts (W) at 300 cycle and delivers 1,200 mipsl600 MFLOPS MHz (internal clock frequency). S\vitcliing noise on (peak). Several noteworthy circuit and imple- the power supplies is controlled by an on-chip distrib- mentation techniques were used to attain the uted coupling capacitance bcn\,ccn po\ver and ground target operating frequency. of 160 nanofarads (nF).The chip contains an 8-kilobyte (IU3), first-level (Ll) instruction cachc; nn 8-IU3 L1 data cache; and a 96-IU3sccond-le\lel (L2) unif cd ciatn and instruction cache. This paper focuses on thc circuit implementation of tlie Alpha 21 164 CPU. Space does not permit a description of the complete dcsign process utilizcd throughout the project. I~isteaci,somc of the signif - cant circuit design challenges cncounterccl d~~ringtlic project are discussed. T11c papcr begins \\lit11 nn intro- ductory o\~er\rie\vof thc chip micronl-cliitect~~re.It continues with a dcscriptio~lof the floorplan and tht: physical layout of the chip. Tlie next section discusses the clock distribution and latch dcsign. This is fol- lowed by an overview of the circuit dcsign strategy and some specific circuit design cxa~nplcs.Thc papcr con- cludes \ilitli information about dcsign (~?liysical id electrical) \lcrification nncl CAD tools.

Microarchitecture Overview

The Alpha 21 164 chip is a complctcly new i~nplcmcn- tation of the Alpha i~rchitecture.Figure 1 slio\vs a bloclc diagram of the Alpha 21 164 chip. The micro- processor consists of fve f~nctionalunits: tlie instr~~c- tion fetch, decode, and branch unit (I-box); the integer executio~iunit (E-box); the memory nianage- nlent unit (IM-box); the cachc co~itroland bus inter- hce unit (C-box); and the floating point illlit (F-box).

100 1)ipilal Tcchnicnl Journal Vo1. 7 No. 1 1995 p 2 n ; G.? GS z-. " td" ""g 2%.0m g. 0 i6 20 i" 0% + ; 0 - Figure 2 Floorplan of rlic hlplin 21 164 <:hip

much as possjblc. The S-caclic \\!,IS i~nplemcntcd\\,it11 ~netal4, \Irere used primarily fo~.clock, po\\.cl-, 2nd metal 4 reiicl and \\trite b~lscsspanning the clitircl gro~~iddistribution. When necessary, the ~~ppcl-mctal height. This providcd the acccss nccdcd to both the Iiq~crs\verc also uscci to ro~~tccritical signals mcl long top and bottom of the S-c,lchc fo~.routing to the ~L~SCS.The metal oriclltatio~is\\,ere clioscn to accom- I-cache and 1)-cache. modate both the cache struct~~rcsilnci tllc clnta p;ltlls The D-cache supports tu7oloads pcr cycle, rcquir- of the fi~nctionali~nits. With rcfcrcncc to Figure 2, ing a dual-ported read dcsign. The 1)-caclic was metal 2 and lrletal 4 lincs \vcrc ;irrangcd to run vcrti- ilnplcnicntcd as nvo single-ported caches containing cally and nletal 1 and mctal 3 ~+lcrcarranged to run identical data jnstcad of one dual-ported caclle. The horizontally. Most of the global routing was donc by major consideration that Icd to this ciccision \\,as the hand. I.ocal cell routing \\Ins donc hy hand with some ability to share the single-ported dcsign writ11 the assista~iceti-om auto-routing <:AD tools. I-cache. Sharing the dcsign also reduced the overall The upper metal layers arc organixcd ns a fine- analysis and \lerificntion rccl~~ired. pitched regular grid str~~ct~~rcplaced over rlic cntil-c Intcrcon~lectrouting \\.;is ;~nothc~-important part of chip. Thc typical dra\\,n line \\*idthuscci ill this grid is the floorplanning process. Four mcml layers \\!ere 12 pn. Po\\~crand ground lincs arc alternated \\:it11 a available for routing. The lo\\,cr mctal layers, metal 1 single clock line interspersed circry fen. pairs. A limited and 111etal 2, \\.crc used fix local transistor hook~~pand number of critic;11 sigllals and buses arc also routed in signal routing. The LI~~CI-111~~11 I;~!.CI-S, nictal 3 and metal 3 and men1 4. In the pad ring, mctnl 4 is used to

102 1)i~iral 'li.chnical Journal \id. 7 So. I 1995

Figure 3 Schematic of Clock 1)isrrihutioll Systcm

usage and allo\\led dcsig~lcrsto utilize latches that had the use of this latch stylr yiclds J 10 percent iniprove- already been \lcrificd over a range of operating condi- luent in speed o\rer tlic 2 1064 micropsoccssor. tions and process corners. The additio~ialskc\v in thc cloclc, ~.cs~lltingti-0111 tlic Tlic Alpha 2 1164 chip uses Ic\rcl-sensitive, transmis- local clock buffer delay, increases the possibility that sion gatc latches as sho\wn in Figure 6. T\vo basic types data could racc through a pair of Intclics during the of latchcs \\we dc\~clopcd:A-latches (Figures 6a and transition of tlic clock. Althougli thc o\,crall slcc\\l of 6c) and U-latches (Fig~~rcs6b n~id6d). Tlle A-latclics tlic internal clock is lo\\/, this \\!as not considcrcci suffi- are opcn \\~licnC;I,I

Figure 5 <:lock RC llelay

1)ipiral li.ihnii.1l lou1.11;11 \+)I. 7 No. 1 1 105 DIN-B 4 1 DOUT-B C T 1

1 CLK

(b) 13-latch

CLK

DINl-B

DIN2-B

1 CLK

(s) AND Gntc A-latch

Figure 6 Alpha 2 1 164 Sundnrd Latch Esamplcs

at least one additional g~tcciclay bct\\,een all latches, exceptions were ca~-efi~ll!f\icrified for f~~nctionalityanti thus g~~aranteeinga race-free design. Designers had reliability. the option ofdesignating these gates as logic Functions An extensive suite of in-house CAI> tools w.~s used or simple inverters. The cielay did not affect critical to aid and structure thc design proccss. In all cases, the speed pnths, since critical paths tended to have lnorc tools supplcmentcd the desig~lprocess und automatccl than one delay bcnveen Intclics. repetitive \\,ark. Engineering j~~clgmcnt,~nd itcrnti\.e use of the sofn'tarc \\Icrc required to crcatc the final Circuit Design Strategy production schematics. Tools rl~at;~idcd schematic generation includcd a schcm'ltiic editor, n logic s!~ntlic- DLI~to the cornplcsit)~of the Alpl1~12 1164 chip and sis tool, and n cie\ricc-sizing tool. I'osr-schematic tools the large size of the dcsign team, a comprehensive included a latchi~lgmctl~odology chccltcr, a circ~~it design methodology \\$asdc\relopcd. A design guide verifier that highlighted cicsign rncthotlology \%)la- \vas created to provide a consistent sct of rules and tions, and a timing \-criticr that an:llyzcd potential crit- methods for tlie dc\tlopnient of circuit schcmntics ical spced paths. The use of the dcsign tools \cn~.icd and layout. This docun-~enthelped ensure that all across the chip, based on the dcgrcc of customized designers worltcd under the sanlc design assumptions. logic requircd. For example, thc I-box did not rely In addition, it relieved time-consu~ninganalysis of heavily on the synthesis tools bccausc of the 11ccd for eacll circuit by providing guidelines and "rules of optimized circuit strilcturcs. Ho\\v\lcr, the (:-bos thumb" that guaranteed correct operation and mini- used the synthesis tools extensively to produce bnsc- mized the possibility of reliability problems. line schematics, which were then modificcl by hand Guidelincs for conlmon circuit structures such as as necessary. complenlentary, c'~scodc,dyn'lnlic, and static circuits were creatcd by characterizing their bch;lvior over 311 Circuit Design Examples process comers. Adeq~~atenoise margins were ensured by specitjling operating envelopes k)r such design The designers of the Alpha 2 1 164 cliip \\,ere fi~ccd parameters as device sizc, stack height, .lnd beta rutio. u~itha number of irnplc~iietntio~icliallcngcs. The Reliability guidelines were specified For electromigra- most significant challenge \\,as to dcsign a chip that tion, hot carrier effects, and substrntc charge injection. co~~ldrun at 300 IMHz, 50 percent tistel- than the Most circuits were designed within the rules specified previous Alpha i~~~plemcntation.~l>e\,icc scaling, jli tlie guide; ho\\,cvc~.,a fe\v circuit designs violated process de\,elopme~lt,and architccrur.~linipro\.cments the rules. These designs were allowcd only \\/hen per- delivered part, but not all, of the rccluircd speedup. formance and area advantages would be gained. These The additional improvernc~lthad to be obtained using

106 Digital Technical Joul-nnl o.7 So. 1 1995 circuit dcsign tcclinicl~~cs.Other challenges included a stage is a bit-\\rise AND fi~nctionof tlic opcrand/desti- niuch more coniplicatcd microarcliitect~~reand the nation decode mask and the dirty bit niask followed by red~~ctionin latency of ;I nurubcr of instructions from a zero detector (logical OR of the 31 hits). A trunsmis- the previous implemcnntion. Finally, the large phys- sion gate forms a second AN1) f~nctionin this stage ical size of the chip also Icd to challenges in circuit that qualifies the detected I-egister conflict with an dcsign and po\\lcr management. instruction valid signal. The tliirci domino stagc js used The fc)llo\\ring sections dcscribc several circuit to further qualifil tlie detected contlict \\.it11 instruction design challcngcs cnco~~ntcrcdduring the implerncn- type decode information and to start a logical OR of tation of the Alpha 2 1 164 chip. the 38 conflict oi~tputsinto a single stall wire. In the case of bypasses, the third domino sr.igc is i~scdto I-box Design-Issue Stage Dynamic Dirty/Bypass Logic priorit!!-encode the b!,pnsscs so that only tlic most up- Tlie iss~~estagc of the I-box coordinates the release to-date data is bypassed. of instructions into tllc F,-box, F-bos, and IM-bos Special attention \\,:IS gi\,cn to scvcral circuit dcsign pipelines. The deep pipelines and sophisticated rnem- issues \\.hen the domino logic was jmplcmcntcd. ory managenicnt n nit along with tlie high clock Carefill preplanning of the routing provided large frequency prcscntcd significnnt cliallengcs to the lateral spiicing on the dyna~iiicIi~ics to reduce cou- implementatio~iteam. The Alpha 2 1 164 niicroarchi- pling. Noise margins \\,ere protected by ensuring tliat tccturc nllo\\is LIPto 37 instructions to be in progress all dyncimic inputs \\.ere ciri\.cn from local in\lcrtcrs at tlie same time (7 intcgcr operates, 9 floating opcr- cvith a comliion ground rcfcrcncc. Charge-share prob- ates, and 21 loads tliat missed). Superscalar issue of lenis in tlie large sccond domino stngc (31 -bit-wide 4 instructions ~.cqi~ircstllat S operands and 4 ne\\~des- AND-OR fi~nction)\\we minimized due to tlie hct tinations must be checked against these 37 outstand- that only a single bit \\,ill be set in the nc\v instruction's ing i~istructionsin c\rcr)r cycle. Jn addition, 44 bypass operand decode bit niask, c\~liicliis used as the upper paths ;~rcbuilt into the F,-bos and F-box pipelines in input in tlie 31 AND staclts. :Tlicrcforc, only a single tlie Alplin 2 1 164 chip. Each of the 8 operands must be internal ~iodcmay cliargc-share \\lit11 the large output cliccltrd agninst sc\,cral of thcsc bypass paths to cnsurc capacitance. that the most up-to-date data is hr\\arded to tlie issu- Another critical concern in s~~cli;I Iargc dynamic ing instruction. structure \vas po\\.cr consumption. The logic \\,as The rcgistcr comparisons \\)ere i~uplcmentedLIS~II~ implenlcnted in such a \\lay as to minirnizc the ni~mber domino logic. As each instruction is issued, its destina- of nodes that ciiscliargc cacli phase. To minimize tion rcgistcr adci~.cssis decoded into a 3 1-bjt mask that short-circuit currents, the sccond and third do~nino is entered into 3 sliift register that ~niniicstlic appro- stages arc precLiargcd iiic~~lsof matched delay sjg- priate esccution pipclinc. <:liccl