I' UI* Qi .H I1 0 &, hi 1, 7 : I . - * -1 ALPHASERVER 4100 SYSTEM

-. ORACLE AND SYBASE DATABASE PRODUCTS - - 1 FORVLM .& Digital . &&' *-I INSTRUCTION EXECUTION ON ALPHA PROCESSORS 663***01 c: c: "' ~echn ica l 3 Journal i ri. i

. - ,.i;' F ,. ! , 14 )-? ".,' & I-, 1 *,< .- , ',,I"',., , '*a, ,...,- csa y -p~ iic Editorial The Digital TechnicalJournalis a refereed The following are trademarks of Digital Jane C. Blake, Managing Editor journal published quarterly by Digital Equipment Corporation: AlphaServer, Kathleen M. Stetson, Editor Equipment Corporation, 50 Nagog Park, Alphastation, DEC, DECnet, DIGITAL, Helen L. Patterson, Editor AK02-3/B3, Acton, MA 01720-9843. the DIGITAL logo, VAX, VMS, and Hard-copy subscriptions can be ordered by ULTRIX. Circulation sending a check in U.S. hds(made payable AIM is a trademark ofAIM Technology, Inc. Catherine M. Phillips, Administrator to Digital Equipment Corporation) to the CCT is a registered trademark of Cooper Dorothea B. Cassady, Secretary published-by address. General subscription and Chyan Technologies, Inc. CHALLENGE rates are $40.00 (non-U.S. $60) for four and Silicon Graphics are registered trademarks Production issues and $75.00 (non-U.S. $115) for and POWER CHALLENGE is a trademark Christa W. Jessico, Production Editor eight issues. University and college profes- of Silicon Graphics, Inc. Compaq is a regis- Anne S. Katzeff, Typographer sors and Ph.D. students in the electrical tered trademark and ProLiant is a trademark Peter R. Woodbury, Illustrator engineering and computer science fields of Compaq Computer Corporation. HP is receive complimentary subscriptions upon a registered trademark of Hewlett-Packard Advisory Board request. DIGITAL'S customers may qualify Company. HSPICE is a registered trade- Samuel H. Fuller, Chairman for gift subscriptions and are encouraged mark of Metasoftware Corporation. IBM, Richard W. Beane to contact their account representatives. PowerPC, PowerPC 504, and PowerPC Donald Z. Harbert 604 are registered trademarks and RS/6000 Richard J. Hollingsworth Electronic subscriptions are available at no charge by accessing URL is a trademark of International Business William A. Laing Machines Corporation. Insignia is a trade- Richard F. Lary http://www.digital.corn/info/subseription. This service will send an electronic mail mark of Insignia Solutions, Inc. and Alan G. Nemeth Pentiurn are trademarks of Intel Corporation. Robert M. Supnik notification when a new issue is available on the Internet. IPX/SPX is a trademark of Novell, Inc. ispLSI and Lamce Semiconductor are regis- Single copies and back issues are available tered trademarks of Lattice Semiconductor for $16.00 (non-U.S. $18) each and can Corporation. KAP is a trademark of Kuck & be ordered by sending the requested issue's Associates, Inc. MEMORY CHANNEL is a volume and number and a check to the trademark of Encore Computer Corporation. published-by address. See the Further Mental Ray is a trademark of Mental Images. Readings section in the back of this issue Metral is a trademark of Berg Technology, Inc. for a complete listing. Recent issues are Microsoft, MS-DOS, and vsual C++are also available on the Internet at registered trademarks and Windows and http://www.digital.u)rn/iio/dtj. Windows NT are trademarks of Microsoft DIGITAL employees may order subscrip- Corporation. MIPS and R4400 are trade- tions through Readers Choice at URL marks of MIPS Technologies, Inc., a wholly http://webrc.das.dec.com or by entering owned subsidiary of Silicon Graphics, Inc. VTX PROFILE at the OpenVMS system Motorola is a registered trademark of prompt. Motorola, Inc. Oracle is a registered trade- mark and Orade7, Orade 64 Bit Option, Inquiries, address changes, and compli- and Oracle Parallel Server are trademarks mentary subscription orders can be sent of Oracle Corporation. Postscript is a to the Digital TechnicalJournalat the registered trademark of Adobe Systems published-by address or the electronic Incorporated. Powerview is a registered mail address, [email protected] trademark of Viewlogic Corporation. can also be made by calling the Journal SPARCstation is a registered trademark office at 508-264-7549. and SPARCluster, SPARCserver, and Comments on the content of any paper and UltraSPARC are trademarks of SPARC requests to contact authors are welcomed International, Inc.,used under license by and may be sent to the managing editor at Sun Microsystems, Inc. SPEC is a registered the published-by or electronic mail address. trademark of the Standard Performance Copyright O 1997 Digital Equipment Evaluation Corporation. SPICE is a trade- Corporation. Copying without fee is per- mark of the University of California at mitted provided that such copies are made Berkeley. SQL Server and System 11 are for use in educational institutions by faculty trademarks and Sybase is a registered trade- members and are not distributed for com- mark of Sybase, Inc. Sun is a registered mercial advantage. Abstracting with credit trademark and Ultra is a trademark of Sun of Digital Equipment Corporation's author- Microsystems, Inc. Synopsys is a regis- ship is permitted. tered trademark of Synopsys, Inc. Texas Instruments is a registered trademark of Cover Design The information in the Journalis subject Texas Instruments Incorporated. Timing The performance advantage of very large to change without notice and should not Designer is a registered trademark of memory technology for commercial applic:a- be construed as a commitment by Digital Chronology Corporation. TPC-C is a tions is a major theme in this issue of the Equipment Corporation or by the compa- registered trademark of the Transaction Journal. The cover is a collage of images nies herein represented. Digital Equipment Processing Performance Council. UNIX from the development of the AlphaServer Corporation assumes no responsibility for is a registered trademark in the United 4100 four-processorsymmetric multipro- any errors that may appear in the Journal. States and in other countries, licensed cessing system, which offers 8 gigabytes ISSN 0898-901X exclusively through X/Open Company of memory and indusuy leadership per- Ltd. Xilinx is a registered trademark of Documentation Number EC-N7629-18 formance. This four-processorsymmetric Xilinx, Inc. multiprocessing system is not ody charac- Book production was done by Quantic terized by very large memory but by low Communications, Inc. latency, high bandwidth, and 400-megahertz .

The cover design is by Lucinda O'Neill of DIGITAL'S Corporate Design Group. Contents

ALPHASERVER 4100 SYSTEM

AlphaServer 4100 Performance Characterization Z,lrka C\ctano\iic and llarrel 13. Do~ialclson

The AlphaServer 4100 Cached Processor Module Maurice B. Stcinm.ln, George 1. Harris, Architecture and Design Andrcj Kocrv, Virginia C. Larncrc, and Rogcr D. l'annell

The AlphaServer 4100 Low-cost Clock Distribution System Roger A. l'>i~nie

Design and Implementation of the AlphaServer 4100 CPU Glcnn A. Herdeg and Memory Architecture

High Performance I10 Design in the AlphaServer 4100 Samuel H, Duncan, Craig 1). Keefer, and Symmetric Multiprocessing System Thomas A. ~McL~ughlin

ORACLE AND SYBASE DATABASE PRODUCTS FOR VLM

Design of the 64-bit Option for the Oracle7 Relational Database Management System

VLM Capabilities of the Sybase System 11 SQL Server

INSTRUCTION EXECUTION ON ALPHA PROCESSORS

Measured Effects of Adding Byte and Word Instructions 1)avid I? Huntcr and Eric B. llctrs to the Alpha Architecture

Vol. 8 No. 4 1996 Editor's Introduction

Just 40 years ago, a machine called the The ApliaScl.vcr 4 100 cachcd pro- de~nonstratca clear perforninncc ben- TX-0-.I successor to Wlurlwind- cessor 111od~llcdesign is prcscnted by cfit for decision support s)jstems 2nd was built at MIT's Lincoln Laboratory Mo Steinman, George Harris, Andrej online transaction processing. The to find out, among other things, ifa Kocev, Ginny Lamere, and Roger author summarizes data that show core memory as large as 64 Icwords Pannell. Built .lround the Alpha 21 164 a clear benef t for a datnbasc wit11 the could be built. Over tlie years mem- 64-bit 1USC , tl~c 64 Bit Option enitbled running on ory sizes have grown so large that, module is the first kern DIGITAL the AlphaServer 8400 with 8 gigabytes in the '90s, the industry has felt the to employ a higli-performance, cost- of memory; in some cases, the perfor- need to characterize memory in big effective sy~ichrono~~scache rather mance increase was 200 rimcs tha~ macliincs as iiet-y large. At five orders than a traditional asynchronous . ofrhe standard configuration. Syb~ of magnitude greater in size than the Next, Roger Dame reviclvs the clock engineers T.K Rengarajan, Mas TX-0 memory, the AlphaServer 4100 distribution system, the use of off- Berenson, Ganesan Gopal, Rrucc 8-gigabyte memory is indeed very the-shelf phasc-locIa\rc system designed for industrp-leading of the multiple-issue feature of thc Hunter and Eric Betts describc the performance at a low cost. The sys- Alpha 2 1164 microprocessor. Closing process ofanalyzing how thesc addi- tem accommodates up to four &bit the section on tlie 4100 design is the tions affect the performance of a Alpha 21 164 microprocessors operat- 1/0 subsbatem's contribution to the com~nercialdatabase. For testi~~g, ing at 400 megahertz, four 64-bit PC1 system goals of low latency and high the team used prototype hardwarc, bus bridges, and 8 gigabytes of main nlemory and I/O bandwidth. Sam rebuilt lMicrosoti Corporation's SQL memory. Opening the section about Duncan, Craig Kecfer, and Tom Server to use rhc new instructions, die 4100 system, Zark'i Cvetanovic McLa~ghlin present sc\fcral innova- and ran the TPC-R benchmark. and Darrel Donaldson describe the tive techniclues developed for tlic sys- The editors thank Darrel llo~laldson project team's performance characteri- tem bus-to-PC1 bus bridge design, of the AlphaServer 4 100 team and zarion ofdifferent AlphaServer 4100 including partial cache lint \\.rites, KLI~Chung of the Database Applicn- models under teclinical and conimer- pcer-to-pecr transactions dcross 1'CI tion Partners group for their ct'lbrrs cia1 workloads. Both the process and bridges, and support for large bursts to acquire the papers prcscnted in this the findings are ofinterest. As one of data. issue. Our upcoming issue will feature example sct ofdata demonstrates, All efforts to make tlie hard\vare CMOS-6 process technologies. the model 5/300 is not only Faster run faster are for the bcnef t of the than its DIGITAL predecessors but applications that run on those sys- 30 to 60 percent faster than a coni- tems. A papcr from Oracle Corpora- parativc industry platform when run- tion and another from Sybnse, Inc., ning memor)l-jntensive cvorkloads examine ways in which tlicir rcspec- from the SPECfp95 benchmark. rive databasc systems take advantage The four papers chat follow exam- of VLIM. Vipin Cokhale describes ine areas of the system that challenged the 64 Bit Option i~nplemcntation designers to keep costs low and at the for the Oracle7 relational database Jane C. Blake same time deliver high performance. system. A primar!! project goal \\!as to rLfannging Edit01

2 Digital 1ci.hnical Jour~lal Vol. 8 No. 4 1996 I Zarka Cvetanovic Darrel D. Donaldson AlphaServer 4100 Performance Characterization

The AlphaServer 4100 is the newest four- The AlphaServer 4100 is DIGITAL'S latest four- processor symmetric multiprocessing addition processor syninietric multiprocessing (SIMP) ~iiidra~lge to DIGITAL'S line of midrange Alpha servers. Alpha server. TIiis paper characterizes the perfor~nance of the Alphaserver 4100 family, which consists of The DIGITAL AlphaServer 4100 family, which three models:'" consists of models 51300E. 51300. and 51400. has major platform performance advantages 1. Alphaserver 4100 model 5/300E, which has up to four 300-megahert~(MHz) Alpha 21 164 ~iiicro- as compared to previous-generationAlpha plat- processors, each without a module-level, third- forms and leading industry midrange systems. level, write-back cache (B-cache) (a design referred The primary performance strengths are low to as urzcachecl in this paper) memory latency, high bandwidth, low-latency 2. AlphaServer 4100 model 5/300, which has up to 110, and very large memory (VLM) technology. fo~lr300-MHz Alpha 2 1164 nlicroprocessors, each Evaluating the characteristics of both technical with a 2-megabyte (MB) B-cache and commercial workloads against each family 3. AlphaServer 4100 model 5/400, which has up to member yielded recommendations for the best four 400-MHz Alpha 21164 microprocessors, each application match for each model. The perfor- with A 4-MB B-cache mance of the model with no module-level cache Thc performance analysis undertaken examined and the advantages of using 2- and 4-megabyte a number of workloads with different character- module-level caches are quantified. The profiles istics, including the SPEC95 benchmark suites (floating-point and integer), the LINPACIC bench based on the built-in performance monitors are mark, AIM Suite VII (UNIX multiuser benchmark), used to evaluate cycles per instruction, stall time, the TPC-C transaction processing benchmark, image multiple-issuing benefits, instruction frequen- rendering, and memory latency and bandwidth cies, and the effect of cache misses, branch tests."-li Note that both com~nercial(AIM and TPC-C) mispredictions, and replay traps. The authors and technical/scientific (SPEC, LINPACIC, and image propose a time allocation-based model for rendering) classes of \\rorkloads were included in this analysis. evaluating the performance effects of various The results of the analysis indicate that the major stall components and for predicting future per- AlphaServer 4100 performance advantages result formance trends. from the following server features: Significantly liighcr ba~idwidth(up to 2.6 times) and lower latency compared to the previous- generation midrange Alphaserver platforms and leading i~idustrymidrange systems. These improve- ments benefit the large, multistreani applica- tions that do not fit in the B-cache. For example, the Alphaserver 4100 5/300 is 30 to 60 percent faster than the HP 9000 K420 server in the memory-intensive workloads from the SPECfp95 benchmark suite. (Note that all competitive per- formance data presented in this paper is valid as

Vol. 8 No. 4 1996 of the submission of this paper in July 1996. Tl~c the Alpha 21164 rnicroproccssor and a11 multiproccs- references cited refer the reader to the literature sor products by leading i~ld~~stry\renders. The ~iinjor and the appropriate Web sites for the latest perfix- bcncfits come fi-om the simpler interface, the use of mance information.) S~IIC~~I-OII~LISdynamic random-access memory An expanded very large memory (VLIM).The mas- (13liAi\/l)chips (i.e., synchronous nicn~or!r), and tlic imum memory size increased from 2 gigabytes lo\\jcr fill time.' ' Figure 1 shon,s the measured mcm- (GB) to 8 GB without sacrificing CPU slots. This ory load latency using the Imbench benchmark with increase in memory size benefits primarily the com- a 5 12-hytc stride.'" In this benchmarl<, each load mercial, n~ultistreamapplications. For csample, the depends on the res~~lrfrom the previous loact, and AlphaScrvcr 4100 5/300 server achiei~esapprosi- therefore latency is not a good mcasurc of pcrfi)r- mately nvicc the throughput of the Co~llpac] mancc k)r systems that can have multiple o~~tstanding ProLiant 4500 server and 1.4 times the throughput loads. (AlphaSer\rcr 4100 systems can have up to of the AIphaServer 2100 on the AIM Suite VJI two oiltstan~ii~~grequests pel- <:PU on the bus.) bench~narktests. The lmbcncli benchmark data i11dic:ltcs that tlic A 4-1MB B-cache and a clock speed of400 MHz AlphaScrvcr 4100 has the lo\vest memory latency of all indt~stry-leadingreduced-instructio~~ set comput- in the AlphaScr\fer4100 5/400 system. The largcr in3 (RISC) vc~ldors'servers. B-cache size and 33 percent faster clock resulted in As sho\\~nin Figure 2, using a sliglitl!~ clilfi.rcnt a 30 to 40 perccnt performance impro\,cment over \\.orltlond \vhcrc there is no depcndcnc\r bcn\.ccn the AlphaScr\rer 4100 5/300 system. consecutive loads, the AlphaServer 4100 achicvcs c\.cn The performance impro\.ement from the largcr lo\\,cr per-load latency, since the latcnc! for tlic nvo B-cachc increases with tlie number of CPUs. For consccuti\,c loxis can be overlapped. The plntc.~us example, the Alphaserver 4100 5/300 system \\.ith in Figure 2 she\\ thc load latency at each of tlic folio\\,- its 2-MB R-cache design performs 5 to 20 percent ing Ic\,cls of cachc/memory hierarchy: 8-kilobyte faster \\/ith one CPU and 30 to 50 percent faster (ICB) on-chip data cachc (D-cache), 96-K13 on-chip \\lit11 four CPUs than the uncached 5/300E system. secondary instructio~l/data cache (S-cache), 2- and The majority of workloads included in this analysis 4-MB off-chip K-caches (except for model 5/300E), benefit from the B-cache; howevcr, the uncachcd sys- and mcmory. The uncachcd AlpliaScr\lcr 4100 tem outperforms the cached implementation by 10 to 5/300E achicvcs an 85 percent lo\\ler memory load 20 pcrccnt for large applications that do not fit in latcncy than the pre~rious-generatio11AlplinScr\,cl- the 2-MI3 13-cache. 2 100. The AlpliaScr\~er4100 5/300, \\~itIiits 2-/\/I13 Thc performance counter profiles, based on the K-cache, increases mernorv latency 30 pcrccnt ti)r built-in hardware nlonitors, indicate that the major- load opuations and 6 perccnt for storc opcl-ntions ity of issuing timc is spent 011 single and dual issuing compared to the i111cached5/300E system bcca~iscof and that a small number of floating-point \vorkloads thc timc spcnt checking for data in thc 13-cachc. .l'llc take ndv'intagc of triple and quad issuing. The s!lnclironous menlory sho\\s one c!,clc lo\\.cr I.ltc~ic!~ load/store instructions make up 30 to 40 percent of tlla~ltllc ~isy~lchro~louscstended data out (ElX)) all instructions issued. The stall timc associated with I)RAIM (i.c., asynclirono~~smemory), \vhicIi rcs~~ltsin \\,siting for dnta that niissed in the \rarious le\rels of 9 percult faster load operations and 5 pcrccnt hstcr cache hierarchy accounts for the most significant por- storc operations. Note that the caclicd AlpliaScr\,cr tion of the time the server spends processing com- 4100 and NphaScr\rcr 8200 systems, \\diich have mercial \\rorltloads. the snliic clocli spcccis of 300 MHz, ac.liic\~c

Vol. S No. 4 1996 LMBENCH: DEPENDENT LOAD MEMORY LATENCY (STRIDE = 512 BYTES)

ALPHASERVER 4100 51400 (400 MHz)

ALPHASERVER 4100 "00 (300 MH" p!

ALPHASERVER 4100 51300E (300 MHz)

INTEL (200 MHz) b SUN ULTRASPARC (167 MHz)

SGI POWER CHALLENGE (200 MHz)

IBM RSl6000 43P POWERPC (133 MHz)

0 200 400 600 800 1,000 1,200 MEMORY LATENCY (NANOSECONDS)

Figure 1 Irnbench BenchmarkTest Results Showing Memory Latency for Dependent Loads

Memory Bandwidth with the AlphaServer 4100 5/400). The uncached AlphaServer 4100 niodel shows 22 percent higher The AlphaServer 4100 system bus achieves a peak memory bandwidth than the cached model 5/300. bandwidth of 1.06 gigabytes per second (GB/s). The The Alphaserver 4100 memory bandwidth STREAM LMcCalpin benchmark measures sustainable improvement from synchronous memory compared memory bandwidth in megabytes per second (MB/s) to ED0 ranges fro111 8 to 12 percent. The synchro- across four vector kernels: Copy, Scale, Sum, and nous memory benefit increases with the number of SAXPY." Figure 3 sho\.\a nieasured memory band- CPUs, as shown in Table 1. width using the Copy ltcrnel from thc STREAM Lo\\[ memory latency and high bandwidth have benchmark. Note that the STREAM bandwidth is a significant effect on the performance of \\~orkloads 33 percent lower than the actual bandwidth observed that do not fit in 2- to 4-MB B-caches. For esample, on the AlphaServer 4100 bus because thc bus data the majority of the SPEC@95 benchmarks do not fit cycles are allocated for three transactions: read in the 2-1MB cache. (Figure 20, which appears later in source, read destination, and write destination. The this paper, sho\\ls the cache misses.) The SPECfp95 AlphaServer 4100 slio~sthe best memory bandwidth performance comparison presented in Figure 4 sho\xls of all multiprocessor platforms designed to support up that the uncached AlphaServer 4100 5/300E system to four CPUs. The platforms designed to support outperforms the 2-MB B-cache model 5/300 in tlie more than four CPUs (i.e., the AlphaServer 8400, the benchmarlts with tlie highest number of B-cache Silicon Graphics POWER CHALLENGE R10000, and misses (tomcan: s\.vi111,applu, and hydro2d). The per- the Sun Ultra Enterprise 6000 systems) show a higher formance of the uncached model 5/300E is compar- bandwidth for four CPUs than the AlphaServer 4100. able to that ofthe 4-MB B-cache model 5/400 for the The STREAM bandwidth on the AlphaServer 4100 swim benchmark. However, the benchmarks that fit 5/300 is 2.2 times higher than on the previous- better in the 4-MB cache (apsi and \vave5) run signifi- generation Alphaserver 2100 5/250 (2.6 times higher cantly slower on the 5/300E than on the 5/400.

Digital Technical Journal Vol. 8 No. 4 1996 5 INDEPENDENT LOAD LATENCY (STRIDE = 64 BYTES)

DATA SET SIZE KEY:

ALPHASERVER 4100 51300E - ALPHASERVER 4100 51300 - ALPHASERVER 4100 51400 - ALPHASERVER 8200 51300 - ALPHASERVER 2100 51300 Figure 2 Cache/Memory Latency for Independent Lo~ds

KEY: ALPHASERVER 8400 51300 ALPHASERVER 8400 51350 IBM RSl6000-990 SGI POWER CHALLENGE R10000 ALPHASERVER 41 00 51300E ALPHASERVER 4100 51300 ALPHASERVER 4100 51400 HP 9000 J210 ALPHASERVER 2100 51250 SUN SPARCSERVER 2000E INTELALDER PENTIUM PRO SUN ULTRA ENTERPRISE 6000

0 1 2 3 4 5 6 NUMBER OF CPUs

Figure 3 STllEAM McCalpin Memory Copy Bandwidth Comparison

6 Digitdl Tcch~iicalJournal \Jol. 8 No. 4 1996 Table 1 (CINT95) contains eight compute-intensive integer Bandwidth Improvement from Synchronous Memory benchmarlts \\rritten ill C and includes thc benchmarlts to Asvnchronous Memorv slio'i'i~nin Table 2.",'! Number of CPUs The floating-point SPEC95 suite (CFP95) contains 10 compute-intensi\~efloating-point benchmarks \vrit- 1 2 3 4 ten in FORTRAN and includes the benchmarks shown Bandwidth in Table 3."'" improvement 8% 8% 9% 12% The SPEC Homogeneous Capacity Method (SPEC95 rate) measures how bst an SMP system call F~g~lrc4 shows that the AlphaScr\lcr 4100 5/300 perform multiple CINT95 or CFP95 copies (taslts). systcm has a significant (up to two times) pcrforma~~ce The SPEC95 rate metric measures the throughput of advantage over the previous-generation Alphaserver the system running a ni~mberof tasks and is used for 2100 systcm in the SPEC$95 benchmark tests with evaluating multiprocessor system perfor~llance. the highest number ofB-cache misses. The SPECFp95 tests ~nd~catethat the 300 MHz Alpli~ser\,er4100 is Table 2 more th,ln 50 pcrccnt faster than thc HP 9000 K420 CINT95 Benchmarks (SPECint95) scr\w, and thc 400-MHz iUphaSer\ier 4100 1s bvlce as Benchmark Description fast CIS the HP 9000 I(420 In the SI'kCfp95 bcnch- marks that stress the memory subsystem. Artificial intelligence, plays the game of Go SPEC95 Benchmarks A Motorola 88100 microprocessor simulator A GNU C compiler that generates The SPkC95 benchniarl

Table 3 CFP95 Benchmarks (SPECfp95)

Benchmark Description 101 .tomcatv A fluid dynamics mesh generation program 102.swim A weather prediction shallow water model 103.su2cor A quantum physics particle mass computation (Monte Carlo) 104.hydro2d An astrophysics hydrodynamical Navier-Stokes equation 107.mgrid A multigrid solver in a 3-D potential field (electromagnetism) 1lO.applu Parabolidelliptic partial differential 101.TOMCATV - equations (fluid dynamics) 0 5 10 15 20 25 30 35 125.turb3d A program that simulates KEY. turbulence in a cube HP 9000 K420 141 .apsi A program that simulates tempera- ALPHASERVER 2100 51300 ture, wind, velocity, and pollutants ALPHASERVER 4100 51400 (weather prediction) B ALPHASERVER 4100 51300 A quantum chemistry program that B ALPHASERVER 4100 51300E performs multielectron derivatives A solver of Maxwell's equations on Figure 4 a Cartesian mesh (electromagnetics) SPECfp95 Benchmarks Pcrbsrnance Comparison Figure 5 compares the SPEC95 performance of SPEC95 RATES the AlphaServer 4100 systems to that of the other industr!l-leading vendors using published rcsults as of July 1996. Figure 6 sho\vs the same comparison in the multistream SPEC95 rates.'? Note that all the SPEC95 con~parisonsin this paper are based on the peak results that include extensive compiler optimiza- tion~.'~Figure 5 indicates that even the uncached AlphaServer 4100 5/300E performs better than the HP 9000 K420 system, and the AlphaServer 4100 5/400 shows approximately a t~~otinies performance advantage over the HP system. The Alphaserver 4100 5/300 SPECint95 performance exceeds that of the Intel Pentiurn Pro system, and the AlphaSer\fer 4100 5/300 SPEC@95 performance is double that of the Pentium Pro. The Alphaserver 4100 5/400 is 1.5 times (SPECint95) and 2.5 tinies (SPECFp95) faster than the Pentium Pro system. The multiple- processor SPECfp95 on the Alphaserver 4100 is obtained by decomposing benchmarks using the KAP SPECINT-RATE95 SPECFP-RATE95 preprocessor from Kuck i3 Associates. Note that the KEY: cnchcd four-C1'U AlphaServer 4100 5/300 outper- forms the Sun Ultra Enterprise 3000 with sis CPUs in ALPHASERVER 4100 51300E (4 CPUs) ALPHASERVER 4100 51300 (4 CPUs) thc SPEC+95 parallel test. Thc pcrfor~nanccbenefit ALPHASERVER 4100 51400 (4 CPUs) of the cached versus the i~~lcachedAlphaScrvcr 4100 HP 9000 K420 PA-RISC 7200 120 MHz (4 CPUs) is greater in multiprocessor configirratio~lsthan in uni- SUN ULTRA ENTERPRISE 3000 ULTRASPARC 167 MHZ (4 CPUs) processor configurations. INTEL C ALDER PENTIUM PRO 200 MHz (1 CPU) IBM RSl6000 J40 POWERPC 604 112 MHZ (6 CPUs) SPEC95 Multistream Petformance Scaling Figure 6 SPEC95 Throughput Results (SPEC95 Rates) Figures 7 and 8 show SPEC95 multistrea~nperfor- mance as the number of CPUs increases. The SIMP scaling 011 the AlphaServer 4100 is comparable to that

KEY- ALPHASERVER 4100 51300E ALPHASERVER 4100 51300 ALPHASERVER 41 00 51400 HP 9000 K420 PA-RISC 7200 (120 MHz) SUN ULTRA ENTERPRISE 3000 ULTRASPARC (167 MHz) SGI POWER CHALLENGE R10000 (195 MHz) INTEL C ALDER PENTIUM PRO (200 MHz) IBM RSl6000 43P POWERPC 604E (166 MHz)

SPEClNT95 1 CPU SPECFP95 1 CPU SPECFP95 4 CPUs (SUN SYSTEM: 6 CPUs)

Figure 5 SPEC95 Speed Results

8 I)ig~t~l Technical Journal on the AlphaServer 2100 for integer \vorlzloads (that fit in the 5/300 2-1YlB B-cache). Note that SPECint-rate95 scales proportionally to the number of CPUs in the iiiajority of systems, since these work- loads do not stress the memory subsystem. 'Thc SMP scaling in SPECfp-rate95 is lower, since the majority of these workloads do not fit in 1- to 4-MB caches. In the majority of applications, the AlphaServer 4100 5/300 and 5/400 systems improve SMP scaling compared to the uncached AlphaServer 4100 5/300E by reducing the bus traffic (from fewer B-cache misses) and by taking advantage of the duplicate tag store (DTAG) to reduce the number of S-cache probes. The cached 5/300 scaling, ho\-\lever, is lower than the uncached 5/300E scaling in memor!l bandwidth-intensive applications (e.g., tomcat\/ and 1 2 3 4 NUMBER OF CPUs swim). The analysis of traces collected by the logic KEY: analyzer that monitors system bus traffic indicates that the lower scaling is caused by (1) SetDirty overhead, + ALPHASERVER 4100 51300E where SetDirty is a cache coherency operation used to - ALPHASERVER 4100 51300 - ALPHASERVER 4100 51400 mark data as modified in the initiating CPU's cache; - ALPHASERVER 2100 51300 (2) stall cycles on the memory bus; and (3) memory - HP 9000 K420 bank conflicts.2-" - ,, SUN ULTRA ENTERPRISE 3000 IBM RSl6000 J40 Symmetric Multiprocessing Performance Scaling Figure 7 for Parallel Workloads SPECint-rate95 Performance Scaling Parallel worldoads have higher data sharing and lower meniory bandwidth requirements than multistream worldoads. As shown in Fig~~res9 and 10, the AlpliaServer 4100 models with module-lc\~elcaches improve the SMP scaling co~iiparedto the uncached Alphaserver 4100 model in the LINPACIC 1000 X 1000 (million floating-point operations per second [MFLOPS]) and the parallel SPECk95 benchmarl

Very Large Memory Advantage: 2 3 4 NUMBER OF CPUs Commercial Performance KEY: As shown in Figures 11 and 12, the AlphaServer 4100 ALPHASERVER 4100 51300E ALPHASERVER 4100 51300 performs well in the commercial benchmarks TPC-C ALPHASERVER 4100 51400 and AIM Suite VII.I3." In addition to tlie lo\v memory ALPHASERVER 21 00 51300 and 1/0 latency, the AlphaServer 4100 takes advan- HP 9000 K420 tage of the VLM design in these I/O-intensive work- SUN ULTRA ENTERPRISE 3000 IBM RSl6000 J40 loads: with four CPUs, the platform can support up to 8 GB of memory cornpared to 1 GB of meniory on the Figure 8 AlphaServer 2100 system with four CPUs and 2 GB SPECfp-rare95 Performance Scaling with three CPUs. PARALLEL SPECFP95

1 2 3 4 0 1 NUMBER OF CPUs 1 2 3 4 5 6 NUMBER OF CPUs KEY:

ALPHASERVER 4100 51300E - ALPHASERVER 41 00 51300 ALPHASERVER 4100 51300E - ALPHASERVER 4100 51400 - ALPHASERVER 4100 51300 I ALPHASERVER 2100 51300 - ALPHASERVER 4100 51400 + SGI ORIGIN 2000 RlOOOO (195 MHz) - ALPHASERVER 2100 51300 IBM ESl9000 VF - HP 9000 K420 SUN ULTRA ENTERPRISE 3000 -. HP EXEMPLAR S-CLASS PA 8000 (180 MHz) --

Figure 9 Figure 10 LINPACIC 1000 X 1000 I'arallcl Pc~for~iianccScCll~ng I'.II-.~llclSI'E<:fp95 l'crform'lnce Scaling

TPC-C THROUGHPUT (TPMC)

IBM RSl6000 J30 (8 CPUs)

COMPAQ PROLIANT 45OOl166

SUN SPARCSERVER 2000E

ALPHASERVER 41 00 51400

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 THROUGHPUT (TRANSACTIONS PER MINUTE)

Figure 11 Transaction Processillg Pcrformancc (TI'<;-(1 Using ;In Ornclc 1)arnb;lsc)

10 Digital Technical Journal Vol. 8 No. 4 1996 AIM SUITE VII THROUGHPUT

COMPAQ PROLIANT 4500 PENTIUM (166 MHz)

COMPAQ PROLIANT 5000 61200 n PENTIUM PRO (200 MHz)

ALPHASERVER 4100 51300Eh

0 500 1,000 1,500 2,000 2,500 3,000 3,500 -THROUGHPUT (JOBS PER MINUTE) 'These ~nternallygeneraled results have not been AIM cert~f~ed.

Figure 12 AIM Suite VII iClulciuser/Shnred UNIX iMis l'erformancc

Figures 11 and 12 show the AlpliaServer 4100 sys- the October 1996 UNIX Expo, the AlphaScr\rcr 4100 tem's TP<:-(: performance (usi~igan Oracle database) fa~ii~lywon three AIIM Hot Iron Awards: for the best and AIM Suite VII throughput performance as com- pcrforniancc on thc W~ndousNT pared to other industr\~-leadingvendors. Note that the (for systems pr~ccdat more than $50,000) and for performance of the uncaclied AlpliaServer 4100 thc best price/performance In two UNIX liilxes- 5/300E is coniparable to that of tlie 300-1MHz multiuser sliarcd and filc system (for systems pr~cedat Alphaserver 2100. (The AlpliaSer\~er2100 system more than $150,000).'" used in this test had three CPUs and 2 GB of ~nemor): whereas the AlphaServer 4100 system had four CPUs Cache Improvement on the and 2 GB of 111ernor)l.) AlphaServer 41 00 System With its 2-MB B-cache, the AlphaServer 4100 5/300 irnproves tliroughpi~tby 40 percent in tlie Figures 13 and 14 sho\\~the percentage performance AIlM Suite VII benchmark tests as compared to improvement provided by tlie 2-MB B-cache in the uncachcd AlphaScrvcr 4100 5/300E. The the AlphaServer 4100 5/300 as compared to the AlphaServer 4100 5/400, with its 4-IMB B-cache, uncached AlphaServer 4100 5/300E. Figure 13 benefits from its 33 percent hster clocl< and nvo times shows the improvement across a variety of workloads; larger B-cache and provides 40 percent improvement Figure 14 shorvs the improvement in individual over the Alphaserver 4100 5/300. Note that the SPEC95 benchniarlts for one and four CPUs. AlphaServer 4100 5/300 and 5/300E results were As shown in Fig~~re13, tlie 2-i\/IB R-cache in tlie obtained through internal testing and have not been AlpliaServer 4100 5/300 impro\~estlie performance by AIiM certified. The AlphaServer 5/400 results have 5 to 20 percent for one CPU and 25 to 40 percent for AIM certification. four CPUs as compared to the uncached Alphaserver Compared to the best published ind~lstl-yAIM Suite 4100 5/300E s!lstem. The benefits derived fi-om having VII performance, the AlpliaSer\rer 4100 5/300 larger caches are significantly greater for four CPUs throughput is almost nvice that of the Compaq compared to one CPU, since large caches help alleviate ProLiant 4500 server, and the Alphaserver 4100 bus traffic in ~n~~ltiprocessorsystems. 5/400 throughput is more than 50 percent higher The \vorldoads that do not fit in tlie 2- to 4-MB than that of the Cornpacl ProLiant 5000 server.14 At B-cache (i.e., torncanl, s\vlni, applu) in Figure 14

1)igir.il '1'cclinic.il Jou~.n.ll Vol. 8 No. 4 1996 11 PERFORMANCE IMPROVEMENT FROM 2-MB CACHE I AIM SUITE VII MAX USERS 4 CPUs AIM SUITE VII JOBSIMIN 4 CPUs

LINPACK-1K 4 CPUs

LINPACK-1K 1 CPU

SPECFP92 4 CPUs I SPECINT92 4 CPUs 7 SPECFP92 1 CPU

SPEClNT92 1 CPU

SPECFP95 4 CPUs

SPECINT95 4 CPUs

SPECFP95 1 CPU P SPECINT95 1 CPU

0 5 10 15 20 25 30 35 40 45 PERCENT IMPROVEMENT

Figure 13 Pcrforninnce Improvcmcnt across V~rious\.Vorkloads from a 2-Mi3 R-C.lchc run faster on the uncaclicd AlphaScr\,cr 4100 than Cholcsky (I(:) fi~ctorizationas the preconditioner, on the cached AlphaSer\lcr 4100 (up to 10 percent \\,hereas Case 3 uses tlic diagon;ll pl-cconditioner. Lister on one CPU and 20 pcrccnt faster on four lyl~is\vorldoad is rcprcscntnti\ic of large scientific CPUs) due to the o\lerhcad for probing the 8-cache applications that do not fit in mcgnbytc-size cachcs. and thc increase in Set13irt)l band\\/idth.The niajorit)l The \vorltload is important in I;lrgc applications, of the other worldoads benefit from larger caches. c.g., models of electrical ncnvorks, economic systems, The AlphaServcr 4100 5/400 Further improves difhsion, radiation, and elasticity. It \%.asdecomposed the pcrfbrmance by increasing the size of the B-cache to run on multiprocessor systems using the IUI' fi-om 2 LMR to 4 1MR. In addition, the CPU clock preprocessor. impro\7cnient of33 percent, B-cache irnpro\~en~entof Figure 15 sho\\rs that thc uncachcd AlphaScrver 7 pcrccnt in latency and 11 pcrccnt in band\vidth, and 4100 5/300E outperforms thc AlphaScr\,er 8400 by the nlcmory bus speed improvcrncnt of 11 percent 4 1 p~rcc~ltfor one CI'U and by 9 pcrccnt for nvo Cl'Us together yield an overall 30 to 40 pcrcent inlpro\fc- hcc3~1scof Iiigher ticlivcrctl s!,stcm ~LISband\vidtli. mcnt in the AlphaScr\~cr4100 model 5/400 perfor- Ho\vc\~cr,tlie AlphaSer\,er 4100 5/300E f~llsbehind nlallcc as compared to that of the Alphaserver 4100 \\lit11 three and four CPUs, as it docs in thc ~McCalpin model 5/300. manory bandwidth tests shown in Figure 3. Note that with one CPU, the 300-MHz uncaclied dphaservcr Large Scientific Applications: Sparse LINPACK 4100 pcrfi)rnis at the same level as the 400-MHz cachcd AlpliaSer\*er 4100 and perhrms 18 percent 'The Sparse LTNPACIC benchmark solves a large, sparse bcttcr than the 300-MHz cachcd AlpliaScr\~cr4100. sy~n~nctricsystem of linear ccl~rationsusing the con- This is an csamplc of thc t!lpc of application for jug~ltrgrndient (CG) itcrati\fc method. The bench- \\~hiclithe cache diminishes the performance. Thc n1;lrk has threc cases, each \\lit11 3 different type of Alpll~Scr\~cr4100 5/300F, is a bcttcr match for this prcconditio~ner.Cases 1 and 2 LIS~the incomplete class ofapplicatio~lsthan the cached systcn~s. PERFORMANCE IMPROVEMENT FROM 2-MB CACHE IN SPEC95

107.MGRID 104.HYDR02D kl 103.SUZCOR 0 KEY: 102.SWIM 1 CPU 101.TOMCATV 3, 4CPUs - 20 0 20 40 60 80 100 120 PERCENT IMPROVEMENT

Figure 14 SPEC95 I'erformance Iniprovement from a 2-MB B-Cache

Image Rendering issues, branch mispredictions, stall components, and cache mis~es.".'~.'~These statistics are usetill for analyz- Tlie Alphaserver 4100 shows significant performance ing the system behavior i~ndcrvarious workloads. advantage in image rendering applications compared to The results of this analysis can bc irscd by computer the other industry-leading vendors. Figure 16 shows architects to drive hardware design trade-oft;; in hture that tlie AlphaServer 4100 5/400 system is approxi- system designs. mately 4 times faster than the SLIIISPARC system that The SPEC95 cyclcs per instruction (CPI) data was used in the movie To)! Stwyy, as measured in presented in Figure 17 shows lower (:PI val~~esfor Renderhlarks." The NphaServer 4100 js 2.6 times thc intcgcr benchmarks (CPI values of 0.9 to 1.5) faster than the Silicon Graphics POWER CHALLENGE than for tlie floating-point benchmarks (CPI valucs system and 2.4 times faster than the HP/Conves of 0.9 to 2.2). The CPI in co~nmercialworldoads Exemplar SPP-1200 system on the Mental Ray image (e.g., TPC-C) is higher than in the SPEC bcnch- rendering application fi-om Mental Iniages. These marks, primarily bccausc commercial \vorldoads have image rendering applications take advantage of larger a higher stall time, as sJ1o\\r11 in Figure 18. Note caches, and the performance impro\ws as the cache size that the perfor~na~icecounter statistics were collected increases, partic~~larlywith fo~~r(:PUS. with four Cl'Us running TPC-C (with a Sybase data- base), \vhile SPEC95 statistics \ifere collecteci on a Performance Counter Profiles single CTU. The Alpha 21 164 has nilo integer and niro floating- The figures in this section, Figures 17 through 22, point pipelines and is capable of issuing up to four sho\v the p~rform~~ucestatistics collcctcd using instructions simultaneously. Thc integer pipeli~ic 0 the built-in Alph~21 164 pcrformancc counters on the esccutcs arithmetic, logical, load/storc, and shift AlphaServer 4100 5/400 system. These hardware operations. The integer pipeline 1 esecutes arithmetic, ~no~iitorscollect various events, including the number logical, load, and branch/jump operations. The and type of instructions issued, multiple issues, single floating-point pipeline 0 esecutes add, subtract,

l>igit.llI'cchnic.~l Jot~r~lal \'uI. 8 No. 4 1996 13 SPARSE LINPACK

KEY: ALPHASERVER 4100 51300E ALPHASERVER 41 00 51300 ALPHASERVER 4100 51400 ALPHASERVER 8400

1 2 3 4 NUMBER OF CPUs

Figure 15 Spnrsc LINI'ACK I'cl.Fo1.11i3ncc

PlXAR RENDERMARKS IBM RSl6000 390 P

SGI CHALLENGE R4400 (200 MHz)

SUN SPARCSTATION 20 (1 00 MHz)

ALPHASERVER 514004100 11'

ALPHASERVER 4100 51300 KEY:

ALPHASERVER 4100 1 CPU 5MOOE 4 CPUs I 0 500 1,000 1,500 2,000 2,500 RENDERMARKS

Figure 16 Image Rendering I'crforniancc

compare, and floating-point branch instructions. Thc and dual issuing. TI-iplc ancl q~ladissuing is noticcable floating-point pipclinc 1 cscc~ltcsmultiply instruc- ill sc\rcral floating-point benchmarl

14 Digital Tcchnillal Joul-~i;~l \/ol. X No. 4 1996 CPI I 11 TPC-C 11

SPECINT95 VORTEX

M88KSIM

IJPEG GO GCC COMPRESS

SPECFP95 L, WAVE5 ' TURB3D I

TOMCATV I

SWlM I SUPCOR MGRlD HYDROPD FPPPP APSl APPLU

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 CYCLES PER INSTRUCTION

Figure 17 SPEC95 Cycles-per-instruction Comparison

TlME DISTRIBUTION TPC-C 1 SPECINTO5 VORTEX PERL . = M88KSIM = - LI . - IJPEG GO .-I GCC COMPRESS

SPECFP95 5 WAVE5 TURB3D -3 TOMCATV SWIM W KEY: SUPCOR SINGLE ISSUE MGRlD - DUALISSUE HYDRO2D - TRIPLE ISSUE FPPPP - - W QUAD ISSUE APSl - DRY STALL 1-1 APPLU 1-1 .FROZEN STALL 0% 20°h 40% 60% 80% 100% TlME

Figure 18 Issuing and Stall Time The stall time (dry plus frozen stalls in Figure 18) categories: load (both floating-point and integer), is higher in the floating-point benchmarks than in store (both floating-point and integer), integer (all the integer benchmarks and higher in the TPC-C integer instructions, excluding ones ~vitlionly R3 1 or benchmarks than in the SPEC95 benchmarks. Dry literal as operands), branch (all branch instructions stalls include instruction stream (I-stream) stalls including unconditional), and floating-point (escept caused by the branch mispredictions, progralii counter floating-point load and store instructions). Fig~~re19 (PC) mispredictions, replay traps, I-stream cache shows tlie percentage of instructions in each category misses, and exception drain. Frozen stalls include data relative to the total number of instructions executed. stream (D-stream) stalls caused by D-stream cache Note that load/store instructions account for 30 to misses as well as register conflicts and unit busy. Dry 40 percent of all instructions issued. Integer instruc- stalls are higher in SPECint95 and TPC-C (mainly tions are present in both integer and floating-point because of I-stream cache misses and replay traps), benchmarks, but no floating-point instructions exist in \vliereas frozen stalls are higher in SPEC$95 and the SPECint95 and commercial TPC-C \vorkloads. TPC-C (mainly because of D-stream cache misses). The integer and commercial workloads execute liiore The Alpha 21 164 microprocessor reduces the per- branches, while tlie branch instructions make up only formance penalty due to cache misses by implement- a fen1 percent of all instructions issued in the floating- ing a large, 96-KB on-chip S-~ache.~.'This cache is point workloads. three-\\ray set associative and contains both instruc- The cache misses shown in Fig~~re20 are higher tions and data. The four-entry prefetch buffer allo\vs in the floating-point benchmarks tha~lin the inte- prefetcliing of the nest four consecutive cache blocks ger benchmarks. The I-cache misses arc lo\\^ in the on an instruction cache (I-cache) miss. This reduces floating-point benchmarks (except for +ppp) and tlie penalty for I-stream stalls. The six-entry miss higher in tlie SPECint95 benchmarks and tlie TIT-C address file (IMAF) merges loads in the same 32-bytc benchmark. The D-cache misses are high in the major- block and allo\\s servicing n~ultipleload misses \\it11 ity of'tlie benchmarlzs, which indicates that a larger D- one data fill. A sis-entry \\)rite buffer is used to reduce caclic \\10~1ldreduce D-stream misses. The TPC-C the storc bus traffic and to aggregate stores into bcnchniark \\IOU Id benefit from a larger S-caclic and 32-bytc blocks."' faster 13-cache, since the number of S-cachc misses is Fig~~re19 shows the instruction lnis in SPEC95. high. The R-cache misses are negligible in the The Alplla instructions are grouped into the follo\\,ing SI'kCint9S benchmarks and higher in the majority of

INSTRUCTION STATISTICS

TPC-C

SPECINT95 9 VORTEX PERL - M88KSIM = I LI - L IJPEG . GO - GCC -1 COMPRESS - SPECFP95 WAVE5 - TURBBD - -I TOMCATV - I SWIM I I SUPCOR 0KEY: MGRlD STORES HYDROPD LOADS FPPPP INTEGER OPERATIONS .PSI FLOATING-POINTOPERATIONS APPLU -> BRANCHES 0% 20°A 40% 60% 80°h 100% INSTRUCTIONS

Figure 19 SPEC95 Insrr~~ctionProfiles CACHE MISSES

TPC-C C

SPECINT95 VORTEX

IJPEG GO GCC COMPRESS

SPECFP95 WAVE5 TURBBD I TOMCATV SWIM -L SU2COR MGRlD KEY: HYDR02D - I-CACHE MISSES FPPPP D-CACHE MISSES APSl S-CACHE MISSES APPLU B-CACHE MISSES 0 50 100 150 200 CACHE MISSES PER 1,000 INSTRUCTIONS

Figure 20 Caclie Misses

the SPECFp95 TPC-C benchrnarlzs. This data indicates the CPU is not issuing instructions). The stall co~npo- that complex commcrcial worldoads, such as TPC-C, nent is further divided into the dry and fi-ozen stalls: are more profoundly affected by the cache design than time = compute + stall simpler workloads, such as SPEC95. conlpute = single + dual + triple + quad issuing The replay traps are generally caused by (1) full stall = dry + frozen write-buffer (WB) traps (a full write buffer when a store instruction is executed) and full miss address file dry = branch mispredictions + PC mispredictions (MAE') traps (a full RIM when a load instruction is replay traps I-stream cache misses executed); and (2) load traps (speculative esecution of + + + exception drain stalls an instruction that depends on a load instruction, and frozen = D-stream cache misses the load misses in the D-cache) and load-after-store register conflicts and unit busy traps (a load following a store that hits in the D-cache, + and both access the same l~cation).~The replay traps The branch and PC mispredictions affect the per- and branch/PC mispredictions shown in Figure 21 formance of SPECint95 wvorkloads (6 percent of the are not the major reason for the high stall time in the time is spent in branch and PC mispredictions in commercial worldoads (TPC-C), since traps and mis- SPECint95) and have little effect on the performance predictions are higher in some of the SPECint95 of SPECFp95 \vorldoads (less than 1 percent of the benchmarks than in TPC-C. Instead, a high number of time) and the TPC-C benchmark (1.4 percent of cache misses (see Figure 20) correlates well with the the time). The SPECint95 worlcloads are affected pri- high stall time and CPI (see Figure 17) in TPC-C. ~narily by the load traps, whereas the SPEC@95 Figure 22 shows the estimated stall components 111 benchmarks are affected by both load and i'VB/MAF SPEC95 and TPC-C. A time-allocation model is used to traps. Note that the time spent on a load replay trap analyze the performance effect of different stall compo- is overlapped with the load-miss time. nents. The total esecution time is divided into two com- The S-cache and B-cache stalls are high in the ponents: the compute component (where the CPU is SPECfp95 and TPC-C benchmarks, where the stall issuing instructions) and the stall component (where time is dominated by the B-cache and memory laten- cies. Note the high stall time resulting from waiting for

Digital Tcchnical Journal Vol. 8 No. 4 1996 .-. - , REPLAY TRAPS AND BRANCH MlSPREDlCTlONS I TPC-C I SPECINT95 VORTEX PERL M88KSIM LI IJPEG GO - GCC COMPRESS - SPECFP95 WAVE5 -rn TURB3D rn TOMCATV rn SWIM rn SUPCOR -rn MGRlD KEY. HYDROPD - LDU REPLAY TRAPS FPPPP -= WBIMAF REPLAY TRAPS APSl 0 W BRANCH MlSPREDlCTlONS APPLU .- PC MlSPREDlCTlONS 0 10 20 30 40 50 ,, 70 80 REPLAY TRAPS AND BRANCHIPC MlSPREDlCTlONS PER 1.000 INSTRUCTIONS

Figure 21 Rcplay Traps and BI.~IICII/PCMispredictio~~s

SPEC95 STALLTIME COMPONENTS TPC-C c SPEClNT95 VORTEX

M88KSIM LI IJPEGPERLGCCGO E COMPRESS r SPECFP95 WAVE5 - KEY: TURBBD - 1 BRANCH AND PC TOMCATV - I MlSPREDlCTlONS SWIM - U W LDU REPLAY TRAPS SU2COR - WBIMAF REPLAY TRAPS - - I I-CACHE MlSS TO S-CACHE D-CACHE MlSS TO S-CACHE I S-CACHE MlSS TO 0-CACHE B-CACHE MlSS TO MEMORY REGISTER CONFLICT AND UNIT BUSY

0 10 20 30 40 50 60 70 80 90 100 PERCENT OF TOTAL TIME

Figure 22 Esrirnatcd Stall Tinie Distribution

18 Dipil~lTcclin~cal Journal data from memory (close to 40 percent) in se\ieral of 2. IM.Steinman, G. Harris, A. I

Using several performance rnetrics and a variety of 9. J. Gray, ed., fie Handbook ,for Dalahase and workloads, we have demonstrated that the DIGITAL Transaction Processing S~stenzs(San Mateo, Calif.: Morgan Kauffiiian, 199 1 ). AlphaServer 4100 family of midrange servers provides significant perfor~~~anceimprovements over the 10. Information abo~~tthe l~iibcnchsuite of benchmarks pre\/ious-generatio11 AlphaServer platform and pro- is available at littp://realit)r.sgi.co~n/c~nployees/ vides performance leadership compared to the leading Im~e~igr/lnibcncIi/~~~Ii~tis_lmbciicl~.l~t~~il. jndustry vendors' platforms. The ~najorAlphaserver 11. The STILEAM benchmark program is described 4100 performance strengths are the lo\v memory and on-line by the University of Virginia, Department 1/0 latency and high memorp bandwidth, the large- of Computer Science (Charlottesville, Va.) at memory support (VL,M), and the fist Alpha 21164 l~ttp://w~%\~.cs.\~irginia.ed~~/stream. microprocessor. The work described in this paper has 12. The Standard Performance Evaluation Corporation led to design changes that are expected to be imple- (SPEC) makes available submitted results, benchmark mented in future versions of the Alphaserver 4100 descriptions, background information, and tools at platform. The anticipated performance benefits will http://un\%\~.specbench.org. come from a faster CPU, faster and larger caches, faster 13. Information about the Transaction Processing memory, and improved memory bandwidth. Performance Council (TPC) is a\lailable at http:// \w\\',rpc.org. Acknowledgments 14. Information about system performa~icebenchmarking The authors would like to aclcnowledge the contribu- products from AIM Technology, Inc. (1Menlo Park, tions of John Shaltshober, Dave Stanley, Greg Tarsa, Calif. ) is avajlable at http://\\%vw.airn.crom. Dave Wilson, Paula Smith, John Henning, Michael 15. Infornlation about Pixar Animation Studio's Delaney, and Huy Phan for providing many of the l

Digtal Technical journal Vol. 8 No.4 1996 Biographies

Zarka Cvetanovic A consulting c~~gincsrin DIGITAL1s Server l'roduct Dc\clopmcnt Group, Zarka C\ttanovic \vas responsible for the pcrformancc characterizario~iand analysis ofthe NphaScl.\cr 4100, AphaSe~vcr8400/8200, AlphaSener 2 100, DEC: 7000, \'AS 7000, and VLY 6000 systems, and for thc pcrlbrmancc n~odeli~igand definition of hture AphaSer\.er platforms. Since joining IXGITAL in 1986, slic h3s bcco ir~\rol\,edin the de\.clopment of Fast database applications and cflicicn~pal-allcl .~pplicationsfor niulri- psocc\hol. s\.stems. Znrkn recci\.ed ,I 1'11.1). in elcctric~laid coniputcr engineering fi.0111 tlic University of I\.lnss~cliusctts, Amhcl-st. Slic has p~~blisllcdo\-cr a doze11tcclinical papcrs at coniputcr architec~urcconfcrrnccs and in leading indus- tr!. journals.

Darrel D. Donaldson Darrel Donaldson is a senior co~isultin~engineer and the technical leader and enginccri~ignl.inciger for the AlphaSer\,cr 4100 project. He joined DIGITAL in 1983 and servcd as the lead tcchnolog~sifor the VAX 6000, VAX 7000, AlpliaSer\~cr7000, and Alphaserver 4100 projects. Dnrrel \isa Lmchelor's degree in mathematics/ physics born Miami University and a master's degree ill elcctricnl engineering from Cincinnati University, Cincinnati, Ohio. He holds 12 patents and has 10 patcnts pending, all related to protocols, signal integrity, and chip transcci\.rr design for niultiproccssor systems and 11011- volatile memory chip design. 1)clrrel maintains meniber- ship in rhc IEEE Electron Devices Society and the Solid-State Circuits Society.

20 Digital Technical Journal \'oI. 8 No. 4 1996 I Maurice B. Steilunan George J. Harris ~ndrejIcocev The AlphaServer 4100 Virginia C. Lanere Cached Processor Module Roger D. Pan~lell Architecture and Design

'The DIGITAL AlphaServer 4100 processor module The DIGI-rA1,AlphaScrvcr 4100 series ofser\lers reprc- uses the Alpha 21 164 microprocessor series com- scnts thc third gcncration of Alpha ~nicroprocessor- bined with a large, module-level backup cache based, mid-range computer s!lstcms. Anlong the technical goals achic\,cd in the systcln design \\.ere thc (6-cache). The cache uses synchronous cache use offour Cl'U modules, 8 gigabytes (GR) of memory, memory chips and includes a duplicate tag store ancl partial block \\,rites to improve I/O performance. that allows CPU modules to monitor the state Unlil

22 Digital Technical Journal Vol. 8 No. 4 1996 ------I I I I BUS ARBITER I

I TAG RAM I PROGRAMMABLE I INDEX I / I 14 I LOGIC / WRITE ENABLE. I -1 4 OUTPUT ENABLE ~~~~~~$EsSOR I I DATA RAMS I I I ASIC (VCTY) I I I------J t SYSTEM ADDRESS AND 144-BIT COMMAND DATA BUS SNOOP ADDRESS

DATA TRANSCEIVER ADDRESS TRANSCEIVER t SYSTEM BUS A C C \

Figure 1 CPU 1Modulc that results in tlie address being sent to the system bus. commalld (a hit in DTAG), tlie signal MC-SHARED The mcrnory rcccivcs this addrcss and after a delay may be asserted on the system bus by VCTY. If that (~nc~norylatency), it sends the data on the system bus. location has been modified by the microprocessor, L3'1ta transcci\icrs on the C1'U ~nodulcrccci\lc the then MC-DIltTY is asserted. Thus each CPU is aware data and start a cache fill transaction that results in of the state of all the cachcs on the system. Other 64 bytes (a cache block) being written into the cache actions also take place on the module as part of this as bur conseci~tive128-bit words with their associated proccss, which is explained in greater detail in the sec- check bits. tion dealing specifically with the VCTY. In an SMP system, two or more (:PUS may have the Because of the write-back cache organization, a spe- same data in their cache memories. Such data is known cial type of miss transaction occurs when new data as shared, and the shared bit is set in the TAG tbr that needs to be stored in a cachc location that is occupied address. The cache protocol used in tlie AlphaScrvcr by dirty data. The old data 11eeds to bc put back into 4100 scrics ofservers allows each <:PU to modi$ entries the main memory; otherwise, the changes tliat the in its o\vn cachc. Such modificd data is known as dirty, microprocessor madc will be lost. The process of and the dirty bit is set in the TAG. Ifthc data about to be returning that data to memory is called a victim writc- modified is shared, liowc\~cr,the microprocessor rcsets back transaction, and the cache location is said to be the sharcd bit, and other CPUs invalidate tliat data in victimized. This process imrolves moving data out of their own caches. The need is thus apparent for a \vay the cache, through the system bus, and into the main to Ict all Cl'Us keep track of data in other caches. This memory, follo~vedby new data mo\ing fro~iithe niain is acco~liplislicdby the process known as snooping, memory into the cachc as in an ordinary fill transac- aided by several dedicated bus signals. tion. Completing this fill quickly reduces the tirnc that To bcilitatc snooping, a separate copy of thc TAG is the microprocessor is waiting for the data. To speed up niaintainccl in a dedicated cache chip, callcd duplicate this process, a hardware data buffer on the module is tag or DTAG. DTAG is controlled by an application- used for storing the old data while the new data is specific integrated circuit (ASIC) cal led VCTY. VCTY being loaded into the cachc. This buffer is physically and 1)TAG arc locatcd nest to each other and in close a part oftlic data transceiver since each bit of the trans- proximity to the address transcci\icrs. 'Their timing is ceiver is a shift register four bits long. One hundred ticd to the system bus so that the addrcss associated twenty-eight shift registers can hold the entire cachc \\11tIi a bus transaction can eas~lybe applied to thc Iblock (512 bits) of victim data while the new data is DTAG, which is a synchronous memory device, and being read in through the bus receiver portion of tlie the state of the cachc at that address can be read out. data transceiver cliip. In this manner, the microproces- If that cachc location is valid and the address that is sor does not have to wait until the victim data is trans- stored in the DTAG matches that of the system bus ferred along the system bus and into the main memory

Digital Technical lournal before the fill portion of the transaction can take place. occurred liad it not been delayed by one microproces- When tlie fill is complctcd, tlie \victim data is shifted sor cyclc, and the address at thc lWiM is fi~rtlierdelayed OLI~of the ~ictilnbuffer 2nd into tlie main nicniory. by indcs buffer and nenvork ticlays. Index setup at the This is 1<1io\~1i3s 311 c~clia~igc,since the victim \\trite- 1 as the indcs into preset at p~\\~er-i~p.Tlic follo\\~ingdata bloclts arc the DTAG. If thc DTAG locatio~iliad pre\~iouslybccu latclicd after a shol.tcr delay, RC:-1IEAD-SPEED- marlted \lalid, ancl there is a ~iintcli bet\vccn the WAVE. Thc HC-l and the data processor cycles and the WAVI: value is set to 4, so that from the IYTAG lookup, then the block is present in B(:-IWAl>-SI'EED-WAVL is 6. Tlius, after tl~cfirst the microprocessor's cachc. Tliis sccllario is called a dclay of 10 microprocessor cycles, successive data cache hit. blocks are deli\~redevery 6 microprocessor cycles. In parallel \vith tliis, the V<:TY decodes the rcceiveo Figure 2 illustrates thcsc concepts. system bus command to dctcrminc thc appropriate In Fig~~re2, tlie RAM clock at the ~nicroprocessoris ~~pdateto the DTAG and dctcrminc tlic correct system dclnycd by one microproccssor cycle. The 1Wcloclt bi~srcsporise and CI'U command nccdcd to mdintain at the RAM dc\rice is f~rtlicrdelayed by clock buffer s!!stcm-wide cachc coherency. A fe\\~cases are illus- and nct\\.ork delays on the modulc. The address at tlic trated here, without any attempt at 3 co~nprehensi\~e microprocessor is drivcn whcrc the clock would have discussion of all possible transactions. i 6 16 6+ MICROPROCESSOR j = 10 +-;-! CYCLES

HOLD AT MICROPROCESSOR i FORDATA;;;;;;;; ... i i i ... DATA AT DATA 0 DATA 1 DATA 2 DATA 3 MICROPROCESSOR : : : : : : : : : : : : : : : : : ......

Figure 2 (:3chc I

Assume that the DTAG sharcd bit is fi)und to be set microprocessor. Since these transactions are relatively at this ;~ddrcss,the dirty bit is not set, and the bus infrequent, the DTAG saves a great deal of microproccs- command indicates a write transaction. The DTAG sor time and improves ovcrall system performance. valid bit is then reset by the VCTY, and tlie ~iiicro- If the VCTY detects tliiit the command originatccl processor is intcl-rupted to do the same in thc TAG. from the microprocessor co-rcsidcnt on tlic ~nodulc, If thc dirty bit is found to be set, and the command then tlie block is not cliccltcd for a hit, but the com- is a read, the M<:-DIRTY-EN signal is asserted on thc mand is decoded so that tl~cDTAG block is i~pclatccl system bus to tell the c.)thcr CPU that tlic locntion it is (if already valid) or allocated (i.c., marked valid, if not trying to ncccss is in cache and has bccn modified by already valid). In the latter casc, a fill transaction fb- this C1'U. At tlie s,ume time, a signal is sent to the lo\\ls ancl tlic V(:TY writes tlic valid bit into the TAG ;it 1i1icrop~)ccssorrcc1~1esting it to SII~~I!, the modified that time. The fill transaction is the onl), one for \\,hich data to tlic bus so tlic other <:1'U can gct nn LIP-to-date the VCTY \\,rites directly illto tl~cTAG. \~crsionof the clata. All cycles of n system bus transilction are ~~i~nlL>ercd, If the address being csalnincd by tlic V<:TY \\,as with cycle 1 being thc cycle in \\lIiich the system bus not sharcci in the DTAG and the trans?.t.c ~on was a address and command are valid on the bus. The con- write, the valid bit is reset in t11c l>TAG, and 110 bus trollers internal to VCTY rely on tile cycle numbering signals arc generated. The ~iiicroproccssoris rcqucstcd scheme to rcmiiin synchronized with thc system bus. to reset the valid bit in the TAG. Ho\\lc\fcr, if the trans- By remaining synchronized with thc system bus, all action \&Insnot a write, then shared is set ill the DTAG, accesses to the DTAG and accesses from the VCTY to M(:-SHARED is asserted on tlic bus, and a signal is the microprocessor occilr ill fixed cycles relativc to the sent to the microprocessor to set sharcd in the TAG. system bus transaction in progress. From these examples, it becomes ob\ious that only The index uscd for lookups to the DTAG is prc- transactioos that change the scate of the valid, shared, or sented to the DTAG in cycle 1 of the system bus trans- dirty TAG bits rcquire any action on thc part of the action. In the event of a hit requiring an update of the

Digid Technical journal Vol. 8 No. 4 1996 DTAG and primary TAG, the microprocessor interface DTAG Initialization signal, EV-ABUS-RF.Q, is asserted in c2lcles 5 and 6 of Aliotlier important fedt~~t-cbuilt into the VC:R design that system bus transaction, u~itli the appropriate is cursory self-test and initialization of'thc DTAG. system-to-CPU command being driven in cycle 6. The Atier s)~stcnireset, the VCTY \\)rites all locations of the actual update to the DTAG occurs in cycle 7, as does DTAG wit11 a urlicluc data pattern, and then reads the the allocation of bloclts in tlie 1DTAG. entire llTAG, comparing the ddta read versus \\/hat Figure 3 shows the timing relationship of a system was written and checking the parity. A second write- bus command to thc update of the DTAG, including read-compare pass is made using the inverted data pat- tlie sending of a system-to-CPU command to tlic tern. This inversion ensures that all 1ITAG data bits are microprocessor. The numbers along the top of the written and checked as both a 1 2nd a 0. 111 addition, diagram indicate the cycle numbering. In c!lcle 1, tlic second pass of the initialization lca\,cs each block when the signal M<:-(:A-L goes lo\\/, the system bus of the DTAG marked as invalid (not present in the address is valid and is presented to the DTAG as the B-cache) and with good parity. T11c entire initializa- DTAG-INDEX bits. By the end of cycle 2, the DTAG tion sequence takes approximately 1 millisecond per data is valid and is cloclted into tlie VCTY \

SYSTEM BUS CYCLE NUMBER

AAAA AAAA AAAA

DTAG_INDEX<15:0> MC-ADDR<21:6>-A1 AAAA MC-ADDR<38 22> MC-ADDR<38 22> DTAG_DATA<18 2, VALID, SHARED, NOT DIRTY DTAG VS D VALID

DTAG-OE-L

DTAG-WE1 -L

DTAG-WEO-L

EV-ABUS-REQ / MC-ADDR EV_ADDR<39 4> DRIVEN BY MICROPROCESSOR H SETSHARED DRIVEN BY MICRO- EV_CMD<3 0> DRIVEN BY MICROPROCESSOR H >PROCESSOR

Figure 3 - LITAG Operation schen~aticsto represent the gate-Jevel design. The syn- 1/0 pins). The initial phase of the synthesis process cal- thesis process is shown in a flowchart form in Figure 4. culates the timing constraints for internal nenvorks that Logic verification is an integral part of this process, connect between subblocks by invoking the Design and the flowchart depicts both the synthesis and verifi- Compiler with a gross target cycle time of 100 nanosec- cation, and their interaction. onds (actual cycle time of the ASIC is 15 nanoseconds). Only the synthesis is explained at this time. The ver- At the completion of this phase, the process analyzes ification process depicted on the right side of the flow- all paths that traverse multiple hierarchical subblocks chart is covered in a later section of this paper. within the design to deternine the percentage of time As shown on the left side of the flowchart, the logic spent in each block. The process then scales this data synthesis process consists of multiple phases, in which using the actual cycle time of 15 nanoseconds and the Design Compiler is invoked repeatedly on each assigns the timing constraints for internal networks at subblock of the design, feeding back the results from subblock boundaries. Multiple iterations may be the previous phase. The Synopsys Dcsign Compiler required to ensure that each subblock is mapped to was supplied with timing, loading, and area constraints logic gates with the best timing optimization. to s)fnthesjze the VCTY into a physical design that met Once the Design Compiler completes the subblock technology and cycle-time req~~irenients.Since the optimization phase, an industr)f-standard electronic ASIC is a small design compared to technology capa- design interchange format (EDIF) file is output. The bilities, the Design Compiler was run without an area EDIF file is postprocessed by the SPIDER tool to gen- constraint to facilitate timing optimization. crate files that are read into a timing analyzer, Topaz. A The process requires the desigrler to supply timing variety of industry-standard file formats can be input constraints only to the periphery of the ASIC (i.e., the into SPIDERto process the data. Output files can then

VERILOG SOURCE FILES 4 I

100-NS CYCLE-TIME V2BDS GROSS SYNTHESIS+ 15-NS CYCLE-TIME - SUBBLOCK FC PARSE OPTIMIZATION

FIX MINIMUM-DELAY DECSIM: COMPILE HOLD-TIME AND LINK VIOLATIONS

DESIGN COMPILER DECSIM SIMULATION OUTPUTS EDlF FILE RANDOM EXERCISER * FOCUSED TESTS SYSTEM SIMULATION FC ANALYZE WRITE NEW TESTS SPIDER PROCESSES EDlF FILE + t- FC REPORT b

DECSIM FIX TIMING VIOLATIONS ANALYZER SIMULATION AND/OR LOGIC BUGS * (NO FC)

FIX TIMING VIOLATIONS w

Figure 4 ASIC Design Synthcsis and Verification Flow

Digital 'kchnical Journal Vol. S No. 4 1996 27

Bus Monitor TIic bus monitor is a collection of performed using a tool called V2BDS. The parser's DE(:SIh/I sim~~lation\vatchcs tliat monitor tl~csystem task \\!as to postproccss a BDS file: extract information bus and the CI'U internal bus. The \vatches also rcport and gc~iu-~tcan~odified version of it. Tlic inform,ltion \\rhcn various bus signals arc being asserted and extracted \\,as 2 list of co~iti-01signals and logic statc- dcasscrtcd and have tlic ability to halt simulation if mcnts (such as logical espressions, if-then-else state- they cncountcr cache incoherency or a \riotation. ~nc~lts,case statements, and loop constructs). This Cache incohcrcnc!~is a data inconsistency, for exam- information \\.as later s~ippliedto the analyzer. The ple, a piccc of nondirty data residing in tlie B-cache ~nodificdRDS \\,as fi~nctionallpequivalent to the origi- ancl differing horn data residing in niain memory. nal code, but it contained sonic embedded calls to A data inconsistency can occur among tlie CPU mod- ro~~tines\\lliose task \\[as to monitor the activity of the ules: for example, n4,o CPU nlodulcs may have difkr- coiitrol signnls in tl~ccontest of the logic statements. cnt data in their caclics at the same memory address. Data inconsistcncics are detected by the CPU. Each Analyzer Written in C, the analyzer is a collection of one maintains an exclusive (nonsharcd) copy of its nlonitoring routines. Along with the modified B1)S data that it i~scsto compare with tlie data it reads horn code, tlic analyzcr is compiled and linked to hrm the the tcst adcircsscs. If the two copies differ, the <:PU simulation model. During simulation, the analyzer signals to the bus monitor to stop the sim~~lationand is in\~oltcdand the routines begin to monitor the acti\f- report an error. ity of the control signals. It keeps a record of all con- The bus non nit or also detects other \,iolations: trol signals that form a logic statement. For example, assLlmc the following statement \.\,as recognized by the 1. No activity on the system bus for 1,000 consccuti\,c parser 3 one to be monitored. cycles 2. Stalled s!,stcln ~LISfor 100 cycles (A XOR B) AND C 3. Illcg.ll comm.inds on the s!!stem bus and <:PU Tlic analyzcr created a table of all possible combina- intcr~ialbus tions of logic values for A, B, and C; it then recorded 4. Catfistrophic systcln error (machine check) \vhich ones \\ere achieved. At the start of simulation, there nfas zero coverage achieved. Tlic combination of random <:PU and 1/0 activity flooded the systcni bus \\it11 lica\~ptraffic. With the Acliicvcd help oftlic bus monitor, this techniclue esposcd bugs No quicltlv. No As ~ncntioncd,a few directed tests were also wl-itten. No I>ircctcd tcsts \ifcrc used to re-create a situatio~ithat No occurrcd in random tcsts. Ifa bug \\,as i~ncovercdusing No a random tcst tliat ran tlircc dqs, a directed tcst \\!as No \\'rittc11 to re-create tlic same failing scenario. Tlicn, N 0 aticr tlic bug \\!as tiscd, a quick run of the dircctcti tcst No confirmed t11.lt thc problem \\,as indeed corrected. Achieved coverage = 0 pcrccnt

Functional Checker Further assunlc tliat during one of the sirnulatio~i During tlic initial design stages, the verification team tcsts generated by the R~ndomEsercisrr, A assumed dc\.clopcd tlic Fu~lctioualChecker (FC) for the fol- both 0 ,lnd 1 logic states, \\,Iiile R and C remained con- Io\\!ing purpoxs: stantly at 0. At the end of simulation, the statc oftlic tnblc \vould be the follo\ving: 1-0 fi~nction~~llyvcri% the HDL niodcls of all ASICs in the AlphaScrvcr 4100 system ARC Acliicvcd To asscss the tcst covcmgc 000 ks 001 No 'Tlic I-'(: tool consists of three applications: the 010 No parser, the .~nalyzcr,and the report generator. -The 01 1 No right-hand side of Figure 4 illustrates lio\v tlic F(: \\!as 100 Yes i~scdto aid in tlic f~~nctional\w-ification process. 101 No 110 No Parser Sj~lcc1)lX:SI M \\!as the cl~~oscnlogic simula- 111 No tor, the first step \\!as to translate all HDL code to I31X, a DF.<:SIIM bclia\~ior1,lnguage. This task \\,as Acliie\~edcoverage = 25 percent

Val. 8 No. 4 1996 29 Report Generator The report generator application industry. Virti~allyall component vendors will, on gathered all tables created by the analyzer and gcncr- request, supply HSPICE niodels of their products. ated a report file indicating \vhich conibinations were Problelns detected by HSPICE were corrected either not achieved. The report file was then reviewed by the by layout modifications or by schematic changes. The verification team and by the logic design team. module design process flow is depicted in Figure 5. The report pointed out deficiencies in the verifica- tion tests. The verification teani created more tests Software Tools and Models that would increase the "yes" count in the "Achieved" Three internally developed tools were of great value. colunln. For the example shown above, new tests One was MSPG, cvhich was used to display the might be created that would make signals B and C HSPICE plots; another was MODULEX, which auto- assume both 0 and 1 logic states. matically generated HSPICE subcircuits from PC The report also pointed out bults in the design, layout files and performed cross-talk calculations. such as redundant logic. In the example shown, the Cross-talk amplitude violations were reportcd by logic that produces signal B might be the same as the MODULEX, and the offending PC traces were moved logic that produces signal C, a case of redundant logic. to reduce coupling. Finally, SALT, a visual PC display The FC tool proved to be an invaluable aid to the tool, was used to veri~that signal routing and branch- \rerification process. It was a trmsparent addition to the ing conformed to tlie design requirements. simulation environment. With FC, the incurred dcgra- One of the important successes was in data line dation in compilation and simulation time was negligi- modcling, where tlie signal lengths from tlie RAMS ble. It performed n1.o types of coverage analysis: to the microprocessor and the transcei\/ers \\,ere very esliaustive combinatorial analysis (as \\/as described critical. By using tlie HSPICE .ALTER statement and above) 2nd bit-toggle analysis, which was usecl for \cc- M(.>l'>ULEXsubcircuit generator command, wc coulci tored signals such as data and address buses. Perliaps configure a single HSPICE deck to simulate as many as the most valuable feature ofthe tool was the fact that it 36 data lines. As a result, the entire data line group replaced the time-consuming and compute-intensive cc.)~~ldbe simulated in only four HSPICE runs. In an process of fault grading the physical design to verifj, test csccllcnt esaniplc of synergy benvecn tools, the script coverage. FC established a new measure of test covcr- capability of tlie MSPG plotting tool was used to Age, the percentage of achieved coverage. In the above cstract, annotate, and create I'ostScript files of wave- example, the calculated co\Ierage wo~~ldbe nvo out of form plots directly from the simulation results, \vith- eight possible achievable combinations, or 25 pcrccnt. out having to manually display each \\/a\ieform on the For the verification of the cached CPU modulc, tlie screen. A mass printing co111niand was thcn c~scdto FC: tool achieved a final test co\!crage of 95.3 percent. print all stored Postscript files. Another useful HSPJCE statement was .MEASU ICE, Module Design Process which measured signal delays at the specified thrcshold Ic\rcls and sent the rcsults to a file. From this, a separate As the first step in the niodule design process, we used program extracted clean delay values and calculated the the Powcrvieur schematic editor, part of the Vie\vlogic ~i~axiniumand ~ninirnumdelays, tabulating the results CAD tool suite, for schematic capture. An internally ill a scparatc file. Reflections crossing the threshold developed tool, V2LD, converted tlie schematic to a levcls caused incorrect results to be reportcd by form that could be simulated by DECSIM. This proccss the .MEASURE statement, \\rhich \\/ere easily sccn in was repeated ~~ntilDECSIIM ran \+ithouterrors. the tabulation. We then simply looked at the \\laveform During this time, the printed circuit (PC) layout of printout to see where the reflections \vel-e occurring. the module \\!as proceeding independently, i~singtlie The layout engineer was then asked to niodi@ those ALLEGRO CAD tools. The layout process was partly signals by changing the PC trace lengths to cithcr the manual and partly automated with the CCT router, microprocessor or the transceiver. The modified signals which was efkctivc in following the layout engineer's were then resimulated to verif)-the changes. design rules contained in the DO files. Each \~crsionof the completed layout was translated Timing Verification to a format suitable for signal integrity modeling, Ovcral l cache timing was verified with the Timing using the internally developed tools ADSconvert and Designer timing analyzer from Chronology Corpor- IMOI)ULEX. The MODULEX tool was used to estract ation. liele\rant timing diagrams \\/ere drawn using a module's electrical parameters from its physical tlie ~vavefornlplotting facility, and delay values and description. Signal integrity modeling was performed controlling parameters such as the microprocessor with the HSPICE analog simulator. We selected cycle interval, read speed, wave, and other constants HSPICE because of its universal acceptance by the \\,ere entered into the associated spreadsheet. All

Vol. 8 No. 4 I996 DECSIM VL2D (CONVERTS DIGITAL LOGIC TO DECSIM) SIMULATOR

POWERVIEW SCHEMATIC EDITOR

ALLEGRO "DO" FILES LAYOUT TOOL RESTRICTIONS AND ri CONSTRAINTS

ALLEGRO.BRD CCT ROUTER

ADSCONVERT

MODULEX VLS.ADS TOOL TIMING DESIGNER FOR MODULEX * HSPICE ANALOG SIMULATOR - TIMING ANALYZER COMPATIBILITY

LmFOR MANUFACTURING

Figure 5 Design Process Flow

delays were expressed in terms of HSI'ICE-simulated point-to-point path, but in high-speed designs, many values and those constants, as appropriate. This signals must be routed in more complicated patterns. method siniplificd changing parameters to try various The most cornn~onpattern involves bringing a signal "what if" strategies. The timing analyzer wo~~ld to a point where it branches out in several directions, instantly recalculate the delays and the resulting mar- and each branch is connected to one or more gins and report all constrailit violations. This tool was receivers. This method is referred to as treeing. also used to check timing elsewhere on the module, The SI design of this nodule was based on the outside ofthe cache area, and it provided a reasonable principle that coniponent placement and proper sig- level of confidence that the design did not contain any nal treeing are the two most important elements of timing violations. a good SI design. However, ideal component place- nient is not always achievable due to overriding factors Signal Integrity other than SI. This section describes how successful In high-speed designs, where signal propagation timcs design was achieved in spite of less than ideal compo- are a significant portion of the clock-to-clock interval, nent placemcnt. reflections duc to impedance mismatches can degrade the signal quality to such an extent that the system will Data Line Length Optimization fail. For this reason, signal integrity (SI) analysis is an Most of the SI work was directed to optimizing the important part of the design process. Electrical con- B-cache, which presented a dimcult challenge because nections on a module can be rnade following a direct of long data paths. The placement of major module

Vol. 8 No. 4 1996 31 - 7 data bus components (microprocessor and data trans- I lie goal of data linc design \\,as to obtain clean sig- cei\lers) \bras dictated by the enclosure req~~ircmcnts nals at the recei\~ers.Assu~ning that the micropl-occs- and tlie need to tit four CI'Us and eight memory mod- sol-, 1L%R/ls,and tllc transceiver are '111 located in-line i~lesillto tlic system box. Rather than allo\\,ing the \vitliout branching, with the distance between the n\'o microprocessor hc;it-sink height to dictate module RAMS near zero, and since the positions of the micro- spacing, the system designers opted for fitting srnallcr processor and tlie transceivers are fixed, the only vari- menior)r rnod~rlesnest to tlie C:PUs, filling the space able is tlie location of the n\lo RAh/Is on the dat'l linc. that \vo~rldhave been left empty if ~iiodulespacing As shown in the \\la\reform plots of Figures 7 ;lnd 8, were i~niform.As a consequence, the microprocessor tlic cl~lalityof the I-ecei~redsignals is strongly affected and data transcci\rcrs had to be placed on oppositc by this variable. In Figure 7, the reflectiolls arc so I~rgc elids of the 11iod~11c)which made the data bus exceed tliat they exceed threshold levels. Uy contrast, the 11 inches in length. Figure 6 shows the placement of reflections in Figure 8 are very small, and tlicir \\)J\~C- the major components. fi)rnis slio\\i signs of cancellation. From this it can Each cache data line is connected to four co~npo- 1-1c inferred tliat optimum 1'C trace lengths caLlsc tlic ncnts: the microprocessor chip, nvo RAMS, and tlie reflections to cancel. A range of acceptable 1UM posi- bus transcci\fcr. 4s shown in Table 1, any one of these tions \\)as found tlirough HSPICE simulation. The components can ‘let as tlie driver, depending on the rcs~~ltsof these simulations arc summarized in Tahle 2. transaction in progress.

INDEX BUFFERS DATA RAMS (THREE MORE ON (EIGHT MORE ON THE OTHER SIDE) \ /THE OTHER SIDE) . .

CIRCUITRY

PROGRAMMABLE LOGIC

DATA TRANSCEIVERS

- - \ PROGRAMMABLE ADDRESS AND COMMAND SYSTEM BUS LOGIC TRANSCEIVERS CONNECTOR

Figure 6 Placcmcnt of ibl~lol.Components

Table 1 Data Line Components

Transaction Driver Receiver

Private cache read RAM Microprocessor Private cache write Microprocessor RAM Cache fill Transceiver RAM and microprocessor Cache miss with victim RAM Transceiver Write block Microprocessor RAM and transceiver

Vol. 8 No.4 1996 In the series of simulations given in Table 2, the threshold levels were set at 1.1 and 1.8 volts. This was justified by the use of perfect transmission lines. The lines were lossless, had no vias, and were at the lowest impedance level theoretically possible 011 the module (55 ohms). The entries labeled SR in Table 2 indicate unacceptably large delays caused by signal reflections recrossing the threshold levels. Discarding these entries leaves only those with niicroprocessor-to- RAM distance of 3 or more inches and the IW- to-transceiver distance of at least 6 inches, with the total -2.0 1 microprocessor-totranscei\rer distance not exceeding 40 45 50 55 60 65 70 75 80 11 inches. The layout was done within this range, and NANOSECONDS all data lines were then simulated using the network subcircuits generated by MODULEX with threshold Figure 7 levels set at 0.8 and 2.0 volts. These subcircuits l'rivate Cache Rcad Showing hrgc Reflections Due to Unfavorable Trace Length Ratios included the effect of vias and 1'C traces run on several signal planes. That simulation showed that all but 12 of the 144 data- and check-bit lines had good sig- nal integrity and did not recross ally threshold levels. The failing lines were recrossing the 0.8-volt thresh- old at the transceiver. I~icreasingthe length of the RAM-to-transceiver segment by 0.5 inches corrected this problem and kept signal delays within accept- able limits. Approaches other than placing the components in-line were investigated but discarded. Extra signal lengths require additional signal layers and increase the cost of the module and its tliickness. -2.0 1 40 45 50 55 60 65 70 75 80 RAM Clock Design NANOSECONDS We selected Texas Instruments' CDC2351 clock drivers to handle the RAM clock distribution network. The Figure 8 CDC2351 device has a well-controlled input-to-output Private Cache Read Showing Reduced l

PC Trace Length Write Delay Read Delay (Inches) (Nanoseconds) (Nanoseconds)

Microprocessor RAM to Microprocessor RAM to RAM to to RAM Transceiver to RAM Microprocessor Transceiver Rise Fall Rise Fall Rise Fall 0.7 2.3 0.9 S R 1.1 1.4 0.7 2.7 S R S R 1.5 1.4 0.6 3.1 S R SR 1.7 1.5 0.9 2.1 1.2 1.1 0.9 1.O 0.9 2.4 1.o 1.1 1.4 1.3 0.9 2.9 1.O 1.3 1.5 1.3 1.1 1.8 1.2 1.4 0.9 SR 1.3 2.2 1.4 1.4 0.9 1.O 1.2 2.6 1.3 1.4 1.2 1.2 1.5 1.7 1.5 1.7 S R SR 1.4 2.1 1.8 1.7 S R S R 1.6 2.4 1.7 1.4 0.9 1.2

Note: Signal reflections recrossing the threshold levels caused unacceptable delays; these entries were discarded.

1)igitnl 'lcchnic.~lJout.nal -.- U3A133SNWUl SWHOZZ ( SWHO 0 1 'F3-qSWHO OE SWHO 0 L rD ,--JwAMah SWHOOS XI013

SWHO OE

.aiu!~dnlss e sa.~!nba~33!~3p ~~FJJ~ILI! snq aLlJ .S SWHO OE

papn13Ll! S.I~LI~!SSI)'su8!sap ~no!~\a~du! s~x~jja J!aql L153S 811!1\eq PLIC 'ELI~LLIO~I~L\~3S3111 ~LI!IB~!S!IUV :SILISLI~~~UJO~,(I!IDP4~1p\o110~ 3~1~11 seq (~ro!~sasue~~ .s.ro~.~ae1el-7 8u!sne:, snql 'Atlap a,\!13?jjnq1 ru!12!,\) I(.IOLUJLU u!eru 01 sqses-g arll ru0.q qsueJl e aseanu! 1~1es13,\q pIolIsa.Iq1 3111SSO.ISS.I LIU~srlo~s3~a.1 'aldruusa .IOA .sas!,\sp I/VV~n~~xx~io~ro.~q~~r~~se SU!SI~ arlJ ~sleuS!s p!leAjo sa8pa sq~uo pasod~u!.radnssuop ~3131(3l!eM BLI!J!II~~J]noql!~\ SLI~!I~L?SLI~.I~d~ouram -sapald!ll~iw a1ea.13 01 pua~asaqL 'SLiO!l!SlIeJl elep aSeue~u01 nStrnlleqs e snluosaq I! 'asea.13ap saw!] qsLs p!lw apasa~duayo saslnd n\oueu '~a,\a~oq'ppo~ 1133.1 u131sl(s sv ;aq3es-~SJ! se se (sal!~~pue sll!j) I(JO aql UI '~LU!I ploq pue dnl?sjo lCluald Sup1011e 'p.~aalu! -uew u!ew uroy pue 01 s.l?jsue.u .~ojpasn s! snq elep SI! yJOl3-l?d 33Uo ~([LIo lIJl!MS S[EU~!Sll"p1.10~ Imp! ayl UI n!ql LIJIIS S! .~ossn"o.~do.~s!ur$91 12 eqd~aql jo a.Irusa1 bu!dwea au!7 ejea -!L~NL? nq,L 'suosesJ amem~oj~ad-1q4 lI(lpewpd uasoq3 sem aq3e3 sno~~o.~ilsu~(ssrp 'ps~ro!~rraur 1C~sno!~\a~d sv '~I/VMJat11 1e lCl!.~Sa~u! ub!saa ayj lo spadsw le3!sAqd 1e21S!s poo8 8u!~a!q~u104 1e3!~!.13SIX saqx~uq341 yo LII~LI~arl1 Z.'~[3e3 saqsn! olclu ueq] Jai\iyjo saqsue-lq .a.lols t)y(i-~s~cs!ldnp aLl3 lenba 1.1oqs ,iq sprul a.~n,\j srro!]sauuos peo1 aql 40 ~12sjjn3~11 PLIC ~12!sapSI~~OLLI n~3 3~11jo s13adse ,ptol "11 puc su!l Lro!ss!msue.ll nql ni\!Jp ~(la~mil>ape01 ~n!s.(~ldsql uroy ps,\!~sp~un~si(s 001% .~"~\.ra~qdlv pqaxr a.15,\\s.I~I\!.I~ ald!~lnu~ leq~ pa~\\or~s L~O!JBII~W!S sq~jo ~LI~LLI.T~?J.IX~nq] snssnmp uo!nas S!LIL 33IdSH 'I! s.\!.Ip 01 lL1?!3LJjllS StA\ LlO!l33S 1SEZ'JU3 3110 put 'peol srro ,CILIOseq dno~8qluaias aqL .Sn!~eqs pea[ poo8 a.\a!qJe 01 Jnnpp ipcn ~J!A\ sa!.las LI!uols!sa.i 48110.1q1 ~alle-ledII! pa133~~1o3SLIO!IJ~S .IJ,\!.I~ 7301s OI\U .sw.~c?~n~\c,\\11nsyj!p ~JOUIaqljo auosjo Suqp '(q ua,\!jp se.n sdno~Sx!s 1s.1~aql jo ilseg 'wa318 -ueq ~qlsaleJlsnll! 11 n~nY!3~ssps!~a~se~e~lr, pue -u!s e put 'oi\ujo sdno.18 o,\u 'a3.1~11josdno.13 moj a-le ssJnIe?J ~e.~n~sn~!y~eS~X)SS~~O.I~~.I~!LU aq~,iq pasnes s.1sy1 '6 a~ns!~u! ul\zoqs sv ~iu!~u!xo~dles!sA~ld .l!aq1 suop!sue.u elcp sno!~ndsjo s3uass~d3~11 U! ual\a no paseq sdno.18 un.\ns olu! 173p!,\!p 3.1a~SW~ a41 sle~~S!sn?pad iil.lcnu :nBny sen\ jjohd nq~pue 'ysel 'sm,g at11 01 sp~%!sysop .~aii!lnp01 pn~!nba-laJam ssnlu!ed c a1iipo~i1ayl LIO s.~ols!ss.~8s~ %LI!IU~OUI apt?tLI 'p~coq3~ 341jo sap!s rpoq uo y3ecl 01 y3eq palunow saYeyned llcrus rr! sJols!saJjo ki!l!qu[!~,\cprre sau!q>eLu 'S.I~AI\!.IPq3o1q SIZDCI:) o~t~.S~LLI!Idli~as a~enbape ~usu~s~eld~~~nuocl~uos s!jeuIolnv '01 x111By[I! LIMO^^ al\a!qsc 01 pspasu s! jicls13 y201"~ XIII~!~u! u1noqs se se 's~r!~c~ep srlses risen LI! s.rols!sn.r Bu!durep-sapas 'asriesaq le!Jysusq sc,~].led nq~~14no.1~11 ICelap 3uol DATA LINE SCALE: 1 .OO VOLTIDIVISION, OFFSET 2.000 VOLTS, INPUT DC 50 OHMS TIME BASE SCALE: 10.0 NANOSECONDS/ DIVISION

Figure 11 Handling of Difficult Wavcfor~ns

to access the RiUvl without using multiple cycles per Alpha 21164 microprocessor. In addition, it provides read operation, and since the fill transfer involving an opportunity to speed up memory writes by the 1/0 memory comprises four of these operations, the bridge when they modifj an amount of data that is penalty mounts considerably. Due to pipelining, the smaller than the cache block size of 64 bytes (partial synchronous cache enables this hlpe of read operation block writes). to occur at a rate of one per system cycle, which is The AlphaServer 4100 1/0 subsystem consists of 15 nanoseconds in the AlphaServer 4100 system, a PC1 mother board and a bridge. The PC1 mother greatly increasing the bandwidth for data transfers to board accepts 1/0 adapters such as network interfaces, and from memory. Since tlie synchronous RAh4 is disk controllers, or video controllers. The bridge pro- a pipeline stage, rather than a delay element, the win- vides the interface between PC1 devices and between dow of valid data available to be captured at the bus the CPUs and system memory. The 1/0 bridge reads interface is large. By driving the Wlswith a delayed and writes memory in much the same way as the CPUs, copy of the system clock, delay components 1 and 2 but special extensions are built into the system bus pro- are hidden, allowing faster cycling of the B-cache. tocol to handle the requirements of the 1/0 bridge. When an asynchronous cache communicates with Typically, writes by the 1/0 bridge that are smaller the system bus, all data read out fiom the cache must than the cache bloclc size require a read-modifj-write be synchronized with the bus clock, which call add sequence 011 the system bus to merge the ne\v data as many as two clock cycles to the transaction. The with data from main memory or a processor's cache. synchronous B-cache avoids this performance penalty The AlphaServer 4100 memory system typically trans- by cycling at the same rate as the system bus.2 fers data in 64-byte blocks; however, it has the ability In addition, the choice of synchronous RAMS pro- to accept writes to aligned 16-byte locations when the vides a strategic benefit; other ~nicroprocessorvendors 1/0 bridge is sourcing tlie data. Wlien such a partial are moving toward synchronous caches. For example, block write occurs, the processor module checks the numerous Intel Pentium microprocessor-based sys- DTAG to determine if thc address hits in thc Alpha tems employ pipeline-burst, nodule-level caches using 21164 cache hierarchy. Ifit misses, the partial write is synchronous RAM devices. The popularity of thcse permitted to complete unhindered. If there is a hit, systems has a large bearing on the RAM ind~stry.~It is and the processor module contains the most recently in DIGITAL'S best interest to follow the syncllronous modified copy of the data, the I/O bridge is alerted RAM trend of the industry, even for Alpha-based to replay the partial write as a read-modify-write systems, since the vendor base will be larger. These sequence. This feature enhances DMA write perfor- vendors will also be likely to put their efforts into mance for transfers smaller than 64 bytes since most of improving the speeds and densities of the best-selling these references do not hit in the processor cache." synchronous KAM products, wliich will facilitate improving the cache performance in hture variants of Conclusions the processor modules. The synchronous B-cache allows the CPU modules Effect of Duplicate Tag Store (DTAG) to provide high performance with a simple architec- As mentioned previously, the DTAG provides a mech- ture, achieving the price and performance goals of anism to filter irrelcvant bus transactions from the the AlpliaServer 4100 system. The AlphaServer 4100

Digital Technical Journal Vol. 8 No. 4 1996 35 CPU design team pioneered the use of s)lnchronous 9. J. Handy, "Synchronous SWIloundup," D~it~iqlies/ RAMS in an Alpha ~nicroprocessor-based s)rsteni (Scptcmbcr 11, 1995). design, and the knowledge gained in bringing a design from conception to volume shipment will benefit General Reference fi~ti~reupgrades in the AlphaServer 4100 scrvcr Fa~nily, as well as products in other platforms. li. Sites, cd., Alpbci Archilccl~irc?Rq/hl-w?ce A//c.~rz/.lal (Burlington, Mass.: Digital Press, 1992). Acknowledgments

The development of this processor module would not Biographies have been possible without the support of numerous individuals. hck Hcthcrin9-ton" ~erformed enrlv conceptual design and built the project tcam. Pete Bannon ~mplernentedthe synchronous RAM support features In tlic CPU design. Ed Rozman championed the use of random tcsting techniques. Nor111 Plante's skill and patience in implementing the oftcn tedious PC layout requirements contributed in no small niea- sure to the project's success. Many others contributed to firni\vare design, sjatem testing, and performance analys~s, and thcir contnbut~ons are gratef~~lly ackno~i~lcdgcd.Special thanits must go to Darrel Donaldson for supporting this project throitghout thc Maurice B. Steinnian entire development c)lcle. ~MauriccSteinman is a hard\vare principal engineer in the Servcr Product l>e\,elopnie~ltGroup and was the leader of References the CPU design team for the 1)IGITAI.AlphaServer 4100 system. In pre\,ious projects, he was one of the designers of tlie AlpIiaScl.vcr 5400 C1'U ~iiodulcand a designcr of 1. DIGITAL AlphaServer Fa~iiil!~DIGITAL UNIX Perfor- rlie cachc control subsystem for the VAX 9000 computer mance Flash (Maynard, Mass.: Digital Equipment system. Maurice recci\~eda R.S. in compurcr and systems Corporation, 1996), http://\v\\,\\,.c~~rope.digital.c.om/ engineering fi.0111 IknsseJaer Polytechnic Instir~~tcin 1986. i11fo/performance/sp/unis-s~r-flash-9.abs.l1t1nl. He was awarded nvo patents rclated to cache control and coherence and has nvo patents pending 2. Z. Cvctanovic and D. Donaldson, "Alphaserver 4100 Performance Characterization," IXgilal Techr2ical lo~irnal.vol. 8. no. 4 (1996. this issue): 3-20. 3. G. Hcrdeg, "lksign and Implcmentntion of tlie AlphaServcr 4100 Cl'U and Memory Architect~~re," DiCqirnl T~chvcloir~zil,vol. 8, no. 4 (1996, this issue): 48-60. 4. S. Duncan, C. Keefer, and T. McLaughlin, "High Performance 1/0 Design in the AlphaScrv& 4100 Sym- metric iMultiprocessing System," Digitcil Technical .Jotlrnal.vol. 8, no. 4 (1996, this issue): 61-75. 5. "Microproccssor Report," il/licroDesign Rcso~il'ce.~. vol. 8, no. 15 (1994). George J. Harris 6. IBlW Penorzal Conzputer Pori!et- Series SO0 PerJir- Gcorgc Harris was responsible for thc signal intcgrinl and r?zance(Arnionk,N.Y.: Intcrnational Busi~lcssMachines cache design ofthc CPU module in the AlphaServer 4100 Corporation, 1995), http://ikc.engr.washington.edu/ series. He joined 1)IGITAL in 1981 and is a hardware prin- nen~s/whitep/ps-perf. htnil. cipal cnginccr in the Scrver Product Developmcnt Group. Before joining DIGITAL, he designcd digital circuits at 7. L. Saundcrs and Y. Trivcdi, "Testbench Tutorial," Il7tc.- the conlputer di\isions of Honcy\\~ell,RCA, and Fcrranti. gr~itc.d~]~s/e~Ue.~ign, v01. 7 (April and May 1995). Hc also designed co~nputcr-assistedmedical monitoring systems using 1'1)P-11 computers for tlic hi~cricnnOptical 8. I)/C;f 'IiiL Sonicondmclor 2 7 164 (.SGG iM/-lz '%rouUqh Divisio~iof Warner Larnbert. He received a master's degree 43.3 JVIHZI Alpha Microprocessor lI~~~-c.f~i!c.zt*ein electronic comm~~nicationsfrom ~McCillU~uvcrsity, Rejkrence Pla~zual(Hudson, Mass.: Digital Equipn~ent Montreal, Quebec, and \\.as awarded ten patents relating Corporation, 1996). to computer-assistedmedical ~nonitoringand one patent rclated to \vork at DIGITAL in the area of circuit design.

Vol. 8 No. 4 1996 Andrej Kocev Andrcj Koscv joined 1)IGITAL in 1994 after rccci\,ing a I1.S. in computer science fi.oln llnsscl.~crPolytccllnic Institute. Hc is n scllior hard\\,~rcengineer in tllc Scr\,cr I'roduct I)c\,cloplilcnt Group 2nd '1 mcml>c~.oftllc (:1'1' \,crificntion team. He designed the logic \ crilicatio~lsofi- \\arc tic~cribctiin this paper.

Virginia C. Larilere Virginia 1;nmcrc is a hard\varc principal c~lgillccrin the Sc~.verProduct 1)cvclopment Gro~~pand tvas rcspo~~siblc for CI'U module design in the DIGITAI. AlphaScr\.cr 4100 series. Ginnv was 3 member of the \.crific.~tio~lteams for rhc AlpliaScrvcr 8400 and AlphaScrver 2000 <:PU nlod- ulcs. I'rior to those projec~s,she contributed to the design of tllc floating-point processor on rllc \/AX 8600 ;~ndtllc c~ccutionunit on the VLY 9000 computer sysrcnl. Sllc rccci\~cda R.S. ill clccrricnl cnginccring .111d con1pLItcr scicllcc ti.on.1 I'rillccro~l Uni\iersity in 198 1. Gi111iy\vas nwardcd t\\'o txltcnts in the area of the caccurion unit dcsign and is a co-a~ltliorof tlic paper ''Floating Point 1'1-occssork)r the VAS 8600" published in this,/o~~t-rral.

Roger D. Pannell Kogcr l'anncll \\,as the Icadcr of the VC1.Y ASIC: dcsign rcnm k)r the AlpllaSer\,cr 4100 system. Hc is a hnrd\\,nrc prilici1):ll c11giriccrin the S~r\~cl-Protiuct l)c\clopmcnr Group. Kogcr Iias \\,orked on sc\rcrnl projects sillcc join- ing I)igit,ll ill 1977. Most rcccntl\,, he lins beell 3 nlotiulc/ ASIC iic\isncr on rllc AlpJiaScl-\.cr8400 and VAS 7000 1/0 port n~od~~lcsn~~da bus-to-bus 1/0 brldgc. Itogcr rccci\.cd n l3.S. in clccrronic cl~giliecringtcch~lology fro111 the Uni\,c~-sityof Lo\\~cll.

1101. 8 No. 4 1996 I Roger A. Danie The AlphaServer 4100 Low-cost Clock Distribution System

High-performance server systems generally E\rcry digital computcr systcrn licecis a clock distribu- require expensive custom clock distribution tion systcm to synchronize clcctronic co~nmunication. systems to meet tight timing constraints. The primary metric used to cli~anti+tlic performance of a clock distribution system is clock skc\\: Synch- These clock systems typically have expensive, ro1ioi1s SystcIns require multiple copics (outputs) of application-specific integrated circuits for the same clock, and clock ske\v is the un\vanted delay the bus interface and require controlled etch ben\,ccn any njZoof the copies. In general, tlie Ionrel- impedance for the clock distribution on each the sltc\\; the better the cloclIGITAL AlphaServcr 4100 clock distribution allows off-the-shelf components to be used systcm is ;I co~lipact,Io\\'-cost S~IUTIOII for a lligli- for the bus interface, which in turn results in perforniancc midrange server. The clock systcrn pro- vides more copics of the clock than machincs in the lower costs and a quicker system power-up. same class typically need. The distribt~tio~ispstcni Component placement and network com- allo\\a expansion on those mod~~lcdcsigns \+/here pensation eliminated the need for controlled- more copics of the clock are ncccicd \\.it11 niinimal impedance circuit boards. The clock system skew The system is based on a lo\\.-cost, off-the-shelf design makes it possible to upgrade servers phase-locked loop (PLL) as the basic bi~ildingblock. with faster processor options and bus speeds The simple application of the PLL, alone \\roi~ldnot provide lo\\' clock site\\; tho~~gli.Signal integrity tech- without changing components. niques and trade-offs were nccdcd to m~nag-cskc\\z throughout tlic systcm. The technical cliallcngcs were to dcsign a lo\%.-cost system that would (1) recluire only a small area on tile printed wiri~lgboarcis (I'WBs), (2) be adaptable to \,'irious spccd gracics (options) of <:I'Us, and (3) lia\,c good pcrformancc, i.c., lo\\. skew This paper cliscusscs the techniques used to optimize the pcrform3ncc of an off-tl~e-shelfPL1,-based clock distribution s\Istcm.

Design Goals

Based on its cspcricncc \\rich prc\,ious plattbrm designs, the design tc3m considered a clock slas cost-effective, rccl~~ircdless DISPERSION ETCH LENGTH spxe tllan c~~stonicil-cuits, ~ndprovided aclccl~~ate (MILLIMETERS) fan-out. The skew performance, lio\vcvcr, ranged from 2 11s to 4 ns, \vhich exceeded the dcsign goal. Figure Givcn the project time constraints and tlic design Bus Settling Timc As a Function of Dispersion Etch Length

Digital Tcchnial Journal Vol. 8 No. 4 1996 39 a logic state during long icllc times ~~ntilanother motiulc wants to use tlic bus. IVithout this feature, the bus \vould settle at the terminator The\rcnin voltagc if no modules \\!ere dri\~i~igthe bus. Both protocols allow for Thcvcnin \wltagc to bc close to tlic tlircsliolcis of the receivers. Normally this is avoided if the bus is lcti idle, because the receivers can go metastable, i.e., arrive at the unstable condition \\,here its input \,oltagr is bcn\,ccn its spccilicd logic 0 and logic 1 \,olt.lgc Ic\~cls, ~~csi~lti~igill i~~ico~itrollcci oscillatio~i. Ccntcri~igthe Thevenin voltage in the normal full voltagc swing Iiad nvo advantages: (1) it balanced the settling time for both transitio~~s,and (2) it reduced the driver currcnt. Tlic reduced drivcr current allo\\tccl for .I lo\vcr The\~eninrcsistancc, \vllich brought- the tcrmin.1tors closer to the ~~nlo~clcd(no modules) impedance of the Figure 2 bus, thus ensuring that tlie bus \\.ould settle \\,ithin 6 ns. ?'!.pical I'hasc-lockcd Loop Co~lncction

The Basic Building Block Etch Layout +She \ I'WI3 lay-ups used on \rarious modi~lcsin the Texas Instr~~~~icnts'<:1)<:586 clock distribution circuit AlphaScr\lcr 4100 system contain microstrip etch \\pas chosen as the bxic building block for the system (surhcc ctcli) and dual strip-linc ctcli. Ideally, si~iglc because of its lo\\i cos~and fi~nctionality.The dc\licc has strip-linc ctcli would be optimium fbr clock etch; hotv- a hn-out of12 outp~rts\vith a single compensation loop ever, it rcqi~ircsInore laycrs at higher cost for PWR and a frequency mngc of 25 megahertz (1MHz) to 100 material. One dm\vback to dual strip-line lay-ups js (MHz, and is n 3.3-\wit (V) bipolar complcmcnrary etch crosso\.cr. A crosso\,er is a point along an etch metal-oxide semiconductor (KiC:MOS) part. l'roccss trace \\:liere ;~notlicretch, one on a different la!,cr not sltc\v is 1 ns benvccn any n\.o parts \\lit11 tlic S,IIIIC rcf- separated by a reference planc, crosses. The crossover crcncc input clock, and root mean square (RMS)jitter forms small capacitance patches, \vhich can load tlie is 25 picoseconds (ps).' The (:L>C586 has a built-in clock etch and affect its impedance and velocity of loop filter, which rcd~~ccstlie ~iumberof support coni- propagation. The rcsult is additional skc\v from clock poncnts. Unlike custom clock circuits \\lit11 ~nultiplc, etch to clock ctch. Designers avoided crosso\.ers on all independent co~npens:ition loops, the simple, single clock ctcli, anti the dcsign docs not permit parallel loop design recl~~ircclcritical attention to the l:~!'out of etch on tlic otlicr layer \\,ithin tlic dual strip-line, cuch modulc dcsign to ensure the best possible skc\\r \\,hich co~~ldinduce cross t~ll<. pa-formancc. Tlic circuit board layout dcsig~icrhad Fig~~rc2 slio\vs ~natchedetcli lengths I.,, Lz, and Ld. to determi~lcthe maximum etcli length from the l'I.L, On some module designs, this ctch can be tairl!l long. to the receiver. All copies of the clock had to be prc- The layoi~tdcsigncrs \vould generally "scrpcntine" ciselp matched in length to the maximum length or "trombone" these long ctch runs to comply with found, and routcci on the same etch layer with the ~lforc~nc~itioncdlayout rulcs. Spacing bch\lceli 0.5 1 Inm (20 mil) spacing to othcr ctcl~csand mini- the loops on the same ctch ruli in the scrpcntinc or 1iii1111etch crossovers from other etch layers on dual trombonc is critical. If the spacing is too close, then strip-line In?-LIPS.Typical strip-line etch in multila)~cr coupling \\,ill occur, tli~ischanging thc \'clocitv of PWRs is a signal layer that has reference planes, usually propagation as \\,ell as signal q~rality.lksigners used assigned to po\ttcr or ground, in the Iaycr above and simulation to determine a mi~~ini~~~iietch-to-ctcI1 the layer belonf. This dcsign allows better impcda~icc spacing fix c;ich 1'WB lay-up. Thc masimum allo\\lablc control and eliminates cross talk from othcr signal cross-talk noise Ic\~elfor any minimum spacing \\!as layers. PWB thickncss and cost constraints otic~ircsult 400 miIli\rolts (mV).This lc\lcl is \vitl~inthe maximum in ~noditiedforms on tlic inner layers, ho\vcvcr. l>ual transistor-transistol- logic (Tri..) low-state level of strip-line etch is ohcn used in thcse cases. This design 800 mV. L~rgerspacings \\(ere used \\,liere no other consists of t\vo signal laycrs sandwiched bctwccn rcfcr- layout rulcs \\,auld be affected. encc planes iu the layers abovc and belo\\.. Generally the diclcctric thickncss bcnvecn the nvo signal I;i)crs is The Use of External Series Terminating Resistors greater than the diclcctric thick~~cssbenvccn cithcr External scrics tcrmindting resistors (~Isoculled termi- signal la!ler and its related (nearest) rcfcrcncc planc to nators), dcnotcd by R, arc ~~scdat the source (see minimize cross talk bcn\lccn the t\vo signal In)rcrs. Figure 2). Althoi~ghTexas Instruments offers another Figure 2 illustrates n typical application. version of the I'LL, namely ClX:2586, which has

Vol. S No. 4 1996 built-in series terminators, the AlphaServcr 4100 dcsign- the PLl, and a capacitor C from the same pocver pins ers did not use this variation for thc following reasons: to ground. Tlie L-C filter can be implemented in nvo \ways: Some forms ofclock treeing (a method of connect- (1)by using a surface mount inductor and (2) by using ing multiple receivers to the same clock output) a length ofetcli for the inductor. In either case, the Q require niultiple source tern~inators. ofthe circuit has to be kept low to prevent oscillation. The nominal value for the internal series tcnninator Q is a dimensionless number referred to as the quality \\[as not optimum for the target j~npedanceof the factor and is computed from the inductance L and PWBs. resistance R (in this case the inductor's resjstance) of The tolerance of the internal series ter~ninators a resonant circuit using the formula Q = oL/N.\vherc over the process range of the part could be as high w equals 27r/,' and J is the frequency. A low-value resis- as 20 pcrcent compared to 1 percent for external tor in series ~litlithe inductor can help. Estrc~necare resistors. should be taken ifthe length-of-etch (used to generate inductance) implementation is considered. Tlie etch Local Power Decoupling must be strip-line-etch isolated from any other adja- I't,Ls are analog components and are susceptible to cent etch or etch on other lavers not separated by power supply noise. One major point source for noise power or ground planes. A nvo-dimensional (2-D) is the PLL itself Most applications require all 12 out- modeling tool sl~ouldbe used to calculate the length puts to drive s~~bstantialloads, which generates local of etch needed to gct the proper inductance \lalue for noise. Asubstantial number of local decoupling capac- the filter. Simple rules of thun~bfor inductance \\rill itors (one for every four outp~~tpins) and short, \vide not work with reference planes (i.c., power and dispersion etch on the power and ground pins of ground planes). the PLL were required to help counter the noise. The R-C filter is limited to PLLs with moderately Designers also used tangential vias to minimize para- lo\\! current draw on the analog po\ver pins. The cur- sitic inductancc, which can severely rcdirce the effec- rent generates an IR drop (the voltage drop caused by tiveness of the dccoupling capacitors. Typical surhce the current through the resistor) across the resistor R. mount components have dispersion etch, which con- Typical PLL analog power inputs requirc less than nects the surface pad to a via. Tangential vias attach 1 ~nillia~np(rnA), \\lhicl? u~oilldallow a reasonable directly to the pad and eliminate any surface etch that \ialue resistor R. Two capacitors should be used in the can act liltc ind~1cta11ceat high frccli~ency.The 1'LLs R-C type filter: a bulk capacitor For basic filtcr response were also located away from other potential noise and a radio frequency (RF)capacitor to filter higher sources such as the Alpha microprocessor chip. frequencies. Bulk capacitors are any clcctrolytic-style capacitor 1 microfarad (FF) or greater. These capaci- Analog Power Supply Filter tors have intrinsic parasitics that keep them fio111 The most important external circuit to the PLL is the responding to high-frequency noise. The benefit of low-pass filter on tlic analog power pins. Typically, PLL tlie L-C filter is that, although a single capacitor can be designs have separate analog and digital po\ver and used (~VOare still suggested \\lit11 this style filter), the ground pins. This allo\\s thc use of a low-pass filter to reactance of the inductor increases with freqi~encyand prevent local s\vitching noise fi-om entering the analog helps bloclc noise. Both filter styles \\lerc used in tlic core of the PLL (primarily the voltage-controlled oscil- AIphaScrvcr 4100 systeln. lator [VCO]).Ifa fi ltcr is not used, then large edge-to- edge jitter \\rill develop and \\!ill greatly increase clock System Distribution Description skew. Most PI,L vendors suggest filter designs and PWB layout patterns to help reduce the noise entering The AlphaServer n~otlierboard has four CPU slots, the analog corc. The CDCS86 PLL was introduced at eight memory slots, and an 1/0 bridge 11iodule slot. the beginning of tlie Alphaserver 4100 design, and the Each module in the system, including the mother- vendor had not yet specified a filter for the analog board, has at least one PLL. The starting point of the ~xnverinpi~t. It is inlporta~itto note tlii~tif any neur system is the CPU that plugs into CPU slot 0. Each PLI, is considered and preli~ninaryvendor specifica- CPU module has an oscillator and a buffer to drive the tions do 110t include details about the analog po\ver, 111ain system distribution, but tlie Cl'U that plugs into thc dcsigner slio~~ldcontact the vendor for details. slot 0 actually drives the system distribution. A PLL on T\vo forms of Ion-pass filters \\,ere co~lsidered:L-C the motl~erboardreceives the clock source generated and R-C. The L-C filter consists of a series inductor L by the CPU in slot 0 and distributes low sl

1)ipit.ll Tcc.ch~iic.~lJourn,ll Val, 8 No. 4 1996 4 Skew Management Techniques

The AlphaScr\lcr 4100 system had fi)ur design teams. Each team was assigned a portion of the system. Signal integrity tcchniqucs had to be cic\/clopcd to keep tlie ske\v across tlic system as lo\v as possible. These tecli- niq~~eswere structured into a set of design rules that each team had to apply to their portion of the design. To develop these rules, designers explored several

MOTHERBOARD areas, including impedance rangc, termination, tree- ing, PLL placement, and compensqr'‘ Jon.

lmpedance Range DISTRIBUTION Co~ltrollcdimpedance (+/-lo percent from .I target impedance) raises the PWR cost by 10 percent to 20 percent, depending on board sizc. Each raw PWB has to be tested and documentcti by tlic PWB sup- pliers, \\!hicli rcs~~ltsin a fixed cliargc for each PWB, regardless of sizc. Therefore, smaller PWBs have the highest cost burden. The Alphaserver 4100 uses rela- BRIDGE tively small daughter cards. Since low system cost was 'fq a primary goal, noncontrolled impednnce 1'WBs had to be considered. Unfortunately, allowing the PWB impedancc rangc (o\ler process) to spread to greater Figure 3 than +/-I0 percent makes the task of keeping clock System Clock Flo\\, 1)1~gra1n skc\v low more difficult. Specification of mechanical ciimensio~is \\,it11 tolerances \vas tlic only to pro\lide so~necontrol of the inipcdance range \\rith no additional costs. The Alpha microprocessor used on all <:PU options Table 1 contains the results of simulations per- for the AlphaSer\ler 4100 system has its ow11 local formed using SIMPEST, a 2-D modeling tool devel- clock circuitry. The microprocessor uses a built-in oped by DIGITAL, for a six-layer PWB ilsed o~ione of digital PLL that allows it to lock to an external rcfer- the AlphaSer\.er 4 100 modulcs. The PWR dimensions ence clock at a multiple of its il~ter~lalclock.' In the and toJcmnces specified to the \rcndors \\.ere used in context of the AlphaSer\!cr 4100 system, tlie reference the sjrnulations. The dielectric constant, tlic only para- clock is generated by the local clock distribution sys- meter not specified to thc vendor, ranged from 3.8 to tem. The Alph'~Scr\gcr4100 is hlly s)rnchrono~~s. 5.2, \\tliicli overlnps the typjcal industry-published Each CPU 111 the s)~stcnihas two clock SOII~CCS: mnge of 4.0 to 5.0 for FR4-type niatcrial (epoqr-glass one for the bus clistribution (system cycle timc) and PWR).' Since our PWR material acceptance with the one for the rnicroproccssor. 'This design niay appear to vendor is based on meeting dimension tolerances, we be a costly one, but this approach is extremely cost- used thc 60 impedance rangc on all SPICE simula- effective when field upgrades are considered. When tions, thus ensuring that all acccptablc PWB ~iiaterial new, faster \:crsions of the Alpha rnicroproccssor \vo~~ld\\lark electrically. become available, nc\v <;PU options \\;ill bc intro- Tablc 2 slio\vs the impedancc rangc for a controlled duced. To remain s!~~icIironous,the Alpha niicro- impedance I'WB for tlie target inipcdancc reported in processor internal clocks need to run at a multiple of tlie system cycle time. Although the system cycle time goal is 15 JIS,the cyclc timc needs to be adjc~stcdto tllc Table 1 Vendor lmpedance Ranges Specifying speed of tlic <:PU option used. Placing the ~LISoscilla- Dimensions Only tor, c\~hiclidrives the primary lSLL for the clock system (cycle time), on the <:I'U module and designing the 4u Yield 6n Yield - clock distribution systcm to hnction over a wide he- Mean target 71 ohms 71 ohms quency range makes ti eld upgrades as simple as replac- impedance ing the CPU modules. The does not Impedance 62 ohms to 57 ohms to nccd to be changed. range 83 ohms 89 ohms

\/ol. 8 No.4 1996 Table 2 stressed. If the tests i~ldicatcdstressed parts, designers Vendor lmpedance Range for an lmpedance would adjust thc terminator \raluc accordingl\~. Tolerance of +I- 10 Percent +I- 10 Specification Range Treeing Treeing is a method of distributing clocks from a Mean target 71 ohms single source driver to many rccei\,crs. This practice, lmpedance \vhicli is \\tell I

Termination PLL Placement The dcsigr~ersuscd series termination on all cloclts in Placement of the 1'1.1.. on cnch modulc is critical. Figirrc the system. Pilrnllel terminators \\~oulJha\lc exceeded 5 is a simplif cd \~ic\vof the prirnary distribution up to the drive capability of the CDC586. Diode claniping and including tlic PLL on J module. The AlpliaScr\lcr was not practical when so many copies of the clock were rccluircd because of PWR surbcc area con- straints. Nornmallv, the optimal termination value is one that pro\lides critical damping for the case where the driver's impedance is thc lo\\lcst and tlic ctch impedance is the highest. Designers can then make ncijustmcnts at the other extreme corncr (high driver impcdancc anci Ions etch jmpcdnncc) to avoid nonmo- RECEIVER notonic bcha\~ior such as platcai~s. This gcncrall!l PHASE- introduces slope delay uncertainty at the sloiv corncr LOCKED

(high driver irnpcdancc and low etch impedance), SHARED which can be substantial. To mininiizc this cfkct, OUTPUT dcsigncrs selected terminator \~alucsthnt allow over- slioot and solnc bounce-away from the tlircshold rcgioll at the extreme proccss corncr. O\~crsliootcan reach the rnasi~mii~nispccifcd alternating current (AC) input of tlic receivers O\~CI- tlic orst- st-case proccss rmgc. Some rccci\lcrs have built-in ciiodc clamping to tlicir po\vcr supply rails as a rcs~~ltof ESl> circuits in tlicir input structures (ESD circuits arc uscd for static KEY: discharge protection). In these c.lscs, the clock signal is FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP R SERIES TERMINATING RESISTOR clamped, \vliicli in turn dampens bounce. The jnjec- tion currents caused by clamping \\.oulJ be tested in Figure 4 SPICE siniulations to be sure tli.~ttlic parts were not Clock Trccing

Digital Tcch~iicalJour11;ll \~ol.8 No. 4 I996 43 7' 4100 system has nvo types of module connectors: I .. (7jI + TL;) - 7;>. a iMctral conncctor (Futurebus+-style con~lcctor)is For - T,(cq~131 ctcl~ lengths), 'l;cI - 7;;. ~isedon thc (:PU modules anci the 1/0 bridge module, Adding Ti,; to the co~iipcnsatio~ipath !,icltis and an Estcndeci Inciustl-\I Standard Architecture (EISA) conncctor is used on rlic memor!8 modules. ?;

MOTHERBOARD PRIMARY TYPICAL MODULE DISTRIBUTION LOCAL DISTRIBUTION

DISPERSION ETCH L, CLOCK IN FROM CPU 0 TO PHASE- PHASE- RECEIVERS LOCKED LOCKED CONNECTOR LOOP

R

R

COMPENSATION LOOPS L2 1 KEY FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP R SERIES TERMINATING RESISTOR L,, L,, ETCH LENGTHS

Figure 5 I7rirnnr\. l>isrril>urion

\lol. S No. 4 I906 compensation Some modules have a wide variety of circuits recciving clocks tliat, becai~seof input loading, do not balance LIGHTLY LOADED RECEIVER well with the various treeing methods. Designers used du1111iiycapacitor loading to help balance the HEAVILY LOADED treeing. This approach \\!as particularly usehl on RECEIVER memory modules, which could be depop~~latedto provide different options using the same etch. Surface- COMPENSATION LOOP mount pads were added to the etch such that if the FB INPUT (PLL) WlTH depopi~lated\~crsion were built, a capacitor could be NO CAPACITOR added to replicate tlie missing load on the tree, thus COMPENSATION LOOP keeping it in balance. The CPU modules have a wide FB INPUT (PLL) WlTH variety of clock needs, which results in nvo forms CAPACITOR of skew: (1) load-to-load skew at the module and (2) colitrol logic-to-Cl'U skew, for control logic KEY: located on thc niotherboard. The local load-to- T, LIGHTLY LOADED RECEIVER CLOCK EDGE TlME load skew is acceptable because only one PLL is (REFERENCE) used and tlic output-to-output skew is only 500 ps. T2 HEAVILY LOADED RECEIVER CLOCK EDGE TlME T3 COMPENSATION LOOP FB INPUT EDGE TlME WlTH R/Iotherboard-to-Cl'U control logic skesv, though, is CAPACITOR critical because of timing constraints. FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP Duniniy capacitor loading at each lightly loaded receijcr would have reduced the skew, but to compcn- Figure 6 sate for just one heavily loaded receiver would have Feedback Loop Con~pellsation recli~iredniany capacitors. PWR surface area and the requirement of simplicity dictated the need for an to relax the jitter specification from 25 ps to 70 ps alternative. The solution was to keep tlie clock edges RMS, and there were some difficultics gctting good as hst as possible (by adjusting the series terminators) load bdance. The specification did not change, how- and to add a compe~~sationcapacitor at the input (the ever. Reassessing the allocated bus settling time yields feedback [FBI) of the I'LL'S compensation loop. This thc follo\ving: effectively reduced the skew froni the slowest load 011 the CPU to tlie control logic on the motherboard. Bus cycle Figlire 6 shows the disparity between light and heavy Transmitting module (Tco) loading from Tl to q.Without feedback compensa- Setup and hold time for the tion, the PLL sclf-adjusts to the lightly loaded receiver. receiving module This adjustnlent results in skcw TI to $ from the Clock skew heavy load to the control logic 011 the niothcrboard. Time allocated for bus settling A capacitor on tlic FB input of the PLL split the dif- SPICE simulations for a f~llyloaded bus with the ference benveen to & and & to 7; and minimized worst possible driver receiver position yielded a bus the puceivcd sltcw. settling time of 5.7 ns. The relaxed skew of 2.2 ns maxinluni \vas acceptable for the design. Skew Target Comparative Analysis Dcsigncrs generated the worst-case module-to-module clock skew specification for the AlphaServer 4100 A comparison of clock distribution systems benveen from vendor specifications, SPICE simulations, and nvo other platforms best summarizes tlie AlphaServer bench tcsts using thc techniques discussed in this 4100 system. The AlphaServer 4100 has a price and paper. Tlic worst-case skcw goal is 2.2 ns and is sum performance target benveen those of the AlphaServer marizcd in Table 3. 2100 and the Alphaserver 8400 systems. Table 4 com- The reader will note that eight times tlie endor or's pares the basic differences among these systems relat- specification may appear to be a rather conservative ing to clock distribution for a CPU modulc froni each spccitication. TIic LISC of tills \laluc \\/as bascd on two platform. concerns: ( 1 ) the I'LL. was new at the time, and expesi- Both the AlphaServer 2100 and the AlphaServer cncc suggcstccl that the vendor's specification was 8400 systc~iishave large custom ASICs for their niod- aggressive; and (2)some le\rel of padding was required ule's bus interface. The AlpliaSer\~er4100 and the if tlie exception to the r~~leswas needed. Act~~alsystem Alphaserver 8400 systenis have bus terniination; the testing bore out these concerns. The vendor had Alphaserver 2100 system does not. Allowing a bus to

Digital Tecl~nicalJour~lal Vol. 8 No.4 1996 45 Table 3 Worst-case Clock Skew Stage Source Skew Component Motherboard Out-to-out skew 500 ps (vendor spe~ification)~ Inputs to modules Load mismatch 100 ps (simulation/bench test) Module to module PLL process 1,000 ps (vendor specificati~n)~ Inputs to receivers Load mismatch 200 ps (simulation/bench test) lnputs to receivers PLL jitter 400 ps (eight times the vendor specification)' Total clock skew 2,200 ps = 2.2 ns

Table 4 Clock Distribution Comparison of Three Platforms AlphaServer 2100 System AlphaServer 4100 System AlphaServer 8400 System Bus width 128 + ECC 128 + ECC 256 + ECC Bus speed 24 ns 15 ns 10 ns Clock skew 1.5 ns 2.2 ns (max.) 1.1 ns (max.) Inputs requiring clocks 10 25 14 Clock drivers used 12 13 11 Number of clock phases 4 1 1 settle naturally (with no termination), as in the case of Conclusions the AlphaServer 2100 system, recli~iresa tighter skew budget from the clock systcm. The trade-off is higher An cffccti~re,low-cost, higll-performance clock distri- cost, po\\~cr,andPWPWB area for lower bus speccl. bution systeni can be Jcsigncd sing sn off-the-shclf Higher performance systems, sucl1 as the AlphaServcr componcnt as tlie basic building block. l>IGITAL, 8400 and AlpIiaServer 4100 s)~stcms,generally rcq~~irc AlpliaSert~er4100 system designers acco~nplisliedthis hstcr bus speeds with terminators. The AlphaScrvcr by optimizing the bus and de\,cloping sinlple tech- 4100 has shorter bus stubbing nodule transceiver to niques structilred in the hrm of design rules. Tlicse connector dispersio~letch) sod slower bus speed than rules arc the Alphaserver 8400, \\/l~ichallows larger ske\\l (as Use positive edgcs for critical clocking. a percentage of the bus spccd). Table 5 is a comparison of board area needed and Match delay throi~glidifferent connectors using cost for the clock system. Dcsig~~ersanalyzed an entry- appropriate pinning. level system consisting ofone CPU n~odule,one mem- Use a fixed dispersion etch lcngth from the connec- ory module, and one 1/0 bridgc or interface module. tor to the PLI,. The board area sho\.vs the spacc rcquired by the active Route and balance all clock nets on the same PWR components only (the digital phase-locked loops, la!.cr. PLLs, drivers, etc.). ~Minirnizeadjaccnt-layer crossovers and maximize Both Tables 4 and 5 sho\v that the clock system spacings. design for the AlphaScrvcr 4100 system recluires onlv one-third the space of either the AIphaServer 2100 Usc minimum \~~ILIctcrl~linators. systcm or the AlphaServcr 8400 system at a fraction of Usc tree and loop compcns'~tio~l\\/here ncedcd tlie cost and distributes more copies of the clock. Use conservative local clccoupling and a low-pass filter on the I'LL (analog power).

Table 5 Board Utilization and Cost Comparison AlphaServer 2100 System AlphaServer 4100 System AlphaServer 8400 System Board area used* 352.8 square centimeters 11 1.4 square centimeters 371.3 square centimeters Normalized cost 1 .OO 0.46 4.40

*Note that these measurements do not include decoupling capacitors and terminators The \\,orst-c<~selab nleasurcmcnt of clock ske\v benvccn myn\~o ~nodulcs in a fi~llycontig~lrcd system \\.AS 1.1 I~S,\vhicIi is \\.ell \\,ithin thc 2.2 ns calculated I~I~S~IIILIIIIsite\\*.

Acknowledgments

Terry Skrypck ,~ndRrucc Alford assisted with the prototyping and measurelnents. Cheryl l'rcston, Andy Iconing, Steve Coe, Gcorgc Harris, dnd Larry lkrenne worked with tlic designers to cnsul-r compliance \vith tlic sign'll integrity rules. 1)3rrcl l)onaldson, Do11Smelscr, Glenn Hcrdeg, and Dan Wisscll provided invaluable technical guidance.

Note and References

1. SI'I<:E is a gcncrdl-purpose circ~~irsim~~l.ltor progrdnl dc\rclopcd b!! Ln\vrence Nagcl dnd Ellis <:olic~lof tl~c l)cp~rt~ncntof Electrical Engineering dnd Computer Scicncca, University of Califomid at Bcrkclcy. 2. <:l)<:-C'lock Ilistt-iblltiort C'ilzuils, 1)'ltn Book (1)311'1s, Tcs.: Tcxns Instruments Incorporated, 1994).

3. Alph~i2 / 164 iLlio.opr-ocmor lY~~rz/ti~~o-oKcji.rz.rtcc Mctrzccc./l (M,~yn~rd,MASS.: Digitdl Equipment Corpora- tion, Scprc~nbcr1994).

4. <:. Guiles, lit,et:>'thit?gfi11/ E~~et*lLt/~t/etl to Ktlor~l Ahol/l Lcr~?lirr~/tc~.s.. . But \Ccrc AJi-~tidlo Ask. 4th cd. (h4ditld1ld,Ha.: hlon, Inc., JJ~LI~I-y1989).

Biography

Roger A. Dame A princip,ll signal integrity cnginccr in tlic ~\ilidr,lngc Scrvcrs group, Koger llarnc is currently worki~lgon the Alph~iSer\~er4100 projcct. During tlic I0 year> he has bccn \\lit11 tliia group, lie has also contributed to the VAX 6000, VAX 5800, VAX 7000, 13EC 7000, dnd 13E<: 10000 proj- ccts. In earlier \\,ark .lt DIGITAL, in the Indl~strialI'roducts group, he dcvclopcd ,inalog-to-digitd process conr~,olsys- tem interhccs. Roger joined l)IG1'1',41~in 1971. He holds an A.S.E.E.T.degree fi.om Springfclcl Tci-linical Cornmu- nin <:ollcgc and a R.S.E.E.T.(surn~na cum Inudc) from <:cntr.~lNc\\ England Collegc. Kogcr is coin\.cntor of the Idscr bus used in thc DEC 7000 and LIEC 10000 aystcms.

1'01. 8 No. 4 1996 47 I Glenn A. Herdeg Design and Implementation of the AlphaServer 4100 CPU and Memory Architecture

The DIGITAL AlphaServer 4100 system is Digital ?'he DIGITAJ- AlphaScr\,er 4100 sy1stem is a synmet- Equipment Corporation's newest four-processor ric multiprocessing (SMP) ~nidrangcserver that sup- midrange server product. The server design is Ix)rts LIP to four All~lla 21164 ~nicroprocessors. A single Alpha 21 164 CPU chip may sim~~ltaneously based on the Alpha 21164 CPU, DIGITAL'S latest issuc ~nultipleexternal accesses to main memory. The 64-bit microprocessor, operating at speeds of AlphaScr\lcr 4100 ~ncrnol-11interconnect was designed up to 400 megahertz and beyond. The memory to m;lsimizc this multiple-iss~~c feature of the Alpha architecture was designed to interconnect up 2 1164 (:PU chip 2nd to talw ati\.nntagc oftlie pcrfor- to four Alpha 21164 CPU chips and up to four mancc benefits of the nc\\, fanlily of memory chips 64-bit PC1 bus bridges (the AlphaServer 4100 c.lllcd synchronous dynamic random-ncccss memorics (S~~RAIMS).To meet the best-in-industry latency and supports up to two buses) to as much as 8 giga- band\vidtIi performance goals, 1)IGIT'AL de\.elopcd bytes of main memory. The performance goal a sinlple memory interconnect ;~rchitecru~-ethat com- for the AlphaServer 4100 memory interconnect bines the existing Alpha 2 1 164 (:l'U memory intcr- was to deliver a four-multiprocessorserver with kcwit11 tllc illdust~)!-sta~iilaI-clSI)IUIM interface. the lowest memory latency and highest mem- Tliro~~glio~~tthis paper the term latcnc!! refers to the ory bandwidth in the industry by the end of timc rcqi~iredto return data kom the 11icrnor!l chips to the <:PUchips-the lo~\rcrthe Iiltcncy, the better the June 1996. These goals were met by the time the puhrrnancc. The AlphaSer\.cr 4100 system achic\,cs AlphaServer 4100 system was introduced in May a minimum latency of 120 nanoseconds (ns) fi-0111 the 1996. The memory interconnect design enables timc the address appears at the pins of the Alpha 21164 the server system to achieve a minimum mem- (1 I'U to the time the CPU intcninl ly rccci\,cs tllc corrc- ory latency of 120 nanoseconds and a maximum sponding data from any ackircss in main memory. Tl~c memory bandwidth of 1 gigabyte per second by tcrln band\\~idtlircfcrs to tlic amount ofdata, i.e., tlic number of bytes, transferred bcn\lccn the memory using off-the-shelf data path and address com- chips and the CI'U chips per unit oftimc-the higlicr ponents and programmable logic between the the bandwidth, the bcttcr the ~~crti)rmance.The CPU and the main memory, which is based on AlphaScr\ler 4100 deli\,crs a m;isirnum memory band- the new synchronous dynamic random-access \\kith of 1 gigabyte per second (GR/s). memory technology. Rcforc introducing the I)lC;lTALAlpliaSe~-ver4100 product in Ma!' 1996, thc tic\~cIopmclittcam con- ciuctcd nn cstcnsi\rc pcrforma~lcc comparison of the top scr\>crs in thc industr!~. 'l'lic bcnch117ark tests showed that the AlpliaScr\,cr 4100 dcli\,ered the lowest memory latency and the highest McCalpi~i memory bandwidth of all tlic fi~~o-to Fo~lr-processor systems in the industry. A companion papcr in this issuc of the .[orrrilal. "Alpli;~Scrvc~-4100 l'cr- formfincc Char'~cterization,"contains the comparati\,c informfition.' ?'his IJ~P"foc~~seso~i tlic nrchitcct~~rc and design of tlic tlircc core ITIOJUICStlii~t \\'ere dc\;cloped concur- rently to optimize the pcrfi)rmancc of the critire

Vol. 8 No. 4 1996 mcmory architccturc. These tlircc modules-the No-External-CacheProcessor Module mother-board, the synchronous mclnory modulc, and tlic 110-cstcr~inl-cnclleproccssor modulc-circ slio\\m The no-external-cache proccssor module is a plug-in in E'ig~~rc1. card \\zit11 a 144-bit Incmory interfice that contains one Alplia 21 164 <:PU chip, eight 1S-bit clocked data Motherboard transceivers, four 12-bit bidirectional address latches, and control provided hy 5-ns 2s-pin PAL and Tlie ~notlicrboardcontains connectors for LIPto four 90-MHz 44-pin 1'Ll)s clockcd at 66 MHz. The Alpha processor modules, up to tbur mcmory modulc pairs, 21164 CPU chip is programmed to operate at a syn- LIP to two I/O interface modulcs (fi)i~rpcriplicral clironous memory interface cycle time of 66 MHz com~x>~icntinterco~lnect [ PC11 bus bridge chips (15 ns) to match the speed of the SDRAM chips on the total), mcmor!~ address multiplcscrs/drivcrs, and n1ernor)I niodulcs. Althougli there are no external logic for mcmory control and arbitration.' All con- cache rando~n-accessmcmory (RAM) chips on the trol logic on the motherboard is iniplcmcntcd using module, the Alpha 2 1 164 itself co~ltainsnvo levels of simple 5-ns 28-pin programmable array logic (PAL) on-chip caclies: a priniary 8-kilobyte (ICB) data caclic dc\,iccs and morc comples 90-mcgalicrtz (MHz) and a primary 8-Is) clockcd at le\rcl 96-I

MOTHERBOARD

r------I r------ILI I NO-EXTERNAL-CACHEPROCESSOR I I SYNCHRONOUS MEMORY I I MODULE (1 TO 4) (1 TO 4 PAIRS)

DATA AND ECC FLOP

I I I I I I I I

Figure 2 Data Path bemeen the CPU and Memory

Table 1 CPU Read Memory Data Timinq

Digital Rchnical Journ.~l \bl. 8 No. 4 1996 5 Table 2 CPU Write Memory Data Timing

al\\lays incurs a one-cycle gap between transactions. Figure 3. The motherboard latches tlic fbll address As a result, all but the tirst t\vo consecutive \\trite trans- alid dri\~csfirst the roll1 and then the column portion actions have addrcss bus commands ti\^ cycles apart. of thc addrcss to the memory modules. Each synch- Since the AlpliaScr\ler 4100 interconnect between ronous memory module buffers tllc ro\v/colu~~i~i the CPU and maill memory was optimized fix the address and fans out a copy to each of the SDW SDRAIM memory chip, the transaction timing, as chips using four 24-bit buffers. Sin~ilarto traditional sho\vn in Tables 1 and 2, was designed to provide data dynamic mndom-acccss memory (I>IWiM) cliips, in the corrcct cycles for the SDRAiils \\;ithout the nccci SL'>lUhlI chips use the ro\\. address on their pins to for custom ASICs to buffer the data bet\\~ccnrlic access the pagc in their memor!, arrays and rhc colunin motherboard and SDIWM chips. This dcsign tvorks adciress that appears later on the sa~iicpills to I-cad or wcll for an infinite stream of all reads or all writes \vritc the desired location \\,ithi11 tlic pagc. Conse- because of tlie SDRAM pipclined interface; howcver, clucntly, thcrc is no nccd to provide the entire 36-bit- when a write tra~~sactionimmediately follows a rcad wide addrcss to the memory modules. All address transaction, a gap or "bubble" must be inserted in the cornp~~~cntstuscdfor transccivcrs, I;ltchcs, ~i~ulti- data stream to account for the fact that rcad dat.1 is plesers, anci dri\,crs on the no-external-cnch proces- rcturned later in the transaction than write data. As a sor niodulc, tlic ~nothcrboard,and the synchronous result, e\,er!l \\trite transaction that immediately fi)llo\vs nicmory mod~~lcconsist of the 56-pin At.V(;16260 or a read transaction produces a five-cycle gap in tlie the ALVCI162260, wliich is the same part \\.ith internal command pipeline. Table 3 sho\vs tile rcad/\vritc output resistors. Address parity is clicckcd by tl~ePCJ transdction timing. bridge chips on all transactions, 2nd nny errors arc reported to the operating s!lstcni. Address Path between the CPU and Memory The addrcss path uses flow-througli latches for the The Alpha 21164 providcs 36 address signals (byte first halfofthe addrcss transfer (j.c., tlic ro\v address) address <39:4>, i.c., bits 4 through 39),5 command fron~the Alpha 21164 to the Sl>lL41Ms. When the bits, and 1 bit ofp~rityprotection. Thcsc 42 signals are addrcss 'Ippcars at the pins of the Alpha 21164, connected directly to fo~~r12-bit bidirectional 1;itchcd the latched transcci\,cr on the processor module, the transceivers on the processor ~ilodulc,as i[l~~stratcdill ~iiultiplcscdrow address driver on tlic motlicrboarcl,

Table 3 CPU Readwrite Memory Data Timing

MOTHERBOARD

r------r------I NO-EXTERNAL-CACHE PROCESSOR I I SYNCHRONOUS MEMORY I MODULE (1 TO 4) I (1 TO 4 PAIRS)

1 LATCH -ADDRESS ,; *~"~~s~~SETS PER I I I PAIR) I I m* '"*a COL BUFFER I I I I I :------I L------J

-- Figure 3 Address Path benvccn tl~c<:1'U and Memory

52 Digir.11 Tcchnic;ll Journal and the fan-out bufkrs on the memory n~odulesare all and are driven directly and i~nrnodifiedthrough the open and ti~rncdon, enabling the address information latched address transceivers on the processor module to propagate directly horn the Alpha 21164 pins to to become the nlothcrboard command/addrcss. Since the SDlaMpins in two cycles. The motherboard then the AlphaSer\~er4100 interconnect benveen the <:PU switches the multiplcscr and drives the column and main memory was optimized for the Alplia 21164 address to tlie memory modules to con~pletethe CPU chip, the Alpha 21164 estcrllal CM1) signals map transaction (see Table 4). Back-to-back rnelnory trans- directly into the 6-bit encoding of the memory inter- actions are pipelined to deliver a new address to the connect comn~andused on tlie motherboard, thus SDKAM chips every four cycles. The fill1 menlory avoiding the need for custom ASICs to manipulate the addrcss is driven to the motherboard in nvo cycles commands bet\\.een the Cl'U and motherboard. (c!lclcs 0-1, 4-5, 8-9), ~\/hcrcasadditional informa- Prudently chosen encodings of the Alpha 21 164 tion about the corresponding transaction (\vhich is external CMD signals resulted in only two command used only by the processor and the 1/0 modules) bits (to determine a read or a write transaction) and follo\\is in a third cycle (cycles 2, 6, 10). To avoid tri- one address bit (to determine the memory bank) statc overlap, the fourth cycle is allocated as a dead bcing uscd by a 5-ns I'AL, on the processor nodule to cycle, \vhich allows the addrcss drivers of the current directly assert a Request signal to the motherboard to transaction to be t~~rnedoff before the address drivers use the memory interconnect. Figure 4 shows the for the nest transaction can be turned on (c!icles 3, 7, control path between the CPU and memory. If the 11). These four cycles constitute the addrcss transfer central arbiter is ready to all on^ a nc\v transaction by that is repeated every four or ti \re cycles for consecuti\~e the processor ~nodulcasserting a Request signal (i.~.,if transactions. Note that the one-cycle gap inserted the memory intercon~lectis not in use), then a 5-11s bct\\~ec~itransactions Rd3 and Rd4 for reasons indi- l'Al, on the noth her board asserts the control signal catcd earlier in the read data timing description causes Row-CS to each of the nienlory modules in the fol- the row addrcss for transaction lXd4 to appear at the Io\vi~igcycle. At the samc tinle, another 5-ns PAI, 011 pins of the SDLWIMSfor three cycles instead of two. the motherboard decodes 7 bits of the address and drives the Scl signal to all memory modules to Control Path between the CPU and Memory indicate \\/hich of the four memory module pairs is The Alpha 21 164 provides five command bits (four bcing selected by thc transaction. Each synchronous Alpha 21164 CIMD signals plus the Alpha 21164 memory module uses another 5-ns I'AI., to ininicdi- Victin-Pending signal) that indicate tlie operation ately send the corresponding chip select (<:S) signal to being req~lestcdby tlie Alpha 21164 external inter- the rccluested SDRAM chips on one of the CS<3:0> f'~ce."Thesc fve colnmancl bits are included in the 42 signals \vlien the Ko\v-CS control signal is asscrtcd if command/,~ddress (<:A) signals indicated in Figure 3 selected by the value encodcd on Sel, as sho\vn in Figure 4.

Table 4 CPU Read Memory Address Timing

Cycle (1 5 ns) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Address Bus Command Rdl Rd2 Rd3 Rd4 SDRAM Address Row Addrl Col Addrl Row Addr2 Col Addr2 Row Addr3 Col Addr3 ... Row Addr4 Col Addr4 Motherboard Address Mern Addrl Info1 Mem Addr2 llnfo2 Mem ~ddr3llnfo3 ... Mem Addr4 Info4 Alpha 21 164 Address Addrl Addr2 Addr3 Addr4 Addr5

MOTHERBOARD

r------r------7 ADDRESS I NO-EXTERNAL-CACHE PROCESSOR I (1 TO 4 PAIRS)

I PAIR) I I PAL I I I------A L-_------

Figure 4 Control l'~t11bct\vccn thc CPU 2nd Memory

1)igit.d Tccllnlcal Joul.nal Vol. 8 No. 4 1996 53 Tablc 5 sho\vs the control signals bct\vccn the bcn\,ccn processors. The third CA cycle occurs only proce~sormodules, the mcmor!, modules, and the one c!rcle akcr tlic nodc asserts the Request signal, ccntral arbiter on thc ~~lothcrboarcifor rn~dtiple ho\\re\,cr, bccausc of bus parking. Bus parking is nn processor modules issuing single rcad transactions. arbitration feature that causes tlie central arbitcr to The central arbiter receives one or lllorc Kequest assert the Grant signal to the last node to use thc bus signals fi-on1 tlie processor modulcs and asserts a wfhen the bus is idle (follo\ving cycle 7 of transaction unicluc Grant<,?> signal to the proccssor module tliat Rd2). Consequcntlv, if the same processor wishes to currently o\irns the bus. The arbitcr then drives a copy usc the bus again, the assertion of CA and Ro\v-(:S of the CA signal to every proccssor module along with signals occurs n4fo cyclcs earlier than it would without thc identical Rosv-CS signal to every memory ri~ocl~~lc tlie bus parking feature. to marl< cycle 1 of a new transaction. Note that tlic cycle counter begins at cycle 1 \vitli each ne\v Data Transfers between Two CPU Chips C:A/llo\\_<:S assertion and may stall for one or more (Dirty Read Data) cyclcs \\:hen gaps appear on the ii>cmor!r interconncct. The Alpha 21 164 (:I'U chips contain internal \\.rite- T\\co transactions may be pipclincd nt tile sumc time. back caches. Wllcn a (:1'U \\,rites to a block ofdatn, tlic For simplicity of implementation in programrnnble modifcd data is l~cltilocull\, in the \\lrite-back cachc logic dc\.ices, tlie cycle counter of each transaction is until it is \\,rittcn hack to main memory at a later time. al\\.ays exactly four cycles fi-om the other. The modified (dirty) copy of the block of data ~iiust Tablc 6 shows a single proccssol- modulc issuing be returned in place of the unmodified (stale) copy t\vo consecutive read transactions (dual-issue) tii- From main mcmory \vlicn anotlier CPU issues a read lowed hy a third rcad transaction at a later time. t~u~isactio~l011 the mcIi1ory interconnect. The rncm- Normally, the node issuing the transaction on the bus ory modulcs rctllrn the stale data at the normal time dcasscrts the licq~~estsignal in cycle 2. If a nodc con- on the memory intcl-conncct, and the dirty data is tinues to asscrt the Request signal, the ccntral arhitcr returned by the processor modulc containing the continues to asscrt the Grant signal to that node to modified copy in the cyclcs tliat folio\\. The proccssor allow guaranteed back-to-back transactions to occur. module issuing the rcacl transaction ignores the stale Note that the first CA cycle occurs thrcc cyclcs aficr data from memory. the asscrtion of the Req~~estsignal bccausc ot'tlic dcla!~ Therefore, to maintai~~c~clic colierenc!~ bcn\.ccn ~\,itliinthe ccntral arbiter to s\vitch the Grant signal the \\,rite-back caches contained in multiple Alpha

Table 5 Multiple CPU Read Memory Control Timing

Cycle Counter 123 45 67-1 2(3)345 67 (15-ns cvcle) 12345 6(7)7-1 23 Request 12341234124 24 24 24 ( 3 3 3 3 ( 4 4 4 4 4 ( Grant ... 1111222233 3344444 4 CA. Row-CS (New transaction) X X X X Address/Cornmand Bus AddrtRdl Info1 AddrIRd2 Info2 AddrlRd3 Info3 ... AddrlRd4 Info4 SDRAM CMD (RAS.CAS.WE) ACT 1 Read 1 ACT 2 Read 2 ACT 3 ... Read 3 I ACT4 Read 4

Table 6 Single CPU Read Memory Control Timing

VoI. X No. 4 1996 21164 (:PU chips, each read transaction that appears board (arbiter and memory control) uses eight PALS on the memory interconnect causes a cache probe and three PLDs; and each synchronous memory mod- (snoop) to occur at all other CPU chips to determine if ule uses three PALS. a modified (dirty) copy of the requested data is found As shown in Table 1, the minimum memory read in one of the internal caches of another Alpha 21164 latency (read data access time) is eight cycles (120 11s) CPU chip. If it is, then the appropriate processor mod- fro111 the time a new co1111nandand address arrive at ule asserts thc signal L)iryEnablefor a ~ninimum the pins ofthe Alpha 21 164 chip to the time thc first of five cyclcs to allo\\~the memor)l ~iloduleto finish data arrives back at tlie pins. The SDWsare pro- driving the old data. The processor module deasserts grarnmed for a burst of four data cycles, so data is the signal \\hen the dirty data Iias been fetched fro~il returned in four consecutive 15-11s cycles. Two trans- one of tlie internal caches and is ready to be driven actions at a time are interleaved on the luemorv inter- onto the motherboard data bus. Table 7 shocvs read connect (one to each of the two memory banks), data corresponding to transaction Rdl being retur~led which allows data to be continuously driven in every fro~~iCPU2 to Cl'Ul fi\7e cycles later than the data bus cycle. This results in the maximum memory read from memory, which is ignored by CPU1. Note the bandwidth of 1 GB/s. one-cycle gap in cjlcles 10 and 15 to avoid tri-state overlap between the memory module and processor Trade-offs Made to Reduce Complexity modulc data path drivers. The Alpha 21164 external interface contains many As discussed earlier in this section, the Alphaserver commands required exclusively to support an external 4100 system implements memory address decoding cache. By not including a module-level cache on thc and memory co~ltrolwithout using custom ASICs no-external-cache processor modulc, only Read, on the motherboard, svnchronous memory, or no- Write, and Fetch com~nandsarc generated by the external-cache processor modules. Using PALS allo\vs Alpha 21164 external interface; tlie Lock, MH, the address dccode hncti~nand the fan-out buffering SetDirty, WriteBloclcLock, UCacheVictin~, and to thc large nunibcr of SDIiAMs to be performed at ReadMissModSTC coniniands are not used.k7 This the same tinie, thus reducing the component count design allows the logic on the processor module that is and the access time to main memory. All the necessary asserting the Request signal to the central arbiter to be glue logic between the Alpha 21164 CPU and the implemented simply in a small 28-pin PAL because SDRAMs, including the central arbiter on the mothcr- only nvo of tlie Alpha 21164 CMD signals are board, was implemented i~sing5-11s 28-pin program- required to encode a Read or a Write command. mable PALS or 90-MHz 44-pin ispLS1 1016 in-circuit Similarly, allowing a maximum of two memory banks reprogrammable PLDs produced by Lattice Semicon- in the system, independent of the nuniber of memory ductor. These dcviccs can be reprogrammed directly nlodules installed, enables the Request logic to the on the module using the parallel port of a laptop per- central arbiter to be implemented in the 28-pin PAL, sonal computer. Each no-external-cache processor since only one address bit (byte address <6>) is module uses five PAL and four PLDs; the mother- required to determine the memory bank.

Table 7 Dirty Read Data Timing

Digital Technical Jot~rnal Vol. 8 No. 4 1996 55 To decode memory addrcsscs in 28-pin PALS, the 21 164 1'AL codc, and diagnostic sofi\va~-e),which is AlpliaScr\~cr4100 systcm uscs the concept of memory often placed in read-only ~iicmorics(ROlMs) on the llolcs. Tlic mcmory intcrconncct arcliitcct~~rcand con- 1?rocessor rnodule or ~ilothcrboard,\\,as mo\~cdto the sole codc support sc\,cn different sizes of ~ilcmory 1/0 subsystem. Only a small 8-K1', single-bit scrial modules 2nd up to ~LII-pairs of nlemory nlodules per KOM (SROM) \+,as placed on each processor niodulc system for a total systcm memory capacity of 32 1MB to that woiild initialize the Alpha 2 1 164 dipon pourer- 8 GB. Any 117is of111crnory ~iiodi~lcpairs is s~~pportcdas up anci instruct the Alplia 21 164 to ncccss the rest of long as the largest mcmory pair is pl;lced in the lo\\icst- the fi rnnvarc codc from the 1/0 subsystem. numbcrcd memory slot. Tlic pliysicul memory address mnge for each of the h)ur nicmorp slots is assigned as Quick Design Time if all hi~rmemory modr~lcpairs are the same size. Conscq~~e~ltl!~,if n\,o additional mcmor!, pairs tliat arc To pro\,idc stable (:PU and mcmor!, Iiarcl~varefor I/(> smaller than the pair in the lo\vcst-ui~mberedslot su bsystcm liard\\,arc debug and operating systcm soh- arc installed in the llppcr memory slots, there will be a ware dcb~lgand tli~~sallo\\l the 1)ItiITAL AlpliaServer gLlpor "hole" jn the physical mcmory space between 4100 to be introduced on schcdulc in May 1996, the the h~/osmaller n~elnorypairs (see Table 8). Rather corc modulc set \\,as dcsig~icdand po\\,crcd on in less than rccl~~ireeach Incmory ~llodulcto compare the full than six mol~tlis.This primary goal oftlic AlphaScrvcr JilcJliory address to a base acldrcss and sizc register to 4 100 project was achieved by kccping the dcsign tcxn cicterminc ifit should respond to the mcmory transac- small, by using only progmm~nnblclogic and existing tion, the 28-pin 1)Al. driving Scl on the motlicr- data pt'h colnponents, and by kccping the al-nount of board (scc Figure 4) uses the se\.cn address bits documentation ofdesign intcrthccs to 3 mini~num. Adclr<32:26> and the sizc of the memory module in The design tcaln for the motherboard, no-ester~lal- the lo\\,cst-numbered slot to encode the niernory slot cache processor modulc, and s~~nclironoi~sInemory n~~rnbcrof the sclcctcd ~ncmorymodulc pair. Console modi~lc consisted of one dcsign engineer, one cocic tictccts any mcmory Iiolcs at power-up and tells schcmatic/layo~~tassistant, one signal integrity engi- tlic operating systems that these are unusable physical nccr, and t\v~si~ii~~latio~i engineers. Thc team also mcmor)l addresses. enlisted the help of mcmbcrs of the other AlpIlaScrver Anotlicl- si~nplificationthat die AlphaScrver 4100 41 00 design teams. systcm uses is to removc 1/0 space registers fro~nthe ?I - Ilc architccturc and actunl ti nal logic dcsig~lof tlic clata path of the proccssor and memory modulcs. core module set \\,ere dc\,clopcd at the same tiole. Ry Kccausc there are no custom ASI(:s on thcsc modulcs, using progra~n~nablclogic and off-the-shelf addrcss reading and \\,riting co~ltrol registers n.ould have and dam path compo~icnts,tlic logic \vas urritten in rccluircd additional ci~tapi~tli components. Sincc ,111 ARL, cocic (a langi~agcuscd to describe tlic logic fi~nc- tlic error checking is pcrformcci by either the 2 1164 tions of programnzablc dc\iccs) and compiled i~nme- CPU chip or the PC1 bridge chips and since there arc diiltcly into the PALS and 1'Ll)s \\rhilc the architecture no address decoding control rcgistcrs required on tlic \vns being specified. If the desired f ~nctionalitydid not mcmory modulcs, thcrc \vas no nccd for more than fit into the programmable dc\~iccs,the architecture a Vc\v bits of co~ltrolinformation to be acccsscci by \\?:IS n~oditicdi~ntil the logic did fit. All three ~nodulcs sotiwarc on the processor or mcmory rnod~~lcs.The ulcrc dcsig~ledby thc sanlc cngi~~ccrat the same time, I?<: bus (slo\\~serial bus) alrcndy prcsc~itin the 1/0 so thcrc \\'as 110 need for intcrf~ccspccificatio~ls to be st~bs!~ste~nwas l~seilfix transfcrl-ing this small amount \\,rittcn for each modulc. F~~rtl~crmorc,modifications of i~lforn~~tion. 2nd cnhanccments co~~ld17c 1ii;1c1eill p.irallcl to each Furthermore, in the process of removing the 1/0 dcsign to optimize performance and reduce co~iiplcx- sp~datapath from the motherboard and proccssor ity across a11 three modules. modules, the firm\vare (i.c., the console code, Alpha

Table 8 Memory Hole Example Memory Slot 1 2-GB Module Pair 000000000-07FFFFFFF

I 1 Memory Slot 2 2-GB Module Pair 080000000-OFFFFFFFF

I Memory Slot 3 1-GB Module Pair 1 100000000- 13FFFFFFF :, M$*ij H-&', ::. 1 Memory Slot 4 1-GB Module Pair 1 180000000 - 1BFFFFFFF

Unused Memory I

VoI. 8 No. 4 1996 Rccai~sethe dcsign did not incorporate any custom Many multiprocessor servers share a common ASICs, the core s)rstcni \\(as po\vcrcd on as soon as the command/address bus by issuing a request to use the modulcs were built. Any last-n~inutclogic changes bus in one cycle, by either \\laiting for a grant to bc required to fix problems identified by simulation returned kom a central arbiter or performing local arbi- could be made directly to the reprogrammable logic tration in the next cycle, and by driving the command/ devices installed on the modulcs in the laboratory. In address on the bus in the cycle that follows. This particular, the reset and po\ver sequencing logic on the sequence occurs for all transactions, eve11 when the noth her board was not even sj~niilatedbefore power-on memory bus is not being used by other nodes. The and was dc\~clopecidirectly on actual I1 a1.d ware. Alphaserver 4100 nic1nor)I interconnect implements Since the 1/0 subsystcrn \\.as not available \vhen the bus parlung, ~~liicliallo\vs a modu1.e to turn on its core module set was first powered on, the soh\iarc that acidress dri\lers even though it is not currently using ran on the core llard\\/arc \\/as loaded from the serial the bus. If the Alpha 21 164 on that ~nodclleinitiates a port of a laptop personal coniputer and through the new transaction, the command/addrcss tlo\vs directly Alpha 21 164 serial port, and then ivrittcn directly into to memory in two less cycles than it \vould taltc to pcr- main memory. 13jagnostic programs that had been form a costly arbitration sequence. Transaction Rd3 in de\reloped for simulation were loaded into the mcmory Table 6 sho\\rs an example of the cffects of bus parking. of act~lalhard\vare and run to tcst a four- processor, fi~lly loaded memory configi~ration.This testing cnabled High Memory Bandwidth signal integrity fixes to be made on the Ilard\vare at h~ll speed before thc I/O subsystcn~was available. When One of the most important features of the SL>IUM the I/O subsystcln was powered on, the core module chip is that a single chip can provide or consumc data set was operating bug free at full speed, allo\ving the jn every cyclc for long burst lengths. The AlpIiaServer AlphaScr\,er 4100 to ship in volume six ~i~onthslater. 4100 operates the SDRAMs with a burst length offour As mentioned in the section Sinlplc Design, the cycles for both reads and writes. Each S1)RAM chip central arbiter logic 011 the motherboard was irnple- contains nvo banks dctermined by Addr<6>, \vhich mcnted in progranlmable logic. Consequentl!~, by selects consecuti\le memory blocks. If accesses are q~~icltlychanging to the rep-ogramn~ablclogic on the made to alternating banks, then a single SDRAM can motlicrboard instead of performing a I&ngthyredesign continuously drive rcad data in every cjlclc. Tlic arbi- of a CIIS~OI~ASIC, designers were able to avoid several tration of the AlphaServer 4100 Incnior!! intcrconncct logic desjgn bugs that \\(ere found later in the custom supports only n4~1mcnlory banlcs, so tlic smallest ASICs ofother AlphaScrvcr 4100 processor and men- 1ne1nory niodule, Ivhich consists of one set of ory modules. SDRAlMs, can provide the salnc 1-GR/s masimun~ rcad bandwidth as a fi~llypopulated mcmory configu- Low Memory Latency ration, i.c., a system configured with the minimum amount of memory can perform as well as a fully con- Minimizing the access time of data being rcturncd to figured system. the C1'U on a rcad transaction was a major design goal To incrcasc the single-processor memory bandwidth, for the core module set. The core module set dcsign \\!as the arbitration allows two simultaneoi~srcad trans- optimized to deliver the Addr and CS signals to the actions to he iss~~edfiom a single processor module. As SDRAMs in nvo cycles (30 ns) from the pins of long as the arbitration nlemory bank restrictions and the Alpha 21164 CPU and to return the data fro111 the arbitration himess restrictions are obeyed, it is possible S~)KAIMSto the Alpha 21 164 pins in another two cycles to issue back-to-back read transactions to ~~icmoqlfrom (30 ns). VVith the SDRAlMs operating at a two-cycle a single CPU with read data being returned to the Alpha internal ro\v acccss and a nvo-cycle internal column 21164 CPU in eight consecutive cycles instead of the access to tlie frst data (60 ns total internal SDRAM usual hur (see Tables 1 and 6). This dual-issue feati~re acccss time), the ~iiainmcmory latency is 120 ns. and the other lo~lmemory latency and high memory The lo\v latency \\,as acco~iiplishedin four \\laps: bandwidth features of the AlphaSer\ie~-4100 architec- ture enabled the AIpliaScr\~cr4100 system to mect the 1. By rcmo\~ingcusto~ii ASICs and error checl

Vol. 8 No.4 1996 57 before the nest driver turns on. Ry keeping the lo\\~er- Using this size chip allon~ccldesigners to build syncliro- order address bits connected to nll S~~RAIIS,i.e., by 110~1sJiie1nor!, modules tliat contain 9, 18, 36, anci not interlea\~ingadditional banks of memory chips on 72 S131W1Msand provide, rcspccti\icl\,, 32 MR, 64 MB, low-order address bits, consccuti\ze accesses to alter- 128 IUR, and 256 MB of main memory per pair. '1-l~c nating memory banks such as large direct memory mcmory architecture supports synchronous nielnory access (1)MA) sequences can potentially achieve tlie riiodules tliat contain up to I GB of main memory per full 1-GB/s read bandwidth ofthe data bus. With tlie pair (LIPto 4 GB per system) by using the 64-Mb dead cycle inserted, tlic rcad bandwidth of the mem- S1)RAMs; ho\ve\lcr, whcn the Alphaserver 4100 sys- ory interconnect is reduced by 20 percent. tem was introduced, the pricing and a\railability of the The data bus connecting the processor, memory, 64-Mb SDMM did not allo\\. these larger capacic syn- and 1/0 modules was iniplcniented as a traditional chronous memory modules to be built. shared 3.3-volt tri-state bus \\,it11 a single-phiisc syn- At tlie same tinic the s!rnchro~io~~smcmory modulcs chronous clock at all modulcs. As a result, the bus \\.ere being designed, a fihnily of plug-in cornpatiblc Rcconles saturated as morc processors are added a~ld mcmory modules built witli El30 13RA1us was bus tsaftic increases. To Icep tlie design time as short dcsig~lcdand built. The Iiien~oryarcliitccture supports us possible, the Alpl~aSer\~c~-4100 dcsigncrs chosc not F,130 mcrnory modules containing up to 2 GB of main to esplore tlie concept of a switched bus, on \vhicli mcmory pcr pair (LI~to 8 GI3 per system) by LIS~I~~thc morc than one private ti-ansfcl- map occur at a ti~iie 64-Mb ED0 DRAM. Whcn the AlpliaScr\~er4100 sys- bcnvcen multiple pairs of nodes. Clearly, tlic tem \\.as introduced, the 64-1Mb ED0 DRAM nras AlpliaServer 4100 system has reached the practical a\railable and ED0 memory modulcs containing 72 or upper limit of bus bandwidth using the traditional trj- 144 ED0 DRAMS wcrc built providing 1 GR and 2 GB state bus approach. of main nicniory per pair. To round out the range of memory capacities and to pro\.idc an alternative to the Reconfigurability synchronous memory moclulcs in case tl~erewas a cost or design problem \vith the nc\xr 16-Mb SDRAM chips, The Alphaserver 4100 hard\varc niodules \\/ere a hmily of ED0 nxmory motiulcs \\/asalso built ~~si~ig designed to alloc\~enhancements to be ~i~adein the 16-lMb and 4-Mb El30 OlUl\/ls, pro\iding 64 MB, f~~turcwithout having to redesign every element in 256 MR, and 512 AIIR of main memory per pair. the system. Although ED0 DIWh/ls can provide data at a liighcr b'lnd\vidth than standard l>lW~Ms,n single ED0 Motherboard Options 1)MlM cannot return data in ti)ur consccuti\.c 15-11s The Alphaserver 4100 motherboard contains four c!~clcs likc the single SDlWlM ~15ccion tlie synchronous clcciicated processor slots, eight dedicated memory mcmory niodules. Therefore, a custoni ASIC was ~~sed slots (~OLI~memory pairs), and one slot fix- an on the ED0 memory modulc to access 288 bits of I/O module witli nvo PC1 bus bridges. Designed at data every 30 11s from the ED0 1)RAMs and multiples tlic same time but not produced until after the the data onto the 144-bit mcmory interconnect cvcry AlphaScriler 4100 motherboard \vas available, 15 ns. To imitate the nvo-bank feature of a single the AJpliaServer 4000 niothcrboi~rdcontains only nvo SI)RAI\/I, a second bank of El30 l)IMl\lls is reql~ired. processor slots, Four memory slots (t\\io menlory Conscili~cntly, the minimum number of menior!I pairs), and slots for nvo I/O modules allo\ving four chips pcr El30 memory moclulc is 72 bur-bit-\vide 1'(:1 bus bridges. Since module liarci\\rarc verifcation El30 DRAM chips, \\,liereas the mini~iiuninumber in the laboratory is n lengthy process, tlie AlphaSesver of liiemor!l chips per synclirono~~smemory ~OCILIIC 4000 niothcrboard \\,asdesigncd to use tlie same logic is o~ily18 four-bit-\\ride SI3KAM cliips 01-as few as as thc Alphaserver 4100 except fbr the programmable 9 eight-bit-\vide SDRAJM chips. arbitration logic, \\~hicli had a clifferent algorithm When the Alphaserver 4100 system \vas introduced, because of the estra 1/0 module. Whcn the signals on the fastest ED0 DRAM a\,ailablc tliat met thc prici~lg the Alphaserver 4000 motherboard were routed, all rccluircnicnts was the 60-11s version. When this chip nets were kept shorter than the corresponding nets on is used on the ED0 mcmory modulc, data cannot the Alphaserver 4100 motherboard so that every sig- be returned to the mothcrbonrcl as fast as data can be nal clid not need to be rccsaniincd. Only those signals returned from the synchronoi~smc~nory modules. To that \\lcrc uniquely differ-cnt \\,crc s~~bjectto the f~ll support tlle 60-11s EI)O l>lUh4s, n one-cycle (15 ns) signal integrity verification proccss. incscnsc in the access time to main Iiicmor!r js required. Support hr this extra c\,clc of latcnc!, \\as designed iuto Memory Options tlic mcmory interconnect by pl~cingJ. one-cycle gap -.I'Jic synchronous memory modules a\railable for the bct\\,ccn cycles 2 and 3 (scc 'T'nblc 1 ) of any rcad trans- AlphaScrvcr 4100 arc all based on the 16-Mb SDRAM. action accessing a 60-11s ED0 mclnor!l module. Con- secl~~e~itly,the read memory latency is one c!lcle longer

Vol. 8 No. 4 1996 and the niaximu~nread bandwidth is 20 percent less processor ~iioduleis plug-in compatible, and systcms \vJien using ED0 memory modules built witli 60-ns can be upgraded without clia~igingtlie motherboard. ED0 DRAMS. Note that it is possible to have a misture This is true even if the frequency of the synchronous of ED0 memory modules and synchronous memory memory interconnect changes, although all processor modules in the same system. In such a case, only tlie modules in the system must be configured to operate memory read transactions to tlie 60-ns ED0 memory at the same speed. The oscillators for both the high- module would result in a loss of performance. speed internal CPU clock and tlie memory intcrcon- New versions of the El30 Iiicmory modules that nect bus clocl< are located on the processor niodulcs contain 50-ns ED0 DRAMS providing up to 8 GB of to allow processor upgrades to be made without mod- total system memory arc scheduled to be introduced ifjing the motherboard. within a year aher the introduction ofthe AlphaServer 4100. These modules will not recluire the additional Summary cycle of latency, and as n result they will have identical performance to the synchronous memory modules. Tlie high-performance DIGITAL AlphaSer\!cr 4100 SMP server, which supports i~pto four Alpha 21164 Processor Options CI'Us, was designed simply and cluicldy using off-thc- The no-external-cache processor niodule was designed shelfcomponents and programmable logic. \inen tlie to support either a 300-MHz Alpha 21 164 CPU chip AlphaScrver 4100 system was introduced in May with a 60-MHz (16.6-ns) synchronous memory inter- 1996, the memory interconnect design enabled the connect or n 400-MHz Alp.lia 21164 Cl'U chip with server to achieve a minimu111 menlory latency of a 66 MHz (15-17s) synchronous memory interconnect. 120 nanoseconds and a maximum memory band- As previously mentioned, the Alpha 21164 itself \vidth of 1 gigabyte per sccond. This industr\!-leading contains a primary 8-1U$ data cachc, a primary 8-IU3 performance \\[as achieved by using off-the-shelf data instructioli cache, and a second-level 96-IU3 three- path and address components and progranimable way set-associative data and instruction cachc. The logic behvecn tlie <:PU and the SDIUIM-based main no-external-cache processor module contains no third- memory. The motherboard, the s!~~iclironousmemory level cache, but by leeping the latency to main mem- module, and tlie no-esternal-cache processor module ory low and by issuing multiple references from the were developed concurrently to optimize the perfor- same Alpha 21 164 to main memory at tlie same time mance of the memory architecture. Thcse core ~nod- to increase memory bandwidth, the performance of ules were operating successfi~llywithin six months of many applications is better than that of a processor the start of the design. Tlie AlpliaServer 4100 liard- module containing a third-level external cache.' ware nodules were designed to allow fi~tureenhance- Applications that are small enough to fit in a largc ments \vithout redesigning the system. third-level cache perform better with an esternal cache, however, so tlie AlpliaServer 4100 offers several Acknowledgments variants of plug-in compatible processor modules con- taining a 2-ME, 4-MB, or greater module-level cache. Bruce Alford fro111 Revenue Systenis Eng~neering In addition, cdched processor modules are being ass~stedwith the schematic entry, module layout, designed to support Alpha 21 164 CPU chips that run manuhcturing Issues, and po\ver-up log~cdes~gn, and f~sterthan 400 MHz while still maintaining the masi- succeeded in smootlily transitioning the core module niuni 66-MHz sy~ichrono~~smemory interconnect. set to lin long-term engineering support organization. The architecture of the cached processor module Roger Dame handled s~gnalintegrity and timing was dcvcloped in parallel with the core module set, analysis, \vIiile Dale Iance<:hnractcrizatio~~," 13igit~il Techrticctl 3.3 volts. 'l..lie AlpliaServer 4100 motherboard does ,/ot,lr17nl,vol. 8, no. 4 (1996, this issue): 3-20. not provide 2 \rolts of po\vcr to the processor modulc 2. S. Duncan, C. Kecfer, and T. ~McLaughlin, "High connectors; consequently, :a 3.3-to-2-volt converter Perforniance 1/0 Design in the AlphnSer\~er4100 Sym- card is used on the higher-speed processor modules metric Multiprocessing System," Digital Techrzical to provide this unique voltage. Each new version of .Jo~~rnnl,vol. 8, no. 4 (1996, this issue): 61-75. 3. The .4lpli:iScrver 4000 s!,stcm contllins the same CPU- to-mcniory intcrt:lcc ns the AlplinSc~.vc~4100 s!.stclii but s~~pportshalf the nu~iibcrofproccs\ors and mc~nory ~~loci~~lcsa~lcl t\\,icc the n~~nibcrof I><:[ bridges. Thc AlphaScr\~cr4000 motherboard \\,.IS iienigrtcd at thc same time 2s thc AlplinServcr 4100 ~niothcrl~oardbut \\.IS ~iotprod~ccd ~~ntil nftcr the Alph;iScn.cr 4100 nlotl~c~~bo;~rdwas a\failnble. 4. iM. Stcinman ct al., "Tlic AlpliaScr\,cr 4100 Cachrd I'roccssor Modu lc Architccturc ;111ti I)csign," Di,qi&c/l 7i~cbiricol,Jo~/ri1~11,vol. 8, no. 4 (1996, this issue): 2 1-37. 5. R. I);lmc, "'l'hc AlphaScr\,cr 4100 Lo\\.-cost <:lock Dis- tributio~iSystem," Digilol 'l~chi/ic~r/l./o/11~11~11.\,ol. 8, no. 4 ( 1996, this issue): 38-47. 6. A@ha 2 1 164 .Ilici~op~ncc~.s.sot.Hoizlrr.rli.c, NcjL,i-oir~-c~ ,llr~i~//al(M.~ynard,Mass.: L>igit;ll 1-quipmcnt Corpora- tion, Order No. KC-QAEQA-.I-(:, Scptcmbcr 1994).

7. Tlic Fctcli command is not implcmcntcd on thc AlpliaScrver 4100 system, but tl~crcis no mcchnnism to kccp it from appearing on the ClM1) pins of thc Alpha 21 164 (:PI chip. The Fctch command is simply tcl-mi- natcd \vitliout any additional action.

Biography

Glenn A. Herdeg Glenn Hcrdcg has been \vorkuig on the design ofcorn- putcr niod~~lcssince joining Digital in 1983. A principnl hard\v;~rcrngiliecr in the AIphaServer Platfor111 1)cvclol)- mcnt pro~~p,lie \\..is thc project leader, architect, logic dcsig~icr,and module dcsigner for tl~cAlphaserver 4 100 motherboard, no cstcrnal-cachc processor modules, and sy~rchnniousnicmory niodulcs. Hc also led thc design of rlic Alph;lScn.cr 4000 nlotlicrboard. 111carlicr \\~ork, C;lc~>riscrvcd as the principal ASIC .uld moduk dcsigncr for scvcral 1)EC 7000, VAX 7000, a11d VAS 6000 p~.ojccts. Hc I~olds:I U.A.ill pliysjcs from Colby <:c~llrgeanti nli M.S. irl c471rlp1rtcrsystems from Rcnssrlacr I'olytccl~~~icI~lstit~~tc a11d 11~qtwo p~tc11ts.Glenn is currently i~~\rol\,cdill f 11.t1lc1.

Vol. 8 No. 4 1996 I Samuel H. Duncan Craig D. Keefer High Performance I10 Thomas A. McLaughlin Design in the AlphaServer 4100 Symmetric Multiprocessing System

The DIGITAL AlphaServer 4100 symmetric multi- The Alphaserver 4100 is a s)r~iimetric~nultiprocess- processing system is based on the Alpha 64-bit ing system based on the Alpha 21 164 64-bit NSC -.I 7 RlSC microprocessor and is designed for fast microprocessor. his midrange system supports one to four CPUs, one to four 64-bit-wide peer bridges to CPU performance, low memory latency, and the peripheral component interconnect (PCI), and high memory and I10 bandwidth. The server's one to four logical memory slots. The goals for the I10 subsystem contributes to the achievement AlphaServer 4100 system were fast (:PU performance, of these goals by implementing several innova- low menlory latency, and high memory and I/O tive design techniques, primarily in the system bandu~idth.One measure ofsuccess in acliicving these bus-to-PC1bus bridge. A partial cache line write goals is the AIM benchmark niultiprocessor perfor- mance results. Thc AlpliaSer\ier 4100 system was technique for small transactions reduces traffic audited at 3,337 peak jobs per minute, with a SLIS- on the system bus and improves memory latency. tained nu~iiberof 3,018 user loads, and won thc AIlM A design for deadlock-free peer-to-peer transac- Hot Iron price/performance award in October 1996.' tions across multiple 64-bit PC1 bus bridges reduces The subject of this paper is the contribution of the system bus, PC1 bus, and CPU utilization by as T/O subs)atem to these high-performance goals. In an much as 70 percent when measured in DIGITAL in-house test, 1/O performance of an AlphaSer\rer 4100 system based on a 300-megahertz (MHz) AlphaServer 4100 MEMORY CHANNEL clusters. processor shows a 10 to 19 percent improvement in Prefetch logic and buffering supports very large 1/0 when compared with a previous-generation~eratio~~ bursts of data without stalls, yielding a system midrange Alpha system based on a 350-1MHz proces- that can amortize overhead and deliver perfor- sor. Reduction in CPU utilization is particularly bene- mance limited only by the PC1 devices used in ficial for applications that use small transfers, e.g., the system. transaction processing. I10 Subsystem Goals

The goal tor the AlphaServer 4 100 I/O su bsystcni was to increase overall system performance by Reducing CPU and system bus utilizdtion for all applications Deliirering fill1 1/0 bandwidth, specifically, a band- \vidtll limited only by the PC1 standard protocol, which is 266 nicgabytes per second (MB/s) on 64-bit option cards and 133 MB/s on 32-bit option cards Minimizing latency for all direct memory access (DMA) and progra~n~ncd1/0 (1'10) transactions Our discussion focuses on scveral innovative tecliniclues used in the design of the 1/0 subs)~ste~n 64-bit-\\ride peer host bus bridges that dran~atically reduce CPU and bus utilization and deliver fill1 PC1 bandwidth:

Vol. 8 No. 4 1996 61 A partial cache linc \\.rite technique for coherent application-specific intcgratcd circuit (ASIC) chips, DMA writes. This technique permits an 1/0 device one control chip, and n\,osliced data path chips. to insert data that is smnllcr than a cache line, or The nvo independent PC1 bus bridges arc the intcr- block, into the cache-colicrent domain without first t3ccs between tlie system bus nnd their respective P<;I obtaining owncrsliip of the cache block and pcr- buses. A PC1 bus is 64 or 32 bits \vide, transferring forming a read-niodilj-write operation. Partial data at a peak of266 MB/s or 133 MR/s, respectively. cache line writes reduce traffic on the system bus In tlie Alphaserver 4100 systcm, the PC1 buses arc and improve latency, particularly for messages 64 bits wide. passed in a MEMORY CHANNEL cluster.' Thc PC1 buses connect to a PC1 backplane module Support for device-initiated transactions that target with a number of expansion slots and a bridge to the other devices (peers) across niultiple (peer) PC1 Extended Industry Standard Arcliitccture (EISA) bus. buses. Peer-to-pccr transactions reduce system In Figure I, each PC1 bus is slio\\.n to support up to bus utilization, P<:I bus utilization, and CPU uti- ti~urdevices in option slots. lization by as niucli as 70 percent when measured in The AlpliaServer 4000 series also supports a config- MEMORY CHANNEI., clusters. In testing, we ran ~lrationin which nvo of the CI'U cards are replaced a MEMOI

62 Vigiral Technical Journal I PC1 BACKPLANE MODULE I I I I I I I I ONE DEDICATED I PC1 AND THREE I SHARED PClIElSA SLOTS I I I I I I BRIDGE I 64-BIT PC1 1 I

I PC1 BRIDGE MODULE

COMMANDIADDRESS A A A SYSTEM BUS DATA AND ECC v v

1 v f v 1 v f 1

Figure 1 AlphaScrvcr 4100 Sysccni \virIi Four CPUs, l'lvo 64-bit Buses

I - - I I PC1 BACKPLANE MODULE I I I I I I I I I I ONE DEDICATED I I I PC1 AND THREE I I SLOTS SHARED PCllElSA I I I SLOTS I I I I I I BRIDGE I 64-BIT PC1 1 I

I------I I

MEMORY

COMMANDIADDRESS A A A V SYSTEM BUS DATA AND ECC V V t v v

I I PC1 BRIDGE MODULE I I------J

I I 64-BIT PC1 3 64-BIT PC1 2 I I I I I FOURPCI EXPANSION I SLOTS I I

Figure 2 AphaScr\~er4000 Systcni \vith 'l'wo <:Pus, Four 64-bit Buses

D~g~ulTcclinicnl Journal Vol. 8 No. 4 1996 63 mcmory for all CPUs on the system bus. Tliis merging iU~\~crncntof blocks of lcss than 64 h!,tcs is of \vritc data into tlic cache-coherent domain is typi- irnport'uit to application performance hccausc thcrc cally done on the PC1 bi~sbridgc, \\~hicIireads the are liigli-pcrf~~rmnnccdc\riccs tliat move Its\ tlinn cnclic linc, nlcrgcs the ne\\ bytcs, and \\!rites tlic caclic 64 bytes. One csamplc is DIGITAL'S MI:IUOIhllA \\rrjtcs (111 the target Alph,lScr\.cr on the L'CI that perfor~nccia single 16-byte-aligned 4100 contribute to message Intenc!: 'The jmpro1.c mcmory \\lritc \vould ha\r consumed system bus nient in latency pro\lidccl by the partial cache line \\.rite bnnci\vidtli that co~~ldha1.e mo\~ed256 bytcs of data, featurc is approsimntcly 0.5 ~s per \\,rite. Witli t\vo or 16 timcs the amount of data. We tlicrcforc had to \\.rites per message, latcncy is reduced by appl-osi- find a Iiiorc efficient approach to writing subblocks niatcly 15 percent over an AlpliaSer\~er4100 systcm into the cache-coherent donlain. \vitIi the partial caclic linc writc feature. Witli version We first examined opportunities for cfficicncy gains 1.5 of MEJ\~ORYCHANNF.1, adapters, net I;ltcncy in the mcmory The AlpliaScr\~cr4100 mcm- \vill improve by ;~l>o~~t3 FS, and the eff'ect of partial ory systcm interface is 16 bytes \vide; a 64-byte cache cache linc writcs \\Jill approach a 30 percent impro\,c- line rc;d or \\Trite takes four cycles on the systcm bus. nient in message latency. Tlic Iiicmory modules themselves can be dcsig~icdto In sulnmar!,, the cliallc~igcis to efficiently n~o\~cn mask one or liiore of the \\,rites and allon, alig~ic~i block of ddr~of a common size (multiple of 16 h!,tcs) blocks that arc niultiples of 16 bytes to be \\~rittcnto that is s~iiallcrtlian ;I caclic linc into the c,lclie-colicrcnt nicniory in n single system bus trans,~ctio~~.Tlic prob- domain. Witlio~~tally fi~rtlicrimpro\rement, the tccli- lem with permitting a lcss th'ln coniplctc caclic linc niqw rcciuccs system bus utilization by as rn~~cllns 11 \\(rite, i.c., lcss tll;l~i64 bytcs, is that tlic \vritc goes to factor of ~)LII-.Tliis tcclinicluc allows subblocks to bc main memor!l, but the only LIP-to-dntc/co~iipIctc merged \\.ithout i~icurringthe overhead of rend-modi$- copy of a caclie line may be in 3 CI'U card's caclic. write, yet maintains caclic coherency. Tlie only draw- To permit the more efficient partial cnclie line back to the tccli~iiqucis sonic illcreased con~plcxityin writc operations, ~vcmodified tlic systcm bus cache- the CPU cachc controller to support this mode. We coherency protocoJ. When a PC1 bus bridgc issues considered tlie alternative ofadding a small cache to tlic a partial cachc linc \\/rite on the system bus, each <:PU PC1 bridge. Writes into the same memory region that card performs a cache lookup to scc if the targct of occur \\,irIiin n short pcriotl of timc could merge dircctly the \\!rite is dirty. In the event that tlic targct cnclic into '1 caclic. Tliis npproncli adds significant complexity block is dirty, tlie <:PU signals the P(:I bus bridgc and incrc<~scsperformance only if transactions tliat tar- bck)rc thr end of the partial \\lritc. On ciirty partial get tlic same cachc linc arc \.cry close toget1ic1-in timc. cnclic linc write trans~ctions,the bricigc simply pcr- forms a second transaction as a rend-modi~-write.If Peer-to-Peer Transaction Support the targct caclic block is not dirty, the operation com- plctcs in a si~iglcsystem bus transaction. System bus anti 1'<;1 bus utilization can be optimizccl Address traces taken during product dc\clopment for certain applications by limiting tlie nunibcr of ti~iics \verc simulated to determine the freq~~cncyof dirty the sanic block of data moves through the system. cache blocks tliat are targets of DMA writcs. Our sim- As noted in the section AlphaScrvcr 4100 Systcm ulations showed that, for the address trace we used, Overview, the I'CI si~bs!~stc~iican contain t\\lo or bur frccli~c~~cy\\pas extremely rare. Mcnsurcmcnt taken independent PC1 bus bridges. Our design al1ou.s cxtcr- from several applications and bcnclimnrlts confirmcci nal de\riccs connected to tlicsc scparatc pccr I'CI bus that a dirty cnclie block is alniost never asserted \\/it11 bridges to slinrc darn \\~itlio~~t,icccssing main mcmor!r a piirtial caclic linc \\!rite. and by sing a nii~ii~nalaniollnt ofhost bus L>ancl\\riilrli. 'Tlie l>MA transfcr of blocks that arc aligned In other \\,ords, external dc\,iccs can effect direct access m~~ltiplcsof 16 bytes but less than a c;lclic line is four to clata on 3 peer-to-peer basis. timcs more efficient in tlie 4100 system tlia~iin earlier l>IGITAL. implcmcntations. 111 conventional syste~iis,a data file on a disk that is a Ci'U daughter card. Each PC1 bridge to the system reqi~estedby a clic~ltnode is transferred by DMA from bus has a translation look-aside buffer (TLB) that con- the disk, across the PC1 and the system bus, and into verts PC1 addresses into system bus addresses. The use main memory. Once the data is in ~iiainmemory, a net- of a TLB permits hardware to make all of physical u~orkdevice can rcad the data directly jn memory and memory visible through a relatively small region of send it across the network to the client node. 111a 4100 addrcss space that \\re call a DMA \vindow. system, device peer-to-peer transaction circumvents A DMA windo\\/ can be specified as "direct the transfer to main memory. Ho\vever, peer-to-pwr mapped" or "scatter-gather mapped." A direct- transaction rcquires that the target dc\~iceIiave certain riiapped Db1A wi~ido\iladds an offset to the PC1 properties. The csscntial property is that the device tar- addrcss and passes it on to the system bus. A scatter- get appear to the source device as if it is main memory. gather mapped DMA window ilses the TL,B to look up The balance of this section explains how conven- the system bus address. tional 1)MA reads and \\?rites are performed on the Figure 3 is an example of how PC1 memory address AlpliaServer 4100 system, how the infrastructure for space might be allocated for DMA windo\vs and for conventional DMA can be used for pcer-to-pccr trans- PC1 device control status registers (CSRs) and memory. actions, and ho\v deadlock avoidance is accomplished. A PC1 device initiates a DMA writc by driving an address on the bus. In Figure 4, data from PC1 dcviccs Conventional DMA 0 and 1 are sent to the scatter-gather 13MA windows; We extended the features of con\cntional DMA on the data from 1'CI dcvicc 2 are sent to tlie direct-mapped NphaScrvcr 4100 system to support peer-to-pccr DAM n~indo\v.When an address hits in one of the transaction. Conventional DA4A in the 4100 systc111 DMA windows, the PC1 bus bridge acluiowledges works as follows. the addrcss and immediately begins to accept write Address space 011the Alpha processor is 2:" or 1 tera- data. While consu~ni~lgwrite data in a buffer, the PC1 byte; the AIphaServer 4100 system supports up to bus bridge translates the 1'CI address into a system 8 gigabytes (GB) of main memory. To directly address address. The bridge then arbitrates for the system bus a11 of memory without using memory management and, using the translated address, completes the write liard\varc, an addrcss must be 33 bits. (Eight GB is transaction. The write transaction completes on the ecli~ivalentto 2.'.' bytes.) 1'CI before it completes on the system bus. Becai~sethe amount of memory is large compared to A DIMA rcad transaction has a longer latency than address space available on the PCI, some sort of inem- a 1)MA write because the PC1 bus bridge niust first or)! management hardware and sohvare is needed to translate the PC1 address into a syste~iibus address and mal

SYSTEM ADDRESS SPACE PC1 MEMORY ADDRESS SPACE (240 BYTES) (232 BYTES)

PC1 DEVICE CSRs SCATTER-GATHER WINDOW 0 112 ME PC1 DEVICE CSRs 384 MB (UNUSED) a512 MB SCATTER-GATHER WINDOW 1 PC1 DEVICE PREFETCHABLE MEMORY SPACE

DIRECT-MAPPEDWINDOW 2

Figure 3 Esamplc oEl'CI Memory Addrcss Space Mnppcd to DMA Windows

Vol. 8 No. 4 1996 65 I PC DEVICE 2 1 y-1 y-1 !--I- J+I WINDOW I 7PC1 DEVICE 1 PC1 DEVICE 0 I I SYSTEMADDRESS SCATER-GATHER PC1 MEMORY I I------lSPACE DMA WINDOWS ADDRESS SPACE

Figure 4 Esarnple of PC1 De\.ice Reads or Writcs to DMA Windo\\,s and Add1.c~~Trar~slation to System Rus Addrcsscs

SYSTEM BUS

I PC1 BUS I BRIDGE I

DMA READ PREFETCH

4DATA

POSTED PI0 WRITES BYPASS INTERRUPTS PENDED PI0 OR READ READS f t t

Figure 5 Diagram of Data Paths in a Singlc PC1 1311s Bsidgc

through the outgoitig queuc (OQ)en route to the sys- Followi~lgis an esatnplc of how a con\~entional tern bus. DMA read data is passed through an incom- "bo~~nce"DMA operation is itscd to move a file from a ing queue (IQ) bypass by \vay ofa DMA fill data buffer local storage device to a ncnvork device. The cxa~nple en route to the PCI. illustrates ho\v data is \\,rittcn inro memory by one Note that the IQ orders Cl'U-initiated PI0 transac- cic\,icc \\'here it is temporarily storcd. Later thc data is tions. IQ bypass is necessary for correct, dead- read by another DhllA cit.\~icc.This operation is called lock-frec operation ofpecr-to-peer transactions, \\lhich a "bounce I/O" because thc data "bou~iccs" off are esplaincd in the next section.

66 1)igitnl Tcclinicill Journal Vol. 8 No. 4 1996 memory and out a network port, a common operation the PC1 master: The device driver provides the master for a network file server application. device with a target address, size of tlie transfer, and Assume PC1 device A is a storage controller and PC1 identificatio~iofdata to be moved. 111the case in which device R is a network device: a data file is to be read from a disk, the device dri\ier sohvare gives the PC1 device that controls the disl< a 1. Tlie storagc controller, PC1 device A, writes the file "handle," \vhich is an identifier for tlie data file and the into a buffer on the PC1 bus bridge using an PC1 target address to \vhich the file should be written. address that hits a DhL4 windo\\. To reiterate, in a con\~entionalDIVA transaction, thc 2. The PC1 bridge translates the PC1 rncmory address target address is in one of tlie PC1 bus bridge DMA into a system bus address and \\{rites the data into \\~indo\\~s.-The DMI4 window logic translates the nlcnlor)l. address into a main memory address on the system bus. 3. 7'11~ CPU passes the network device a PC1 niemory In a peer-to-peer transaction, tlie target address is space address that corresponds to the system bus translated to a2 address assigned to another PC1 device. address of the data in memory. Any PC1 device capable of DMA can perfor111peer- 4. Thc ncnvorl< controller, PC1 device R, reads the file to-peer transactions on tlie AIphaSer\rer 4100 s)atem. in main memory using a DIMwindow and sends For example, in Figure 6, PC1 dcvice A can transfer the data across the network. data to or from PC1 device R ~'ithoutusing any resources or facilities in the system bus bridge. Tlie use If both controllers are on the same PC1 bus segment of a peer-to-pcer transaction is controlled entirely by and if the storage controller (PC1 device A) could sofm~are:The device driver passes a target address to write directly to the network controller (PC1 device PC1 device A, and device A uses the address as the B), 110 traffic '~\iouldbe introduced on the system bus. DMA data source or desti~lation. Traffic on the system bus is reduced by saving one If the target of the transaction is PC1 device C, then DMA write, possibly one copy operation, and one system services sofis\lare allocates a region in a scatter- DMA read. 011tlie PC1 bus, traffic is also rcduced gather map and specifies a translation that maps the bccausc there is one transaction rather than two. scatter-gather-~iiappedaddress on 1'CI bus 0 to a sys- When the target of a transaction is a de\iice other than tem bus address that maps to PC1 device C. This main memory, the transaction is called a peer-to-peer. address translation is placed in tlic scatter-gather map. Peer-to-peer transactions on a single-bus system arc Wlien PC1 device A initiates a transaction, the address simple, borderi~igon trivial; but deadlock-free support matches one of the DMA windows that has been ini- on a system with multiple peel- 1'CI b~~sesis quite a bit tialized for scatter-gather. The PC1 bus bridge accepts morc difficult. the transaction, lool

Digital Technical Journal Vol. 8 No. 4 1996 67 .) .) .) J 4 COMMANDIADDRESS SYSTEM BUS DATA AND ECC V t 1 V 7 T 4 4 r------I BRIDGE 0 BRIDGE 1 I .) DMA READ PREFETCH ADDRESS

I I DATA DATA I 6 & I I -7POSTEDPIO POSTED PI0 I WRITES BYPASS I WRITES BYPASS I PI0 FlLL INTERRUPTS PENDED PI0 PI0 FILL INTERRUPTS PENDED PI0 OR READ OR READ I READS READS I I * * 4 I * -74 * I

PC1 DEVICE E oV o PC1 DEVICE F PC1 DEVICE G oU - PC1 DEVICE H

Figure 6 Alpli;lScrvc1.4100System Diagram Sl~o\\ii~~gDatn Patlis through I'(:I Ilus 131.idgcs

A dcadlock situation is allalogous to highway gridlock This section assumes that the reader is f~milinrwith in which two lines of automobiles face each other on the l'(:I protocol and ordcritlg rules.-' a single-lanc road; there is no rootn to pass and no Figure 6 sliou,s the data paths through two P(:I to back LIP. liulcs for dcadlock avoidance arc analo- bus hricigcs. Transactions pass through thcsc briiigcs gous to the rules for directing vehicle traffic on n rial-- as hllo\\rs: row bridge. (:PU sofi\\,arc-initiated PI0 reds and 1'10 \\)rites An cs317iplc of peer-to-peer deadlock is one in arc cntrics in the IQ. which two 1'CI cic\~icesare dependent ou the coniplc- tion of a \\:rite as masters before they \\rill accept \\!rites l)c\,icc pccr-to-peer transactions targeting dc\iccs as targets. When tliesc two devices target one another, on pcu 1'CI segments also LISC the IQ. tlic result is dcadlock; each device responds \\?it11 P(:I-de\.icc-initiated reads and \\,rites (1')lMA or luTKY to cvcry \\

Digital Tecli~iicalJournal COMMANDIADDRESS SYSTEM BUS DATA AND ECC

I PEER WRITE PEER WRITE 1 PEER WRITE PEER WRITE I PEER WRITE PEER WRITE PEER WRITE PEER WRITE PEER WRITE PEER WRITE I PI0 READ REQUEST I PI0 READ REQUEST I

PC1 DEVICE A PC1 DEVICE 8 PC1 DEVICE C PC1 DEVICE D MASTER OF - TARGET OF PEER WRITES PEER WRITE PEER WRITES PEER WRITE

I I PC1 DEVICE F PC1 DEVICE H PC1 DEVICE E PC1 DEVICE G

PI0 READ READREQUEST PI0 READ READ

- - - -- Figure 7 Block 1)iagrarn Shu\ving Deadlock Case \vithout IQ Bypass Path

Required Characteristics for Deadlock-free Peer-to-Peer I10 Bandwidth and Efficiency Target Devices PC1 devices must follo\\~all PC1 standard ordering With o\.erall system performance as our goal, we r~~lesfor deadlock-free peer-to-peer transaction. The selected nvo design approaches to deli\,cr fill1 PC1 specific rule relevant to the Alphaserver 4100 design batid\vidth ~\,ithoutbus stalls. These \\;ere support for For peer-to-peer transaction support is Delayed large bursts of PCI-device-initiated DMA, and suffi- Trans'ictio~iRule 6, \vhich guarantees that the IQ \\,ill cient buffering and prefetching logic to keep up \vith al\\,ays cmpn.: tl~cPC1 and a\.oid introducing stalls. We open this sec- tion with a revie\\. of the bandwidth and latency issues A target rn~~staccept all memory \\.rites \\re exanlined in our efforts to achieve greater band- addressed to it while completing a request using \vidth efficiency. Delayed Transaction terrninatio~i.~ Thc band\~idthavailable on a platfor111 is dependent Our design includes a linlc ~nechanisniusing scatter- on the efficiency of the design and on the type of gat her TLBs to create a logical connection benveen two transactions performed. Band\vidth is measured in PC1 devices. It includes a set ofrules for bypassing data millions of bytes per second (MR/s). On a 32-bit that ensures deudlock-free operation when all partici- l'C1, the available bandwidth is efficiency ~n~~ltiplicd pants in a peer-to-peer transaction follow the ordering by 133 MB/s; on a 64-bit PCI, available bandwidth is rulcs in the PC1 standard. The link mechanism provides efficiency multiplied by 266 MB/s. By efficiency, nJe a logical path for peer-to-peer transactions and the mean the amount of time spent actually transferring bypassing r~~lesguamntee the IQ \\ill al\\.~!rs drain. data as compared \\,it11 total transaction time. The key fcature, then, is a bmarantee that the IQ \\;111 Both parties in a transaction contribute to efficiency al\\.iiys dl-ain, thus ensuring deadlock-free operation. on the bus. The Alphaserver 4100 1/0 design kecps the o\,erhead introduced by the system to a minimum and s~~pportslarge burst sizes o\,er \vhic.Ii thc pcr- tr~nsactionoverhead can be amortized. Support for Large Burst Sizes To predict the etliciency of a given design, one must break a transaction into its constituent parts. For exam- ple, when an 1/0 device initiates a transaction it must Arbitrate for the bus PERCENT 80'o 1 AVAILABLE 70% Connect to the bus (by driving the address of the transaction target) SPENT MOVING 40% DATA Transfer data (one or more bytes move in one or (EFFICIENCY) 30% more bus cycles) 20% DATA 10% Disconnect from the bus 0% IN A Time actually spent in an 1/0 transaction is the OVERHEAD CYCLES sum of arbitration, connection, data transfer, and (LATENCY PLUS STALLS) disconnection. The period of time before any data is transferred KEY: is typically called latency. With small burst sizes, band- 90% - 10O0/o 40% - 50% 80% - 90% .30% - 40'' width is limited regardless of latency. Latency of 70%-80% .20% - 30% arbitration, connection, and disconnection is fairly 60% - 70% .10% 20% .50%-60% .0%-10% constant, but the amount of data moved per unit of . time can increase by making the 1/0 bus wider. The Alphaserver 4100 PC1 buses are 64 bits wide, yielding Figure 8 PC1 Efficiency as a Function of Burst Sizc 2nd Latency (etlicienql X 266 MB/s) of available bandwidth. As shown in Figure 8, efficiency improves as burst size increases and overhead (i.e., latency plus stall cycles. Peer-to-peer reads of devices on different bus time) decreases. Overhead introduced by the segments are always converted to delayed-read trans- Alphaserver 4100 is fairly constant. As discussed ear- actions because the best-case initial latency will be lier, a DIMA write can complete on the PC1 before it longer than 32 PC1 cycles. completes on the system bus. As a consequence, we PC1 initial latency for DM reads on the were able to keep overhead introduced by the plat- Alphaserver 4100 system is commensurate with form to a minimum for DMA writes. Recognizing that expectations for current generation quad-processor efficiency improves with burst size, we used a queuing SMP systems. To maximize efficiency, we designed model of the system to predict how many posted write prefetching logic to stream data to a 64-bit PC1 device buffers were needed to sustain DAM \\/rite bursts with- ~vitl~outstalls after the initial-latency pe~ialtyhas been o~~tstalling the PC1 bus. Bascd on a sirn~~lationmodel paid. To make sure the design could keep up with an of the configurations shown in Figures 1 and 2, we uninterrupted 64-bit DMA read, we used the qi~euing determined that three 64-byte buffers werc sufficient model and analysis of the slatem bus protocol and to stream DMA writes from the (266 MB/s) 1'CI bus decided that three cache-line-size prefetch buffers to the (1GB/s) system bus. would be sufficient. The algorithm for prefetching Later in this paper, \ve present measured perfor- uses the advanced PC1 commands as hints to deter- mance of DMA write bandwidth that matches the sim- mine how far memory data prefetching should stay ulation model results and, with large burst sizes, ahead of the PC1 bus: actually exceeds 95 percent efficiency. Memory Read (MR): Fetch a single 64-byte cache line. Prefetch Logic DMA writes complete on the PC1 before they com- Memory Read Line (MRL): Fetch two 64-byte plete on the system bus, but DMA reads must wait for cache lines. data fetched from memory or from a peer on another Memory Read Multiple (MRM): Fetch nvo PCI. As such, latency for DMA reads is al\vays worse 64-byte cache lines, and then fetch one line at than it is for writes. PC? Local B~isSpec@ccition a time to keep the pipeline full. Keuisioi7 2.1provides a delayed-transaction mechanism Mer the PC1 bus bridge responds to an MRM corn- for devices with latencies that exceed the PC1 initial- mand by fetching two 64-byte cache lines and the scc- latency requirement.' The initial-latency requirement ond line 1s returned, the bridge posts another read; as on host bus bridges is 32 PC1 cycles, ~\/hicliis the max- the oldest buffer is unloaded, new reads arc posted, imum overhead that may be introduced before the keeping one buffer ahead of the PCI. The third first data cycle. The Alphaserver 4100 initial latency prefetch buffer is reserved for the case in which a DMA for memory DMA reads is between 18 and 20 PC1

DigitJ Technical Journal Vol. 8 No. 4 1996 71 MRM completes while there arc still prefctcli rcads outstanding. Reservation of this buffer accomplishes two things: (1)it eliminates a time-delay bi~bblethat would appear between consec~~tiveDNA read trans- actions, and (2) it maintains a resource to fetch a scatter-gather translation in the event that tlie next transaction address is not in the TLR. Measured DMA bandwidth is presented later in this paper. The point at which the design stops prefetching is on page boundaries. As the DlMA window scatter-gather map is partitioned into 8-IU3 pages, the jnterface is designed to disconnect on 8-IU3-aligned addresscs. BURST SIZE (BYTES) The advantage of prefetching reads and absorbing KEY: posted \\?rites on this system is that the burst size can IDEAL PC1 be as large as 8 KR. With large burst size, the overliead MEMORY WRITE (MEASURED) of connecting and disconnecting from the bus is amortized and approaches a negligible penalty. Figure 9 Conlparison of l\/lensured DIMA Write Performance on an Idcal 64-bit PC1 3rd on 3n Alph,~Ser\~er4100 System DMA and PI0 Performance Results

We haire discussed tlie relationship bct\veen burst size, buffers. Si~ni~lntionpredicted that this ~iu~nberof initial latency, and bandwidth and described several buffcrs \vo~~ldbe sufficient to sustain hll band\vidth techniques we used in tlie AlpliaScrver 4100 PC1 bus DIVA writ~s-e\~en \vhen the systern bus is extremely bridge design tcj meet the goals for high-band\vidth busy-because tlie bridges to the PC1 are on a shared I/O. This section presents tlie performance delivered system bus that has roughly 1 GB/s available band- by the 4100 1/0 subsystem design, which has been width. The PC1 bus bridges arbitrate for the shared measured using a high-performance PC1 transaction systenl bus at a priority liiglier than the CPUs, but the generator. bridges arc permitted to execute only a single transac- Wc collected performance data under thc UNIX tion each tinlc t1ie)l \\!in the systcnl bus. Therefore, in operating system \\it11 a reconfigurable interfilce card tlie nrorst case, a PC1 bus bridge \vill wait behind three developed at DIGITAL, called PC1 Pamette. It is a other PC1 bus bridges for a slot on the bus, and each 64-bit PC1 option \vith a Xllinx FPCA interface to bridge \\,ill have at least one quarter of the available PCI. The board nras configured as a programmable system bus band\vidth. With 250 MB/s available but PC1 transaction generator. In this configuration, the with potential delay in accessing the bus, three posted board can generate burst lengths of 1 to 512 c)rclcs. \\{ritebuffers are sufficient to niaintain f~~llPC1 band- DMA either runs to a fixed count of \vords transferred \vidth for memory \\?rites. or runs continuous.ly (software selected). 'I'lic DMA The ideal PC1 system is represented by calculates engine runs at a fised cadence (delay benvecn bursts) performance data for comparison plIlQOSCS. It is a sys- of 0 to 15cycles in tlie case of a fixed count and at 0 to tem that has three cycles of target latency for \vrites. 63 cycles when run continuous1)r. Three cycles is the best possible for a medium decode The source of the DMA is a combination of a fire- device. The goal for DkM writes was to deliver perfor- running counter that is clocked using the PC1 clock lnance limited only by the PC1 device itself, and this and a PC1 transaction count. The free-running counter goal \\,as achic\led. Figure 9 demonstrates that mea- time-stamps successi\lc \vords and dctccts wq'~t states SLI~C~DMA write performance on the AlpliaSer\ler and delays benveen transactions. 'The transaction count 4100 system approaches theoretical masimunis. The identifies retries as \dlas transaction boundaries. colnbination of optimizations and innovations used As the target of PI0 read or write, the board can on this platform yielded an implementation that meets accept arbitrarily large bursts ofeither 32 or 64 bits. It the goal for DMA writes. is a medium decode device and alwaja operates with zero wait states. DMA Read Efficiency and Performance As noted in the section Prefctcll Logic, band\vidth DMA Write Efficiency and Performance performance of DMA rcads nlill be lo\ves than the pcr- Figure 9 sho\vs the close cornpanson bctwccn the formancc of f>MA \vritcs on all systelns because thcre AlphaServer 4100 system and a ncarly pcrfcct I'CI is delay in fetching the read data from niemory. For design In me'~sured DMA ~~1-1tebancl\vidth. As this reason, we incl~~dedthree cache-linc-size prefetch explained above, to sustain large bursts of DlMA buffers in the design. \vrites, wve implemented thrcc 64-byte posted \\trite

72 Digital Tcchnir.il Journal \'ol. 8 No. 4 1996 Figure 10 compares DA/M read bandwidth mea- system with a single CPU, and the results are pre- sured 011 the AlphaServer 4100 system with a PC1 sys- sented in Figure 11. The pended protocol for flow tem that has 8 cycles of initial latency in deli\lering control on the system bus limits the number of read DMA read data. This figure shows that delivered transactions that call be outstanding from a single bandwidth improves on the AlphaServer 4100 system (:PU. A single CPU issuing reads \vill stall waiting for as burst size increases, and that the effect of initial read-return data and cannot issue enough reads to latency on measured performance is diminished with approach the bandwidth limit of the bridge, LMeasured larger DMA bursts. read performance is quite a bit lower than the tlieoret- Tlie ideal PC1 system used calculated performalice ical limit. A system with multiple CPUs doing PI0 data for co~iiparison,assuming a read target latency of reads-or peer-to-peer reads-\\rill deliver PI0 read 8 cycles; 2 cycles are for medium decode of the bandwidth that approaches the predicted performance address, and 6 cycles are for memory latency of 180 of the PC1 bus bridge. PI0 writes are posted and the nanoseconds (ns). This represents about the best per- CPU stalls only when the writes reach the IQ thresh- formance that can be achieved today. old. Figure 11 shows that PI0 writes approach the Figure 10 shows memory read and memory read theoretical limit of the host bus bridge. line commands with burst sizes limited to what is PI0 bursts are li~iiitedby the size of the 1/0 read expected from these commands. As explained else- and write merge buffers on the CPU. A single where in this paper, memory read is used for bursts of Alphaserver 4100 CPU is capable of bursts up to less than a cache line; menzoly read line is used for 32 bytes. PI0 writes are posted; therefore, to avoid transactions that cross one cache line boundary but are stalling the slatem with syste11i bus flo\\~co~itrol, in the less than two cache lines; and mcmory read ~nulk@le maximum configuration (see Figure 2), we provide a is for transactions that cross mfo or more cache line minimum of three posted write buffers that may be boundaries. filled before flow control is used. Configurations with Tlie efficiency of nzeinoly read and nze~noly fewer than the maximum number of CPUs can post read line does not improve with larger bursts because more PI0 writes before encountering flow control. there is no prefetching beyond the first or second cache line respectively. This sho\.vs that large bursts Summary and use of the appropriate PC1 commands are both necessary for efficiency. The DIGITAL AlphaServer 4100 system incorporates design innovations in the PC1 bus bridge that provide Performance of PI0 Operations a highly efficient interface to 1/0 devices. Partial PI0 transactions are initiated by a CPU. AlphaServer cache line writes improve the efficiency of small writes 4100 PI0 performance has been measured 011 a to memory. The peer link niecha~iisniuses TLBs to

- 32 64 128 256 512 1024 2048 4096 BURST SIZE (BYTES) KEY: IDEAL PC1 (8 CYCLES TARGET LATENCY) MEMORY READ MULTIPLE (MEASURED) MEMORY READ LINE (MEASURED) MEMORY READ (MEASURED)

Figure 10 Comparison of DMRead Bandwidth on the AlphaServer 4100 System and on an Ideal PC1 System

Digital Tcchlical Journal - - PI0 WRITE, 32-BIT PC1 PI0 READ, 32-BIT PC1 PI0 WRITE, 64-BIT PC1 PI0 READ, 64-BIT PC1 KEY: MEASURED PERFORMANCE THEORETICAL PEAK PERFORMANCE

Figure 11 Comparison ofAlpliaSer\ter 4100 PI0 Performance \\litli Tlicoretical 32-bvtc Burst Peak Performance

map dcvicc address space on independent peer 1'CI References and Note buses to permit direct peer transactions. Reordering of transactions in queues on the PC1 bridge, combined 1. Willtc~.UNIX Hot Iron A\\rards, UNIS EXPO I'l~rs, with thc use of 1'CI delayed transactions, pro\,ides a Octobcr 9, 1996, http://\\,\\.\\..aiiv corn (iblcnlo I'.irk, deadlock-free design for peer transactions. Buffers and Calif.: Alhl Technology). prefetch logic that support very large bursts without 2. R. Gillett, ''1MEA~101

74 Digiral 'lkchnical Journal Vol. S No.4 1996 " Craig I(ccfer is a principal liard\vare engineer \\/hose engi- neering expertise is designing gate arrays. He \vas tlie gate array designer for one of the two 2351< CMOS gate arrays in the AlphaScr\~er8200 systern and the team leader for the con~niand.ind address gate drray in the AlphaSer\~cr8400 1/0 nlodule. A mc~nbcrof the Server product Development Group, he is now responsible for designing gate arrays for hierarchical switch hubs. Craig joined DIGITAL in 1977 and holds a B.S.E.E from the University of lon~ell.

Thomas A. McLaughlin Tom iMcLaughlin is a principal lhard\vare engineer \vork ing in DIGITAL'S Server Product Development Group. HE is currently involved with the next generation of higli- end server platforms and is focusing 011 logic synthesis and ASIC design processes. For the AlphaServer 4100 project, he was responsible for the logic design of the I/O subsystem, including ASIC design, logic synthesis, logic verification, and timing verification. Prior to joining the AlphaServer 4100 project, he was a member of Design and Applications Engineering within DIGITAL'S External Semiconductor Technology Group. Tom joined DIGITAL in 1986 after receiving a R.T.E.E.T. fkom the Rochester Institute ofTeclinology; lie also holds an M.S.C.S. degree from tlie Worcester Polytechnic Institute.

Digit31 Technical Journal I Viphi V. Gokhale Design of the 64-bit Option for the Oracle7 Relational Data base Management System

Like most database management systems, the Introduction Oracle7 database server uses memory to cache data in disk files and improve the performance. Historically, the li~uitingfactor for the 0riicle7 rela- tio~ialdatabase management system (RDBlMS) perfor- In general, larger memory caches result in better mance on any given platform has bcen'tlie amount of performance. Until recently, the practical limit computational and 1/0 rcsourccs available on a single on the amount of memory the Oracle7 server node. Although CPUs have beconic faster by an order could use was well under 3 gigabytes on most of lnagnitude over the last sc\reral years, 1/0 speeds 32-bit system platforms. Digital Equipment ha1.c 11ot irnprojred co1nmcnsur3tcl!,. For instance, the Corporation's combination of the 64-bit Alpha Alp11.1 CPU clock speed alone has increased four times since its introduction; during the same time period, system and the DIGITAL UNlX operating system disk access times have improvcd by a factor of nvo at differentiates itself from the rest of the com- bcst. The overall thro~~ghp~~tof database soft\\lare is puter industry by being the first standards- critically depende~lton the spccd of access to data. compliant UhllX implementation to support To overcome the I/O speed limitation and to maxi- linear 64-bit memory addressing and 64-bit mize performance, the standard Oracle7 database server application programming interfaces, allowing already ~~tilizesand is optimized for various parallelizu- tion tcchniclues in soh\,arc (e.g., intelligent caclung, high-performance applications to directly access data prefetching, and parallel query execution) and in memory in excess of 4 gigabytes. The Oracle7 hard\vare (e.g., symmetric niultiprocessing [SIMP] sys- database server is the first commercial data- tems, clusters, and massively parallel processuig [LMPP] base product in the industry to exploit the per- systems). Given the disparity ill latency for data access formance potential of the very large memory bct\\rcn memory (a fe\v tens of nanoseconds) and disk configurations provided by DIGITAL. This paper (a fc\v milliseconds), a common technique for maximiz- ing perfor~nanceis to minimize disk I/O. Our project explores aspects of the design and implementa- originated as an investigation into possible additio~lal tion of the Oracle 64 Bit Option. performance improvements in tile Oracle7 database server in the context of increased memory addressability and eseciltion speed pro\rided by the AlphaSenler and DIGITAL UNLY system. Work done as part of tl1.i~proj- ect subsequently became the foundation for product de\,elopment of the Oracle 64 Bit Option. Of the memory resource that the Oracle7 database uses, the largest portion is used to cache the most fre- quently used data blocks. With hardware and operat- ing system support for 64-bit memory addresses, new possibilities have opened LIPfor high-performance applicatio~~software to take advantage of large mem- ory configurations. Two of the concepts utilized are hardly new in data- base development, i.e., impro\.ing database server per- formance by caching more data in memory and irnpro\ling 1/0 subsystem througliput by increasing data transfer sizes. Ho\vever, various conflicting fac- tors contribute to the practical upper bounds on

76 Digiral Technical Journal perforniance impro\~ement.These hctors include basic unit for I/O and disk space allocation in the CPU architectures; memory addressability; operating Oracle7 RDBMS. Large block sizes mean greater den- system features; cost; and product requirements for sity in the rows per block for the data and indexes, and portability, compatibility, and time-to-market. An typically benefit decision-support applications. Large additional design challenge for the Oracle 64 Bit blocks are also useful to applications that require long, Option project was a requirement for sig~lifica~itper- contiguous rows, for example, applicatio~isthat store formance increases for a broad class of existing data- multimedia data such as images and sound. Rows that base applications that use an open, general-purpose span multiple blocks in Oracle7 require proportion- operating system and database software. ately more I/O transactions to read all the picccs, This paper provides an overview of the Oracle 64 resulting in performance degradation. Most platforms Bit Option, factors that influenced its design and that run the Oracle7 system support a maximum data- implementation, and performance implications for base bloclt size of 8 kilobytes (ICB); the DIGITAL some database applicatio~iareas. In-depth information UNIX system supports bloclt sizes of up to 32 IU3. on Oracle7 RDBMS architecture, administrative COIII- The shared global area (SGA) is that area of memory mands, and tuning guidelines can be found in the used by Oracle7 processes to hold critical shared data OI-LLC~~~Se~-ocrDocc~rne~~tn/ion Sel.' Detailed analysis, structures such as process state, structured query lan- database server, and application-tuning issues arc guage (SQL)-level caches, session and tra~lsaction deferred to tlie references cited. Overall observations states, and redo buffcrs. The bulk of the SGA in terms and conclusions from cxpcrimcnts, rather than specific of size, however, is the database buffer (or block) details and data points, are i~sedin this paper except cache. Use of the buffer cache means that costly disk \+theresuch data is publicly available. 1/O is avoided; therefore, the performance of the Oracle7 database server relates directly to the arnount Oracle 64 Bit Option Goals of data cached in the buffer cache. LSGA seeks to use as much memory as possible to cache database blocks. The goals for the Oracle 64 Bit Option project were as Ideally, an entire database can be cached in memory follows: (an "in-menior)r" database) and avoid almost all I/O during normal operation. Demonstrate a clearly identifiable performance A transaction whose data request is satisfied fi-om increase for Oracle7 running on DIGITAL UNIX the database buffer cache executes an order of rnagni- systems across two commonlp used classes of data- tude faster than a transaction that n3~1stread its data base applications: decision support systems (DSS) from disk. The difference in pcrforniauce is a direct and online transaction processing (OLTP). consequence of tlie disparity ill access ti~iicsfor main Ensure that 64-bit addressability and large memory memory and disk storage. A database block found in configurations are the only two control variabl.es the buffer cache is termed a "cache hit." A caclic miss, that influence o\~crallapplication perfor~iiance. in contrast, is the single largest contributor to dcgra- Rrcak tlie 1- to 2-GB barrier 011 tlie a~~iount dation in transaction latency. Both BOB and LSGA use of directly accessible memory that can practically memory to avoid cache misses. The Oracle7 buffer be used for typical Oracle7 database cache cache implementation is the same as that of a typical implemcntations. write-back cache. As such, a cache miss, in addition to Add scalability and performance features that com- resulting in a costly disk 1/0, can have secondary plement, rather than replace, current Oracle7 effects. For instance, one or more of thc lcast recently server SMP and cluster offerings. used buffers may be evicted from the buffer cache if no free buffers are available, and additional 1/0 transac- Implement all of the above goals without signifi- tions may be incurred if the evicted block has been cantly rewriting Oracle7 code or introducing appli- cation incompatibilities across any of the other modified since the last time it was read fro111the disk. platforms on which the Oracle7 system runs. Oracle7 buffer cache management algoritlinis already implcment aggressive and intelligent caching sche~ncs and seek to avoid disk I/O. Although cache-miss Oracle 64 Bit Option Components penalties apply with or \vithout tlie 64-bit option, Tcvo major components make up the Oracle 64 Bit "cache thrashing" that results from constrained cache sizes and large data sets can be reduced wit11 the Option: big Oracle hloclts (BOB) and large shared option to the bcnefit of many existing applications. global area (LSGA). They arc briefly described in this section. :I'lie Oracle7 buffer cache is specifically designed and optimized for Oracle's multi-versioning read- The BOB cornponcnt takes advantage of large memory by making individual database blocks larger consistency transactional model. (Oracle7 buffer than those on 32-bit platforms. A database block is a cache is independent of the DIGITAL UNIX unified buffer cache, or UBC.) Since Oracle7 can manage its

Digital Technical Journal Vol. 8 No. 4 1996 7 o\vn buffer cache more effectively than file system strained by the fact tliat this resource is also shared by but'fer caches, it is ohcn recommended that the filc man)! other critical data structures in the SGA besides system cache sizc be rcciuccd in favor of a Idrger tllc Lx~ffercache and tllc mcmor!! nccded by the oper- Oracle7 buffcr cache \vhcn the database resides on ating system. 1311 eliminating the nccd to choose 3 file system. Reducing filc systcm caclie size also mini- bct\vccn the size of the database blocks and bi~ffcr mizes redunda~itcaching of data at the file system caclic, Oracle7 or1 a 64-bit platform can run a greatcr level. For this reason, we rcjcctcci early on the obilio~~s nppliciition mix \vitliout sacrificing performance. dcsign solutio~iof i~singthe DIGITAL UNIX file sys- Despite the codependency and the common goal tem as a large cache for taking advantage of large of reducing costl!~ disk I/O, 13OB and LSGA address memory ~01ifig~11-atio1~s-c\~~11tl~o~~gh it had tlic nvo different dimensions of ci;lt;ih3se scnlabilin: I30B appeal of complete transparency and no code changes addresses on-disk database sizc, and the LSGA addresscs to the Oracle7 s!.stcrn. in-memory database sizc. Applicarion developers and datab,lsc administrators l~avccomplete flesibility to Background and Rationale for Design Decisions hvor one o\rr the other or to use tlic~nin combination. In Oracle7, the on-disk data structures that locate The primary impeti~sfor this projcct was to e\~aluatc a row of data in the elatabase LIS~a block-address- the i~nplcton the Oracle7 database server of emerging byte-offset tuple. The data block address (DBA) is n 64-bit platforms, such as tlie Alphaserver system and 32-bit cli~antity,\\/hich is f~rtlicrbrokcn up into filc 1)IGITAL UNIS operating system. Goals set forth number and block ofket within that filc. The byte off- for this project and subscclucnt dcsign considerations set within a block is a 16-bit quantity. Although the therefore escluded any performance and fi~nctionality n~~liibcrof bits in the DRA i~scdfor file nirmber and enhancements in tlie Oracle7 1U)BMS that could not block offsct are platform dcpcndc~it(10 bits for the filc bc attributed to the benefits offercd by a typical 64-bit niunbcr and 22 bits for tlic block offsct is a common platform or otherwise encapsulated \\tithin platti)r~n- ti)rmat), there exists a theoretical ~~pperlimit to tile spccific layers of the databasc server code or the oper- size of an Oracle7 databasc. With some cxccptions, ating system itself. most 32-bit platfor~ussupport a masimium data block Common arcas of potential benefit for a typical size of 8 KB, with 2 I(R as the dcfa~~lt.For example, 64-bit platform (rvhcn compnrcd to its 32-bit cotrn- using a 2-KB block sizc, the upper limit for the size tcrpart) are (a) i~~crcascddirect memory addressabilit): of tlic database on l>IGITAL. UNIS is slightly undcr and (b) the potential ti)r configuring systems with S terabytes (TB); whereas a 32-13 hlock size raises greater than 4 GB of Inemor!.. As noted above, appli- that limit to slightly undcr 128 TI%.The ability to SLIP- cation perfor~iianceof tlic Oracle7 d'ltabasc ser\.er port buffer cache sizes \\.ell bc!,ond those of 32-bit clcpc~~dson \\lllether or not data are fi)i~ndin the data- pl'ltforms \\,as a critical prcrcqi~isitcto enabling larger base buffer cache. A 64-bit platform provides the sized data blocks and consecli~entlylargcr sized data- opport~~nityto expand the database buffer cache in bases. Some 32-bit platforms arc also constrained by Oracle7 to sizes \vcll beyond those of a 32-bit plat- the tjct tliat each data file cannot cxcecd a size of4 G'B form. BOB and LSGA reflect the o~ilylogical dcsign (especially if the data filc is a file system managed choices available in Oracle7 to take advantage of this object) and therefore may not be able to use all of the cstended addressability and meet the project goals. n\railablc block offset range in the esisting DKA for- Implcmentatio~~of thcsc components focused on mat. The largest databasc sizc tliat c,ln be s~ipportedin ensuring scalability and maximizing the effectiveness such a case is e\.eli smaller. Addressing the perceived of avnilable nicmory resources. linlits on the size of an Oracle7 databasc \ifas an i~npor- tant consideration. Design alternatives that recluired BOB: Decisions Relevant to On-disk Database Size changes to the layout or an intcrpl-ctation of DBA hr- Larger database blocks consume proportionately niat \\'ere rejected, at least in this project, because such largcr amounts of memory \vIicn the data contained in changes would have introduced incompatibilities in those blocks are read from tlie disk into the databasc on-disk data structures. buffcr cache. Consequently, the size of the buther It slio~~ldbe pointed out that on current Alplia cache itself must be increased if an application reqi~ires processors using an 8-ICE page sizc, a 32-KB data a greater number of thcsc largcr blocks to be cached. block spans four memory pages, and 1/0 code paths For an!! given size of databasc buffer cache, Oracle7 in the operating systenl 11ccd to Iock/~~nloc[~~OLIK database administrators of 32-bit platforms have times as man11 pages \\,hen pcrtbrming an 1/0 trans- had to choose benvecn the sizc of each database block action. The larger transfer sizc also adds to the total and tlie number of databasc blocks tliat must be in time taken to perform an I/O. For instance, four the cache to niini~nizedisk I/O, t11c choice depe~lding pages of memory that contain the 32-K13 data block on data access patterns of the applications. Memory may not be physically contiguous, and a scatter-gather available for the database buffcr cache is fi~rthercon- operation map be recluired. Althoi~ghthe Oracle7

Val. 8 No. 4 I996 database supports ro\v-level locliing for maximum would use segmented allocations if the size of concurrency in cases where unrelated transactions Ilia)[ the mcniory allocation request exceeds a platform- be acccssirig different rows within a given data block, dependent tliresliold. In particular, the size in bytes access to the data block is serialized as each individual for each nlernory allocation request (a platfor~il- change (a transaction-level change is broken down dependent value) was assumed to be \veil under 4 GR, into multiple, smaller units ofchange) is applicd to the \vJlich was a correct assumption for all 32-bit plat- data block. Larger data blocks accommodate more forms (and even for a 64-bit platform without ISGA). rows ofdata and consequently increase the probability Internal data structures i~sed32-bit integers to repre- of contention at tlie data block level if applications sent the size of a memory allocation rcqucst. change (insert, update, delete) data and have a locality For each buffer in the buffcr cachc, SGA also ofrcfcrence. Experiments have sho\4~1i,however, that contains an additional data structure (buffer header) this added cost is only marginal relative to the overall to hold all the metadata associated with that buf- pcrforrnance gains and can be offset easily by carefully fer. Although memory for the buffer cache itself \\/as tuning the a,pplication. Moreover, applications that allocated using a special interface into the lncniory mostly clucr)~tlic data rather than niodi5 it (c.g., DSS Iiianagcmcrlt layer, rnenior)~allocation for buffcr applications) greatly benefit fi-om larger block sizes headers used conventional interfaces. A different since in this case access to the data block need not be allocation scheme \\,as needed to allocate meniory serialized. Subtle costs such as the ones mentioned for buffer headers. The buffer header is the only above ne\lertheless help explain \vhp some applications major data structure in Oracle7 code whose size may not necessarily see, for example, a fourfold pcr- requirements are directly dependent on the number of formance increase when tlie change is made from an buffers in tlie buffer cache. Existing memory man- 8-IU3 block size to a 32-IUS block size. agement interfices and algorithms used prior to LSGA As with Oracle7 iniplcnicntations on other platforms, work were adequate until the number of buffers in database block sizc \i~itlithe 64-bit option is determined the buffer cache exceeded approximately 700,000 at database creation time using db-block-size con- (or bufkr cache size of approximately 6.5 GH). Minor figuration paranictcr.' It cannot be changed dynan~ically code changes were necessary in memory manage- at a latcr time. ment algorithms to accommodate bigger allocation requests possible with existing high-end and future LSGA: Decisions Relevant to In-memory Database Size Alphaserver configurations. The focus for the LSGA effort was to idcnti@ and eli~ii- The AlphaServer 8400 platform can support Inem- inate any constraints in Oracle7 on the sizes to which ory config~trationsranging from 2 to 14 GB, using the database buffcr cachc could grow. DIGITAL UNIX 2-GB memory niodules. Some existing 32-bit plat- meinory allocation application progrrunming interfaces forms car1 support physical memory configurations (APIs) and proccss address space layout rnaltc it fairly that exceed thcir 4-GB addressing limit by way ofseg- straightfor\\larcl to allocate and mallage Systcrn V mentation, such that only 4 GB of that meniory is shared nmnory scgmcnts. Although tlie sizc of each directly accessible at any time. Program~ningcomplex- shared mcmory segment is limited to a maximum of ity associated with such segmented memory models 2 GB (duc to the requirement to comply with UNIX precluded any serious consideration in the design standards), multiplc scgmcnts can be used to work process to extend LSGA work to such platforms. around this restriction. The memory management Significantly rewriting thc Oracle7 code \\!as specifi- layer in Oracle7 code therefore \vas the initial area of cally identified as a goal not to be pursued by this proj- focus. Much of the CIracle7 code is \vrittcn and archi- ect. The Alpha processor and DIGITAL UNIX system tectcd to make it highly portable across a diverse rangc provides a flat 64-bit model to of platforms, i~icludingmemory-constrained 16-bit the applications. DIGITAL UNIX extends standard desktop platforms. A particularly interesting aspect of UNIX APIs into a 64-bit programming en\iironment. 16-bitplatforms with respect to memory management Our choice of tlie Alphaserver and DIGITAL UNlX as is that these platforms cannot support contiguous a development platform for this project was a fairly memory allocations bcyond 64 KB. Users arc forced simple one from a time-to-market perspective because to resort to a segmented memory model such that it allowed us to keep code changes to a minimum. each individual segment does not exceed 64 1U3 in Efficiently managing a buffer cache of, for example, size. Although such restrictions are so~newhatcon- 8 or 10 GB in size was an interesting challenge. More straining (and perhaps irrelevant) for most 32-bit than five million buffcrs can be accommodated in a platforms-more so for 64-bit platforms-which can 10-GB cache, with a 2-KB block size. That number of easily handle contiguous liiemory allocations well buffers is already an order of magnitude greater than in excess of 64 1U3, memory nianagcrnent layers in what we u7ere able to experiment with prior to the Oracle7 code are designed to be sensitive and cautious LSGA work. The Oracle7 buffer cache is organized as about large co~ltigi~ousmemory allocations and an associative write-back cache. The mechanism for

Digiral Technical Joun~al locatbig a data block of interest in tliis c~cheis s~~pported cntrp allo\\*the Alpha <:PU to use a single translation by common dgorith~nsand data structures such as hash look-aside buffer (TLB) entry to map a 512K physical ti~nctio~ismd li~lkedlists. In man![ cases, tral~ersingcriti- mcmory space. Using one 1'LR entry to map larger cal linked lists is serialized among contending threads of ph!lsical memory has the potential to reduce proccsbor cscc~~tionto maintain the integrity ofthc lists tlic~iiscl\~cs stalls during 'TLB misses anci ref 11s. Also, beca~~seof the and secondary data structures managed by these lists. As rcquircmcnt that the grani~larityhint rcgion be both a result, the sizc ofsuch critical lists, fix example, has an virtually and physically contig~~ous,it is allocated at sys- impact on overall concurrency. The larger buffer count tem startup time and is not subject to normal virtu.11 no\\.possible in LSGA configi~~itionshad the net effcct nicmory management; tbr csamplc, it is never paged in of reduced concurrency bcca~~scthe sizc of tlicse lists is or OLI~,n~id subseq~~cntly the cost ofn page fa~~ltis mini- proportion at el!^ larger. 1SGA pro\'idcd a fi-amc\vork to mal. Since pages in gran~~l;i~in,liint rcgio~isare pli!rsj- tcst contributions from other unrelated projects that cally contiguous, my I/() done fiom this rcgion of addrcssccl such potelitin1 bottlcncclSS, An in dust^-!^- Examples of this \vork arc in the basc operating system sta~ldardbcnchmark \\,auld Iin\'c bccn a logicnl choice grani~larityhint regions ancl in the shared page tables.?' for a \\lorkload that \\loulci dcmonstratc performance For every page of physical and virtual lileniory, a11 bcncf ts. Ho\lre\rcr, the cnor~nousti~iic, rcsoLIrccs, and opcrxing system niust ~iiaintainvarious data structurcs cfhrt rcquircd to stage an a~~ditcdTlY: benchmark such as pagc tables, data stl.ucturcs to track regions oF and the strict g~~idelincsFor nny direct coniparison of mcmory with certain nttributcs (such ns S~~stemV slixcd p~~blisllcdbcnclimark results \\.ere major factors in nicmory regions, or tczt and data segments), or diit.1 tlic decision to dc\rclop 3 \\rorkloaci for this projcct structures that track processes which lia\.e references to that matched the spirit of the TI'(; benchmark but not these mcniory regions. Ancillary opcr.iting system data ncccssarily the letter. structures sucli as pagc tnblcs gro\tr in size pro- In late 1995, Oraclc Corporation ran a series ofpcr- ~x)rtion,ltclyto the size ofpli!~sical mcmory. C:hangcs tbrmancc tcsts for a DSS-class \\forkloadofthe Oracle7 to page table managcmcllt associntccl \\'it11 System V systcm, \\lit11 ancl \\zithoL~tthe 64-bit option on the shared mcmory regions ucrc ~nadcsuch that processes AlplinSer\~cr8400 system run~ii~igthe DIGITAL UNIS that mapped the sharcd mcmorp regions could share operating system \\fit11 8 GR of physical mcmory. A page tables in addition to the data pages themsclvcs. detailed report on this tcst is published and available fi-om Oraclc C~rporation.~Thcsc results, shown in Prior- to this change, each process mapping the shared memory region ~~seda copy of'associnted pagc tables. Fig~~re1, clcarly demonstrate the benefits of a large i\ i\ clinngc like this rcduccii physical mcmorjr consump- nmount of physical ~ncmoryin ,I config~~ration\\,it11 the 64-bit option. A sunlmilr!p ofthc tcsts conducted is tion by the operating system. For csnniple, on an Alpha <:PU supporting an SKI3 page sizc, it would take 8 KH ~wcsuitcdhere along \\fit11 somc data points and kc! observations. in pagc table entries to m;ip 8 MR of physical memory. (1Zcnders interested in pcrfi)rmancc characteristics ot For nn SGA of 8 GB, it \\lould tnkc 1 1Ml3 in page tablc entries. It is not ~~nco~ii~iio~iin the 01-acle7 systenl for nn audited industry-standard O1,Tl' bcnc11nia1-k arc li~~ndredsof processes to connect to the database, and referred to the nigital ~~'c./~I~~c~I/,/oIII~I~LI/,V~ILIIIIC8, thcrcforc niap tlie 8 GI3 ofSGA. Witlio~~tshared pagc Number 3. T\\,opapers present performance character- tables, 100 such processes would have consumed 100 istics of Oracle7 Par,illcl Scr\lcr rclc.lsc 7.3 using 5.0 GR MR of physical mcmor!l by maintaining a per-process SGA, and a TPC-<: workload on a fi)ur-node cl~~stcr.') copy of page tablcs. Tlic tcst database co~lsistcdoff 1.c tables, reprcscnt- ing approximately 6 GB of ciata. The tests incl~tdcd A gran~~larityhint rcgion is n rcgion ofph!aicaJly con- tiguous pages of mcmory thnt shnrc \.irtual and physical t\vo separate configurations: mappi~igsbetween all the proccsscs tliat map them. A "standard" config~~ration\\~tIl ,I 128-MI3 SGA Such a mcmor!l layout nllows 1)IGITAL UNIX to take \vith a 2-KB database block sizc nd\,antagc of tlie granularity hilit feature supported by A 64-bit option-e~iabledconf guration \\~itha 7-GI3 Alpha processors. Gra~~ularityhilit bits in a page tablc SGA and 32-KB database bloclc sizc

Vol. 8 No. 4 1996 PERFORMANCE RATIOS OF LSGA TO SGA of the 64-bit option) is a standard offering in tlie Oracle7 database server product since release 7.1. Use of parallel query in this tcst illustrates the effcct of the 64-bit option enhancements on preexisting mecha- nisms for database performance irnprove~~ient. All other things being equal, if tlie only difference between a standard configuration and a 64-bit- option+nabIcd co~~fig~~rationis that the entire data set is cached in memory in the latter configuration and tliat typical times for main memory accesses arc a few tens of nanoseconds \\/hereas times for disk accesses arc a fc\v milliseconds, only the six to seven tinics performance " .- 1 2 3 4 5 6 increase i~itra~lsactio~i 1 \\io~lld seem hr belo\v TRANSACTION TYPE expectation. For a till1 table scan operation, the Oracle7 server is already optimized to use aggressi\ie data Figure 1 prefctch. Before tlie server begins processing data in Pcrform'i~~ccI~nprovcrncnt~ for J DSS-class \Vorkload, Kat~osof LSGA to SGA a given data bloclc, it launches a read operation for the next block. This technique significantly reduces application-\iisible disk access latencies by o\,erlapping The e\~aluationinclucled running six separate trans- conip~~tationand I/O. Disparity in access time for main action types ag'iinst these t\lroconfigurations: memory and disk is still largc enough to cause the com- putation to stall while waiting for the read-ahead 1/0 to 1. Full table scan against a table with 42 million rows finish. When data is cached in memory, this rc~naining (witliout tlie Parallel Query Option) stall point in the qllery pl-ocessing is eliminated. 2. Full table scan against a table \\/it11 42 million rows It is also important to note that a f~lltable scan (with tlie P~rallelQ~~ery Option) operation tends to access the disk sequentially. It is 3. Set of ad lioc queries against a talsle with typical for disk access times to be better by a factor of 42 nill lion rows at lcast nvo in sequential access .is compared with rail- 4. Set of ad hoc clucrics invol\~inga join against dom access. Icccping block sizc and disk and ~iiai~i three tables with 10.5 million, 1.4 million, and memory access times the sanic as bcforc in this equa- 42 million rows, I-cspecti\~clp tion, '1 faster Alpha Cl'U \\~ouldyield better ratios in this test because it \vould finish co~i~pl~tationpropor- 5. Set of ad hoc q~~eriesin\rol\~ing a join against four tionately faster and \\lould \\lait longer for the read- tables with 1 million, 10.5 million, 1.4 million, and ahead I/O to finish. Follo\v-on tests with faster CPUs 42 million rows, respectively supported this observation. Overlapping computution 6. Set of ad hoc qileries involving a join against and 1/0 as in ;I tablc scan operation may not be possi- five tables \\,it11 70,000, 1 million, 10.5 million, ble in an indcx lool

Vol. 8 No. 4 1996 Unlike fill1 table scans, tlie sort/mergc operation Acknowledgments gcncrntcs intcrmecliatc res~~lts.Depending on the size of thcsc partial rcsults, they may be stored in main I\/I.III\, pcol>lc tvithin sc\.cral groups and disciplines st both Iiicmory if an adeq~~ateamount of menlory is avail- Oracle .~ndDIGITAL hxccontrib~rted to the succc.ss of this able; or they may be \vrittcn back to temporary storage ~>rojcct.I \\ OLIICIlike to tlia~ikthe follo\vinp intii\,idu.ils liom Or,lclc: kV.lltcr Bartistclla, Saar Maoz, let' Kcllllcd!, r11lti space ill the ci'it,lbase. The latter operatio11 resi~ltsin additional l/Os, proportionately more in nunibcr as l).l\.iti Ir\\,in of the 1)IGITAI. System Busincss ['nit; ,~ndfiorn DIGITAL.: Jim Wood\\zarci, PIILII;~L,o~ip, l).~srcll i~ipt~tstothc sort/mcrgc gro\l1 in size or count. The I)~~nn~~cli,and L).l\,c Wincliell of the l)IC;l'l';iI, LSIS 64-bit option makes it possible to eliminate thcsc I/Os Engineering ~~OLIP.Mcr~~bers of tllc Computer S!,stclns as well, as illustrated in transaction types 4 thro~~gh6. I)i\,ibion's Pcsfornlancc Group at DIGITAL. hn\rc also con- Pcrformance improve~ne~itsare greater as the conl- tr~burcdto this projcct. plexity of queries increases.

Conclusion References

The disparity between memory speeds and disk speeds I . 01 wclc 7 SCJI.[IL'I.DOCI.II~?C'I~~U~~OII .Ye/ ( Kcci\\,ood SI~orcs, Calif.: Ol-nclc Corporation). is likely to continue for the hrcsccablc future. Ldrgc mcmory configurations represent an opportunity to 2. I~I(;I'l;Jl.L!VIX I V.0 Releclst~,\iol~s (Maynard, IM~ss.: o\,ercornc this disparity and to increase application 1)igiral Equipment Corporation, 1996). perhl-niancc by caching a large amount of data in 3. R. Sites and R. Witek, eds., Alphu A~rhilcc'l~itr~KC;/~,I- memory. Even though the Oracle 64 Hit Option c~rrcc~.Ilc~~~~~c~/(Ne\\.to~i,1Mas.s.: Digir~l I'rcss, 1995). impro\,cs database perforn~ance-two orders of mag- nitude in so11ic cases-specific application characteris- 4. O~z~clc>64 Bit Option Pel:/bl.n~u~lceRepol? 011 I)i~:,i/ril I :\%Y (lLd\vood Sliorcs, Calif.: Or~cle<:orpol-ation, tics must bc cvaluatcd to determine tlie best means k)r p~rt~~urnbcr C10430, 1996). maximizing o\rcrall perfoslilancc and to balance thc significant increase in hardware cost for tlie largc 5. J. Pi;lntcdosi, A. Sathaye, and D. Shnkshobcr, "l'crfor- amount of mcmor!,. TIic Oracle 64 Bit Option com- mancc Mcns~~rc~ncntof TruClustcr Systems under tllc plements existing Oracle7 feati~resand fi~nctionalin,. 'I'I'(:-<: 13cncIirnark," and T. Kawnf, 1). Sllakshobcr, and I). Stnnlcy, "Pcrfor~nanccAnalysis Using Vcl.!~ L,;lrgc The cxact cstcnt of thc increases in speed with the iVlcmosy on thc 64-bit Alphaserver Systcnl," 1)i~ilol 64-bit option varies based on the type of database 'li.cl~~~ical,/o~l~~i~ul,\,ol. 8, no. 3 (1996):46-05. opwruion. Faster (:PUS and denser memory allow for c\rc~~morc pcrforniance improve~-nentsthan lia\,c Biography bccn clcmo~latmtcd.Factors of ilnportance to nc\v or existing applications, particularly those sensiti\,e to response time, are an order of magnitude performance Vipin V. Gokhale in terms of spced increases and the ability to utilize Vipin Gokh~lcis (1 (:onsulti~igSofnva1.c Engineer ;lr Orciclc nicmory configurations much larger than prc\fio~~sly <:orlx)l-.)tion irl the I)TGI?'.-IL Systcm Busi~lcssUnit \\.licrc he 1135 co~itributedto porting, optimizatioll, and plarform- possiblc in Oracle7 or fix applications that use specific fcnturcs ant1 fr~lcrio~ialityexte~lsiorls to Ornclc's n~odcratc-sizcdata sets. With sufficient physical mem- d.lrabasc scr\-cr on I)IGI'I'.4L's operating h!.stclns and scr ory, the dat~bases~~sed bv these rcsponse-time- vcl-s. Hc was ~rcsporisiblefbr delivering the ti rst 01-aclc7 scnsiti\rc applications can now be entircl!~ cached in pol-t to tllc 1)IC;I~I'AI. UNIX platform. Prior to joininp memory, cliniinaring \rirtually all disk I/O, \vhicIi is Ornclc in 1990, Vipin \\..IS n Senior Sofn\.are E~iginccr in India, dc\,clopingrclccorn~~~~~nications soft\\.arc. Hc oftcn a major constraint to response time. In-memory received .I B.'li.ch. in Electronics and l'cIccom~nuniz.~- (or f~llycaclicd) Oracle7 databases do not conipro- rions fi-om the Institute ofTcchnology, Banaras Hinciir niise transactional integrity in any wa),; nor do such University, Indin, in 1985. configurations require special hardware (for example, nonvolatile random access menlory [RAM]). I

Vol. 8 No. 4 1996 T.K. Rengarajan Maxwell Berenson Ganesan Gopal VLM Capabilities of Bruce McCreadp Sapan Panigrahi the Sybase System 11 Srikant Subramaniam SQL Server lMarc B. Sugiyama

Software applications must be enhanced to The advent of the System 11 SQL Server from Sybase, take advantage of very large memory (VLM) Inc. coincided with the widespread availability and system capabilities. The System 11 SQL Server use of very large memory (VLM) technology on DIGITAL'S Alpha n~icroprocessor-based computer from Sybase, Inc. has expanded the semantics systems. Tccl~nologicalfeatures of the System 11 SQL of database tables for better use of memory Server werc used to achieve record results of 14,176 on DIGITAL 64-bit Alpha microprocessor-based transactions-per-minute C (tpmC) at $198/tpmC systems. Database memory management for on the DIGITAL Alpha~erver8400 server product.' the Sybase System 11 SQL Server includes the One of thcse features, the Logical Memory Manager, ability to partition the physical memory avail- provides thc ability to fine-tune memory manage- ment. It is the first step in exploiting the semantics of able to database buffers into multiple caches database tables for better use of memory in VLM sys- and subdivide the named caches into multiple tems. To partition memory, a database adnlinistrator buffer pools for various I10 sizes. The database (DBA) creates multiple named buffer caches. The management system can bind a database or 1)11A then subdivides each named cache into multiple one table in a database to any cache. A new buffer pools for various 1/0 sizes. The DBA can bind a facility on the SQL Server engine provides database or one table in a database to any cache. A ncw thread in the SQL Server engine, called the nonintrusive checkpoints in a VLM system. Houseltccper, uses idle cycles to provide free (non- intrusive) checl

VLM Technology

The term very largc rncmor)l is subjective, and its widespread meaning changes with time. By VLM, wc meall systems with more than 4 gigabytes (GB) of memory. In late 1996, perso~lalcomputer servers with 4 GB of memory appeared in the marketplace. At $10 per megabyte (MB), 4 GB of memory becomes afford- able ($40,000) at the departmental level for corpora- tions. We expect that most of the mid-range and high-end systems \vIII be built with more memory in 1997. Gro\vth in the amount of system memory is 'in ongoing trend. Growth beyond 4 GB, ho\vevcr, is a significant expansion; 32-bit systcms run out of mem- ory after 4 GB. DIGITAL developed 64-bit computing with its Alpha line of microprocessors. Digital is now

Digiral Technical Journal Vol. 8 No. 4 1996 83 \vcll-lxxitioncd to facilitate the transition from 32-bit provides technological ,~d\rancesthat take advantage of to 64-hit s!,stcms. S!~base, Inc. pro\idcd one ofthc frst VLIM systcms. Thesc Jrc the L,ogical iMcmory relational database ma~iagcmcntsystcnis to LISC V1,iM Manager, VLA4 q~~cl-yoptiniizatio~i, the Houscltccpcr technology. The Sybasc System 11 SQI. Ser\,er pro- thre~d,and hzzy checkpoints. We discuss the signiti- \ricics hll, native support of 64-bit Alplid microproces- cance of tliesc ad\~,lncesin the remaining sections of sors and tlic 64-bit 1)IGITAL UNlX operating systcni. this paper. l)IGITAL, UNIS is the first operating s),stcm to provide a 64-bit .~ddrcssspacc for 311 proccsscs. The System 11 Logical Memory Manager SQT. Scrvcr uses this large address spacc primarily to caclic large portions of the database in mcmory The Sybase SQL, Server consists of several I)IC;ITAI, VI,i\/l technology is appropriate fix use \\lit11 applica- UNIX processes, called engines. The DBA confgures tions that Iia\~cstringent response time rccl~~ircmcnts. the number of engines. As sIiol\~nin Fig~~rc1, each With thcsc applications, for cs'~mplc,call-routing, it engine is perrnancnrl\, dedicated to one CPU of .I sym- bcco~ncsnecessar!, to fit the cntirc databasc in nicm- metric ni~~ltiprocessing(SIMP) machine. The Syb~sc or!!.' ' Tl~cuse of VLiM systcms can also bc bcncfcial engines share \'irti~.llmcmory, \\rliich has bccn sizcci to \\.hen the pricc/pcrfor~mance is impro\,ed by adding include the SQL, Scr\rcr csecutable. The virtual mcm- mor-c rnc~nory.~ or\, is lockcd to pliysical Iiicmor!i. As a result, tlicrc is never any opclating system paging for the Sybnsc Main Memory Database Systems memory. This sliarcd mcmor!~region also uses large operating system pages to minimize translation look- ,I hc \\~idcsprcad a\~ailabilityof VLM systcms raises aside buffcr (TLR) entries for the <:PU.TThc sliarcd tlic possibility of building main mcmory database memory holds the cl~tab~scbufkrs, stored procedure (IMIM~)R)systcms. Se\cral techniques to imp-o\rc the caclic, sort buffers, and othcr dynamic ~-~icrnor\!.This 1>ufi)r1i1.1liceof h?i\/11)13 systems have bccn discussed memory is managed csclusi\~clyby the SQI, Scr\u-. in tllc d,ltabase literature. l1',A partitions the nicm- tist in IMIMDBs,access mcthocis can be developed \\it11 ory, it can tlic~ibind databasr entities to a particular no Ikcy \ralucs in the index but only poi~itcrstod,lta cache. The database entity is one of the follo\\,ing: an ro\\.s in mcmory.Q~~eryoptirnizcrs nccci to consider <:1'U costs, not 1/0 costs, \\,hen comparing \,ario~~s

,~ltcrnati\rcplans for a query. In all 1\/1Ml)l3, clicck- CPU CPU pointing and failure recolYer!' arc the orll~,reasons for SYBASE SYBASE performing disk operations. A chcckpoint process can ENGINE ENGINE hc made "f~~zzy"\\rich lo\\' impact on transaction rlirouglip~~t.Akcr a system failure, incrcmcntal rcco\.- cry processing allo\\~stransaction ~~occssi~igto resume SECOND-LEVEL bcti)rc the rcco\{crp is complctc.' CACHE As memory sizes incrcasc with VLM systcms, dam- I I b,lsc sizes ;1rc also incrrasilig. 111 gc11cra1, \tlc cspcct C 4 BUS thnt d~t~[>ases\dl not fit in mcmo~-\~in the next decade. Therefore, for most of the databases, i\/Ih41)13 MEMORY tccl~niqucscan be exploited only for those p'irts of tlic n dntabasc that cio fit in memor!l.' Figure 1 In adciition to the capability oFcacliing the cntirc SQI, Server on .in Slcll' S\\rcm database in buffers, the Sybase System 11 SQI, Scr\,er

Vol 8 No. 4 I996 entire database, onc table in a database, or one index costs are reduced to an estimate of time. Since the on one table in a database. There is no limit to the number of 1/0 operations depends on the amount of number ofsuch entities that can be bound to a caclie. memory available, the optimizer uses the size of the This cache binding directs the SQL Server to use only cache in the cost calculations. With LMM, the opti- that cache for the pages that belong to the entity. mizer uses the size of the llamed cache to which a cer- Thus, tlie DBA can bind a small database to one cache. tain table is bound. Therefore, in the case ofa database In a VLM system, if the cache were sized to be larger that completely fits in tnemory in a VLM system, the than the database, an MMDB would result. optimizer choices are made purely on the basis of CPU Figure 2 shows the table bindings to named caches cost. In particular, the 1/0 cost is zero, when a table with the LMM. The procedure cache is used only or an index fits in a named cache. for keeping compiled stored procedures in memory The Spbase System 11 SQL Server introduced the and is shown for completeness. The item cache is a notion of tlie fetch-and-discard buffer replacement small cache of 1 GB in size and is used for storing policy. This strategy indicates that a buffer read from a s~iiallread-only table (item) in memory. The default disk will not be used in the near klture and hence is cache holds the remaining tables. Figure 2 shows one a good candidate to be replaced fi-om the caclie. The table bound to the item cache and the other tables buffer management algorithms leave this buffer close bound to the default cache. By being able to partition to the least-recently-used end of the buffer chain. In the use of memory for the item table separately, the the simplest example, a sequential scan of a table uses SQL Server is now able to take advantage of MMDB this strategy. With VLM, this strategy is turned off techniql~esfor only the item cache. if the table can be co~npletelycached in memory. The Each named cache can be larger than 4 CB. The size fetch-and-discard strategy can also be tiuied by appli- is liniited only by the amount of memory present in cation developers and DBAs if necessary. the system. Although we do not expect such a need, it is also possible to have hundreds of named caches; Housekeeper 64-bit pointers are used throughout the SQL Server to address large memory spaces. One of the motivations for developing VLM was the The LMM enables the DBA to fine-tune the use of extremely quick response time requirements for trans- memory. The I,MM also allows for the introduction actions. These environments also require high avail- ofspecific MMDB algorithms in the SQL Server based ability of systems. A key component in achieving high on the semantics of database entities and the size of availability is the recovery time. Database systems named caches. For csample, in tlie f~~ture,it becomes write dirty pages to disk primarily for page replace- possible for a DBA to express the fact that most of one ment. The checkpoint procedure writes dirty pages to table fits in one named cache in memory, so that SQL disk to minimize recovery time. Server can use clock buffer replacement. The Sybase System 1 1 SQL Server introduces a new thread called the Housekeeper that runs only at idle VLM Query Optimization time for the system and does useful work. This thread is the basis for lazy processing in the SQL Server for The SQL Servcr q~~eryoptimizer co~nputesthe cost now and the k~ture.I11 System 11, the Houselteeper of query plans in terms of CPU as well as I/O. Both writes dirty pages to disk. At first, it writes pages to disk from the least-recently-used buffer. In this sense, it helps page replacement. In addition to ensuring that there are enough clean buffers, the Housekeeper also PROCEDURE CACHE, 0.5 GB attempts to minimize both tlie checkpoint time and I I the recovery time. If the system becomes idle at any I ITEM CACHE, 1 GB time during transaction processing, even for a few mil- liseconds, tlie Housekeeper keeps the dislts (as Inany as possible) busy by writing dirty pages to disk. It also makes sure that none of the dislts is overloaded, thus DEFAULT CACHE, preventing an undue delay if transaction processing resumes. In the best case, the Ho~~sekeeperautomati- cally generates a free checkpoint for the system, thereby reducing the performance impact of tlie checkpoint during transaction processing. In steady Figure 2 state, the Housekeeper continuously writes dirty pages Table Bindings to Named Caches with Logical lMe11101.)~Manager to disk, while minimizing the number of extra writes incurred by premature \vrites to disk."'

Digiral Ethnical Journ;ll Checkpoint and Recovery the average) 8 transactions to complete, assuming mi- form arrival rates at commit point. This indicates a nat- As thc sizc of memory increases, the following nvo ural grouping of S transactions per log \\lritc. For the hctors increase as well: (1) thc number of\\.rites to same system, if the log dislc is ratcd at 3,600 rpm, the disk during the checkpoint and (2) the nu~nbcrof same calculation yields 16 transactions per log write. disk I/Os to be done during recovery. The Sybase The group conimit algorithm used by the SQL Syste~n11 SQL Server allo\vs the 1'>13A to tune the Server also talccs advantage of disk arrays by initiating amount of buffers that will be kept clean ill1 tlic time. multiple asyuclironous \\~ritcsto different mcmbcrs of This is called the wash region. In essence, tile \vash the dislc array. 'l'hc SQL Scrvcr is also able to jss~~cup rcgion represents the amount of memory that is al\\~ays to 16 kulobytcs in one write to a single disk. Togetlicr, clean (or strictly, in the process of being written to the group commit algorithnl, large writes, and the disk). For example, if the total amount of mcmory for ability to drive multiple disks in a disk array eliminate database buffers is 6 GB and the wash rcgion is 2 GB, the log bottleneclc k)r high-throughput systems. then at any time, only 4 GB of memory can be in an ~~pdatedstate (dirty). The ability to tune the \\rash Future Work rcgion reduces the load 011 the checkpoint procedure, as \vcll as recovery. VWien a VLM systcm tails, tlie large number ofclatn- The Sybase System 11 SQL, Server 113s implemented base buffers in mcmor!, that are dirty necd to be a f~zzyclieckpoint that allows transactions to proceed reco\~esed.Thercforc, database rccovery time gro\ils even cl~~ringa checkpoint operution. Tr;uisactions with the size of mcmory in the VLM systcm, at least :Ire stalled only when they try to ~~pdatc3 database for all database systems that irsc log-based recovery. page that is being written to disk by the checkpoint. In addition, since there are a large number of dirty Even in that case, the stall lasts only hr tlie time buffers in memory, the pcrfi)rmance impact of clicck- it takes the disk write to complete. In addition, in point on transactions also increases with memory sizc. the SQL Server, the chcckpoint process can keep mul- To minimize tlic recovery timc, one may increase the tiple disks busy by issuing a large number of asynchro- checkpoint frcclucncv. The checkpoints have a higher nous \vrites one after another. During the timc of impact, liowe\,cr, anci need to be done infrecluentl!: the checkpoint, the Ho~~sekecpcroften becomes TIicsc conflicting requirements need to be acldresscti active due to cstra idle time created by the clieckpoint. for VLIM s!stcms. The Ho~~sckeeperis self-pacing; it docs not S\\~JIII~the When a ~i'itab,iscfirs in mcmor!l, the buffer replacc- storage systcm \\lit11 writes. 1iie1it algorithm can be eliminated. For example, fi)r a single table that fits in one named cache, this opti- Commit Processing mization can be done with the LMM. In addition, if a table is read-only, it is possiblc to minimizc the syn- Thc SQL Server uses the group commit algorithm to chronization necessary to access tlic buffers in mcm- improve tl~roughput."'~The group commit algorithm ory. These optimizations require syntax for the DBA collects the log records of multiple transactions and to speci@propcrtics (for cxa~nplc,read-only) of tables, \\!rites them to thc disk in one I/O. This allo\vs higher as well as propcrtics of named caches (for cxamplc, transaction througliput due to the a~nortiz~tionof buffer rcplaccment ;~Igorithms). disk 1/0 costs, as \\tell as coniniitting more and more Tliesc two arcas .is \\lcll as other IMMDK tcchniclucs transactions in each disk \\!rite to the log file. The SQL \\!ill be cxplorcd by the SQL Serves developers for Scrvcr does not use a timer, however, to improve the incorporation in fi~ti~rcreleases. gro~~pingof transactions. Instead, the duration of the pr~\lio~~slog I/O is used to collect transactions to bc Summary committed in the next batch. Thc sizc ofthc batch is determined by the number of transactions that rcach The Sybasc System 11 SQL Scrver supports VLM commit processing during one rotation of the log systems built and sold by DIGITAL. Tlic SQL Server disk. This self-tuning algorithm adapts itself to \~nrio~~s can complctcly c;~chcparts of a database in memory. speeds of disks. For tlie same transaction processing It can also cache tl~centire database in mcnqory if systcm, the grouping occurs more often \vith slo\vcr tlie datdbasc sizc is smaller than tlie amount of mcnl- disks than with hster disks. ory. Svstc~n11 has hcilitics that address issues of Consider, for example, a system performing 1,000 fast access, checkpoint, and recovery ofVLM systems; transactions per second. Let LIS assilme the log disk is these facilities arc tlic Logical ~MernoryManager, the ratcd at 7,200 rpm. Each rotarion of the disk takes VLM clucry optimizer, the Housekeeper, and fi~zzy 8 nlilliseconds. Within this duration, we expect (on checkpoint. The SQL Scrvcr product achic\.ed

Vol. 8 No 4 1996 SIMP T1'C performance of 14,176 tpmC at Biographies $19S/tpmC on a DIGITAL VLlCI system. The tcch- nolo81de\ielopcci in System 11 lays the ground\vork for f~~rtherimplementation of MM1)R tcchniqucs in the SQL Scrvcr.

Acknowledgments

PVe gratefi~llyackliowlcdge the various members of thc SQLServer devcloprncnt team who contributed to the VLM capabilities described in this paper.

References and Notes T.I<. Rengarajan 1. For more i~iforniationabout a~~ditcdtpmC rncnsure- 'r. I<. Rc~~garaj'inhas been building high-performance ments, see the Transaction Processing Pcrfor~iiance databasc systems for tlie past 10 years. He no\\3leads the Council home page on the World Wide Web, Server Pcrfornlancc Enginccring and llcveloprnent (SPeel)) littp://\\o\?v.tpc.org. Group in SQI,Ser\,er Engineering at Sybase, Inc. His most rcccnt focus has been System 11 scalability and self-tuning 2. S.-0. Hvasslio\rd, 0. Torbjornsen, S. UI-atsbcrg, and algorirli~ns.Prior to joining Sybasc, hc contributed to tlic 1'. Holagcr, "The ClustRa 'T'clcco~l~l)at,ibase: Higli DEC Kdb s!~stc~ii'it LIIGITAI. in the drcas of buffer man- Availability, Higli '~Ihroughput, 21ld Ileal-Time agclncnt, high availability, 01.1'1' pcrforniancc on Alpli~ b1.S. Response," Pt.ocecdi71gs (q. /he Llst I/o:y Large systems, and multimedia databdses. Hc holds degrees in co~iip~~ter-aideddesign and computer science from tlie Datal?asc Cor2Ji.r-o?ce, Zurich, Switzerland, 1995. University of ICentucky and rhc Uni\,crsity of Wisconsin, 3. H. Jagatiisli, 1). Licu\vcn, R. Rastogi, A. Silbcrscliatz, respectively. and S. Sudharshan, "lhli: A High Pcrtbrrnance &lain ih4emory Storage l\4anager," Procee~lirzgsqf'tbe 20tIg Vcty Lotfc Datnh~lse CoiIJC.r-erzcc~Conjkrence, Santiago, Chile, 1994. 4. iM. Hc!jrc~is, S. Listg'~rtcn, &I.-A. Nejrnar, and I<. Widkinson, "Smallbase: A Main-iMcmory 1ILIMS fol- High-l'crforrnance Applic,itionsn (1995). 5. H. Garcia-Molina and I<. Salem, "ivlnin Memory Database Systems: An O\,er\~ic\\/,"1tl:'E T~zitzsnctions otz K?lo~cler(qectrzd D61lu Er?girreerir~g.\lol. 4, no. 6 (1992): 509-516. 6. D. Ga\vlick and D. IGnkadc, "Var~cticsofConcurrcnc)~ Max\vell Berenson Control in IMS/VS Fnst Path," D61tctl?nseErzgirzeer- Mas Bercnson is a staffsofnva~.~cnginccr in the Ser\,e~. ir?g B11llelir7.vol. 8, no. 2 (1985): 3-10. Pcrformancc Enginccring and De\~cloprncntGroup in SQI, 7. E. Levy and A. Silbcrscliatz, Ir~cteir~e~z/ulRecouo:)~ Server E~iginceringat Syb'ise, Inc. L3uring his fbur !tears ~t i17 ibI61itz 11/Iet1lo~lDU~~IIZIS~ S)c\,cloprncnt Group 'it S!ib,~sc, Inc. Professional Technical Reference, 1996). He \\(as a member oftlic [cam th~timplcmcntcd the House- keeper in System 11. In 'lddition, lie has \vorkcd on a num- 10. Sl'bctse S)lstetil 1 1 SQL .Set~'erDOCIII~IC~I~~I~~~I~ Set ber of projects that II~I\TCc~ih~~~iccd tlic pcrfor~n.i~ice and (Eniery\~illc,C,ilif.: Sybase, Inc., 1996). sc,lling of the Sybasc SQL Server. At prcscnt, he is wol-king on a pcrforniancc fc,lrurc for a11 upcoming release. He 11. 1'. Spil-o, A. Joslii, and T. licngdrajan, "Designing holds bachelor degrccs in ad\,'inccd physics and in elec- 'In Optilnizcd Tr.i~isdctio~lCommit I'rotocol," Di;qital tronics and commurlication engineering kom the Indian Tcchr7ical ,/out.~zal.vol. 3, no. 1 (Winter 1991): Institute of Science, Ba~~galorc,111di'i. 70-78.

111g1rATcchnic~l Journal Vol. 8 No. 4 1996 87 Bruce McCready B~LICCMcCrcady is an SQL Scr\,cr performance cngincer in the Scrvcr Pcrfornlancc Engiiiccriiig and De\,elopment Group at Sybase, Inc. Iirucc rccci\.cd a K.S. in computer scie~~cekom the University of <:cllili)rnia at Bcrkclcy in 1989.

Sapan Panigrahi A scnior performance engiliccr, Sapan Panigrah works in rlic Scrvcr Pcrformancc Engineering and Development Group ;~tSybasc, Inc. Hc \\,,is responsible for TPC bencli- marks and perforniancc analysis li)r the S!rbnsc SQL Scr\,cr.

Srikant Subramaniam A member of the Server Pcrli)rmance Engineering and l)c\lcloprnent Group at Syhasc, Inc., Srikant Subraninniam \\,as invol\led in the design and implementation of the VLM support in the Sybase SQI, Scrvcr. He was a member of the team that iinplcmcntcd the Logical R/Ieniory Manager in System 11. In ;iddition, lie Iias worlied on projects ttint Iia\lc cnhanccd the perfol~iiiancca~id scaling ofthc Sybnse SQL Scrver. At present, lie is working on performance optimizations for an upcoming rclcasc. He holds an M.S. in comlxiter science fi-on1 the University ofSaskatchc\\~an, C:anaiia. His specialn arcn was thc performance of sliarcd- ~~~~~~~~y ~ii~~ltiproccssorsystems.

Marc B. Sugiyania Marc Sugiyarna is a statTsofn\~nrccnginccr in the SQL Scr\lcr Pcrfor~nanccEnginccri~ig and 1)cvclopment C~LIF) ;~tSpbasc, Inc. He \\,.IS the tcchoical lcad for the original port ofSybase SQL Servcr to the 1)IC;ITIL Alpha OSF/I system. He is coauthor of Syh~~.sol'c~t:/i)rr~~n~~ce TLIIZI~I~. pi~blislicdby Prentice Hall, 1996.

88 1)igiral Technical Journal Vol. 8 No. 4 1996 I David P. Hunter Eric B. Betts Measured Effects of Adding Byte and Word Instructions to the Alpha Arch itect ure

The performance of an application can be The Alpha Architecture and its initial implementations expressed as the product of three variables: were limited in their ability to manipulate data values (I)the number of instructions executed, (2) the at the byte and word granularity. Instead of allowing single instructions to manipulate byte and word val- average number of machine cycles required to ues, tlie original Alplla Architecture rcquired as nlall)l execute a single instruction, and (3) the cycle as sixteen instructions. Recently, DIGITAL extended time of the machine. The recent decision to thc Alpha Architecti~reto manipulate byte and word add byte and word manipulation instructions data values with a single instruction. The second gen- to the DIGITAL Alpha Architecture has an effect eration of the Alpha 2 1164 microprocessor, operating upon the first of these variables. The perfor- at 400 megahertz (MHz) or greater, is tlie first imple- mentation to include the new instructions. mance of a commercial database running on This paper presents the results of an analysis of the Windows NT operating system has been the effccts that the new instructions in the Alpha analyzed to determine the effect of the addition Architecture have on the performance, code size, and of the new byte and word instructions. Static dynamic instruction distribution ofa consistent esecu- and dynamic analysis of the new instructions' tion path through a comnlercial database. To exercise effect on instruction counts, function calls, and tlie database, \ye modified the Transaction Processing Performance Council's (TPC) obsolcte Tl'C-I3 bench- instruction distribution have been conducted. mark. Although it is no longer a valid Tl'C bench- Test measurements indicate an increase in per- mark, the TI'C-B benchmark, along with other TPC formance of 5 percent and a decrease of 4 to benchmarl

Dig~talTechnical Jo~~rnnl Vol. 8 No. 4 1996 89 transaction ciccrcased from 115,895 to 107,854 and 12-entry ITB. Tlic chip contains tlircc on-chip (a 7 percent rc~i~~ctio~l). cachcs. Tlic Ic\~clone (Ll) cachcs inclucic an 8-KB, Tlic rest of this papcr is divided as follo\\~s:\\,c begin ciircct-~nappcd I-caclic and an 8-ICE, du;il-ported, \\lit11 a brief o\~cr\~ic\\~of the Alplia Architccturc n~lciits clircct-nxlppcd, \\,rite-through D-cache. A tliirci introduction of tlic new bl~cand \\.ord manip~~latio~i on-chip c~clicis a 96-Ia, three-\\nay set-associnti\,c, i~istructions.Next, \\,c cicscribe the hardnrare, sob\-arc, \\.rite. hack ~iiixed instruction and data cachc. The and tools i~sedin our cspcriments. Lastly, \vc pro\*idc floating-point pipeline was reduced to nine stages, ancl an analysis of tlie instruction distribution and coiunt. tlic <:I'U has t\\.o integer units and turo tloating-point cscci~tion1111its.' Alpha Architecture The Exclusion of Byte and Word Instructions The Alplin Architccturc is a 64-bit, loacl and store,

rcciuccd instr~~ctionsct computcr (RISC) ? ~c.-IIltccturc ' The original Alpha Architect~~rcintended that opcra- that was designed \\lit11 high performance and longcv- tions in\/ol\lcci in loading or storing aligncci bytes and ity in mind. Its major areas of concentration arc \\lords would involve sequences as given in Tables 1 tlic processor clock spccd, the multiple instruction and 2."' As many as 16 additional instructions ure issue, and multiple proccssor implementations. For 11 rccli~ircdto accomplish these operations on unaligned detailed account of tlic Alplia Architecture, its major darn. Tlicsc same operations in the MIPS Arcliitccturc dcsign choices, and o\,crall benefits, see the paper invol\,c only a single instruction: I,R, I.W, SR, and by 11. Sitcs." l'hc original arcliitect~~redid not dctinc SW." Tlic MIl'S Architect~~rcalso incl~rdcssinglc tlic capabilit\r to manipulate b!,tc- and \$lord-lc\'cl instructions to do the same for ~~naligncdJat~. (;i\vn data \\*it11a singlc instruction. As a result, the first n situation in \\~hichall other factors arc consistent, this three implcmcntatio~isof the Alpha A-chitecturc, tlic \vould appcw to give the MIPS Architecture an ad\,an- 2 1064, the 2 1064A, and the 2 1 164 microproccssors, tngc in its ability to reduce the nunbcr of instructions \vcrc forced to LIS~as many as sistccn additional cxccutcci per \\lorkload. instructions to accomplish this task. The Alpl~a Sitcs has prcscntcd several key Alpha Architccturc Arcliitcct~rrc\\!as recently cstcnded to include six new dcsign decisions.' Among them is the decision not to i~.~structionsfor manipulating data at byte and \\lord includc byte load and store instructions. Iky dcsign boundaries. The sccond implenientation of the 21 164 nssumptions related to the esclusio~iof tlicsc fcaturcs L~~iiilyof microproccssors i~lcludesthese extensions. inclutic thc follo\\ling: The f rst i~iiplc~iicntatio~iof the Alpha Arclii- The majority of operations wo~ildinvolve naturally tccturc, the 21064 microprocessor, was intro- aligned data elements. duced in Novc~iibcr 1992. It was fabricated in a 0.75-micromctcr (prn) complementary metal-oxide se~niconductor((;IMOS) process and operatcci at Table 1 spxds up to 200 !MHz. It had both an 8-kilobyte Loading Aligned Bytes and Words on Alpha (ItB), direct-mapped, \\!rite-through, 32-byte line instr~~ctioncachc (I-cache)and data cache (D-cache). Load and Sign Extend a Byte Tlic 21064 rnicroproccssor nlas able to issuc hvo LDL R1, D.lw(Rx) instructions per clock cycle to a 7-stage intcgcr EXTBL R1, #D.mod, R1 pipcline or a 10-stage floating-point pipcli~ie.' The sccond implcmcntatio~iof the 21064 generation \\!as Load and Zero Extend a Byte the Alpha 21064A niicroprocessor, introduced ill October 1993. It was manufactured in a 0.5-pm LDL R1, D.lw(Rx) CMOS p~)ccssm~i operated at spceds of 233 MHz to SLL R1, #56-8*D.m0d, R1 275 MHz. This implementation increased the size of SRA R1, #56, R1 tlic I-caclic and 1)-cache to 16 1U3. Various other tiif- fcrcnccs exist between the two implementations and Load and Sign Extend a Word arc outlinccl in tlie product data sheet." The Alpha 21164 microprocessor was thc second- LDL R1, D.lw(Rx) gcncration implc~ncntationof the Alpha Architecture EXTWL R1, #D.mod, R1 and was introduced in October 1994. It was rnanu- fact~~rcdill a 0.5-p~iiC:NIC)S technology and has tlic Load and Zero Extend a Word abiliy to issuc fo~~rinstructions per clock c),clc. It contains a 64-cntrp data. translation buffer (l>TB) and LDL R1, D.lw(Rx) a 48-entry instruction translation buffer (ITB) co111- SLL R1, #48-8*D.mod, R1 narcd to the 2 1064A microprocessor's 32-entry DTB SRA R1, #48, R1

90 1)ipit;ll Tc~hnic.ilJournal Table 2 Windows NT, to the Alpha platform. The Windows Storing Aligned Bytes and Words on Alpha NT operating system had strong links to the Intel st36 Store a Byte and the MIPS Architectures, both of which included instructions for single byte and word manipulation." LDL R1, D.lw(Rx) This strong connection influenced the Microsofi devel- INSBL R5,#D.mod, R3 opers and independent sohare vendors (ISVs) to MSKBL R1, #D.mod, R1 favor those architectures ovcr the Alpha design. BIS R3, R1, R1 Another factor contributed to this issue: the major- STL R1, D.lw(Rx) ity of code bcing run on the new operating system came from the Microsoh Windosvs and MS-DOS en\+ Store a Word ronments. In designing soh~arcapplications for these two en\~ironments,the n~anipulationof data at the LDL R1, D.lw(Rx) byte and word boundary is prevalent. With the Alpha INSWL R5,#D.mod, R3 n~icroprocessor'sinability to accomplish this manipil- MSKWL R1, #D.mod, R1 lation in a single instruction, it suffered an average of BIS R3, R1, R1 3:l and 4:l instructions per workload on load and STL R1, D.lw(Rx) store operations, respectively, compared to those architectures with single instructions for byte and \\lord manipulation. In the best possible scheme for multiple instruction To assist in rilnning the ISV applications ~~nderthe issue, single bpte and write instructions to memory Windows NT operating system, a new tec.hnolog\~was are not allowed. needed that would allo\v 16-bit applications to run as if they cvere on the older operating system. Microsoft The addition of bytc and write i~istructionswould developed the Virtual DOS Machine (Vl>M) environ- require an additional byte shiher in the load and ment for the Intel Architecture and the Windows- store path. 011-Windows (WOW) environment to allocv 16-bit These factors indicated that tlie exclusion of specific Windows applications to work. For non-Intel architec- instructions to ~nanipulatcbytcs and words would be turcs, Insignia developed a VDM envjronment that advantageous to tlie performance of the Alpha emulated an Intel 80286 microprocessor-based com- Architecture. puter. Upon examining this emulator more closely, The decision not to include byte and word manipu- DIGITAL found opportunities for improving perfor- lation instructions is not without precedents. The mance if the Alpha Architecture had single bpte and original MIPS Architecture developed at Stanford word instructions. University did not have bytc instructions." Hennessy Rased upon this information and other factors, a et al. have disc~isscda series of hardware and software corporate task force was commissioned in March 1994 trade-offs for pcrhrma~icewith respect to the MIPS to investigate improving the general performance of processor.'%~ong those trade-offs are reasons for Windows NT running on Alpha machines. The hrther not including the ability to do byte addressing opera- DIGITAL studied the issues, the more convincing the tions. Hennessy et al, argue that the additional cost argument became to extend tlie Alpha Architecture to of including the mechanisms to do byte addressing include single byte and word instructions. was not justified. Their studies showed that word ref- This reversal in position on byte and word instruc- erences occur more frequently in applications than do tions was also seen in the evolution of the MIPS byte references. Hennessy et al. conclude that to make Architecture. In the original MIPS Architecture devel- a word-addressed nxichine feasible, special instruc- oped at Stanford University, there were no load or tions are required for inserting and extracting bytes. store byte i~istructions.'~However, for the first com- These instructions arc available in both the MIPS and mercially produced chip of the lMIPS Architecture, the the Alpha Architectures. MIPS R2000 RISC processor, developers added instructions for the loading and storing of bytes." One Reversing the Byte and Word Instructions Decision reason for this choice stemmed from the challenges posed by the UNIX operating system. Many implicit During the dc\~clop~iientof tlie Alpha Architecture, byte assumptions inside the UNIS lcernel caused per- DIGITAL supported two operating systems, OpenVMS formance problems. Since the operating system being and UL,TIUS. The developers had as a goal the ability implemented was UNIX, it made sense to add the bpte to maintain both customer bases and to facilitate their instructions to the MIPS Architecture.'" transitions to tlie new Alpha microprocessor-based In June 1994, one of the coarchitects of the Alpha machines. In 1991, Microsoft and DIGITAL began Architecture, Richard Sites, submitted an Engineering work on porting Microsoft's new operating system,

Digiral Technical Journal \hl.8 No. 4 1996 91 per e~ilulatedinstruction. When it is disabled or not Sites and Perl used an early \u-sio~lof the Microsoft present, the action taken depends upon the hardware SQL Server version 6.0, in which the fastest network support for tlie new instructions. If disabled in hard- transport available at that time was Named Pipes. In ware, thc instruction is treated as an illegal instruction; the final release of SQL Server version 6.0 and sub- if enabled, it is executed like any other instruction. sequent versions of the product, thc Transmission Control Protocol/Internet Protocol (TCP/IP) Microsoft SQL Server replaced Named Pipes in this category. Based upon this, we rebiiilt the libraries associated with TCP/11' To observe the effects of thc new instructions, we instead of those associated with Named Pipes. Other chose the Microsoh SQL Server, a relational database networking libraries, such as those for DECnet and management system (LWBMS) for the Windows NT Internehvork Packet Exchange/Sequenced Packet operating system. Microsofi SQL Server \\!as engi- Exchange (IPX/SPX), were not rebuilt. neered to bc a scalable, multiplattbrm, multithreaded RDBMS, supporting symmetric multiprocessing (SMP) systems. It nras designed specifically for distrib- Table 5 uted client-server computing, data warehousing, and Images and DLLs Modified for the Microsoft SQL database applications on the Internet. Server In an earlier investigation, Sites and Perl present a profile of the mcrosoft SQL Server running the TPC-B Sites V6.0 Function DLLIEXE DLLIEXE ben~hmark.~They identi@ the executables and DLLs that are involved in running the benchmark and break sqlserver.exe sqlservr.exe SQL Server Main down the percentage of time that each contributes to Executable the benchmark. Their results, summarized in Figure 1, ntwdblib.dll ntwdblib.dll Network show that only a few SQL Server executables and Communications DLLs were heavilp exercised during the benchmark. Library Merveri~ing these results with the SQL Server devel- opends50.dll opends60.dll Open Data Services Networking Library opment group at Microsofi, we decided to rebuild dbnmpntw.dll N/A V4.21A Client Side only the images and DLLs identified in Figure 1 to use Named Pipes Library the new byte and word instri~ctions. ssnmpntw.dll N/A V4.21A Named Pipes Table 5 lists thc cxccutables and DLLs that we niodi- Library fied and their correlation to tlie ones identified by Sites NIA dbmssocn.dll V6.5 Client Side and Perl. The variations exist because of lame changes TCPIIP Library of DLLs or the use of a different network protocol. We NIA ssmsso60.dll V6.5 Netlibs TCPllP changed nenvork protocols for performance reasons. Library

Figure 1 Images/DLL\ Invol\,ed in a TPC-B Trailsaction for Microsofi SQL Server Based on Sites and l'erl's Pui3l!rsis

Digital Technical lourd Vol. 8 No. 4 1996 93 Compiling Microsoft SQL Server to The application benchmark can be run in two dif- Use the New Instructions ferent modes: cached and scaled. Tlic cachcd, or in- menlory n~odc,is ~lsed to estimate the svste~n's Our goal nras to nicasure o~ilythe effects introduced rnaxini~~mperfi)r~nancc in this benchmark en\'r~ro~i- by using the new instructions and not effccts intro- mcnt. Tliis is accornplishecl by building a small database duced by diffcre~it\lersions or generations of compil- tliat resides complctcly in tlic databasc cache, which in ers. Therefore, \ile ~iccdcdto tind a way to use the same turn tits within the system's physical random-access version of a compiler that diffcrcd only in its use or mcmory (RAM). Since the entire databasc resides in nonuse of the necv instructions. To do tliis, \\lc used mcmory, 'dl I/O acti\~ityis eliminated \\it11 the cscep- a conipiler option available on the Microsok Visual tion of log \\!rites. Consequently, the bcnchmark only C++ conipiler. Tliis switch, available on all 1USC plat- pertbrms one disk 1/0 for cach transaction, once the forms that support Visual C++, allo\~athe gcncmtion c~ltircdatabase is rcad otYtlic disk and into the database of opti~nizcdcode for a spccifc proccssor within a caclic. The result is n representation of the r~iaximum proccssor family while maintaining binary conipatibil- number of tps tlint the system is capable of sustaining. ity with all processors in the processor family. Processor The scalcd mode is run l~si~iga bigger database cvith optimizations are accomplislicd by a combination of a largcr amount of ciisk I/O acti\,ity. The increase in specific code-pattern selection and codc scl~cduling. clisk I/O is a result of the need to rcad ;wd write data to The default action of the conlpiler is to LISC a blended locations tliat arc not \vithin tlic database cache. These model, resulting in codc that csccutcs ccl~~ally\\,ell addition31 rcads and \\,rites acid cstra clisk I/Os. The across all processors within a platkxrn Cimily. result is nonnall!l characterized as having to do one Using this compiler option, wc built two versions read ancl one \\?riteto the database and 3 single write to of the aforenlentioned images \\jitliin the SQI, the transaction log for cach transaction. Tl~ecombina- Server application, varying only their use of the codc- tion of a largcr database and additional 1/0 activity generation switch. The first version, refcrrcd to as the decreases the tps wluc from the cached \.crsion. Rased Original build, \\!as built \vitho~ltspcci@ing a11 argil- upon our prc\,io~~sc~pcrie~~cc running this bcnchmark, ment for the code-generation switch. The sccond one, the scalcci bc11cl11na1-kcan be cspcctcd to reacli approx- referred to as Byte/VVord, set the s\iritcl~to generate imately 80 percent of the cachcd pcrfi)rmance. code patterns sing the new byte and ~i~ordnlanipula- For the scalcd tests, \\re built a datab;ise sizcd to tion instructions. All other rccl~liredfiles came from rlic accommodate 50 tps. Tliis \\,as less than SO percent SQL Server \.ersion 6.5 Beta I1 distribution C1)-ROM. of the masim~~mtps produced by tlie cached results. We chose this size because \\re \\!ere concentrating The Benchmark on isolating n single scalcd tunsaction under a ~nodcr- The be~~chmarkwe chose was derived fio~rlthe TPC-1% ate load and not under the maximum scaled perfor- benchmark. As prc\riously mentioned, the TI'<:-B mance possible. bcnchmark is now obsolete; howe\rer, it is still uscfi~l for stressing a database and its interaction with a com- Image Tracing and Analysis Tools puter system. The TPC-B benchmark is relatively Collecting only static nicas~~remcntsof tlic executables easy to set up and scales readily. It has been used by and l>I..Ls affcctcd was insufticicnt to determine tlie both database vendors and computer ~iianufacturcrs applicability of the new instructions. We collected the to measure the performance of either the computer actual instruction traccs of SQL Server while it exc- systeni or the actual database. We did lot inclucic all c~ltedthc application bcncIi~narl<.Furtliermorc, \llc the required metrics of tlic TI'<:-B bclich~llark;tlicre- decided tliat thc ~bilityto trace the actual instri~ctions fore, it is not in full co~npliance\ilith p~~blislicdg~~ide- being cscci~tcd\\pas more desirable than dc\~clopingor lines of the TPC. We refer to it henceforth sinlply as cstcnding a simulator. To obtain tlie traccs, we needed thc application benchmarl<. n tool that \\fo~~ldallow us to The application benchmark is characterized by sig- Collect both system- and user-modc codc. nificant disk 1/0 activity, moderate systc1n and applica- tion execution tin~e,and transaction integrity. The C:ollecr f~nctiontr'iccs, \i~liicli\\could allow us to application benchmark exercises and measures the cffi- align the starting 31id stopping points of different ciency of the processor, 1/0 architecture, and RDBMS. benchmark runs. The results lncasure perforn~anccby indicating how Work witlio~~tniodi%ing cithcr the application or many sinlulated banking transactions can be corn- the operating system. pleted per second. This is defined as transactions per In tlic past, tlic only tool that would provide second (tps) and is the total n~~~iibcrof committed instruction traccs ~~nclcrtlic Windoiia NT optrating transactions that were started and co~npletcdduring system WAS the debugger running in single-step mode. the measurement interval.

Digital Technical Journal Vol. 8 No. 4 I996 Obtaining traces through either the ntsd or tlie Results windbg debugger is quite limited due to the following problems: We collected data on three different experiments. In the first investigation, we loolted at the relative perfor- The tracing rate is only about 500 instructions per mance of the three different versions of the Microsofi second. This is far too slo\v to tracc an)~tliingother SQL Server o~~tlinedin Table 4. \Ye conipared the than isolated pieces of code. three variations using the cached version of tlie appli- Thc tracc fiils across spstcln calls. cation benchmark. The trace loops infinitely in critical section code. I11 tlie secolid experiment, we observed how tlie Register contents arc not easily displayed for each new instructions affect the instruction distribution in instruction. the static images and DLLs that we rebuilt. We com- pared the Byte/Word versions to the Original \lersio~is Real-time analysis of instruction usage and cache of the images and DL,Ls. We also attenlpted to link tlie niisses are not possible. differences in instruction counts to tlie use of the new Instruction traces can also be obtained using the instri~ctions. PatchWrks trace analysis tool.' Although this tool Lastly, we in\estigated the variation benveen the operates with near real-time performance and can Original and tlie Uyte/\'Vord versions with respect to trace instructions execiiting in I

Digital T&d Journal Vol. 8 No. 4 1996 Table 6 lnstruction Deltas (Normal Minus ByteIWord) for the SQL Server Images and DLLs

Instrutlion dbmrsocn.dll ntwdblib.dll opendr60.dll rqlse~r.exe rrmsso60.dll Instruction dbmrrocn.dll ntwdblib.dll o~ends60.dll ralservr.exe srmsso60.dll

Ida xor ldah 511 Idl sra Idq srl Idq-l cmpbge Idq-u mskbl st1 rnskwl stb zapnot stw addl stq addq s1q-c s4addl beq crnovge bge crnovne bgt crnovlt blbc cmovlbc blbs callsys blt extqh bne ldwu br ldbu bsr mull ret sub1 cmpeq subq crnplt insll crnple inswl cmpult call-pal crnpule extlh and insbl bic extll bis extbl ornot extwl

size, tlie nunibcr of fi~nctions,the number and type of Of the rebuilt iniages and Dl,J.s, sqlscr\~r.csealia nc\v byte and \vord instructions, and lastly, nop and opc11ds60.dll sho\\cd the most \,ariations, \\,it11the fie\\, 11-apbinstr~~ctions. The results are presented in Tablcs instructions making 3.73 percent ;lnd 3.9 percent 7 through 10. of these tiles. The most frcqlrcntly occul-ring nc\v We expected that the instructions used to ~iianipulatc instruction was Idbu, follo\vcd b!l Id\\,~r.Thc least- bytes and \\lords in the original Alpha Arcliitccturc used instructions \\we sextb and scsnv. The size of (Tables 1 and 2) ~\~oulcldecrease proportionally to the the imagcs \\,as rcduccd in thrcc out of tivc images. usage of the nc\v instructions. Thcse assumptions hcld The image size rcd~~ctionranged from negligible to true fix all the images ancl DLLs that llsccl the new just over 4 pcrccnt. In all cascs, the size of the code instr~~ctions.For example, in the original Alpha section \\,as ~.cduccdand rangcci from insig~iificant Architecture, the illstructions MSKBL and MSKWI, arc to approximately 8.5 percent. There \\,as no change in used to store a byre and ivord, respccti\,cly. In tlic the number of fi~nctionsin any of the tiles. sqlser\~r.excimagc, tlicsc two instructions sho\\.cti a dccrcase of 3,647 and 1,604 instructions, rcspccti\,clv. Dynamic Instruction Counts Compare this \\rich the corresponding addition of 3,969 We gathered dntn fron~the application benchmark STB and 2,798 STW instructions in the s,lmc inl.~gc. running in both cached and scaled ~nocics.M7c ran at Looking furtiicr into the sqlservr.cse imagc, wc also saw least ouc iteration of the benchmark tcst prior to gath- t11at 10,231 L1>13U instl-uctions \vcre ~~scciand tlic ering trace data to allo\\! both the Winclo\vs NT oper- usage ofthc EXTB1,instruction was reduced hy 10,656. ating systcm ancl tlic Microsofi SQl. Scr\cr ciat~bascto Although these numbers do not correlate on a one-tbr- reach a stcady stntc ofopcratio~~on the system u~lder one basis, \vc bclic\c this is clue to other ~~s'igcof thcsc tcst (SL"I'). Stci~ci!~st.ttc \\,as acliic\,cci \\,hen the SQL instructions. Other usagc might includc the compiler Server caclic-hit ratio reached 95 percent or greater, SC~CIIICfor introducing the II~L\~instructions in plilccs the n~unbcrof transactio~isper sccond \vas constant, \\~licreit uscd an I .l)L, or nn Ll3Q in tlie Original image. 2nd the <:1'U utilization \\pas as close to 100 pcrccnt JS possible. The traces were gathered over a sufficient Table 7 ByteIWord Images and DLLs

ImageIDLL Total Total Total Number Total File Text Code of Byte1 % Byte1 LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW Total Total Bytes Bytes Bytes Functions Word Word Count 9b Count % Count % Count % Count % Count 9b NOPs TRAPB

sqlservrexe 8053624 2981148 2884776 3364 dbrnssocn.dll 13824 5884 5520 13 ntwdblib.dll 318464 246316 231688 429 opends60.dll 212992 104204 97240 243 ssrnsso60.dll 70760 9884 9128 19

Table 8 Original Build of Images and DLLs

ImageIDLL Total Total Total Number Total File Text Code of Byte1 %Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW Total Total Bytes Bytes Bytes Fundions Word Word Count % Count % Count % Count % Count % Count % NOPs TRAPB

sqlservrexe 8337248 3264108 3153480 3364 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6207 2252 dbmssocn.dll 13824 6012 5656 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 ntwdblib.dl1 318464 246620 231904 429 0 0 0 0 0 0 0 0 0 0 0 0 0 0 770 10 openddO.dll 222720 114012 105536 243 0 0 0 0 0 0 0 0 0 0 0 0 0 0 405 128 ssmsso60.dll 71284 10300 9424 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0

Table 9 Numerical Differences of Original Minus ByteIWord Images and DLLs

ImagelDLL Total Total Total Number Total File Text Code of Byte1 %Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW Total Total Bytes Bytes Bytes Fundlons Word Word Count % Count % Count % Count % Count % Count % NOPs TRAPB

lsqlservrexe -283624 -282960 -268704 0 t26869 * 4 +I0231 -38 +6320 t24 +3969 +15 +2798 ~10 +I39 +1 +3412 +13 278 -33 dbrnssocndl1 0 -128 -136 0 -18 +1 t9 +SO -4 t22 .-3 t17 -2 +ll 0 0 0 0 +5 0 ntwdbl~bdl1 0 -304 -216 0 +9 0 +3 -33 0 0 -1 +ll -5 A-56 0 0 0 0 3 0 opcnds60dll 9728 -9808 -8296 0 +948 +4 +464 +49 +I93 +20 2216 +23 +59 16 -9 -1 +7 +1 14 0

ssmsso60.dll -524 --416 -296 0 +67 I3 +18 +27 -35 +52 +7 +10 +3 +4 -4 .-6 0 0 +7 0

Table 10 Percentage Variation of Original Minus ByteIWord Images and DLLs

ImagelDLL Total Total Total Number Total File Text Code of Byte1 % Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW Total Total Bytes Bytes Bytes Functions Word Word Count h Count % Count % Count % Count % Count % NOPs TRAPB sqlservrexe 3 402% -8.669% -8.521% 0000% NIA NIA N/A NIA NIA NIA NIA NIA NIA NIA N/A N/A NIA NIA 4.479% - 1.465% dbmssocndl1 0 000% -2.129% -2.405% 0000% N/A NIA N/A NIA NIA N/A N/A NIA NIA N/A N/A N/A NIA NIA +31.250% N/A ntwdblib dl1 0 000% -0.123% -0.093% 0000% NIA N/A NIA NIA NIA N/A N/A NIA NIA NIA N/A NIA NIA NIA -0.390% 0.000% opends60.dll 4 368% -8.603% -7.861% 0 000% N/A NIA NIA N/A NIA NIA N/A N/A NIA N/A N/A NIA NIA N/A -3.457% 0.000% ssrnsso60 dl1 -0 735% -4.039% -3.141% 0 000% NIA NIA N/A NIA NIA NIA NIA NIA NIA NIA NIA NIA NIA NIA +38.889% NIA period of time to ensure that we capt~rredsc\icral tr~ce,ancl Fig~~rc3 sIio\vs an example output for a transactions. The traces were then cclitcd into separate hrnction tracc from Ntstcp. Since Ntstcp can attach to individual transactions. The gcomctr-ic mean was 3 r~~nningprocess, \vc al lowed the application bencli- talcen from the resulting traces and used for all subse- mark to achieve steady state prior to data collection. quent analysis. 1 his :lpproach ensured that did not scc the effects of We used Ntstcp to gather cornpletc instruction and \\)arming ~rpcithcr the machine caclics or the SQL hnction traces of both versions ofthe SQL Server data- Server database c~clic.Each instruction tracc consisted base while it executed the application benchmarl<. of approxim,ltcIy one n~illioninstr~rctions, \vliich \\,as Figure 2 shows an example o~~tputfor an insrl-~lction s~rfficiclitto cover rii~rltib>lctr~risactio~is. The data \\,as

** Breakpoint (Pid Oxdl, Tid Oxb2) SQLSERVR.EXE pc 77f39b34 ** Trace begins at 242698 opends60!FetchNextCommand 00242698: 23deffbO Ida sp, -5O(sp) I/ sp now 72bff00 0024269~: b53e0000 stq SO, O(sp) I/ 3072bff00 = 148440 002426aO: b55e0008 stq sl, 8(sp) /I @072bff08 = 0 002426a4: b57e0010 stq s2, IO(sp) I/ @072bff10 = 5 002426a8: b59e0018 stq s3, 18(sp) // @072bff18 = 1476a8 002426ac: b5be0020 stq s4, 20(sp) /I @072bff20 = 2c4 002426bO: b5de0028 stq s5, 28(sp) /I 3072bff28 = 41 002426b4: b5fe0030 stq fp, 30(sp) /I 3072bff30 = 0 002426b8: b75e0038 stq ra, 38(sp) I/ @072bff38 = 242398 002426bc: 47f00409 bis zero, aO, SO /I SO now 148440 002426~0: 47f1040a bis zero, al, sl /I sl now 72bffa0 002426~4: 47f2040b bis zero, a2, s2 /I s2 now 72bffa8 002426~8: d3404e67 bsr ra, 00256068 I/ ra now 2426cc opends60!netIOReadData 00256068: 23deffa0 Lda sp, -6O(sp) /I sp now 72bfea0 0025606~: 43f 10002 add1 zero, al, tl /I tl now 72bffa0 00256070: b53e0000 stq SO, O(sp) /I @072bfea0 = 148440 00256074: b55e0008 stq sl, 8(sp) /I @072bfea8 = 72bffa0 00256078: b57e0010 stq s2, IO(sp) /I @072bfeb0 = 72bffa8 0025607~: b59e0018 stq 53, 18(sp) /I @072bfeb8 = 1476a8 00256080: b5be0020 stq s4, 20(sp) /I @072bfec0 = 2c4 00256084: b5de0028 stq s5, 28(sp) // @072bfec8 = 41 00256088: b5fe0030 stq fp, 30(sp) // @072bfed0 = 0 0025608~: b75e0038 stq ra, 38(sp) /I @072bfed8 = 2426cc 00256090: ald01140 Id1 s5, 1140(a0) I/ @00149580 1479e8 00256094: 47f00409 bis zero, aO, SO /I SO now 148440 00256098: alfOOldO Id1 fp, ldO(a0) /I @00148610 dbbaO 0025609~: 47e0340d bis zero, #I, s4 /I s4 now 1 002560aO: a0620000 Id1 t2, O(t1) I/ @072bffa0 155~58 002560a4: b23e004c st1 al, 4c(sp) /I @072bfeec = 72bffa0 002560a8: b25e0050 st1 a2, 50(sp) /I @072bfef0 = 72bffa8 002560ac: b27e0054 st1 a3, 54(sp) I/ @072bfef4 = 1476a8 002560bO: e460001d beq t2, 00256128 // (t2 is 155~58) 002560b4: 220303e0 Ida aO, 3eO(t2) I/ a0 now 156038 002560b8: 47f00404 bis zero, aO, t3 I/ t3 now 156038 002560bc: 63ff4000 mb I1 002560~0: 47e03400 bis zero, #I, vO I/ vO now 1 002560~4: a8240000 ldl-L to, O(t3) I/ @00156038 0 002560~8: b8040000 stl-c vO, O(t3) /I @00156038 = 1 002560cc: e40000b6 beq vO, 002563a8 /I (v0 is 1) 002560dO: 63ff4000 mb I/ 002560d4: e4200001 beq to, 002560dc /I (to is 0) opends60!netIOReadData+Ox74: 002560dc: albe004c Id1 s4, 4c(sp) /I @072bfeec 72bffa0 002560eO: aOOdOOOO Id1 v0, O(s4) /I 3072bffaO 155~58 002560e4: a04003dc Id1 tl, 3dc(v0) /I @00156034 0 002560e8: 20800404 Ida t3, 404(v0) /I t3 now 15605c 002560ec: 405f05a2 cmpeq tl, zero, tl I/ tl now 1 002560fO: e4400003 beq tl, 00256100 // (tl is 1) 002560f4: a0600404 Id1 t2, 404(v0) /I @0015605c 15605c 002560f8: 406405a3 cmpeq t2, t3, t2 // t2 now 1 002560fc: 47e30402 bis zero, t2, tl /I tl now 1 00256100: 47e2040d bis zero, tl, s4 /I s4 now 1

Figure 2 Examplc of Instruction Trace Output fro111 Ntstcp

98 Digiral Tcchnrcill ]our11:1l Vol. 8 No.4 1990 00256104: e4400005 beq tl, 0025611~ I1 (tl is I) 00256108: aOaOOOOO Ldl t4, O(v0) /I@00155c58 204200 0025610~: 24df0080 Ldah t5, 80(zero) I1 t5 now 800000 00256110: 48a07625 zapnot t4, #3, t4 /It4 now 4200 00256114: 40a60005 add1 t4, t5, t4 /It4 now 804200 00256118: bOaOOOOO st1 t4, O(v0) 11 @00155c58 = 804200 002561 1c: aOfe004c Ldl t6, 4c(sp) /I@072bfeec 72bffa0 00256120: aOe70000 Ldl t6, O(t6) /I @072bffa0 155~58 00256124: b3e703e0 st1 zero, 3eO(t6) I/ @00156038 = 0 00256128: e5a00061 beq s4, 002562b0 /I(s4 is 1) 0025612~: 257f0026 Ldah s2, 26czero) I1 s2 now 260000 00256130: 216b62f8 Lda s2, 62f8Cs2) /Is2 now 2662f8 00256134: 5fff041f cpys f31, f31, f31 I1 00256138: a2le0054 Ldl aO, 54(sp) /I@072bfef4 1476a8 0025613~: 225e0040 Lda a2, 40(sp) /Ia2 now 72bfee0 002561 40: aOObOOOO Ldl v0, O(s2) I1 3002662f8 77e985a0 00256144: 227e0048 Ida a3, 48(sp) /I a3 now 72bfee8 00256148: a23e0050 Id1 al, 50(sp) I/ 3072bfefO 72bffa8 0025614~: 47ef0414 bis zero, fp, a4 // a4 now dbbaO 002561 50: a2100000 Ldl aO, O(a0) /I@001476a8 2cO 00256154: 6b404000 jsr ra, (vO),O I1 ra now 256158 KERNEL32!GetQueuedCompLetionStatus: 77e985aO: 23deffc0 Lda sp, -4O(sp) 11 sp now 72bfe60 77e985a4: b53e0000 stq SO, O(sp) I1 @072bfe60 = 148440 77e985a8: b55e0008 stq sl, 8(sp) //@072bfe68 = 72bffa0 77e985ac: b57e0010 stq s2, IO(sp) I1 ii1072bfe70 = 2662f8 77e985b0: b59e0018 stq s3, 18(sp) 11 @072bfe78 = 1476a8 77e985b4: b75e0020 stq ra, 2O(sp) /I@072bfe80 = 256158 77e985b8: 47f00409 bis zero, aO, SO // SO now 2cO 77e985bc : 47f 1040a bis zero, al, s1 // sl now 72bffa8 77e985cO: 47f2040b bis zero, a2, s2 11 s2 now 72bfee0 77e985c4: 47f3040c bis zero, a3, s3 /Is3 now 72bfee8 77e985c8: 47f40411 bis zero, a4, a1 I1 a1 now dbbaO 77e985cc: 221 e0038 Lda aO, 38(sp) I1 a0 now 72bfe98 77e985d0: d3405893 bsr ra, 77eae820 /Ira now 77e985d4

Figure 2 (continued) Example of Instruction Trace Output from Ntstep

then reduced to a series of single transactions and ana- trace. The remaining four instructions combine to lyzed for instruction distribution. For both the caclied- makc up 2.6 percent and 2.7 percent of the instruc- and the scaled-transaction instruction counts, we com- tions executed per scaled and cached transaction, bined at .least three separate transactions and took the respecti\~ely.Other observations include the following: geometric mean of the instructions executed, which The number of instructions executed decreased caused slight variations in the instruction counts. All 7 percent for scaled and 4 percent for cached resulting instruction counts were within an acceptable transactions. standard deviation as cornpared to individual transac- tion instruction counts. The number of ldl-l/stl-c scquenccs decreased We collccted the fiulction traces in a similar fashion. 3 percent for scaled transactions. Once the application benchmark was at a steady state, All the instructions that are identified in Tables 1 we began collecting thc tiinction call tree. Based on and 2 show a decrease in usage. Not surprisingly, previous work with the SQL Scrvcr database and con- the instructions mskwl and niskbl coniplctely disap- sultation \vitl~~Microsofi cnginccrs, we could pinpoint peared. The ins~\~Iand insbl instructions decreased thc beginning of a single transaction. \Ve then began by 47 percent and 90 percent, respectively. The sll collecting samples for both traces at the same instant, instruction decreased by 38 percent, and the sra using an Ntstep feature that allowed us to start or stop instruction usage decreased by 53 percent. These sample collection bascd upon a particular address. reductions hold true within 1 to 2 percent for both The dynamic instruction counts for both the scaled scaled and cached transactions. and the cached transactions are given in Tables 11 and The instructions Jdq-u and Ida, which are used 12. We also show thc \lariation and percentage \{aria- in unaligned load and store operations, show a ti011 between the Original and the Ryte/Word versions dccrcase in the range of20 to 22 percent and 15 to of the SQL Server. Two of the six new instructio~ls, 16 percent, respectively. sextb and sexbv, are not present in the Ryte/Word

Digital Technical Journal Vol. 8 No. 4 1996 99 0 ** Breakpoint (Pid Oxd7, lid Oxdb) SQLSERVR.EXE pc 77f39b34 0 ** Trace begins at 00242698 0 ** . opends60!FetchNextCommand 13 ** . . opends60!netIOReadData 72 **... KERNEL32!GetQueuedCompletionStatus 85 **.... KERNEL32!BaseFormatTimeOut 99 **.... ntdll!NtRemoveIoCompletion 129 **... opends60!netI0CompletionRoutine 272 ** . . opends60!netIORequestRead 285 **... KERNEL32!ResetEvent 290 **.... ntdll!NtCLearEvent 318 **... SSNMPN60!*0x06a131f0* 348 **.... KERNEL32!ReadFile 399 **..... ntdLl!NtReadFiLe 412 **..... KERNEL32!BaseSetLastNTError 417 ...... ntdll!RtlNtStatusToDosError 423 k*...... ntdll!RtlNtStatusToDosErrorNoTeb 509 **.... KERNEL32!GetLastError 560 ** . opends60!get-client-event 665 ** . . opends60!processRPC 682 **... opends60!unpack-rpc 749 ** . opends60!execute-event 762 ** . . opends60!execute~sqLserver~event 802 **... opends60!unpack-rpc 864 **... SQLSERVR!execrpc 911 ...... KERNEL32!WaitForSingleObjectEx 937 ...... KERNEL32!BaseFormatTimeOut 950 **..... ntdll!NtWaitForSingleObject 1024 **.... SQLSERVR!UserPerfStats 1038 ** . . . . KERNEL32!GetThreadTimes 1055 ** . . ...ntdll!NtQueryInformationThread 1173 ** . . . SQLSERVR!init-recvbuf 1208 ** . . . SQLSERVR!init-sendbuf 1227 ** . . . SQLSERVR!port-ex-handle 1263 ** . . . SQLSERVR!-Otssetjmp3 1313 **.... SQLSERVR!memalLoc 1365 X*..... SQLSERVR!-OtsZero 1405 **.... SQLSERVR!recvhost 1437 ...... SQLSERVR!-OtsMove 1500 **.... SQLSERVR!rnemalloc 1577 ** . . . SQLSERVR!rn-char 1580 ** . . . . SQLSERVR!recvhost 1612 ** . . ...SQLSERVR!-OtsMove 1777 ** . . . SQLSERVR!parse-name 1808 ** . . . . SQLSERVR!dbcs-strnchr 2115 ** . . . SQLSERVR!rpcprot 2131 ** . . . . SQLSERVR!memalloc 2183 ** . . ...SQLSERVR!-OtsZero 2252 ** . . . . SQLSERVR!getprocid 2319 ** . . ...SQLSERVR!procrelink+Ox1250 2546 ** . . ...SQLSERVR!-OtsRemainder32 2559 ...... SQLSERVR!-OtsDivide32+0x94 2597 ...... SQLSERVR!opentable 2642 ...... SQLSERVR!parse-name 2673 ...... SQLSERVR!dbcs-strnchr 2979 ...... SQLSERVR!parse-name 3010 ...... SQLSERVR!dbcs-strnchr 3323 ...... SQLSERVR!opentabid 3363 ...... SQLSERVR!getdes 3493 ...... SQLSERVR!GetRunidFromDefid+0~40 3510 ...... SQLSERVR!-OtsZero 3658 ...... SQLSERVR!initarg 3668 **...... SQLSERVR!setarg 3703 ** ...... SQLSERVR!-OtsFieLdInsert 3764 ** ...... SQLSERVR!setarg 3799 ** ...... SQLSERVR!-OtsFieldInsert 3857 ** ...... SQLSERVR!startscan 3901 ** ...... SQLSERVR!getindexZ 3978 ...... SQLSERVR!getkeepslot 4064 ...... SQLSERVR!rowoffset 4109 ...... SQLSERVR!rouoffset 4170 ...... SQLSERVR!-OtsMove 4331 ...... SQLSERVR!memcmp 5323 ...... SQLSERVR!bufunhold 5436 ...... SQLSERVR!prepscan 5550 ...... SQLSERVR!match-sargs-to-index

Figure 3 Es.uiiplc of Function l'rncc Outpilt from Ntstep

Vol. 8 No.4 1996 Figure 3 (continued) Example of Function Trace Output from Ntstcp

Digiral Tcchtiical Journal Vol. 8 No. 4 1996 101 Table 11 lnstruction Count and Variations for Scaled Transaction lnstruction Original ByteIWord Delta % Delta I Instruction Original ByteIWord Delta % Delta

stb stt stw cmple ldwu inswl ldbu srl cmpbge extqh cmovlbs cmpule addt cmpult cmovl bc cmplt cmovle rdteb insqh extwl cmovgt stq-u callsys blt mulq bic s8subq extll cmovlt extlh Idt bge zap mb umulh 511 mull subl ornot br cmpeq sra insql bsr bl bs s4addl s8addl ret mskwl zapnot jsr addq CPYS subq rnskqh ldah cmovne extbl mskbl xor crnoveq and insbl bne extwh addl trapb Idq-u mskql st1 jmp Ida cmovge stq blbc 1% bgt beq Idl-l bis stl-c Idl extql Totals

For the scaled transaction, a dccrcasc in 58 out of instructions pcr trilnsaction mcasurcd in Table 13. If 81 instructions types occurred. Of the remaining 25 this correlation holtis truc, \vc ~\~ouldcspcct to see an instructions, 21 had no change and only 4 instructions, increase in pcrformancc of approsimatcly 7 percent mull, s8addl, trapb, and subl, showed an increase. For for scaled transactions runs. cached transactions, 22 instruction counts decrcascd, 29- increased, and 22 remained unchanged. Dynamic Instruction Distribution 1- he performalice gain of 3.5 perccnt mcasurcd For Tlic pcrfor~i~anccof thc Alpli;l microproccssor ~lsing the cached version of the application benchmark cor- technical ancl comrncrcial \\/o~-1tlo;ldshas bccn evalu- relates closely to the dccreasc in the number OF ntcd.' The commercial workload ~~sedLV~S clrl7it-

102 Digiral Technical Journal Vol. 8 No. 4 1996 Table 12 lnstruction Count and Variations for Cached Transaction lnstruction Original ByteIWord Delta % Delta I lnstruction Original ByteIWord Delta % Delta stb stt stw cmple ldwu inswl ldbu srl cmpbge extq h cmovlbs cmpule addt cmpult cmovl bc cmplt cmovle rdteb insqh extwl cmovgt stq-u callsys blt mulq bic s8subq extll cmovlt ext l h Idt bge za P mb umulh sII mull sub1 ornot b r cmpeq sra insql bsr blbs s4addl s8addl ret mskwl zapnot jsr addq CPYS subq mskqh Ida h cmovne extbl mskbl xor cmoveq and insbl bne extwh addl trapb Idq-u mskql st1 jmp Ida cmovge stq blbc Idq bgt beq ldl-l bis stl-c Idl extql Totals

credit, which is similar to the Tl'C-A benchniark. The instruction maltcup of each group. Figure 4 sho\\ls the TI'C-B benchmark is similar to the TPC:-A, differing percentage of instructions in each group for the four only in its method of execution. Cvetanovic and alternatives we studied. In all four cases, INTEGER Bhandarltar presented an instruction distribution LOADS make up 32 percent of the instructions exe- matrix for the debit-credit \vorldoad. The Alpha cuted. In the scaled Byte/Word category, the new instruction type mix is dominated by the integer class, ldbu and Id~winstructions cornpose 1 percent of the followed by other, load, branch, and store instructions, integer instructions, and the new stb and stw instruc- in descending order." We took a similar .~pproach tions accounted for 18 percent of the integer store but divided the instructions into more groups to instructions executed. achieve a finer detailed distribution. Table 13 gives the Table 13 the method of loading 01-storing bytes and \vords Instruction Groupings on the originlil Alpha Architecture made hca\,y use of Instruction thcsc n,pcs of instructions. Group Group Members In our Inst esnniination, nre lool

CACHED BYTENVORD

CACHED ORIGINAL

SCALED BYTENVORD

SCALED OR GI~AL

-- - 0 10 20 30 40 50 60 70 80 90 100 PERCENT KEY:

INTEGER LOAD INTEGER STORE INTEGER CONTROL INTEGER ARITHMETIC LOGICAL SHIFT @ BYTE MANIPULATION OTHER

Figure 4 111s11.~1crionGro~~p l)isrributio~i

Vol. 8 No.4 I996 Table 14 lnstruction Variations (Scaled Minus Cached Transactions) lnstruction Original lnstruction Original lnstruction Original ByteIWord stw cmplt sub1 ldwu rdteb br ldbu extwl sra cmovl bc stq-u bsr callsys blt s4addl s8su bq bic ret za P extll zapnot umulh extlh addq mull bge su bq ornot mb ldah cmpeq sII extbl bl bs cmovge xor s8addl bl bc and mskwl bgt bne jsr Idl-l addl CPYS stl-c Idq-u mskqh extql st1 cmovne cmple Ida cmoveq inswl stq extwh srl Idq trapb extqh beq mskql cmpule bis jmp cmpult Idl Totals produce a different set of results and \vo~lld~licrit References and Notes investigation. Thc use of another ciatabasc, such as the Ornclc IIl)BMS, \\,hich makes greater LIS~of byte opcr- 1. 7,. C\.ctano\.ic and 13. Bha~idarkar,"C:harnctcl.izarion ations, \\tould possibl!! result ill 21.1 cvcn greater pcrfor- ofiUpI1.l ASP Pcrform.lncc Using TI' and SPEC Work- m'incc impact. Lastly, rebuilding tlic cntirc operating lodds," Llst il/~tlci~tlItrlerrrationul Sympo.siri~n012 system to use the lie\\, instructions \i~ouldmnltc an Conrprrtcr-A~z'hittxtrrr~, C:hicago (1994). interesting and \\,orthwhile study. 2. MI. Kohler ct al., "l'crformancc Evaluation ofTransdc- tion Processing," Digital Technical Journal, vol. 3, Acknowledgments no. 1 (Winter 1991): 45-57. 3. S. Lcutcncggcr and 1). Dins, "A Modeling Study of the As \vitli any project, many people wcrc instrumental in TPC-C 1Senchmark," /'rocc~cclill Itltcr-~iatioreulCor lJi~rer~ceor1 ~M~etzclgr- Steve Jcnncss gave us nunnerous insights into the ttrerrt ?/!J'rlcttrt,SIGILIOI) Record 22 (2),(June 1993). Windo\\.s NT operating system. Tom Van Rnak pro- 4. R. Sitcs n~ld E. Pcrl, PrttchWrk7~-A D)~I.~G/.~v~c \ridcd sc\,cml analysis and trncing/sim~~lationtools for Exec~itiotl Ttze~.itrg%)ol (Palo Alto, Calif.: Digital tlic Windo\vs NT cn\.ironn~cnt.Rich Gro\~cpro\lided Eq~~ipmcnrCorporation, Sysrcms Rescnrch Ccntcr, .~cccssto c~rlybuilds of tlic GEM compiler back end 1995). that contained byte and \\lord support. Stan Gaza\vay lx~iltthe SQL Server application \\,it11 tlnc moditica- 5. W. Kohlcr, A. Sh.111, .ind F. Ranb, 0ueri)ieto oj"/PC' tions. Vclibi Tasar provided encouragement iund sanity Bo~cht17ot./.?C: 7771. Otzlet.-Et~tq.,Derichnzark (Snn Jose, Calif: '11.onsaction l'rocessing Perforlnancc chcclting. John Shakshober lent i~isightinto tlnc world <:ouncil Tcchnicnl Rcporr, 1991). of TI'(:. Pctcr Bannon pro\iidcd the early prototype maclii~~c.Contributors from microso oft Corporation 6. R. Sitcs, "Alpha ASP Architccrure,'' Digilal Techni- included Todd lhgland, who hclpcci rebuild the SQL, ca1Jo111-lrul.\.ol. 4, no. 4 (Special Issue 1992): 19-34. Scrvcr; Kick Vicik, who provided cictailcd insights into 7. Alpha AXP S~stc~n.~H~rrrdbook (Maynard, Mass.: the operation of the SOL Ser\!cr; and Damien Digital Equipment COI-porntion,1993). Lindaucr, \\:Ilo helped set LIPand run the TP<: bcnch- 8. DECchip 21064A-23.3. 275 Alpbn A.W ~Vficro- mark. Finally, \vc thank Dick Sitcs for cnsouraging processor DLI~LISl~c~c~t ( M,lynard, (Mass.: Digital us to undcrtakc this effort. Eq~~ipmcntCorporation, 1994).

Vol. 8 No. 4 1996 105 9. Alpha 21163 Micro-or Hc~I~~Kt$! errce hlalrl~d(Maynard, Mas.: 1)igital Eqoipmt Corporation, 1994). 10. K.Sites and R. Witck, A@l'a ~XJ'Alz./nktum~#- rnce ~Mrr~rl~ul,2d cd. (Newton, Mass.: lXdd Pma, 1995). 1 1. G. Kane, ~~ll~.S/<2000 l

12. J. Hcnncssy, N. jtrilppt, E.'. Ilaktb md I. GJ& W.f: Eric B. Betts A IxSl PnxwAr&&mm (SontiKd, Uf.: Eric Retts is 3 principal sofnvare c~~gi~lccrin thc DIGITAL Computer Spttmt khtanp, Smhd U~urmty, Sofnvarc l'artncrs Engineering Group, \vlic~-clie has been 'l'cclinical 1-rt No. 223, 1981). involvcd \\,it11 pcrh)rmance e~igiurcring,project manage- ~ncnt,and bcnclimarking for the Microsoti SQI. Scr\.cr 13. J. Henness!; N,Jouppi, i-. B~skett,1'. Gross, 1. Gill, and Windoivs NT products. I're\~io~~sl~~\\,it11 rlic Fcdclal and S. Prz!~lsyM, ffu~zlli~~z~-e/SoJ~i~~r~.o?hrclc1r!fi,k)r Go\7crnnicnt lkgion, Eric was a mc~iibcrofrhc technical lrzo~c~rsc~rlR~V$JI-II/~IIIL.~~ (Stanford, Calif.: (:oniputcr support group and J reclinical lead on scvcr.~lgo\.ernment Systems l..lbinatory, Snnford Unitcrsity, Tcclinical progrmis. Iuring the early days of the port of Windows N'I' to Alpha, l>IGI'TAL's cnginccrs ported the AlK firm\\,are to the Alpha platform. 17. The Alpha instruction type mix includcd PA1,code calls, barriers, and other implemcntnrion-s}>ccific MLcodc instr~rctio~ Biographies

David P. Hunter I)a\.id Hunter is thc cndncccing nianagcr of the 1)IGITAI. Software Partners -Advtuad Dcvclopmcnt Group, \vhere he lm% &led hp&mnnce invcsti- gations ofdatabasa and thdrhtrminru with UNiS and Windows NT. Pricwta tbkwwk, kkldpsitions in rllc Alpha Migration (- rht TSV Porting Group, and the Governni~xizGu@s Tc~bnicdPmgr~m [Manage- ment Office. He j&d DlCIlTALin rhe Lsboratory 1)nra I'roducts Group ill 1983,ahhtd~~dqedthc t:.\;Ylab User Mangniie~lcm. Hc &athc projc.ct lcader of the advanced develop~mysricb, m,an aecuti~~einhr~na- tion system, for \\hictik dm and soti\\.arc components. David hamop- pppkdons pcnding in rhc arca ofsohva~rcnglnbng. Hc holds a dcgrcc in clectri- c;ll and coniputcr tnginetlirrgh Mhtastcrn U~ii\crsity

106 Digital Technical Journ;il Vol. 8 No.4 1996 Further Readings

The Digilnl Tcchi~icu~/o~~r-rraIis a refereed, quarterly Alpha AXP Partners-, Raytheon, Kubota/ publication of papers that csplorc the foundatiolls DECchip 21071/21072 PC1 Chip Sets/ of DIGITAL'S prodllcts and tccli~~ologies.~~o~/r~zc,l DLT2000 Tape Drive content is selected by the Journal Ad\lisory Board, Val. 6, No. 2, Spring 1994, F,Y-F947E-TJ and papers are written by DIGITAL,'s engineers High-performance Networking/OpenVMS AXP and engineering partners. Engineers who would System Software/Alpha AXP PC Hardware like to contribute a paper to theJorima1 should Vol. 6, No. 1, Winter 1994, EY-QOl 1 E-TJ contact the managing editor, Jane Rlake, at Software Process and Quality [email protected]. Vol. 5, No. 4, Fall 1993, EY-P920E-L)P Topics covcrcd in previous issues of the Digital Product Internationalization Technica,/o~.irrrnlarcas follows: Vol. 5, No. 3, Summer 1993, EY-1'986E-DP Internet Protocol V.6/Preservation of Historical Multimedia/Application Control Computer Syste~ns/Fortran for Parallel Computing/ Vol. 5, No. 2, Spring 1993, EY-1'963E-L>P Server Performance Evaluation and Optimization/ Internet Collaboration Software DECnet Open Networking Vol. 8, No. 3, 1996, E(:-N7285- 1 S Vol. 5, No. 1, Winter 1993, EY-1M770E-l)P Spiralog Log-structured File System/ Alpha AXP Architecture and Systems OpenVMS for 64-bit Addressable / Vol. 4, No. 4, Spccial Issue 1992, EY-JSS6E-DP High-performance Message Passing for Clusters/ NVAX-microprocessor VAX Systems Speech Recognition Software Vol. 4, No. 3, Stlmmer 1992, EY-J884E-L)P VO~.8, No. 2, 1996, EY-N6992-1 S Semiconductor Technologies Digital UNIX Clusters/Object Modification Tools/ 1/01,4, No. 2, Spring 1992, EY-152 1 E-I)P excursion for Windows Operating Systems/ Network Directory Services PATHWORKS: PC Integration Software \kl. 8, No. 1, 1996, EY-U025E-1'1 Vol. 4, No. 1, Winter 1992, EY-J825E-1)P Audio and Video Technologies/ UNIX Available Image Processing, Video Terminals, and Servers/Real-time Debugging Tools Printer Technologies Vol. 7, No. 4, 1995, EY-UOO2E-TJ Vol. 3, No. 4, Fall 1991, EY-HS89E-L)P High Performallce Fortran in Parallel Environments/ Availability in VAXcluster Systems/ Sequoia 2000 Research Network Performance and Adapters VOI. 7, NO. 3, 1995, EY-T838E-TJ Vol. 3, No. 3, Summcr 1991, EY-I-IS90E-DP ~A~~ailahleorrly or! Ihc I~tertretl Fiber Distributed Data Interface Graphical Software Development/Systems Engineering Vol. 3, No. 2, Spring 1991, EY-HS76E-1)P Vol. 7, No. 2, 1995, EY-U001E-TJ Transaction Processing, Databases, and Database Integration/Alpha Servers & Workstations/ Fault-tolerant Systems Alpha 21 164 CPU Val. 3, No. 1, Winter 1991, EY-F58SE-DP Vol. 7, No. I, 1995, EY-T135E-TJ VAX 9000 Series (Aouiluhk orrly on the Intemc~l) Vol. 2, No. 4, Fall 1990, EY-E762E-1)P RAID Array Controllers/Worktlow Models/ DECwindows Program PC LAN and System Management Tools Vol. 2, No. 3, Su~nmer1990, EY-t;756E-l>P Vol. 6, No. 4, Fall 1994, EY-T11SE-TJ VAX 6000 Model 400 System Alphaserver Multiprocessing Systems/ DEC OSF/1 \hl. 2, No. 2, Spring 1990, EY-C197E-DP Symmetric M~lltiprocessing/ Scientific Computing Optimization for Alpha Compourld Document Architecture \lol. 6, No. 3, Summcr 1994, EY-S799li-TJ Vol. 2, No. 1, Winter 1990, EY-C196E-DP

Digid Technicat Journal Vol. 8 No. 4 1996 107 Call for Authors from Digital Press

Digital Press is an imprint of Butterworth-Heinemann, a major international pub- lisher of professional books and a member of the Reed Elsevier group. Digital Press is the authorized publisher for Digital Equipment Corporation: The two companies are working in partnership to identifjr and publish new books under the Digital Press imprint and create opportunities for authors to publish their work.

Digital Press is committed to publishing high-quality books on a wide variety of subjects. We would like to hear from you ifyou are writing or thinking about writ- ing a book.

Contact: Liz McCarthy, Associate Acquisitions Editor, or Mike Cash, Digital Press Manager

DIGITAL PRESS 31 3 Washington Street Newton, MA 02158-1626 U.S.A. Tel: (617)928-2649, Fax: (617)928-2640 E-mail: Liz.McCarthyQrepp.com or [email protected]