I ALPHASERVER 4100 SYSTEM

ORACLE AND SYBASE DATABASE PRODUCTS Digital FOR VLM Technical INSTRUCTION EXECUTION ON ALPHA PROCESSORS Journal

Volume 8 Number 4 1996 Editorial The Digital Technicaljournal is a refereed The following arc trademarks of Digital Jane C. Blake, Managing Editor journal published quarterly by Digital Equipment Corporation: AlphaServer, Kathleen M. Stetson, Editor Equipment Corporation, 50 Nagog Park, AlphaStation, DEC, DECnet, DIGITAL, Helen L. Patterson, Editor AK02-3/B3, Acton, MA 017 20-9843. the DIGITAL logo,VAX, VMS, and ULTIUX. Hard-copy subscriptions can be ordered by Circulation sending a check in U.S. funds (made payable AIM is a trademark of AIM Technology, Inc. Catherine M. Phillips, Administrator to Digital Equipment Corporation) to the CCT is a registered trademark of Cooper Dorothea B. Cassady, Secretary published-by address. General subscription and Chyan Te chnologies, Inc. CHALLENGE rates are $40.00 (non-U.S. $60) for four and Silicon Graphics are registered a-ademarks Production issues and $75.00 (non-U.S. $115) for and POWER CHALLENGE is a trademark Christa W. Jessica, Production Editor eight issues. University and college profes­ of Silicon Graphics, Inc. is a regis­ Anne S. Katzcff, Typographer sors and Ph.D. students in the electrical tered trademark and ProLiant is a trademark Peter R. Wo odbury, lllustrator engineering and computer science fields of Compaq Computer Corporation. HP is receive complimentary subscriptions upon a registered trademark of Hewlett-Packard Advisory Board request. DIGITAL's customers may qualifY Company. HSPICE is a registered a·ade­ Samuel H. Fuller, Chairman for gifi: subscriptions and are encouraged mark of Metasoftware Corporation. IBM, Richard W. Beane to contact their account representatives. Power PC, PowerPC 504, and PowerPC Donald Z. Harbert 604 are registered trademarks and RS/6000 J. Electronic subscriptions are available at Richard Hollingsworth is a trademark oflnternational Business no charge by accessing URL William A. Laing Machines Corporation. Insignia is a trade­ http://www.cligital.com/infojsubscription. Richard F. Lary mark of Insignia Solutions, Inc. Intel and This service will send an electronic mail Alan G. Nemeth Pentium arc trademarks oflntel Corp01-ation. notification when a new issue is available Robert M. Supnik IPX/SPX is a trademark of Novell, Inc. on the Internet. ispLSI and Lattice Semiconductor arc regis­ Single copies and back issues are available tered trademarks of Lattice Semiconductor for $16.00 (non-U.S. $18) each and can Corporation. KAPis a trademark of Kuck & be ordered by sending the requested issue's Associates, Inc. MEMORY CHANNEL is a volume and number and a check to the a·ademark of Encore Computer Corporation. published-by address. See the Further Mental Ray is a a·ademark of Mental Images. Readings section in the back of this issue Metra! is a trademark of Berg Te chnology, Inc. for a complete listing. Recent issues are Microsofi:, MS-DOS, and Visual C++ are also available on the Internet at registered trademarks and Windows and http://www.cligital.com/i nfo/dtj. Windows NT are tradem

The cover design is by Lucinda O'Neill of DIGITAL's Corporate Design Group. Contents

ALPHASERVER 4100 SYSTEM

AlphaServer 4100 Performance Characterization Zarka Cvetanovic and Darrel D. Donaldson 3

The AlphaServer 4100 Cached Processor Module 21 Maurice .B. Stcinm;m, George J. Harris, Architecture and Design Andrej Kocev, Virginia C. Lamere, and D. Roger Pannell

The AlphaServer 4100 Low-cost Clock Distribution System Roger A. Dame 38

Design and Implementation of the Alpha Server 4100 CPU 48 Glenn A. Herdeg and Memory Architecture

High Performance Design in the AlphaServer 4100 61 1/0 SJmuel H. Duncan, Craig D. Keefer, and Symmetric Multiprocessing System Thomas A. NlcL1ughlin

ORACLE AND SYBASE DATABASE PRODUCTS FOR VLM

Design of the 64-bit Option for the Orade7 Relational Vipin V. GokhJie 76 Database Management System

VLM Capabilities of the Sybase System 11 SQL Server j 83 T.K. Rengara an, Maxwell Berenson, Ganesan Gop;!], Bruce lvlcCrc.Jd)', Sa pan Panigrahi, Srikanr Subram;1niam, and Marc B. SugiyamJ

INSTRUGION EXECUTION ON ALPHA PROCESSORS

Measured Effects of Adding Byte and Word Instructions P. David Hunter and Eric 13. Bcrrs 89 to the Alpha Architecture

8 1996 Digit;!\ Tccbnicll Journal Vol. No.4 Editor's Introduction

Just 40 years ago, a machine called the The AlphaScrvcr 4100 cached pro­ demonstrate a clear performance ben­ TX-0-a successor to Whirlwind­ cessor module design is presented by efit tor decision support systems and was built at MIT's Lincoln Laboratory Mo Steinman, George Harris, Andrej online transaction processing. The to find out, among other things, if a Kocev, Ginny Lamere,

Vol. 8 No.4 1996 2 I Zarka Cvetanovic Darrel D. Donaldson Alpha Server 4100 Performance Characterization

The AlphaServer 4100 is the newest four­ The AlphaServer 4100 is DIGITAL's latest fo ur­ processor symmetric multiprocessing addition processor symmetric multiprocessing (SMP) midrange to DIGITAL's line of midrange Alpha servers. Alpha server. This paper characterizes the performance of the AlphaServer 4100 tamily, which consists of The DIGITAL AlphaServer 4100 fa mily, which three models:1-5 consists of models 5/300E, 5/300, and 5/400, AlphaServer 4100 model 5/300E, which has up to has major platform perfor mance advantages I. fo ur 300-megahertz (MHz) Alpha 21164 micro­ as compared to previous-generation Alpha plat­ processors, each without a mod ule-level, third­ forms and leading industry midrange systems. level, write-back cache (B-cache) (a design referred The primary performance strengths are low to as uncached in this paper) memory latency, high bandwidth, low-latency 2. AJphaServer 4100 model 5/300, which has up to 1/0, and very large memory (VlM) technology. tour 300-M Hz Alpha 21164 microprocessors, each Evaluating the characteristics of both technical with a 2-megabyte (MB) B-cache and co mmercial workloads against each fa mily 3. AlphaServer 4100 model 5/400, which has up to member yielded recommendations for the best fo ur 400-MHz Alpha 21164 microprocessors, each application match for each model. The perfor­ with a 4-MB B-cache mance of the model with no module-level cache The performance analysis undertaken examined and the advantages of using 2- and 4-megabyte a number of workloads with diffe rent character­ module-level caches are quantified. The profiles istics, including the SPEC95 benchmark suites (floating-point and integer), the UNPACK bench­ based on the built-in performance monitors are mark, AIM Suite VII (UNIX multiuser benchmark), used to evaluate cycles per instruction, stall time, the TPC-C transaction processing benchmark, image multiple-issuing benefits, instruction frequen­ rendering, and memory latency and bandwidth cies, and the effect of cache misses, branch tests-" 15 Note that both commercial (AlJ\1 and TPC-C) mispredictions. and replay traps. The authors and technical/scientific (SPEC, UNPACK, and image propose a time allocation-based model for re ndering) classes of workloads were included in this analysis. evaluating the performance effects of various The results of the analysis indicate that the major stall components and for predicting future per­ AJphaServer 4100 pertormance advantages result formance trends. from the to! lowing server rcatures:

• Significantly higher bandwidth (up to 2.6 times) and lower latency compared to the previous­ generation midrange AJphaServer plattorms and leading industry midrange systems. These improve­ ments benefit the large, multistrcam applica­ tions that do not fit in the B-cache. For example, the AlphaServer 4100 5/300 is 30 to 60 percent fa ster than the HP 9000 K420 server in the memory-intensive workloads fr om the SPECf1.)95 benchmark suite. (Note that all competitive per­ formance data presented in this paper is valid as

8 1996 Digital Tcdmical journal Vol. No.4 3 of the submission of this paper in July 1996. The the Alpha 21164 microprocessor Jnd all multiproces­ references cited rekr the reader to the literature sor products by leading industrv vendors. The major and the appropriate Web sites for the latest ped(>r­ benefits come ri·om the simpler intcrhce, rhe use of nunce information.) synchronous dvnamic random-access memun· (DRMvl) chips (i.e., svnchronous memorv), and rhe • An expanded very large memory (VLM). The max­ imum memory size increased from 2 gig<1bytes lower fill time." Figure 1 shows the measured mem­ (GB) to 8GB without sacrificing CPU slots. This ory load latencv using the lmbench benchm::�rk with increase in memory size benefitsprimarily the com­ a 512-byte stride.'" In this benchmark, each load me!-cial, multistream applications. For example, the depends on the result ri·om the previous load' and AJphaServer 4100 5/300 server achieves approxi­ therd()re l:!tency is nor a good measure of pnr(>r­ mately r..vice the throughput of the Compaq mance rc>r systems that can have multiple outstanding ProLiant 4500 server and 1.4 times the throughput loads. (AiphaServer 4100 systems can have up to of the AJphaServer 2100 on the AIM Suite Vll two outst:rnding requests per CPU on the bus.) benchmark tests. The lmbench benchmark data indic1tes rhar the AlphaServer 4100 has the lowest memory latency of A 4-M 13 B-cache and a clock speed of 400 MHz • :rll industry-leading reduced-instruction set comput­ in the AJphaServer 4100 5/400 system. The larger ing (RISC) vendors' sen·ers. B-cache size and 33 percent faster clock resulted in As shown in Figure 2, using a slightlv dii"fi..Tel1t a 30 to 40 percent performance improvement over worklo:�d where there is no dependencv bel:\1-een the AlphaServer 4100 5/300 system. consecutiYe loads, the AJphaSen·er 4100 achie1·es c1 en The performance improvement rrom rhe IJrger lower per-Joad latency, since the iateilC\" r(>r the l:\1"0 B-cache increases with the number of CPUs. For consecutive lo:�ds can be overbpped. The platc1lls example, rhe AJphaServer 4100 5/300 system with in Figure 2 sholl" rhe load latency at each ofrhe r(>llow­ its 2-M13 R-cache design performs 5 to 20 percent ing levels of cache/memory hierarclw: R-kilobne raster with one CPU and 30 to 50 percent raster (KB) on-chip data cache (D-cache), 96-KB on-chip with four CPUs than the uncached 5/300£ system. secondary instruction/data cache (S-cache), 2- and The majority of workloads included in this analysis 4-MB offchip B-eaches (except rl.lr model 5/300 E), benefit ri·omthe B-cache; however, the uncached sys­ :�nd memory. The uncached AlphaServer 4100 tem ourperr(mns the cached implementation by 10 5/300E :�chieves Jn 85 percent lower memory load to 20 percent f(>r large applications that do nor fit in latency than the previous-generation Alph:�Server the 2-MB B-cache. 2100. The AJphaServer 4100 5/300, with its 2-MB The pertc>rmance counter profiles, based on rhe B-cache, increases memorv latency 30 percent r(>r built-in h::�rdware monitors, indicate that the nnjor­ load operations and 6 percent for store oper:�tions ity of issuing time is spent on single and dual issuing compared to the uncachcd 5/300E svsrem because of and that a small number of Aoating-point workloads the time spent checking for data in the B-c1che. The take advantage of triple and quad issuing. The svnchronous memorv shows one cvcle lower Lltencv load/store instructions make up 30 to 40 percent of than the asvnchronous extended dar:� out ( EDO) all instructions issued. The stall time associated with DRAM (i.e., asynchronous memorv), ll"hich results in waiting ror data that missed in the various levels of 9 percent bster load operations and 5 percent bster cache hierarchy accounts ror the most significant por­ store operations. "Note that the cached AlphaSnl'er tion of the time the server spends processing com­ 4100 and phaSe rver 8200 SI'Stems, ll'hich ha\·e AI mercial workloads. the same clock speeds of 300 lv!Hz, achieve com­ par:�ble B-c:�che latencv, while the memory Lnenc1· Memory latency r(>r :�II AlphaServer 4100 S\'Stems is signific:�ntlv lower than on both the AlphaServer 8200 and the Memory IJrency and bandwidth have been recog­ Alph:.1Server2100 systems. The latency to the B-cKhe nized as important perr(mnance factors in the earJy in this rest is lower on the AlphaServer 2 J 00 th:ll1

Alpha-based implementations. "'·'7 Since CPU speed is on the other AlphaServer systems due to 32-byte increasing at a much higher rate than memory speed, blocks (compared to 64-byte blocks in the 4100 �111d the "memory wall" limitation is expected to become 8200 systems). Although not shown in this rest, many even more important in the future. Therdore, reduc­ applications can benefit from the larger uche block ing memory latency and increasing bandwidth have size. The 400-IV!Hz AlphaSen·er 4100 svsrem uses 33 percent raster and Khiei"CS percent been major design goals ror the AlphaServer 4100 J CPU 11 platrorm.' The AlphaServer 4100 achieved the lowest reduction in B-cache and memorv Lltcncv compared memory latency of all DIGITAL products based on the 300-MHz AlphaServer 4100 s1·stem. to

1996 4 Vol. 8 No.4 LMBENCH: DEPENDENT LOAD MEMORY LATENCY (STRIDE= 512 BYTES)

ALPHASERVER 8200 (300 MHZ)

ALPHASERVER 4100 5/400 (400 MHZ)

ALPHASERVER 4100 5/300 (300 MHZ)

ALPHASERVER 4100 5/300E (300 MHZ)

INTEL PENTIUM PRO (200 MHZ)

SUN ULTRASPARC (167 MHZ)

HP 9000 K210 (119 MHZ)

SGI POWER CHALLENGE R10000 (200 MHZ)

IBM RS/6000 43P POWERPC (133 MHZ)

0 200 400 600 1,000 1,200 BOO MEMORY LATENCY (NANOSECONDS)

Figure 1 lmbench Benchmark Test Results Showing Memory Latency for Dependenr Loads

MemoryBandwidth with the AJphaServer 4100 5/400). The uncached AJphaServer 4100 model shows 22 percent higher The AJphaServer 4100 system bus achieves a peak memory bandwidth than the cached model 5/300. bandwidth of 1.06 gigabytes per second (GB/s). The The AJphaServer 4100 memory bandwidth STREAM McCalpin benchmark measures sustainable improvement from synchronous memory compared memory bandwidth in megabytes per second (MB/s) to EDO ranges from 8 to 12 percent. The synchro­ across fo ur vector kernels: Copy, Scale, Sum, and nous memory benefit increases with the number of CPUs, as shown in Table SAXPY." Figure 3 shows measured memory band­ l. width using the Copy kernel from the STREAM Low memory latency and high bandwidth have benchmark. Note that the STREAM bandwidth is a significantdtect on the performance of workloads 33 percent lower than the actual bandwidth observed that do not fit in 2- to 4-MB B-eaches. For example, on the AJphaServer 4100 bus because the bus data the majority of the SPECtp95 benchmarks do not fit cycles are allocated fo r three transactions: read in the 2-MB cache. (Figure 20, which appears later in source, read destination, and vvrite destination. The this paper, shows the cache misses.) The SPECtp95 AlphaServer 4100 shows the best memory bandwidth performance comparison presented in Figure 4 shows of all multiprocessor platforms designed to support up that the uncached AlphaServer 4100 5/300£ system to fo ur CPUs. The platforms designed to support outperforms the 2-MB B-cache model 5/300 in the more than fo ur CPUs (i.e., the AJphaServer 8400, the benchmarks with the highest number of B-cache Silicon Graphics POWER CHALLENGE R10000, and misses (tomcatv, swim, applu, and hydro2d). The per­ the Sun Ultra Enterprise 6000 systems) show a higher fo rmance of the uncached model 5/300£ is compar­ bandwidth fo r four CPUs than the AlphaServer 4100. able to that ofthe 4-MB B-cache model 5/400 for the The STREAM bandwidth on the AlphaServer 4100 swim benchmark. However, the benchmarks that fit 5/300 is 2.2 times higher than on the previous­ better in the 4-MB cache (apsi and waveS) run signifi­ generation AlphaServer 2100 5/250 (2.6 times higher cantly slower on the 5/300£ than on the 5/400.

8 No.4 1996 Digital Technical journal Vol. 5 INDEPENDENT LOAD LATENCY (STRIDE= 64 BYTES) 300

250

(jJ 200 z0 0 u w (J) 0 z 150 <(

>-� u z w � 100 _j

50

0 ����� 4 KB 8 KB �16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1MB 2MB 4MB 8MB 16 MB � ��DATA SET SIZE KEY:

------ALPHASERVER 4100 5/300E - ALPHASERVER 4100 5/300 ALPHASERVER 4100 5/400 ALPHASERVER 8200 5/300 -- ALPHASERVER 2100 5/300

Figure 2 Cache/Memory Latency 1ndepencknt Loc1ds for

1.000

900

0z 800 0 u w (J) 700 wa: CL (J) w 600 f-- >- en <( KEY: <.9w 500 ---+-- ALPHASERVER 8400 5/300 I6 - ALPHASERVER 8400 5/350 f-- 0 IB M RS/6000-990 SGI POWER CHALLENGE R10000 03: z 300 - ALPHASERVER 4100 5/300E <( en - ALPHASERVER 4100 5/300 - ALPHASERVER 4100 5/400 HP 9000 J210 ALPHASERVER 2100 5/250

----<>---- SUN SPARCSERVER 2000E INTEL ALDER PENTIUM PRO SUN ULTRA ENTERPRISE 6000

0 2 3 4 5 6 NUMB ER OF CPUs

Figure 3 STREAM t'vlcCalpin Memory Copy Bandwidth Comparison

Digital Te chnical journal Vol. No.4 6 8 l996 Ta ble 1 ( CINT95) contains eight compute-intensive integer Bandwidth Improvement from Synchronous Memory benchmarks written in C and includes the benchmarks to Asynchronous Memory ''·12 shown in Table 2 . Number of CPUs The floating-point SPEC95 suite ( CFP95) contains 10 compute-intensive floating-point benchmarks writ­ 2 3 4 ten in FORTRAN and includes the benchmarks shown Bandwidth "·12 in Table 3 . improvement 8% 8% 9% 12% The SPEC Homogeneous Capacity Method (SPEC95 rate) measures how t:\st an SMP system can Figure 4 shows that the AlphaServcr 4100 5/300 perform multiple CINT95 or CFP95 copies (tasks). system has a significant (up to t\vo times) performance The SPEC95 rate metric measures the throughput of advantage over the previous-generation AlpbaServer the system running a number of tasks and is used tor 2100 system in the SPEC!p95 benchmark tests with evaluating multiprocessor system performance. the highest number of B-cache misses. The SPEC!p95 tests indicate that the 300-MHz AlphaServer 4100 is Ta ble 2 more than 50 percent faster than the HP 9000 K420 CINT95 Benchmarks (SPECint95) server, and the 400-MHz AlphaServer4100 is twice as Benchmark Description fast as the HP 9000 K420 in the SPECtp95 bench­ marks that stress the memory subsystem. 099.go Artificial intelligence, plays the game of Go SPEC95 Benchmarks 124.m88ksim A Motorola 88100 microprocessor simulator 126.gcc A GNU C compiler that generates The SPEC95 benchmarks provide a measure of pro­ SPARC assembly code cessor, memory hierarchy, and compiler pert(Jrmance. 129.compress A program that compresses large benchmarks do not stress gr

145.FPPPP Ta ble 3 CFP95 Benchmarks (SPECfp95) 141.APSI Benchmark Description 125.TURB3D 101 .tomcatv A fluid dynamics mesh generation program 110.APPLU 102.swim A weather prediction shallow water

107.MGRID model 103.su2cor A quantum physics particle mass 104.HYDR02D computation (Monte Carlo) 1 04.hydro2d An astrophysics hydrodynamical 103.SU2COR Navier-Stokes equation 107.mgrid A multigrid solver in a 3-D potential 102.SWIM field (electromagnetism) 110.applu Parabolic/elliptic partial differential 101.TOMCATV equations (fluid dynamics)

0 5 10 15 20 25 30 35 125.turb3d A program that simulates KEY: turbulence in a cube HP 9000 K420 • 141.apsi A program that simulates tempera­ ALPHASERVER 2100 5/300 • ture, wind, velocity, and pollutants ALPHASERVER 4100 5/400 • (weather prediction) ALPHASERVER 4100 5/300 0 145.fpppp A quantum chemistry program that ALPHASERVER 4100 5/300E performs multielectron derivatives 146.wave5 A solver of Maxwell's equations on Figure 4 a Cartesian mesh (electromagnetics) SPECtp95 Benchmarks Performance Comparison

8 7 Digital TcchnicaJ journ;ll Vol. No.4 1996 Figure 5 compares the SPEC95 performance of SPEC95 RATES the AlphaServer 4100 systems to that of the other 450 industry-leading vendors using published results as of July 1996. Figure 6 shows the same comparison in 400 the multistream SPEC95 rates u Note that all the SPEC95 comparisons in this paper are based on the 350 peak results that include extensive compiler optimiza­ 300 tions. 12 Figure 5 indicates that even the uncached AlphaServer 4100 5/300£ performs better than the HP 9000 K420 system, and the AlphaServer 4100 250 5 I400 shows approximately a two times performance advantage over the HP system. The AlphaServer 4100 200 5/300 SPECint95 performance exceeds that of the Intel Pentium Pro system, and the AlphaServer 4100 150 5/300 SPECtp95 performance is double that of the Pentium Pro. The AlphaServer 4100 5/400 is 100 1.5 times (S PECint95) and 2.5 times (SPECfp95) faster th::m the Pentium Pro system. The multiple­ 50 processor SPECtp95 on the AlphaServer 4100 is obtained by decomposing benchmarks using the 0 KAP SPEC INT_RA TE95 SPECFP_RATE95 preprocessor from Kuck & Associates. Note that the KEY: c:Khed tour-CPU AlphaServer 4 00 5/300 outper­ I ALPHASERVER 4100 5/300E (4 CPUs) the Sun Ultra Enterprise 3000 with six CrUs in • tCll'IllS ALPHASERVER 4100 5/300 (4 CPUs) Iii the SrECtp95 parallel test. The performance benefit ALPHASERVER 4100 5/400 (4 CPUs) 0 of the cached versus the uncached AlphaServer 4100 HP 9000 K420 PA-RISC 7200 120 MHZ (4 CPUs) • SUN ULTRA ENTERPRISE 3000 ULTRASPARC 167 MHZ (4 CPUs) is greater in multiprocessor configurations than in uni­ 0 INTEL CALDER PENTIUM PRO 200 MHZ (1 CPU) processor configurations. 0 IB M RS/6000 J40 POWERPC 604 112 MHZ (6 CPUs) 0 SPEC95 Multistream Performance Scaling Figure 6 SPEC95 Throughput Results (SPEC95 Rates) Figures 7 and 8 show srEC95 multistrcam perfor­ mance as the number of CrUs increases. The SMr scaling on the AlphaServer 4100 is comparable to that

SPEC95 35

30

25

20 KEY

ALPHASERVER 4100 5/300E 15 • ALPHASERVER 4100 5/300 ALPHASERVER 4100 5/400 0 10 • HP 9000 K420 PA-RISC 7200 (120 MHZ) SUN ULTRA ENTERPRISE 3000 0 ULTRASPARC (167 MHZ) SGI POWER CHALLENGE R10000 (195 MHZ) 5 0 INTEL CALDER PENTIUM PRO (200 MHZ) 0 IBM RS/6000 43P POWERPC 604E (166 MHZ) 0 0 SPECINT95 1 CPU SPECFP95 1 CPU SPECFP95 4 CPUs (SUN SYSTEM 6 CPUs)

Figure 5 SPEC95 Speed Results

8 Dig;itol journal Vol. 8 No.4 1996 Te chnical SPECINT_RATE95 on the AJphaServer 2100 for integer workloads 450 (that fit in the 5/300 2-MB B-cache). Note that SPECint_rate95 scales proportionally to the number of CPUs in the majority of systems, since these work­ loads do not stress the memory subsystem. The SMP scaling in SPECfP_rate95 is lower, since the majority of these workloads do not fit in 1-to 4-MB caches. In the majority of applications, the AJphaServer 4100 5/300 and 5/400 systems improve SMP scaling compared to the uncached AJphaServer 4100 5 /300E by reducing the bus traffic (from fe wer B-cache misses) and by takjng advantage of the duplicate tag store (DTAG) to reduce the number of S-cache probes. The cached 5/300 scaling, however, is lower than the uncached 5/300E scaling in memory O L------�------�------� bandwidth-intensive applications (e.g., tomcat\' and 1 2 3 4 NUMBER OF CPUs swim). The analysis of traces collected by the logic KEY: analyzer that monitors system bus trafficindic ates that the lower scaling is caused by ( 1) Set Dirty overhead, - ALPHASERVER 4100 5/300E ALPHASERVER 4100 5/300 where SetDi rty is a cache coherency operation used to

- ALPHASERVER 4100 5/400 mark data as modified in the initiating CPU's cache;

- ALPHASERVER 2100 5/300 (2) stall cycles on the memory bus; and (3) memory --- HP 9000 K420 bank conflicts.2·3 SUN ULTRA ENTERPRISE 3000

---o--- IBM RS/6000 J40 Symmetric Multiprocessing PerformanceSc aling for Parallel Workloads Figure 7 SPECint_rate95 Performance Scaling Parallel workloads have higher data sharing and lower memory bandwidth requirements than multistream SPECFP_RATE95 workloads. As shown in Figures 9 and 10, the 450 AJphaServer 4100 models with module-level caches improve the SMP scaling compared to the uncached 400 AJphaServer 4100 model in the UNPACK 1000 X 350 1000 (million floating-point operations per second [MFLOPS]) and the parallel SPECfP95 benchmarks 300 that benefit from 2- and 4-M.B B-eaches. Figure 9 250 indicates that tl1e AJphaServer 4100 5/400 outper­ forms the SGJ Origin 2000 system in the UNPACK 200 1000 X 1000 bench mark by 40 percent. Figure 10

150 indicates that the fo ur-CPU AlphaServer 4100 5/400 shows better scaling than any other system in its class 100,.. and outperforms the six-CPU Sun Ultra Enterprise

5 3000 system by more than 70 percent.

Very Large Memory Advantage: : ['------'------'------'--2 3 4 NUMBER OF CPUs Commercial Performance KEY: As shown in Figures and 12, the AJphaServer 4100 - ALPHASERVER 4100 5/300E ll ALPHASERVER 4100 5/300 performs well in the commercial benchmarks TPC-C ALPHASERVER 4100 5/400 and AIM Suite VII.13•14 In addition to the low memory

- ALPHASERVER 2100 5/300 and ljO latency, the AJphaServer 4100 takes advan­ --- HP 9000 K420 tage ofthe VLM design in these I/O-intensive work­ SUN ULTRA ENTERPRISE 3000 loads: with fo ur CPUs, the platform can support up to ---o-- IBM RS/6000 J40 8GB of memory compared to l GB of memory on the

Figure 8 AJphaServer 2100 system with four CPUs and 2 GB SPECfp_rate95 Pertonnancc Scaling with three crus.

Digital 'kchnical journ;ll Vol. 8 No. 4 1996 9 UNPACK 1000 x 1000 PA RALLEL SPECFP95

2,000 35

1,800 30 1,600

1,400 25

1,200 20 1,000

800 15

600 10 400

200

2 3 4 NUMBER OF CPUs 2 3 4 5 6 NUMBER OF CPUs KEY: KEY: - ALPHASERVER 4100 5/300E - ALPHASERVER 4100 5/300 - ALPHASERVER 4100 5/300E ALPHASERVER 4100 5/400 ALPHASERVER 4100 5/300 -- ALPHASERVER 2100 5/300 ALPHASERVER 4100 5/400 - SGI ORIGIN 2000 R10000 {195 MHZ) - ALPHASERVER 2100 5/300 IBM ES/9000 VF - HP 9000 K420 HP EXEMPLAR S-CLASS PA 8000 {180 MHZ) SUN ULTRA ENTERPRISE 3000

Figure 9 Figure 10 P;�ullcl SPECfp95 Pcrtixmance Sc1ling UNPACK lOOO x lOOO Parallel Pcrtonmncc Scali ng

TPC-C THROUGHPUT (TPMC)

IBM RS/6000 J30 (8 CPUs)

COMPAQ PROLIANT 4500/166

HP 9000 K420

SUN SPARCSERVER 2000E

ALPHASERVER 4100 5/400

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000

THROUGHPUT (TRANSACTIONS PER MINUTE)

Figure 11 Transaction Processing Pcrtorrnancc (TPC-C Using an OrJCic D

!0 Digiral Technic.1l Journal Vo l. 8 No. 4 lY% AIM SUITE VII THROUGHPUT

COMPAQ PROLIANT 4500 PENTIUM (166 MHZ)

COMPAQ PROLIANT 5000 6/200 PENTIUM PRO (200 MHZ)

ALPHASERVER 2100 5/300

ALPHASERVER 4100 5/400

ALPHASERVER 4100 5/300"

ALPHASERVER 4100 5/300E"

0 500 1,000 1 ,500 2,000 2,500 3,000 3,500 THROUGHPUT (JOBS PER MINUTE)

"These internally generated results have not been AIM certified.

Figure 12 AJ M Suite VII Jv lulriuser/S hared UNIX Mix Performance

Figures 11 and 12 show the AlphaServer 4100 sys­ the October 1996 UNIX Expo, the AipbaServer 4100 tem's TPC-C performance (using an Oracle database) family won three AIM Hot Iron Awards: for the best and AIM Suite VII throughput pertormance as com­ performance on the vVindows NT pared to other industry-leading vendors. Note that the (for systems priced at more than $50,000) and tor performance of the uncached AlphaServer 4100 the best price/performance in two UNIX mixes­ 5/300E is comparable to that of the 300-MHz multiuser shared and filesystem (tor systems priced at AlphaServer 2100. (The AlphaServer 2100 system more than $150,000).14 used in this test had three CPUs and 2 GB of memory, whereas the AlphaServer 4100 system had four CPUs Cache Improvement on the and 2 GB of memory.) AlphaServer 4100 System With its 2-MB B-cache, the AlphaServer 4100 5/300 improves throughput by 40 percent in the Figures 13 and 14 show the percentage performance AIM Suite VII benchmark tests as compared to improvement provided by the 2-MB B-cache in the uncached AlphaServer 4100 5 /300E. The the AlphaServer 4100 5/300 as compared to the AlphaServer 4100 5/400, with its 4-MB B-cache, uncached AlphaServer 4100 5 /300E. Figure 13 benefits from its 33 percent taster clock and two times shows the improvement across a variety of workloads; larger B-cache and provides 40 percent improvement Figure 14 shows the improvement in individual over the AlphaServer 4100 5/300. Note that the SPEC95 benchmarks fo r one and tour CPUs. AlphaServer 4100 5/300 and 5/300E results were As shown in Figure 13, the 2-MB B-cache in the obtained through internal testing and have not been Alp baSe rver 4100 5/300 improves the performance by AIM certified. The AlphaServer 5/400 results have 5 to 20 percent for one CPU and 25 to 40 percent fo r AIM certification. tour CPUs as compared to the uncached AlphaServer Compared to the best published industry AIM Suite 4100 5/300E system. The benefits derived from having VII pertormance, the AlphaServer 4100 5/300 larger caches are significantly greater tor to ur CPUs throughput is almost twice that of the Compaq compared to one CPU, since large caches help alleviate ProLiant 4500 server, and the AlphaServer 4100 bus traffic in multiprocessor systems. 5/400 throughput is more than 50 percent higher The workloads that do not fit in the 2- to 4-MB than that of the Compaq ProLiant 5000 server.14 At B-cache (i.e., tomcat\', swim, applu) in Figure 14

Digital Tec hnical Journal Vol. 8 No. 4 1996 ll PERFORMANCE IMPROVEMENT FROM 2-MB CACHE

AIM SUITE VII MAX USERS I 4 CPUs AIM SUITE VII JOBS/M IN I 4 CPUs

LINPACK_1 K 4 CPUs

LINPACK_1 K CPU 1 I

SPECFP92 4 CPUs

SPECINT92 4 CPUs

SPECFP92 1 CPU I SPECINT92 1 CPU

SPECFP95 4 CPUs

SPECINT95 4 CPUs

SPECFP95 1 CPU � SPECINT95 1 CPU I

0 5 10 15 20 25 30 35 40 45 PERCENT IMPROVEMENT

Figure 13 PcrrormatKe Impro,·cmcnt ;l cross VJ rious Wol'i

run ra ster on the uncached AlphaServer 4100 tha n Cholesky (!C) t:

taster on one CPU and 20 percent fa ster on f()ur This workload is representative of l :t rge scientific

CPUs) due to the overhead f(>r probing the B-cache :t ppliotions that do not fitin megabyte-size caches. and the increase in Set Dirty bandwidth. The majority The workload is importallt in l arge applications, of the other workloads benefitfi· om larger caches. e.g., models of electrical ncrworks, economic systems,

The Ai phaServcr 4100 5/400 fltrther improves diffltsion, radiation, and c lasticirv. It was decom posed the ped(mnance by increasing the size of the B-cache to run on multiprocessor systems using the KAP fi·om 2 MB to 4 MB. In Jddition, the CPU clock preprocessor. improvement of 33 percent, B-cache improvement of figure 15 shows that the uncached AJphaServer 7 percent in !Jrency and ll percent in bandwidth, and 4100 5/300E outperf(mm the Al phaServer 8400 by the memory bus speed improvement of 11 percent 41 percent fo r one CPU and by 9 percent fcJr two CPUs

together yield an ovec1 1 l 30 to 40 percen t improve­ because of higher delivered system bus bandwidth. ment in the AlphaServer 4100 model 5/400 perfor­ However, the AlphaServcr 4100 5/300E ta ils behind mance as compared to that of the AlphaScrver 4100 with three and to ur CPUs, as it docs in the McCalpin model 5/300. memory bandwidth tests shown in Figure 3. Note tllJt with one CPU, the 300-MHz utlCJched AlphaServer Large Scientific Applications: Sparse UNPACK 4100 pc rf(mm :tt the same level as the 400-MHz cached AlphaServer 4100 and pcrfcm11S 18 percent

The SpJrse UNPACK benchmJrk solves a large, sparse better than the 300-MHz cached Al ph aSen'er 4100.

symmetric svsrem of l i near eq uations using the con­ This is <1 11 e xample of the tvpe of application tor jugate grJd ient (CG) iterative method. The bench­ which the cache diminishes the pedcmllJilCe. The mark has three cases, each with a diffe rent tvpe of AlphaScrver 4100 5/300E is a better march for t his precondirioncr. Cases 1 and 2 usc the incomplete class of applications than the cached systems.

12 Digital Tcchnic1l journal VoL 8 No. 4 1996 PERFORMANCE IMPROVEMENT FROM 2-MB CACHE IN SPEC95

SPECINT95

147.VORTEX

134.PERL

132.1JPEG

130.LI

129.COMPRESS

126.GCC

124.M88KSIM

099.GO

SPECFP95

146WAVE5

145.FPPPP

141.APSI

125.TURB3D

110.APPLU

107.MGRID 104.HYDR02D - 103.SU2COR KEY: 102.SWIM 0 1 CPU 101.TOMCATV 4 CPUs -- • -2�0 0-- 20 ...... 40 60 80 100 120 PERCENT IMPROVEMENT

Figure 14 SPEC95 Performance Improvement ti·01n a 2-MB B-Cacl1e

Image Rendering issues, branch mispredictions, stall components, and cache misses.'·"'.l' These statistics are usdi.d fo r analyz­ The AlphaServer 4100 shows significant performance ing the system behavior under various workJoads. advantage in image rendering applicatjons compared to The results of this analysis can be used by computer the other industry-leading vendors. Figure 16 shows architects to drive hardware design trade-oHs in fi.1 ture that the AlphaServer 4100 5/400 system is approxi­ system designs. mately 4 rimes taster than the Sun SPARC system that The SPEC95 cycles per instruction (CPI) data was used in the movie Toy Story, as measured in presented in Figure 17 shows lower C:PI values fo r RenderMarks.'5 The AJphaScrver 4100 is 2.6 times the integer benchmarks ( CPI values of 0.9 to 1.5) taster than the Silicon Graphics POWERCHA LLENGE than tor the floating-point benchmarks ( CPI values system and 2.4 times faster than the HP/Convex of 0.9 to 2.2) . The CPI in commercial workloads Exemplar SPP- 1200 system on the Mental Ray image (e.g., TPC-C) is higher than in the SPEC bench­ rendering application tt·om Mental Images. These marks, primarily because commercial workloads have image rendering applications rake advantage of larger a higher stall time, as shown in Figure 18. Note caches, and the performance improves as the cache size that the performance counter statistics were collected increases, particularly with fo ur C:PUs. with tour CPUs running TPC- C (with a Sybase data­ base), while SPEC95 statistics were collected on a PerformanceCounter Profiles single CPU. The Alpha2116 4 has two integer and two floating­ The figures in this section, Figures 17 through 22, point pipelines and is capable of issuing up to fo ur show the performance statistics collected using instructions simultaneously. The integer pipeline 0 the built-in Alpha 21164 performance counters on the executes arithmetic, logical, load/store , and shift AlphaServer 4100 5/400 system. These hardware operations. The integer pipeline l executes arithmetic, monitors collect various events, including the number logical, load, and branch/jump operations. The and type of instructions issued, multiple issues, single floating-point pipeline 0 executes add, subtract,

Digital Tec hnical journal Vo l. 8 No. 4 1996 13 SPARSE UNPACK 70

60

50

(f) � 40 _J u. ::;;;;

30

20

KEY: 10 D ALPHASERVER 4100 5/300E D ALPHASERVER 4100 5/300 Ill ALPHASERVER 4100 5/400 • ALPHASERVER 8400 0 2 3 4 NUMBER OF CPUs

Figure 15 Sparse LIN PA CK l\:d(mnc1ncc

PIXAR RENDERMARKS IBM RS/6000 390

HP 9000 735 (125 MHZ)

SGI CHALLENGE R4400 (200 MHZ)

KEY:

D 1 CPU • 4 CPUs

0 500 1 ,000 1 ,500 2,000 2,500 RENDERMARKS

Figure 16 Image Rendering Pcrr(mnance

compare, and floating-point branch instructions. The and dual issuing. Triple and quad issuing is noticeable floating-point pipeline 1 executes multiply instruc­ in several floating-point benchmarks, but, on averap:c, tions. The time distribution illustrated in Figure 18 only 3 percent ofthe time is spent on triple and qu,'td indicates that most of the issuing time is spent in single issuing in the SPECf1.., 95 benchmarks.

14 Digir�l Tc chnic1l )ourn:1l Vol .::>No. 4 1996 CPI

TPC-C

SPECINT9S VORTEX PERL M88KSIM

Ll IJPEG GO GCC COMPRESS

SPECFP9S WAVES TURB3D TOMCATV SWIM SU2COR MGRID HYDR02D FPPPP A PSI APPLU

0 o.s 1.0 1.S 2.0 2.S 3.0 3.S 4.0 CYCLES PER INSTRUCTION

Figure 17 SPEC95 Cycles-per-instruction Comparison

TIME DISTRIBUTION

TPC-C

SPECINT9S VORTEX PERL M88KSIM Ll IJPEG GO GCC COMPRESS

SPECFP9S WAVES TURB3D TOMCATV

SWIM KEY: SU2COR • SINGLE ISSUE MGRID • DUAL ISSUE HYDR02D D TRIPLE ISSUE FPPPP D QUAD ISSUE A PSI • DRY STALL APPLU D FROZEN STALL 0% 20% 40% 60% 80% 100% TIME

Figure 18 Issuing and Stall Time

Digit,ll T�chnical Journal Vol . 8 No. 4 1996 15 The stall time (dry plus fr ozen stalls in Figure 18) categories: load (both floating-point and integer), is higher in the floating-point benchmarks than in store (both floating-point and integer), integer (all the integer benchmarks and higher in the TPC-C integer instructions, excluding ones with only R3 l or benchmarks than in the SPEC95 benchmarks. Dry literal as operands), branch (all branch instructions stalls include instruction stream (!-stream) stalls including unconditiorral), and floating-point (except caused by the branch mispredictions, program counter floating-point load and store instructions). Figure 19 (PC) mispredictions, replay traps, I-stream cache shows the percentage ofin structions in each category misses, and exception drain. Frozen stalls include data relative to the total number of instructions executed. stream (D-stream) stalls caused by D-stream cache Note that load/store instructions account fo r 30 to misses as well as register conflicts and unit busy. Dry 40 percent of all instructions issued. Integer instruc­ stalls are higher in SPECint95 and TPC-C (mainly tions are present in both integer and floating-point because of I -stream cache misses and replay traps), benchmarks, but no floating-point instructions exist in whereas frozen stalls are higher in SPEC!p95 and the SPECint95 and commercial TPC-C workloads. TPC-C (mainly because of D-stream cache misses). The integer and commercial workloads execute more The Alpha 21164 microprocessor reduces the per­ branches, while the branch instructions make up only fo rmance penalty due to cache misses by implement­ a kw percent of all instructions issued in the floating­ ing a brge, 96-KB on-chip S-cache:'' This cache is point workloads. three-way set associative and contains both instruc­ The cache misses shown in Figure 20 are higher tions and data. The fo ur-entry prdetch bufter allows in the floating-point benchmarks than in the inte­ prefetching of the next fo ur consecutive cache blocks ger benchmarks. The I -cache misses arc low in the on an instruction cache (I-cache) miss. This reduces floating-point benchmarks (except tor !pppp) :1 11d the penalty tor !-stream stalls. The six-entry miss higher in the SPECint95 benchmarks and the TPC-C address file(MAF) merges loads in the same 32-bytc benchmark. The D-cache misses are high in the major­ block and allows servicing multiple load misses with ity of the benchmarks, which indicates that a larger D­ one data fill.A six-entry write buffer is used to reduce cachc would reduce D-stream misses. The TPC-C the store bus traffic and to aggregate stores into benchmark would benefitfr om a larger 5-cache ami 32-byte blocks.'"' taster R-oche, since the number of 5-cachc misses is Figure 19 shows the instruction mix in SPEC95. high. The B-cache misses are negligible in the The Alpha instructions are grouped into the fo llowing SPECint95 benchmarks and higher in the majority of

INSTRUCTION STATISTICS

TPC-C

SPECINT95 VORTEX PERL M88KSIM Ll IJPEG GO GCC COMPRESS

SPECFP95 WAVES TURB3D TOMCATV SWIM

SU2COR KEY: MGRID • STORES HYDR02D LOADS FPPPP 0 INTEGER OPERATIONS A PSI 0 FLOATING-POINT OPERATIONS APPLU • BRANCHES 0% 20% 40% 60% 80% 100% INSTRUCTIONS

Figure 19 SPEC95 Insrrucrion Profiles

16 Dig;it.ll Tcdl llicll )ourn;ll Vol. 8 No. 4 19')6 CACHE MISSES

TPC-C

SPECINT95 VORTEX PERL M88KSIM Ll .... IJPEG GO - GCC COMPRESS

SPECFP95 WAVES TURB3D TOMCATV SWIM SU2COR MGRID KEY: HYDR02D I-CACHE MISSES • FPPPP D-CACHE MISSES • A PSI S-CACHE MISSES 0 B-CACHE MISSES APPLU • 0 � 100 1� �0 CACHE MISSES PER 1 ,000 INSTRUCTIONS

Figure 20 Misses Cache

the SPECfP95TPC -C benchmarks. This data indicates the CPU is not issuing instructions). The stall compo­ that complex commercial workloads, such as TPC-C, nent is furtherdivided into the dry and fi·ozen stalls: are more profoundly affected by the cache design than time = compute + stall simpler workloads, such as SPEC95. compute = single dual + ai ple + quad issuing The replay traps are generally caused by ( l) fu ll + stall = dry frozen write-buffe r (WB) traps (a fu ll write buffe r when a + store instruction is executed) and fu ll miss address file dry = branch mispredictions PC mispredictions (MAF) traps (a full when a load instruction is + MAF + replay traps I -stream cache misses executed); and (2) load traps (speculative execution of + + exception drain stalls an instruction that depends on a load instruction, and fr ozen = D-su·eam cache misses the load misses in the D-cache) and load-after-store register conflicts and unit busy traps (a load fo llowing a store that hits in the D-cache, + and both access the same location).3 The replay traps The branch and PC mispredictions affect the per­ and branch/PC mispredictions shown in Figure 21 fo rmance of SPECint95 workloads ( 6 percent of the are not the major reason fo r the high stall time in the time is spent in branch and PC mispredictions in commercial workloads (TPC-C), since traps and mis­ SPECint95) and have little effect on the performance predictions are higher in some of the SPECint95 of SPECfP95 workloads (less than 1 percent of the benchmarks than in TPC-C. Instead, a high number of time) and the TPC-C benchmark ( 1.4 percent of cache misses (see Figure 20) correlates well with the the time). The SPECint95 workloads are affected pri­ high stall time and CPI (see Figure 17) in TPC-C. marily by the load traps, whereas the SPECfP95 22 Figure shows the estimated stall components in benchmarks are affected by both load and WB/MAF SPEC95 and TPC-C. A time-allocation model is used to traps. Note that the time spent on a load replay trap analyze the performance effect of different stall compo­ is overlapped wi th the load-miss time. nents. The totalexecution time is divided into t\vo com­ The S-cache and B-cache stalls are high in the ponents: the compute component (where the CPU is SPECfP95 and TP C-C benchmarks, where the stall issuing instructions) and the stall component (where time is dominated by the B-cache and memory laten­ cies. Note the high stall time resulting fr om waiting to r

8 Digital Technical journal Vo l. No. 4 1996 17 REPLAY TRAPS AND BRANCH MISPREDICTIONS

TPC-C

SPECINT95 VORTEX PERL M88KSIM

Ll IJPEG ____, GO GCC COMPRESS

SPECFP95 WAVES TURB3D TOMCATV SWIM SU2COR

MGRID KEY: HYDR02D � LDU REPLAY TRAPS FPPPP • WB/MAF REPLAY TRAPS • A PSI BRANCH MISPREDICTIONS 0 APPLU PC MISPREDICTIONS • 0 10 20 30 40 50 60 70 80 REPLAY TRAPS AND BRANCH/PC MISPREDICTIONS PER 1,000 INSTRUCTIONS

Figure 21 Replay Traps and Bmnch/PC Mispredictions

SPEC95 STALL TIME COMPONENTS

TPC-C

SPECINT95 VORTEX PERL M88KSIM Ll IJPEG GO GCC COMPRESS

SPECFP95 WAVES KEY: TURB3D • BRANCH AND PC TOM CATV MISPREDICTIONS SWIM • LDU REPLAY TRAPS WB/MAF REPLAY TRAPS SU2COR 0 I-CACHE MISS TO S-CACHE MGRID 0 D·CACHE MISS TO S-CACHE HYDR02D • S-CACHE MISS TO B·CACHE FPPPP 0 B·CACHE MISS TO MEMORY • A PSI REGISTER CONFLICT AND 0 APPLU UNIT BUSY

0 1 0 20 30 40 50 60 70 80 90 100 PERCENT OF TOTAL TIME

Figure 22 Estimated Stal l Time Distribution

18 Dip:itJI Technical JournJI Vo l. 8 No. 4 1996 data fr om memory (close to 40 percent) in several of 2. M. Steinman, G. Harris, A. Kocev, V. Lamere, and the SPECfp95 benchmarks that do not fit in a 4-MB R. Pannell, "The AlphaServer 4100 Cached Processor cache. Although it contributes to the high SPECfP95 J'vlodule Architecture and Design," Digital Te chnical stall time, the memory component has a negligible Jo urnal, vol. 8, no. 4 (1996, this issue): 21-37. effect on SPECint95 performance, since these bench­ 3. Alpha 21 164 Microprocessor Ha rdware Reference marks generate only a small number of B-cache misses J\!/a nual (Maynard, Mass.: Digital Equipment Corpo­ (see Figure 20). Figure 22 indicates that stalls caused ration, Order No. EC-QAEQA-TE, 1994). by cache misses are the largest component of the total 4. J. Edmondson, P. Rubinfeld, and V. Rajagopalan, stall time; therefore, reducing cache misses and "Superscalar Instruction Execution in the 21164 improving cache and memory latencies would yield Alpha Microprocessor," Il::tl:: Jl1icro, vol . 15, no. 2 the largest performance benefit. (April 1995 ). Once calibrated and validated with measurements, 5. R. Sires, cd., Alpha Architecture Reference Ma nual this model is an effective tool fo r evaluating the perfor­ (Burlington, Mass.: Digital Press, ISBN 1-55558-098-X, mance impact of various components on the overall 1992). system design. System architects can vary parameters, like the cache or memory access times or cache size, 6. SPEC9 5 Benchmarks (Manassas, Va .: Standard Perfor­ and adjust the appropriate stall component to predict mance Evaluation Corporation, 1995 ). performance of alternative designs without carrying 7. J. Dongarra, "Performance of' Va rious Computers out detailed and often time-consuming architectural Using Standard Linear Equation Sofrware" (Oak simulations. Ridge, Te nn .: Oak Ridge National Laboratory, 1996).

8. UNIX System Price Performance Guide (Menlo Park, Conclusion Calif.: AIM Technology, Summer 1996 ).

Using several performance metrics and a variety of 9. J. Gray, ed ., The Ha ndbool< fo r Database and workloads, we have demonstrated that the DIGITAL Tra nsaction Process ing Systems (San Mateo, Calif.: JVlorgan Kau ftinan, 1991 ). AJphaServer41 00 fa mily of midrange servers provides significant pertormance improvements over the 10. Information about the 1m bench suite of benchmarks previous-generation AJphaServer platform and pro­ is available at http://rea liry.sgi .com/employees/ vides performance leadership compared to the leading I m_engr /lm bench/whatis_l m bench. html. industry vendors' platforms. The major AJphaServer 11. The STREAM benchmark program is described 4100 performance strengths are the low memory and on-line by the University of Virginia, Department I/0 latency and high memory bandwidth, the large­ of Computer Science (Charlottesville, Va .) at memory support (Vl,M), and the fast Alpha 21164 http:/jww\v .cs.virginia.edu/stream. microprocessor. The work described in this paper has 12. The Standard Performance Evaluation Corporation led ro design changes that are expected to be imple­ (SPEC) makes available submitted results, benchmark mented in tlHure versions of the AJphaServer 4100 descriptions, background information, and tools at platform. The anticipated performance benefits will http:/jwww. specbench.org. come ti-oma fa ster CPU, fa ster and larger caches, faster memory, and improved memory bandwidth. 13. Information about the Transaction Processing Performance Council (TPC) is available ar http://

www.tpc.org. Acknowledgments 14. Information about system performance benchmarking The authors would like to acknowledge the contribu­ products from AIM Technology, Inc. (Menlo Park, tions of John Shakshober, Dave Stanley, Greg Tarsa, Calif) is available at http:/jwww .aim.com. Dave Wilson, Paula Smith, John Henning, Michael 15. Information about Pixar An imation Studio's Delaney, and Huy Phan fo r providing many of the RenderMark benchmark is available at http:// benchmark measurements. In addition, special thanks www.europe.digital.com/info/alphaserver/news/ go to Maurice Steinman, Glenn Herdeg, and Ted pixar.html.

Gent fo r dedicating system resources and to Masood 16. Z. Cvetanovic and D. Bhandarkar, "Characterization Heydari tor supporting this work. of Alpha f\.,'\ P Performance Using TP and SPEC Wo rk­ loads," 77Je 21st Annual International Symposium References on Co mputer Architecture (April 1994 ): 60-70.

17. Z. Cvetanovic and D. Bhandarkar, "Performance l. G. He:rdeg, "Design and Implementation of the Characterization of the Alpha 21164 Microprocessor AlphaSe:rver 4100 CPU and Memory Architecture," Using TP and SPEC Wo rkloads," 77Je Seco nd Digital Te chnical jo urnal, vol . 8, no. 4 ( 1996, this International Symposium on High-Performance issue): 48-60. Co mputerAr chitecture (February 1996 ): 270-280.

Digital T� chnical Journal Vol . 8 No. 4 1996 19 Biographies

Zarka Cvetanovic A consulting cngin<.:n in DIGITAL's Server Prod uct De1 elopmcn t Group, Zarka l\ ·ctanovic 11·as responsible

fo r the pedorniJnce du racteri zatiun and anail'sis ot'thc Al phaScn cr 4100, AlpluSen et· 8400/8200, AlphaSenu 2 100, DEC: 7000, VAX 7000, and VA.\ 6000 s1 stcms, c1 11d for the perlornunce model ing and definition �fti.tturc AlphaSen-cr plattonm . Since joining DIGITAL in I 986, she h:�sbeen i m·oh· ed in the de1·clopment of fist (btabasc

. ­ applications and efllciclll parallel 1pplications tor m u lti processor s1·stems. /,arb recei1·ed a Ph.D. in e lectrica l and compu ter cnsinccring ti·om the Unii'CI'Sin· of,\-Llssc!Chusetts, Amhct·st. She has pu bl ished mer a dozen t<.:chnical pa�Krs at computer architecture conferences and in leading indus­ tn· joun1 Jls .

Darrel D. Donaldson Darrel Donaldson is a senior consulting engineer and the technical leader and engineeri ng matugcr tiJ r the

. Alp h a.Scrver 4100 project He joined DIGITA L in 1983

and served as the lead technologist tor the VAX 6000,

VAX 7000, AJphaScrver7000, and AJphaServer 4100

projects . Darrel has ;1 bachelor's degree in mathematics/ physics ti ·om Miami U niversi ty and a master's degrc<.: in electrical eng,inec6ngfr om Cincinnati University, Cincinnati, Ohio. H e holds 12 patents and hc1s 10 patents pend i ng, all related to protocols , signal integrity, and chip transceiver design tor mu ltiprocessor systems and non­ volatile memory chip design . Darrel maintc1ins member­ �hip in rhe IEEE Electron Devices Sociery and the Solid State Circuits Society.

20 Vo l. 8 No. 4 l996 I Maw·ice B. Steimnan George J. Harris Andrej Ko cev The AlphaServer 4100 VirginiaC. Lamere Roger D. Pannell Cached Processor Module Architecture and Design

The DIGITAL AlphaServer 4100 processor module The DIGITAL AlphaScrvcr 4100 series of servers repre­ uses the Alpha 21 164 micro processor series com­ sents the third generation of Alpha microprocessor­ bined with a large, module-level backup cache based , mid-range computer systems. Among the technical goals achieved in the system design were the (B-cache). The cache uses synchronous cache use oftour CPU modules, 8 gigabytes (GR) of memorv, memory chips and includes a duplicate tag store and partial block writes to improve 1/0 pertonnann.:. that allows CPU modules to monitor the state Unlike the previous gmcration of mid-range servers, of each other's cache memories with minimal the AJphaServcr 4100 series em accommodate to ur disturbance to the microprocessor. The synchro­ processor modules, while retaining the maximum nous B-cache, which can be easily synchronized memory capacity. Using multiple crus to shJre the workload is known symmt:tric multiprocessing with the system bus, permits short B-cache ur to or from main memory, without the need for crus would have to be C\Jctly t(> ur times that of a sin­ re-synchronization or data buffering. gle CPU system. One of the goals of the design was to keep scalabilitv as high as possible yet consistem with low cost. For example, the AlphaServer 4100 system

achieves a scalability bctor of 3.33 on the LinpLu· improvement compared with the previous generation of mid-range servers.2 The new memory is also faster in terms of the data volume flow­ ing over the bus ( b:mdwidth) and data access time (latency). Again, compared with the previous genera­ tion, a\'ailable memory bandwidth is improved bv a be­ tor of2.7 and latency is reduced bv a fKtor of0.6. In systems ofrhis class, memory is usually addressed in large blocks of 32 64 bytes. This can be ineffi­ to cient when one or two bytes need to be modified because the entire block might have to be read out ti·om memory, modified, and then written back into memory to achievt.: this minor modification. The abil­ ity to modif)' a small fi·Jction of the block without hav­ ing to extract the entire block fi·om memory results in partial block writes. This capability also represents an advance over the pr�.:viousgcm :ration of servers. To take fi.dl advantage of the Alpha 21164 series of microprocessors, a nc,,· system bus was needed. The bus used in the pt-c,·ious generation of servers was not bst

Vo l. 8 No. 4 1996 21 enough, and the cost and size of the bus used in high­ locations. Direct-mapped refers to the way the cache end servers was not adaptable to mid-range servers. memory is addressed, in which a subset of the physical Three separate teams worked on the project. One address bits is used to uniquely place a main memory ream defined the system architecture and the system location at a particular location in the cache. When the bus, and designed the bus control logic and the mem­ microprocessor modifiesdata in a write-back cache, it ory modules.3 The second team designed the periph­ only updates its local cache. Main memory is updated eral interface (I/O), which consists of the Peripheral later, when the cache block needs to be used tor a dif Component Interconnect ( PCI) and the Extended fe rent memory address. ·when the microprocessor Industry Standard Architecture (EISA) buses, and its needs to access data not stored in tbe cache, it performs interface to the system bus (I/0 bridge).' The third a system bus transaction (fill) that brings a 64-byte team designed the CPU module. The remainder of block of data from main memory into the cache. Thus this paper describes the CPU module design in detail. the cache is said to have a 64-byte block size. Before delving into the discussion ofthc CPU module, Two types of cache chips are in common use in however, it is necessary to briefly describe bow the sys­ modern computers: synchronous and asynchronous. tem bus functions. The synchronous memory chips accept and deliver The system bus consists of 128 data bits, 16 check data at discrete times linked to an externalclock. The bits with the capability ofcorrecting single-bit errors, asynchronous memory elements respond to input 36 address bits, and some 30 control signals. As many signals as they arc received, without regard to a clock. as 4 CPU modules, 8 memory modules, and l l/0 Clocked cache memory is easier to interface to the module plug into the bus. The bus is 10 inches long clock-based system bus. As a result, all transactions and, with all modules in place, occupies a space of involving data flowing ti-om the bus to the cache (fill 11 by 13 by 9 inches. With power supplies and the transactions) and from the cache to the bus (write console, the entire system fitsin to an enclosure that is microprocessor-based system transactions) are easier 26 by 12 by 17.5 inches in dimension. to implement and faster to execute. Across the industry, personal computer and server CPU Module vendors have moved from the traditional asynchro­ nous cache designs to the higher-pertorming synchro­ The CPU module is built around the Alpha 21164 nous solutions. Small synchronous caches provide microprocessor. The module's main fu nction is to a cost-effe ctive performance boost to personal com­ provide an extended cache memory tor the micro­ puter designs. Server vendors push synchronous­ processor and to allow it to access the system bus. memory technology to its limit to achieve data rates The microprocessor has its own internal cache as high as 200 MHz; that is, the cache provides new memory consisting of a separate primary data cache data to the microprocessor every 5 nanoscconds.'·6 (D-cache), a primary instruction cache (!-cache), and The AJphaScrver 4100 server is DIGITAL's firstprod ­ a second-level data and instruction cache (S-cache). uct to employ a synchronous module-level cache. These internal caches are relatively small, ranging in At power-up, the cache contains no useful data, size from 8 kilobytes ( KB)fo r the primary caches to so the first memory access the microprocessor 96 KB fo r the secondary cache. Although the internal makes results in a miss. In the block diagram shown cache operates at microprocessor speeds in the 400- in Figure l, the microprocessor sends the address out megahcrtz (MHz) range, its small size would limit on t\vo sets of lines: the index lines connected to the performance in most applications. To remedy this, the cache and the address lines connected to the system microprocessor has the controls t(x an optional exter­ bus address transceivers. One of the cache chips, called nal cache as large as 64 megabytes (MB) in size. As the TAG, is not used for data but instead contains implemented on the CPU module, the externalcache, a table ofvalid cache-block addresses, each of which is also known as the backup cache or 13-cachc, ranges associated with a valid bit. When the microprocessor from 2 MB to 4 MB in size, depending on the size addresses the cache, a subset of the high-order bits of the memory chips used. In this paper, all references addresses the tag table. A miss occurs when either of to the cache assume the 4-MB implementation. the following conditions has been met. The cache is organized as a physical, direct-mapped, l. The addressed valid bit is clear, i.e., there is no valid write-back cache with a 144-bit-wide data bus consist­ data at that cache location. ing of 128 data bits and 16 check bits, which matches the system bus. The check bits protect data integrity 2. The addressed valid bit is set, but the block address by providing a means fo r single-bit-error correction stored at that location does not match the address and double-bit-error detection. A physical cache is one requested by the microprocessor. in which the address used to address the cache mem­ Upon detection of a miss, the microprocessor ory is translated by a table inside the microprocessor asserts the READ MISS command on a set of fo ur that converts software addresses to physical memory command lines. This starts a sequence of events

22 Digital Tec hnical journal Vo l. 8 No. 4 l996 �------;

BUS ARBITER I : I I ------" ------I � I r-- I TAG RAM I PROGRAMMABLE INDEX I - LOGIC I I -1 WRITE ENABLE, I ALPHA 21164 I I I OUTPUT ENABLE MICROPROCESSOR t DATA RAMS I I I CLOCK ASIC (VCTY) I ______I r I J SYSTEM ADDRESS t I AND DTAG RAM 144-BIT COMMAND DATA BUS SNOOP ADDRESS I I I DATA TRANSCEIVER I I ADDRESS TRANSC8VER I SYSTEM BUS !

1 Figure CPU Module that results in the address being sent th<.:system bus. command (a hit in DTAG ), the signal MC�SHARED to Th<.: m<.:mory r<.:c<.:ives this address and after a delay may be asserted on the system bus by VCTY. If that (memory lat<.:ncy), it sends the data on the system bus. location has been modified by the microprocessor, Data transe<.:ivers on the CPU modul<.: rec<.:ive the then MC_DIRTY is asserted. Thus each CPU is aware data and start a cache fill transaction that r<.:suIt s of the state of all the caches on the system. Other in 64 byt<.:s (a cache block) being written into th<.:cache acrions also take place on the module as part of this as t(>ur consecutive 128 -bit words \-vith their associated process, which is explained in greater detail in the sec­ check bits. tion dealing specifically with the VCTY. In an SMP system, two or more C:PUs may have the Because of the write-back cache organization, a spe­ same data in their cache memories. Such data is known cial type of miss transaction occurs when new data as shared , and the shared bit is set in th<.: TAG tc>r that needs to be stored in a cache location tbat is occupied address. The cache protocol used in the AlphaScrver by dirty data. The old data needs to be put back into 4100 s<.:riesof servers allows each CPU to modifY<.:ntries the main memory; otherwis<.:, the changes that the in its own cache. Such modified data is known as djny, microprocessor made will be lost. The process of and th<.:di rty bit is set in t e TAG . If the data about to be returningtl 1at data memory is called a victim write­ h to modified is shared, however, th<.: microproc<.:ssor resets back transaction, and the cache location is said to be the shar<.:d bit, and other CPUs invalidate that data in victimized. This process involves moving data out of their own cach<.:s. The need is thus apparent t( >r a way the cache, through the system bus, and into the main to k:t all Cl'Us keep track of data in oth<.:r caches. This memory, fo llowed by n<.:w data moving fr om the main is accomplished by the process known as snooping, memory into the cache as in an ordinary filltransac­ aid<.:d bys<.:veral dedicated bus signals. tion. Completing this fillqui ckly reduces the tim<.: that To facilitate snooping, a separate copy of the TAG is the microprocessor is waiting fo r the data. To speed up maintain<.:d in a dedicated cache chip, calk:d duplicate this process, a hardware data bufkr on the module is tag or DTAG . DTAG is controlkd by an application­ used fo r storing the old data while the new data is specificin tegrated circuit (ASIC) calkd VCTY. VCTY being loaded into the cache. This buffer is physically and DTAG arc located next to <.:ach other and in close a part of the data transceiver since each bit of the trans­ proximity to the address transc<.:iv<.:rs. Their timing is ceiver is a shift register fo ur bits long. One hundred tied the system bus so that th<.: address associated twenty-eight shiftregisters can hold the entire cache to with a bus transaction can easily applied the block (512 bits) of victim data while the new data is lx to DTAG, which is a synchronous memory d<.:vice, and being read in through the bus receiver portion of the th<.:stat<.: of the cache at that address can be read out. data transceiver chip. In this manner, the microproces­ If that cache location is valid and the addr<.:ss that is sor does not have to wait the victim data is trans­ until stor<.:d in the DTAG matches that of the system bus tcrred along the system bus and into the main memory

Digital T� (hni,al journal Vo l. 8 No. 4 1996 23 bd()re the fill portion of the tra nsanion can take pl ace . occurred had it not been delayed by one microproces­ When the fill is completed, the victim data is shifted sor cycle, and the address at the RAM is further delayed out of the victim bufte r and into the main memory. by index bufkr and network dcl:�ys. Index setup at the This is known as Jn exchange, since the victi m write­ RAM satisfies the minimum setup time required by the back and fillt ransactions execute on the system bus in c hip , and so does address hold. D:�ta is shown as reverse of the order that was initiated by the micro­ :1ppearing after data access time (a chip property), and processor. The transceiver has a signal called BYPASS; data setup at tbe microprocessor is also illustrated. when asserted, it causes three of the to ur bits of the victim shiftr egister to be bypassed. Consequently, t()r VCTY ordi nary block write tr:msactions, the transcei\'Cr oper­ ates without im·olving the \'ictirnbuf ter. As described earlier, a dup l icne copv of the micro­

processor's pri ma rv TAG is m:� inta i ned in the DTAG

B-Cache Design RAM. If DTAG were not prese nt, cJch bus address would have to be applied by the microprocessor to the As previously mentioned , the B-cache uses synchro­ TAG ro decide if the datJ at this address is p resent in nous random-access memory ( RAJ'vl ) devices. Each the B-cache. This activity would i mpose a very large

device requ i res a cloc k that loads signal inp uts i nto !oJd on the microprocessor, th us red uci ng the amount

a register. The RAM operates in the registered inpu t, of usdi. d work it could pcrt(mn. The 111�1in p u rpose of flow-throug h output mode. This means that an input the DTAG and its supporting logic contained in the flip-flop captures addresses, write enables, and write VCTY is to relie\·e the microprocessor ti·om h avi ng to datJ, but the internal RAM arrJy drives read ou tp ut exam i ne each address presented bv the svstem bus. d ata directly as soon as it becomes availabl e, withou t The microprocessor is only interrupted when its pri­ n:gard to the clock. The output enable signal acti\'atcs mary TAG must be u pdated or \\'hen data must be RA1vl output drin:rs asynchronously, indepcndemly of prm·ided to satisf)' tbe bus request. the clock. One of the fi.mdamenral properties of clocked logic VCTYOperation is the requ iremem t()r the data to be present to r some The VCTY contai ns a system bus interrace consisting of

defined time (setup time ) bdon: the clock edge, �m d to the system bus command and address signals, as well as

remain u nchanged tor another imervJI fol lowi ng the some system bus control signa ls req ui red tor the VCTY

clock edge (hold time). Obviously, to meet the setup to monitor each system bus tr�msaction. There is also time, the clock must arrive at the RAMsome time after :�n interrace to the mi croprocessor so that the VCTY

the data or other sign als needed by the RAM . To h elp can send commands to the microprocessor (system -to­

the module designer meet this requirement, the micro­ CPU commands) and monitor the commands and processor may delay the RAM clock by one internal :�ddresscs issued by the microprocessor. Last but not microprocessor cycle time (approximately 2.5 nanosec­ l e.l st, a bidirectional interface between the VCTY and onds). A programmable register in the microprocessor the DTAG allows only those system bus addresses that controls whether or not this de lay is invoked. This require action to reach the microprocessor. delay is used in the AlphaServer 4100 series CPU mod­ While monitoring the system bus f(>r commands

ules, and it eliminates the need t<>r ex ternal delay lines. ti·om other nodes, the VCTY checks tor matches for increased data bandwidth, the cache chips used between the received system bus �1 ddress Jnd the data on CPU mod u les are designed to overlap portions of fr om the DTAG lookup . A DTAC; lookup is initiated successive data accesses. The first d ata block becomes anyti me a valid svstem bus address is received bv the available Jt the microprocessor input after a de lay module. The DTAG location rc>r the lookup is selected equal to the BC_READ_SPEED p arameter, wh ich is by using system bus Address<2 1 :6> as the index into

preset at power-up. The t()l lowing data blocks arc the DTAG. Ifthe DTAG locJtion had previously been

! Jtched after a shor ter delay, BC_READ_SPEED­ marked valid , and there is J 111�1tch between the WAVE . The BC_READ_Sl'EED is set at 10 micro­ received system bus Addrcss<38:22> and the data processor cycles and the WAV E val ue is set to 4, so that ti·om the DTAG lookup, then the block is present in BC_READ_SPEED-WAV E is 6. Thus, after the first the microprocessor's cache. This scenario is caJ icd a de l ay of 10 microprocessor cyc les, successive data cache hit.

blocks arc delivered every 6 microprocessor cycles. In parall el with this, the VCTY decodes the received Figure 2 illustrates these concepts . system bus command to determine the appropriate In Figure 2, the RAM clock at the microprocessor is updJte to the DTAG and determine the correct system dciJyed by one microprocessor cycle. The RAM clock bus response and CPU command needed to mai ntai n

Jt the RAM device is further delayed by clock bufkr syste m- wide cache coherency. A tew cases are i l l us­ and network de lays on the mod u le. The address at the trated here, without any attempt at a comprehensive microprocessor is driven where the clock would have discussion of a l l possible tra nsactions.

24 Digir.tl Tcchniol )oumal Vol. 8 No. 4 1996 6 6 MICROPROCESSOR 10 6 CYCLES

MICROPROCESSOR CLOCK

RAM CLOCK AT MICROPROCESSOR

INDEX AT MICROPROCESSOR ...... ' . . ' ...... ' . . : : : ..: : :' :. :. . . . :... : : : : :: : : • • ••••••• : � : "DEX A ' DEX O "DEX 'NO EX ' X. INDEX 3 ;;�� j � .! ' X ·� , . AT RAM INDEX HOLD AT RAM RAM CLOCK AT RAM I

DATA ACCESS TIME TO MICROPROCESSOR

DATA AT DATA DATA MICROPROCESSOR 1

Figure 2 Cache Read Tra nsaction Sho\\'ing Timing

Assume that the DTAG shared bit is �( >LIIld ro be set microprocessor. Since these transactions arc relatively at this address, the dirty bit is not set, and the bus infrequent, the DTAG saves J great deal of microproces­ command indicates a write transaction. The OTAG sor time and improves over:tll system performance. valid bit is then reset by the VCTY, and the micro­ If the VCTY detects that the command originated processor is int errupted to do the same in rhe TA G. trom the microprocessor co- resident on the module, Ifrhe dirty bit is fo und to be set, and the command then the block is not checked t< >r a hit, but the com­ is a rc1d, rhe MC_DIRTY_EN signal is asserted on rhe mand is decoded so that the DTAG block is updated system bus to tell the other CPU that the loc:�tion it is (if already valid) or allocated ( i .c., marked valid, if not trving to access is in cache and has been modified lw already valid). In the latter case, a filltra nsaction td ­ this CPU. At the same time, a signal is sent to the lows and the VC:TY writes rhe valid bit into the TAG

Digital T..:chni..:al [ourn.tl Vol. 8 No. 4 1996 25 DTAG and primary TAG, the microprocessor interface DTAG Initializa tion signal, EV_ABUS_REQ, is asserted in cycles 5 and 6 of Another important feature built into the VC:TY design that system bus transaction, with the appropriate is a cursory self-test and initialization of the DTAG. system-to-CPU command being driven in cycle 6. The Aftersystem reset, the VCTY writes all locations of the actual update to the DTAG occurs in cycle 7, as does DTAG with a unique data pattern,and then reads the the allocation of blocks in the DTAG. entire DTAG, comparing the data read versus what Figure 3 shows the timing relationship of a system was written and checking the parity. A second write­ bus command to the update of the DTAG, including read-compare pass is made using the inverted data pat­ the sending of a system-to-CPU command to the tern. This inversion ensures that all DTAG data bits are microprocessor. The numbers along the top of the written and checked as both a l and a 0. In addition, diagram indicate the cycle numbering. In cycle 1, the second pass of the initialization leaves each block when the signal MC_CA_L goes low, the system bus of the DTAG marked as invalid (not present in the address is valid and is presented to the DTAG as the B-cache) and with good parity. The entire initializa­ DTAG_INDEX bits. By the end of cycle 2, the DTAG tion sequence takes approximately l millisecond per data is valid and is clocked into the VCTY where it is megabyte of cache and Finishes before the micro­ checked fo r good parity and a match with the upper processor completes its sclftest, avoiding special han­ received system bus address bits. In the event ofa hit, as dling by firmware. is the case in this example, the microprocessor intertace signal F.V_ ABUS_REQ is asserted in cycle 5 to indicate Logic Syn thesis that the VCTY will be driving the microprocessor com­ The VCTY ASIC was designed using the Verilog mand and address bus in the next cycle. In cycle 6, the Hardware Description Language (HDL). The use of address that was received from the system bus is driven HDL enabled the design team to begin behavioral to the microprocessor along with the SETSHARED simulations quickly to start the debug process. command. The microprocessor uses this command In parallel with this, the Vc rilog code was loaded and address to update the primary tag control bits t(x into the Synopsys Design Compiler, which synthe­ that block. In cycle 7, the control signals DTAG_OE_L sized the behavioral equations into a gate-level design. and DTAG_WE l_L arc asserted low to update the con­ The use ofHDL and the Design Compiler enabled the trol bits in the DTAG, thus indicating that the block is designers to maintain a single set of behavioral models now shared by another module. f(x the ASIC, without the need to manuallv enter SYSTEM BUS CYCLE NUMBER 2 3 4 5 6 7

MC_CA_L DTAG_INDEX<15:0> '-�M_C_:_:_A_D_D_R_<2_1_:6_>�- A_1_,c_..JL-A_AA__A _A_AAA_A_A_A_JA AAAA �-' · DOAG_ < ' MC ADDR<38:22> MC ADDR<3822> ::: :: --'--- �-�-�-��:·- --V-A-L 1-D---":._------. � ---:----- �-� � _ �____�______;_J DTAG_WE1_L ------DTAG_WEO_L ---,�"'--���\�\, --'-----�-�--�--�--�--+--�--�--�--

EV_ABUS_REQ --'------:------�--�--�--� MC_ADDR - - EV_ADDR<39:4> ____D_ RI_V_E�N_B_Y_M_IC_�R_PO__RO_C_E_S_S_O_R_ __ _.,�-· - ______--,. •� SETSHARED >DRIVEN BY EV_ :0 ----0 _ ___ - RO CMD<3 > -RIVENBYM-ICROPROCESSOR ���CESSOR

Figure 3 DTAG Operation

26 Digital Technical Journal Vo l. 8 No. 4 1996 schematics to represent the gate-level design. The syn­ I/0 pins). The initial phase of the synthesis process cal­ thesis process is shown in a flowchartfo rm in Figure 4. culates the timing constraints fo r internal nen:vorksthat Logic verification is an integral part of this process, connect between subblocks by invoking the Design and the flowchartdepi cts both the synthesis and verifi­ Compiler with a gross target cycle time of 100 nanosec­ cation, and their interaction . onds (actual cycle time ofthe AS IC is 15 nanoseconds). Only the synthesis is explained at this time. The ver­ At the completion of this phase, the process analyzes ificationprocess depicted on the right side of the flow­ all paths that traverse multiple hierarchical subblocks chart is covered in a later section ofthis paper. within the design to determine the percentage of time As shown on the left side of the flowchart, the logic spent in each block. The process then scales this data synthesis process consists of multiple phases, in which using the actual cycle time of 15 nanoseconds and the Design Compiler is invoked repeatedly on each assigns the timing constraints for internal networks at subblock of the design, feeding back the results fr om subblock boundaries. Multiple iterations may be the previous phase. The Synopsys Design Compiler required to ensure that each subblock is mapped to was supplied with timing, loading, and area constraints logic gates ,,rjth thebest timing optimization. to synthesize the VCTY into a physical design that met Once the Design Compiler completes the subblock technology and cycle-time requirements. Since the optimization phase, an industry-standard electronic ASIC is a small design compared to technology capa­ design interchange fo rmat (EDIF) fileis output. The bilities, the Design Compiler was run without an area EDIF file is postprocessed by the SPIDER tool to gen­ constraint to facilitate timing optimization. erate .filesthat are read into a timing analyzer, Topaz. A The process requires the designer to supply timing variety of industry-standard filefo rmats can be input constraints only to the periphery of the ASIC (i.e., the into SPIDER to process the data. Output filescan then

VERILOG SOURCE FILES I r I r � 100-NS CYCLE-TIME V2BDS GROSS SYNTHESIS 1 15-NS CYCLE-TIME SUBBLOCK FC PA RSE r OPTIMIZAT ION T FIX MINIMUM-DEL' AY DECSIM: COMPILE HOLD-TIME AND LINK VIOLATIONS � DESIGN COMPILER DECS IM SIMULATION OUTPUTS EDIF FILE RANDOM EXERCISER FOCUSED TESTS SYSTEM SIMULATION I FC ANALYZE WRITE NEW � TESTS SPIDER PROCESSES I I EDIF FILE � t � FC REPORT i

� DECSIM� TOPAZ TIMING GATE-LEVEL FIX TIMING VIOLATIONS ANALYZER SIMULAT ION AND/OR LOGIC BUGS I (NO FC) H I I FIX TIMING VIOLATIONS I I -I r Figure 4 AS IC Flow Design Synthesis and Ve rification

Vo l. 8 No. 4 1996 Digital T�dmical Journal 27 be gcncrJ.ted :md easily read by intcrnJ.l CAD tools tests quickly and efficiently without sacrificingflcx ibil such :.�s the DECSIM logic simulator :l!1d the Topaz itv and portability. It consisted of three parts: the test riming analvzcr. generator, the exercise,- code, :1nd the bus monitor. Topn uses information contJincd in the ASIC tech­ nology library to analyze the timing ofthe design as it Te st Generator This collection ofDECSIM commCuses on the use ofbehavior:�l model simulation. cuted b�· the microprocessors. Each routine performs It should also be noted that once the Design Compiler a unique task using one of the addresses supplied by had mapped the design to gates, SPIDER. was also the test generator. For CX

that em accomplish the task in a shorter rime. The ver­ the errors were handled properly, and another routine ification team developed two such tools: the R:llldom exercised lock-type instructions more heavily. Exerciser and the Functional Checker. They are The activity on the system bus generated by the described in detail in this section. crus was not Cllough to veriry the logic. Two addi­ tional system bus agents (models ofsystem bus devices) Random Exerciser simulating the 1/0 were needed to simulate a fu ll Ve rification strategv is crucial to the success of the system-le\·cl em·ironment. The 1/0 was modeled using

design. There arc t\\'o approaches to \'erificuion rest­ so-called commander models. These arc nor HDL or

ing, directed and random. Directed or t( >cuscd tests DECSIM behavioral models ofthe logic but arc \\'rirretl require short run rimes and target specific parts oftbe in a high-b-cl languagc, such as C. From the pcrspcc­ design. To fu llv rest a complex design using directed ti\·e of the CPU, the commander models beh�1ve .l ike tests requires a very large number of tests, which rake real logic and rherdcltT arc adequate for the purpose of a long rime to write and to run. Moreover, a directed verit)•ing the CPU module. There were several reasons rest strategy �1ssumes that the designer can hxcsce f()r using a commander model instead of a logic/

every possible system interaction and is able to write behavioral model. A complete 1/0 model was not yet a rest that will adequately exercise it. For these t-c:�sons, available when the CPU module design began. The random resting has become the pretCrrcd methodol­ commander model was Jn evolution ofa model used in ogy in modern logicdesign s.7 Directed rests were not a previous project, ;md it oHcrcd much needed flexibil­ completely abandoned, but they compose only :1 sm:�ll ity. It could be configured to act :JSei ther an 1/0 inter­

, portion of the test suite. bee or a CPU module :�nd was eJsily progr umnablc to ftmdom rests relv on a random sequence of events flood the system ous 1\'ith even more activity: memon· to create the ta iling conditions. The goal of the reads and writes; interrupts to the crus by randomh· Ra ndom Exerciser \\'aS to create a ri·JlllC\\'Ork that inserting stall cvcles in the pipeline; and assertion of­ would allow the verification team to create random S\'Stem bus sign:1lsat random times.

2S Digital Tcc hnicll )omnal Vo l. S No. 4 1996 Bus Monitor The bus monitor is a collection of pcrt(xmcd using a tool called V213DS. The parser's DEC:SIM simulation watches that monitor the system task was to postprocess a BDS file:ex tract inf(mlution bus and the CPU internal bus. The watches also report and generate a modifiedver sion of it. The intormation when various bus signals arc being asserted and extracted was a list of control signals and logic state­ deasserted and lu\'c the ability to halt simulation if ments (such as logical expressions, it�then-else state­ they cm:ounter uchc incoherency or a violation . ments, case statements, and loop constructs). This CJchc incohcn:ncv is a datJ inconsistency, t(J r exam­ int()rmation was later supplied to the analyzer. The ple, a piece of nondirtv data residing in the B-cachc modified BDS was fi.tnctionally equivalent to the origi­ :.Hld difkring ti·om data residing in main memory. nal code, but it contained some embedded calls to A data inconsistency can occur among the CPU mod­ routines whose task was to monitor the activity of the ules: t()r example, two CPU modu les may have difte r­ comrol signals in the context of the logic statements. cnt data in their caches at the same memory address. D:HJ inconsistencies are detected by the CPU. Each Analyzer Written in C, the analyzer is a collection of one maintains an exclusive (nonsharcd ) copy of its monitoring routines. Along with the modified BDS data that it uses compare with the data it reads ti·om ro code, the analyzer is compiled and linked to t(Jrm rhc the test Jddrcsscs. If the two copies diffe r, the CPU simulation model. During simulation, the analyzer signals to the bus monitor to stop the simulation and is invoked and the routines begin to monitor the acriv­ report an error. iry of the control signals. It keeps J record of all con­ The bus monitor also detects other violations: trol signals that fo rm a logic statement. For example, assu me the t()llowing statement was recognized bv the No activity on the system bus tor 1,000 consecutive 1. parser JS one to be monitored . cvcles XOR 2. Stalled system bus t(>r 100 c�'cles (A B) AND C 3. Illegal commands on the system bus and CPU The analyzer created a table of all possible combina­ internal bus tions of logic v::llues for A, B, and C; it then recorded Catastrophic system error (machine check) which ones were achieved. At the start of simulation, 4. there was zero coverage achieved. The combination of ra ndom CPU and 1/0 activity flooded the system bus with heavy traffic. With the ABC Achieved help of the bus monitor, this technique exposed bugs 000 No quickly. 001 No A� mentioned, a rc w directed tests were also written. oro No Directed tc�ts were u�cd to re -create a situation that OJ l No occurred in r:mdom tests. lt'J bug was uncovered using 100 No a random test that ran three days, a directed test was 10 1 No written rc-cn:atc the same failing scenario. Then, 110 No to 111 No alter the bug was fixed, a quick run of the directed test confirmed th�n the problem was indeed corrected . Achieved coverage = 0 percent

Functional Checker further assume that during one of the simulation During the initial design stages, the verification team tests gencrJted by the Random Exerciser, A assumed both 0 and logic states, \\'hile R and C remained con­ dc,-clopcd the Functional Cbecker (FC) fo r the J<.)l ­ I lowing purposes: stantly at 0. At the end of simulation, the state of the table wo uld be the following: • To fu nctionally vcrit\r the HDL models of all ASI C:s in the AlphaScrvcr 00 system ABC Achieved 4 I 000 Yes To

No. 4 1996 29 Digiral T.:.:hni,,d Journal Vol. I! Report Generator The report generator app!iGLtion industry. Virtually all component vendors will, on gathered all tables created by the analyzer and gener­ request, supply HSPICE models of their products. ated a report filein dicating which combinations were Problems detected by HSPICE were corrected either not achieved. The report file wasthen reviewed by the by layout modifications or by schematic changes. The verification teamand by the logic design team. module design process flow is depicted in figure 5. The report pointed out deficiencies in the verifica­ tion tests. The verification team created more tests Software To ols and Models that would increase the "yes" count in the "Achieved" Three internallydeveloped tools were of great value. column. For the example shown above, new tests One was MSPG, which was used to display the might be created that would make signals B and C HSPICF plots; another was MODULEX, which auto­ assume both 0 and l logic states. matically generated HSPICE subcircuits fr om PC The report also pointed out faults in the design, layout files and performed cross-talk calculations. such as redundant logic. In the example shown, the Cross-talk amplitude violations were reported by

logic that produces signal B might be the same rm plots directly from the simulation results, with­ eight possible achievable combinations, or 25 percent. out having to manually display each waveform on the For the verificationof the cached CPU module, the screen. A mass printing command was then used to fC tool achieved a finaltest coverage of95.3 percent. print all stored PostScript files. Another useful HSPICE statement was .MEASU R.E , Module Design Process which measured signal delays at the specifiedthreshold

levels and sent the results to a file. From this, a separate As the first step in the module design process, we used program extracted clean delay values and calculated the the Powerview schematic editor, part of the Viewlogic maximum and minimum delays, tabulating the results CAD tool suite, t()r schematic capture. An internally in a separate file. Reflections crossing the threshold developed tool, V2 LD, converted the schematic to a levels caused incorrect results to be reported by form that could be simulated by DECSIM. This process the .MEASURE statement, which were easily seen in was repeated until DECSIM ran \\�thout errors. the tabulation. \Ve then simply looked at the waveform During this time, the printed circuit (PC) layout of printout to see where the reflections were occurring. the module was proceeding independently, using the The layout engineer was then asked to modi�' those ALLEGRO CAD tools. The layout process was partly signals by changing the PC trace lengths to either the manual and partly automated with the CCT router, microprocessor or the transceiver. The modifiedsignals which was eftective in t()Jlowing the layout engineer's were then rcsimulated to verifYthe changes. design rules contained in the DO files. Each version of the completed layout was translated Timing Verifica tion to a format suitable fo r signal integrity model ing, OverJII cache timing was verified with the Timing using the internally developed tools ADSconvert and Designer timing analyzer from Chronology Corpor­ MODULEX. The MODULEX tool was used to extract ation. RdevJnt timing diagrams wen; drawn using a module's electrical parameters from its physical the waveform plotting facility, and delay values and description. Signal integrity modeling was performed controlling parameters such as the microprocessor with the HSPICE analog simulator. V•/e selected cycle interval, read speed, wave, and other constants HSPICE because of its universal acceptance by the were entered into the associated spreadsheet. All

30 Di�it.li T�d lllical )ourn.li Vol. ll No. 4 1996 DECSIM SIMUDIGITLAALTOR LOGIC TOVL2D DECSIM) (CONVERTS ,--- � SCHPOWERVIEWEMATIC EDIT OR - - � VIEWDRAW.NET - ANALYSIS

- ALLEGRO� "DO" FILES LAYOUT TOOL 1- - CONSTRAINTSRESTRICTIONS AND + .RTE

,--- ALLEGRO.BRD .DSN CCT ROUTER ..-- tADSCONVERT MODULEX t t VLS.ADSFOR MODULEX TOOL HSPICE TIMING DESIGNER COMPATIBILITY ANALOG SIMULATOR I---+ TIMING ANALYZER MDA FILES � FOR MANUFACTURING

Figure 5 Design Process flow

delays were expressed in terms ofHSPICE-simulated point-to-point path, but in high-speed designs, many values and those constants, as appropriate. This signals must be routed in more complicated patterns. method simplifiedchanging parameters to try various The most common pattern involves bringing a signal "what if" strategies. The timing analyzer would to a point where it branches out in several directions, instantly recalculate the delays and the resulting mar­ and each branch is connected to one or more gins and report all constraint violations. This tool was receivers. This method is referred to as treeing. also used to check timing elsewhere on the module, The SI design of this module was based on the outside ofth e cache area, and it provided a reasonable principle that component placement and proper sig­ level of confidence that the design did not contain any nal treeing are the two most important elements of timing violations. a good SI design. However, ideal component place­ ment is not always achievable due to overriding fa ctors Signa/Integrity other than SI. This section describes how successful In high-speed designs, where signal propagation times design was achieved in spite of less than ideal compo­ are a significant portion ofthe clock-to-clock interval, nent placement. reflections due to impedance mismatches can degrade the signal quality to such an extent that the system will Data Line Length Optimization fa il. For this reason, signal integrity (SI) analysis is an Most of the SI work was directed to optimizing the important part of the design process. Electrical con­ B-cache, which presented a difficult challenge because nections on a module can be made fo llowing a direct of long data paths. The placement of major module

Digital Technical )ourn:1l Vol . 8 No. 4 1996 31 data bus components (microprocessor and data trans­ The goal of data line d esign was to obtain clean sig­ ceivers) was dictated by the enclosure requirements nals at the receivers. Assuming that the microproces­ and the need to fit k>ur CPUs and eight memory mod­ sor, Ri\.Ms, and the transceiver are all located in-line ules into the system box. Rather than allowing the without branching, with the distance between the two microprocessor heat-sink height to dictate module RAMs ncar zero, and since the positions of the micro­ spacing, the system designers opted for fitting smaller processor and the transceivers are fixed,the only vari­ memory modules next to the crus, filling the space able is the location of the two RAMson the data line. that would have been left empty if module spacing As shown in the waveform plots of Figures 7 and 8, were unit(xm. As a consequence, the microprocessor the quality of the received signals is strongly affected and data transceivers had ro be placed on opposite by this variable. In Figure 7, the reflections arc so large

, ends of the module, which made the data bus exceed that they exceed threshold levels. By conrrJst the

II i nch es in length . Figure 6 shows the placement of reflections in Figure 8 are very small, and their wavc­

the major components. t(>rms show signs of cancellation. From this it CJil Each cache data line is connected to fo ur compo­ be inferred that optimum PC trace lengths cause the nents: the microprocessor chip, two RAMs, and the reflections to cancel. A range of acceptable RAM posi­ bus transceiver. As shown in Table l, any one of these tions was t()lmd through HSPICE simulation. The components can act as the driver, depending on the results ofthe se simulations are summarized in Table 2. transaction in progress. (THINDEXREE BUFFERS MORE ON (EIGHDATA RAMST MORE ON THE OTHER SIDE) �-----LTHE OTHER SIDE)

����0 r------, D ITAGl CLOCKE SSOR MICRO- o D � UITRY PROCESSOR D D D D D D / PROGRAMMABLE Dl� ASIC D OJ LOGIC jDD 11 D l.::lGLl D DATA TRANSCEIVERS DODDO DDDDDDDDD ;______-- --______

PROGRAMMABLE____ A DDRES NDCOMMAN_____D SYSTEM BUS LOGIC 0 TRANSCEIVERJ S \ CONNECTOR

Figure 6 P!Jccmcm of JV! Jjor Componems

Ta ble 1 Data Line Components

Tra nsaction Driver Receiver

Private cache read RAM Microprocessor Private cache write Microprocessor RAM Cache fill Transceiver RAM and microprocessor Cache miss with victim RAM Tra nsceiver Write block Microprocessor RAM and transceiver

32 Dif!:italTe chnical )ounul Vol. 8 No. 4 1996 In the series of simulations given in Table 2, the threshold levels were set at l.l and 1.8 volts. This was justified by the use of pertect transmission lines. The lines were lossless, had no vias, and were at the lowest impedance level theoretically possible on the module (55 ohms). The entries labeled SR in Table 2 indicate unacceptably large delays caused by signal reflections recrossing the threshold levels. Discarding these entries leaves only those with microprocessor-to­ -1.0 RAM distance of 3 or more inches and the RAM­ to-transceiver distance of at least 6 inches, with the total -2.0'----�-�-�--�-�-�--'------'-- microprocessor-to-transceiver distance not exceeding 40 45 50 55 60 65 70 75 80 NANOSECONDS ll inches. The layout was done within this range, and all data Jines were then simulated using the network subcircuits generated by MODULEX with threshold Figure 7 levels set at 0.8 and 2.0 volts. These subcircuits Private Cache Read Showing Large Reflections Due to Unfavorable Trace Length Ratios included the effect of vias and PC traces run on several signal planes. That simulation showed that all but 4.0 12 of the 144 data- and check-bit lines had good sig­ nal integrity and did not recross any threshold levels. The fa iling lines were recrossing the 0.8-volt thresh­ old at the transceiver. Increasing the length of the RAM -to-transceiver segment by 0.5 inches corrected this problem and kept signal delays within accept­ able limits. Approaches other than placing the components in-line were investigated but discarded. Extra signal -1.0 lengths require additional signal layers and increase -2.0 '-----'----'---�-----''----'-----'------'---"- the cost of the module and its thickness. 40 45 50 55 60 65 70 75 80 NANOSECONDS RAM Clock Design We selected Te xas Instruments' CDC2351 clock drivers to handle the RAM clock distribution network. The Figure 8 CDC235l device has a well-controlled input-to-output Private Cache Read Showing Reduced Reflections with delay (3.8 to 4.8 nanoseconds) and 10 drivers in each Optimized Trace Lengths package that are controlled fi·om one input. The fa irly

Ta ble 2 Acceptable RAM Positions Found with HSPICE Simulations

PC Trace Length Write Delay Read Delay (Inches) (Nanoseconds) (Nanoseconds)

Microprocessor RAM to Microprocessor RAM to RAM to to RAM Tra nsceiver to RAM Microprocessor Tra nsceiver

Rise Fall Rise Fall Rise Fall 2 7 0.7 2.3 0.9 SR 1.1 1.4 2 8 0.7 2.7 SR SR 1.5 1.4 2 9 0.6 3.1 SR SR 1.7 1.5 3 6 0.9 2.1 1.2 1.1 0.9 1.0 3 7 0.9 2.4 1.0 1.1 1.4 1.3 3 8 0.9 2.9 1.0 1.3 1.5 1.3 4 5 1.1 1.8 1.2 1.4 0.9 SR 4 6 1.3 2.2 1.4 1.4 0.9 1.0 4 7 1.2 2.6 1.3 1.4 1.2 1.2 5 4 1.5 1.7 1.5 1.7 SR SR 5 5 1.4 2.1 1.8 1.7 SR SR 5 6 1.6 2.4 1.7 1.4 0.9 1.2

Note: Signal reflections recrossing the threshold levels caused unacceptable delays; these entries were discarded.

Digit�! Technical journal Vo l. 8 No. 4 1996 33 long delay through the part was beneficial because, series-damping resistors in each cache data line, as as shown in Figure 2, clock delay is needed to achieve shown in Figure 10. Automatic component placement adequate setup rimes. Two CDC235l clock drivers, machines and availability of resistors in small pacbges mounted back to back on both sides of the PC board, made mounting 288 resistors on the module a painless were required to deliver clock signals to the 17 RAMs. task, and the payoff was huge: nearly perkct signals The RA Ms were divided into seven groups based on even in the presence of spurious data transitions their physical proximity. As shown in Figure 9, there caused by the microprocessor's architectural katurcs are fo ur groups of three, rwo groups of two, and a sin­ and RAM characteristics. Figure ll illustrates the han­ gle RAM. Each of the first six groups was dri\'en by dling ofsome of the more difficult wavdcm11s. two clock driver sections connected in parallel through resistors in series with each driver to achieve good load Performance Features sharing. The seventh group has only one load, and one CDC235 l section was sufficient to drive it. HSPICE This section discusses the perr()rnlance of the simulation showed that multiple drivers were needed AlphaServer 4100 system derived ti·o1n the physical to adequately drive the transmission line and the load. aspects of the CPU module design and the effects of The load conm:ctions were made by short equal the duplicate TAG store. branches oftewer than two inches each. The .length of the branches was critical tor achieving good signal Physical Aspects of the Design integrity at the RA Ms. As previously mentioned, the synchronous cache was chosen primarily tor perfo rmance reasons. The archi­ Data Line Damping tecture of the Alpha 21164 microprocessor is such th<1t In the ideal world, all signals switch only once per clock its data bus is used f()r transters to and from main mem­ interval, al lowing plenty of setup and hold time. In the ory (fills and writes) as wel l as its B-cache:' As system real world, however, narrow pulses often precede valid cycle times decrease, it becomes a challenge to manage data transitions. These tend to create multiple reAec­ memory transactions without requiri ng wait cycles tions superimposed on the edges of valid signals. The using asynchronous cache RAM devices. for example, reAcctions can recross the threshold levels and incre

3. The RA.Jv1s retrie\'e data. . 4. The RAl\! ls drive data to the bus intcrf1ce device.

5. The bus interface device req uires a setup time .

vVorst-case delay values t()r the above items might CLOCKDRIVER 1-----Y\�--, be the fo llowing: l. 2.6 nanoseconds'

2. 5.0 nanoseconds

3. 9.0 nanoseconds

4. 2.0 nanoseconds 30 OHMS 5. l.O nanoseconds CPU Total: 19.6 nanoseconds

Thus, tor system cycle times rhar arc significantly CLOCK shorter than 20 nanoseconds, it becomes impossible 0 RIVER 1-----YI�---'

Figure 10 Figure 9 RAJ'vl. Driving the Micropmccssor Jnci TI·J nsccivci· rhmugh

RAM Clock Distribution 10-ohm Series Resistors

34 DigitJI Tcc hnic1l journal Vol. 8 No. 4 l996 DATA LINE SCALE: 1.00 VOLT/D IVISION, OFFSET 2.000 VOLTS, INPUT DC 50 OHMS

TIME BASE SCALE: 10.0 NANOSECONDS/ DIVISION

Figure 11

Handling of DifficultWavd orms

to access the without using multiple cycles per Alpha 21164 microprocessor. In addition, it provides RAM read operation, and since the full transter involving an opportunity to speed up memory writes by the I/0 memory comprises fo ur of these operations, the bridge when they modif)r an amount of data that is penalty mounts considerably. Due to pipelining, the smaller than the cache block size of 64 bytes (partial synchronous cache enables this type of read operation block writes). to occur at a rate of one per system cycle, which is The AlphaServer 4100 I/0 subsystem consists of

15 nanoseconds in the AlphaServer 4100 system, a PC! mother board and a bridge. The PC! mother greatly increasing the bandwidth fo r data transfers to board accepts I/0 adapters such as network interfaces, and from memory. Since the synchronous is disk controllers, or video controllers. The bridge pro­ RAM a pipeline stage, rather than a delay element, the win­ vides the inter£1ce between PCI devices and between dow of valid data available to be captured at the bus the CPUs and system memory. The I/0 bridge reads interface is large. By driving the R.A!vlswith a delayed and writes memory in much the same way as the CPUs, copy of the system clock, delay components 1 and 2 but special extensions are built into the system bus pro­ are hidden, allowing tastercycling of the B-cache. tocol to handJe the requirements of the I/0 bridge. 'When an asynchronous cache communicates with Typically, writes by the f/0 bridge that are smaller the system bus, all data read out fi·om the cache must than the cache block size require a read-modifY-write be synchronized with the bus clock, which can add sequence on the system bus to merge the new data as many as two clock cycles to the transaction. The with data from main memory or a processor's cache. synchronous B-cache avoids this performance penalty The AJphaServer 4100 memory system typically trans­ by cycling at the same rate as the system bus.2 fe rs data in 64-byte blocks; however, it has the ability In addition, the choice of synchronous RAMs pro­ to accept writes to aligned 16-byte locations when the vides a strategic benefit;other microprocessor vendors I/0 bridge is sourcing the data. When such a partial are moving toward synchronous caches. For example, block write occurs, the processor module checks the numerous Intel Pentium microprocessor-based sys­ DTAG to determine if address bits in the Alpha the tems employ pipeline-burst, module-level caches using 21164 cache hierarchy. I fit misses, the partial write is synchronous RAM devices. The popularity of these permitted to complete unhindered. If there is a hit, systems has a large bearing on the industry.9 It is and the processor module contains the most recently RAM in DIGITAL's best interest to to llow the synchronous modified copy of the data, the l/0 bridge is alerted RAM trend of the industry, even tor Alpha-based to replay the partial write as a read -modifY-write systems, since the vendor base will be larger. These sequence. This fe ature enhances DMA write perfor­ vendors will also be likely to put their efforts into mance fo r transfers smaller than 64 bytes since most of improving the speeds and densities of the best-selling these references do nor hit in the processor cache.< synchronous products, which will fa cilitate RAM improving the cache performance in future variants of Conclusions the processor modules. The synchronous B-cache allows the CPU modules Effectof Duplicate Ta g Store (D TAG) to provide high performance with a simple architec­ As mentioned previously, the DTAG provides a mech­ ture, achieving the price and performance goals of anism to filter irrelevant bus transactions fr om the the AlphaServer 4100 system. The AlphaServer 4100

Digiral Technical Journal Vol. 8 No. 4 1996 35 CPU design team pioneered the use of synchronous 9. ] . Handy, "Synchronous SRAlvl Ro undup," Dataquest RAMs in an Alpha microprocessor- based system (September ll, 1995). design, and the knowledge gained in bringing a design fr om conception to volume shipment will benefit General Reference fu ture upgrades in the AlphaServer41 00 server fa mily, as well as products in other platf-orms. R. Sites, ed., Alpha Architecture R(fere1lce Manual (Burlingron, Mass.: Digit:� Press, 1992 ) . I Acknowledgments

The development of this processor module would not Biographies have been possible without the support of numerous individuals. Ri ck Hetherington pertormed early conceptual design and built the project team. Pete

Bannon implemented the synchronous RAM support fe atures in the CPU design. Ed Rozman championed the use of random testjng techniques. Norm Plante's

skill and patience in implementing the often tedious PC layout requirements contributed in no small mea­ sure to the project's success. Many others contributed to firmware design, system testing, and performance analysis, and their contributions are gratefully acknowledged. Special thanks must go to Darrel Donaldson fo r supporting this project throughout the Maurice B. Steinman entire development cycle. JV laurice Ste inman is a hardware principal engineer in the Server Product Development Group and was the leader of the design tc1m fo r rhe DIGITAL AlphaServer 4100 References CPU system. In previous projects, he was one ofthe designers of the AlpluServcr 8400 module and a designer of DIGITAL AlphaServer Family DIGITAL UNIX Perfor­ CPU 1. rhe cache conn·ol subsystem tor rhc 9000 com puter mance Flash (Maynard, Mass .: Digital Eq uipm ent VAX system. Maurice received a B.S. in computer and systems 1996 , Corporation, ) http:/jwww. europe.digital.com/ engineering ti·om Rensselaer Polytechnic I nsriturc in 1986. -9 info/ performance/ sys/ unix -svr-flash .abs.html. He was :�warded two patents related to cache control and coherence and has p:Henrs pending. Cveranovie and D. Donaldson, "AiphaServer 4 [\VO 2. Z. LOO Performance Characterization," IJi,t.;ital Te ch nical Jo urnal, val. 8, no. 4 ( L 996, this issue): 3-20.

3. G. Hcrdeg, "Design and Implementation of the AlphaServer 4100 and Memory Architecture," CPU D(f!,ital Te chnica! Joumal, vol. 8, no. 4 (I996, this issue): 48-60.

S. Duncan, C. Keefer, and T. McLaughlin, "High 4. Performance 1/0 Design in rhe AJphaServer 4100 Sy m­ metric Multiprocessing System," D(c;ital Te chnical Jo urnal, vol. 8, no. 4 ( 1996, this issue): 61-75.

5. "Microprocessor Report," MicroDesign Resources, val. no. 15 (1994 ). 8, George J. Harris 6. Perso nal Cnmputer Power Series Perj'or­ JIJM 800 George Harris was responsible fo r the signal integrity and n cache design ofrhe module in the AlphaServer 4100 ma ce (A.rmonk,N.Y.: International Business Machines CPU DIGITAL I 98 1 Corporation, 1995 ), http://ike.e ngr.washingron.edu/ series. He joined in and is a hardware prin­ news/whirep/ps-perf.hrml. cipal engineer in the Server Product Development Group. Iktore joining DIGITAL, he designed digital circuits at 7. L. Saunders and Y. Trivedi, "Testbench Tutorial," Inte­ the computer divisions ofHoneywell, and Ferranti . RCA, grated System Desig n, val. 7 (April and May 1995 ). He also designed computer-assisted medical monitoring svsrems using computers fo r the American Optical PDP-11 Semiconductor Through 8. [)!GJ'JAL 27 764 (366 i\'1 !-lz Division ofWarncrLambert. He received a master's degree Alpha Microprocessor J-Ja rdware 433 J\1/ Hz) in electronic communications fr om McGill Universi ty, Reje rence t\llanual (Hudson, Mass.: Digital Eq uipment J\il ontreal, Quebec, and was aw arded ten parents relati ng Corporation, 1996 ) . to computer-assisted medical monitoring and one patent related to work at DIGITAL in the area of circuit design.

36 Digital Tec hnical journal Vol. 4 1996 8 1 o. And rej Kocev Andrej Koccv joined DIGITAL in 1994 after receiving a B.S. in con1putcr science ti·om Rensselaer Polytcdmic I nstirute. He is a senior hardware engineer in the Sen·er Product Development Group and a n1emhn ofrhc CPL' l'crification team. He designed the logic 1crilicarion sol[­ " arc described in this paper.

Virginia C. Lamere

Virginia LmH.:n: is a hardll'are principal enginee r in the Scrn-r Product De,·clopment Group Jnd ll'as responsible

ttlr C:l'l m odu le design in the DIGITAl. AlphaSen-cr 4100 series. l;innv 11·Js J nu:mbe1· ofrhe verification n::1ms t(.>r rhe AlphaSenn 8400 and AlphaServer 2000 C PU mod· uks. Prior to those projects, she conrribured to rile design ofrhc floating-point processor on rhe VAX 8600 and the execution unit on the VAvX 9000 computer system. She n.:ccivcd '' B.S. in clccrrical engineeri ng and computer science t'rom Princeton Unive1·sity in 1981. Ginny was awarded two p:ucnts in the area of the C.\ecution unit

design and is J co-author of the paper "Floating Point Processor t(n· the VAX 8600" published in this jounw/.

Roger D. Pannell

Roger I\m nell was the l eader of the VCTY AS IC design

tc:1m t(Jr the Alph:1Sen•e1· 4100 svste m. He is :1 hardware princip:1l engi neer in the Server Product De,·elopmcnt (;roup. Roger Ius II'Orked on several projects since join­ ing l)igitJI in 1977. Most recent!\', he has been a module/ ASIC (ksignn on rhe AlphaSer�t:r 8400 and VAX 7000

1/0 port modules and ,1 bus-to-bus I/0 bridge. Roger ree<.:iH:d a B.S. in elccrmnic engineering r..:chnologv ti·om the University of Lowell.

Digital ·tcdmi ·;11 )ourn;11 Vol. R No. 4 1996 37 I Ro ger A. Dame

The Al pha Server 4100 low-cost Clock Distribution System

High-performance server systems generally Every digital computer system needs a clock distribu­ require expensive custom clock distribution tion system to synchronize electronic communication. systems to meet tight timing constraints. The primary metric used to quanti!)' the performance of a dock distribution svstem is clock skew. S\'llch­ These clock systems typically have expensive, ronous S\'Stems require multiple copies (outputs) of application-specific integrated circuits for the same clock, and clock skew is the unwanted delay the bus interface and require controlled etch of between any t\\'O the copies. In general, the Jo\\'er impedance for the clock distribution on each the skew, the better the clock svstcm. Clock skew is one module in the server system. The DIGITAL ofsever:d parameters that aftcct bus speed . Bus length, AlphaServer 4100 system utilizes phase-locked bus loading, driver and receiver technology, and bus signal voltage swing also affect bus speed . If problems loop circuits, clock treeing, and termination arise that jcop:mi izc meeting timing goals, though, techniques to provide a cost-effective, low­ these addirion:tl parameters arc difficult to change skew clock distribution system. This system because of ph,·siCll and architectural constraims. provides multiple co pies of the clock, which D rrAL AlphaSenn4100 distribution The rc; clock allows off-the-shelf components to be used system is a compact, low-cost solution fo r a high­ for the bus interface, which in turn results in performance midr::� ngc ser\'er. The clock system pro­ vides more copies of the clock than machines in the lower costs and a quicker system power-up. same class typically need. The d istri bu tion system Component placement and network com­ allows expansion on those module designs where pensation eliminated the need for controlled­ more copies of the clock are needed with minimal impedance circuit boards. The clock system h f ske\\'. The svsrem is based on a low-cost, oH�rhe-s el design makes it possible to upgrade servers phase- locked loop (PLL) as the basic building block. PLL with faster processor options and bus speeds The simple application of the alone ,,·ould not provide low clock skew, though. Signal integrity tech­ without changing components. niques and rrade-offs were needed to m�magc skew throughout the system. The technical challenges were

ro design J low-cost system that would ( l) require only a small area on the printed wiring bo::mis (PW"Bs),

( 2) be adap t�l ble ro ,·a rious speed grades (options) of Cl'Us, and (3) h�we good performance, i.e., lo\\' skew.

This paper discusses the techniques used to optimize the perti:Jrmance of an offthe-shelf PLL- based clock distribution system.

Design Goals

Based on irs experience with previous plartorm designs,

the design rea m considered a cloc k ske\\' under 10 per­ cent of the bus cycle rime a reasonable t:t rget tOr a midrange server S\'Stem. The cycle rime d esign target of rhe AlphaSenn 4100 system was lS nanoseconds (ns);

consequently, the skew goal was 1.5 ns or less. This goal would �1 llow a total of 13.5 ns h>r clock to out­ put of the rransmi rri ng module (Teo) (the time the

Vo l. No. 4 Digit,ll Tec hnic�! )ourn:d 8 I 996 transmitting module needs to dri\'e d:�ta to a stable benefitsof the oft�the-shelfsolution, it was paramount state ti·om a clock edge); setup and hold time require­ that we make the oft�the-shdfsolution work. ments tc lr the receiving module (the minimum time th:lt data needs to be stable at the recei,·er ( tl op] betore Bus Trade-offs and ati:er the local clock edge); and bus settling time. The fo llowing is a breakdown of the timing based on The design philosophy of using stock components tor the se lection of components f(lr the bus interbcc: the bus interface allowed some latitude in the bus design. Typical bus interfaces use large ASICs, each Bus cycle 15.0 ns handling up to 50 percent of the data bits. Such a Transmitting mod ule (Teo) 5.1 ns design results in a relatively long dispersion etch ti·om Setup and hold time tor the the connector to the ASIC. These devices can range receiving module 1.5 ns in size from 200 to 400 pins and can require up to Clock skew 1.5 JlS 38 mm of etch ti·om the ASIC to the connector. SPICE Time ::tllocated for bus settling 6.9 JlS simulations demonstrated that the length of each The selection of components was based on a,·ailabil­ module's dispersion etch or bus "stubbing" had a pro­ ity, speed, cost, and size. The goal was to eliminate the found effect on bus settling time.' Figure l shows bus need tclr costly appJication-specinc integrated circuits settling time (worst-case dri\'er-receiver combination) (ASICs) :md still meet the critical timing perf(mnance. as a fi.mction of module dispersion etch. The bus trunk

The AlphaSer\'er 4100 bus is a simple distributed length was nxed at 305 111 111 . bus, 305 millimeters (mm) long, with 10 loads (mod­ The designers used an 18-bit-wide transceiver in ules) and parallel termination at both ends. The fi rst­ a low-profile surbce mount package with a pin pitch order estimate of bus settling time assumed one fu ll of 0.5 111111. The location of the 1/0 pins tor the bus rdlection or twice the loaded velocity ofpropagation connections on the interrace transceiver (located on delay end to end. The estimate took into account bus the same side of the package, which allows the device timing optimization, which is discussed btcr in this to be placed very close to the bus connector) and the paper. It was also estimated that 25 copies of the clock connector pitch facilitated short dispersion etch (less wou ld be required tor the processor modules, and than 13 mm ). This design decreased by 1 ns the set­ 46 copies of the dock would be required tclr certain tling time typically t()lll)d on ASIC-based intert:lces memory modules (synchronous dynamic random­ with comparable trunk lengths and loading. access memory [SDRAM]-based designs). Only the Bus termination is another parameter that designers rising edge of the clock could be used fo r critical tim­ can manipulate to fu rther improve settling ti me. We ing. If the blling edge were used tclr latches, then used parallel terminators at both ends of the bus on the clock skew would dramatical ly increase because of the AlphaServer 4100 system. The bus protocol has rwo duty cycle distortion associated with PLLs. The mem­ katures that allow aggressive termination, approaching ory module design allowed very little space for clock the unloaded impedance of the trunk. 'vVe placed an circuitry and needed more copies of thc clock than any anticontention cycle between the module that relin­ other module design in the system. Further, the physi­ quishes the bus

Vo l. 8 No. 4 1996 39 a logic state dl!ling long id le times until another module CLOCK R L RECEIVER 1 wants to use the bus. Without this katurc, the bus ---INPUT 1 would settle at the tcrminZ�tor Thevcnin voltage if no I PHASE- L2 modules were driving the bus. Both protocols allow f( x LOCKED A LOOP Thevcnin voltage to be close to the thresholds of the FB A receivers. Normally this is avoided if the bus is lettidle, because the receivers can go metastable, i.e., arrive at L3 the unstable condition \\'here its input voltage is between its specified logic 0 and logic 1 voltage levels, resu lting in uncontrolled oscillation. Centering the Thevenin ,·ol tage in the normal fu ll voltage swing had KEY: two ad,·antages: ( l) it balanced the settling time t(x FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP SERIES TERMINATING RESISTOR A both transitions, and (2) it reduced the driver current. L 1 , L2, ETCH LENGTHS The reduced driver current allowed t(x a lower AND L3 Thevenin resistance, which brought the tcrmi1utors closer to the unlo�1dcd (no modules) impedance of the Figure 2 bus, thus ensuring that the bus wo uld s<::ttle 1\'ithin 6 ns. Tvpic

The Basic Building Block Etch Layout The PWB lay-ups used on various modules in the Texas Instruments' CDC:586 dock distribution circuit AlphaServer 4100 system contain microstrip etch was chosen as the basic building block t()!· the system (surbce etch)

40 Vol. No. 4 !996 Digiral Tc chniol Journal R built-ill series terminators, the AlphaServcr 4100 design­ the PLL and a capacitor C tl·om the same power pins ers did not use this variation tor the tollowing reasons: to ground. The L-C fi lter can be implem<.:nted in two ways: Some tOnns ofclock treeing (a method of connect­ • ( 1) by using a surtace mount inductor and (2) by using ing multiple receivers to the same clock output) a length of etch f()r the inductor. In either case, the Q require multiple source terminators. of the circuit has to be kept low to prevent oscillation.

• The nominal value fo r the internal series terminator Q is a dimensionless number reterred to as the quality was not optimum fo r the target impedance of the fa ctor and is computed from the inductance L and PWBs. resistance R (in this case the inductor's resistance) of

= • The tolerance of the internal series terminators a resonant circuit using the formula Q wL/ R. where

over the process range of the part could be as high w equals 2'IT/: and/ is the frequency. A low-value resis­ as 20 percent compared to 1 percent fo r external tor in series with the inductor can help. Extreme care resistors. shou ld be taken if the lengtb-ot� etch (used to generate inductance) implementation is considered. The etch Local Power Decoupiing must be strip-.line-ctch isolated from any other adja­ PLLs arc analog components and are susceptible to cent etch or etch on other layers not separated by power supply noise. One major point source tor noise power or ground planes. A rwo-dimensional (2� D) is the PLL itself. Most applications require all 12 out­ modeling tool should be used to calculate the length puts to drive substantial loads, which generates local of etch needed to get the proper inductance value tor noise. A substantial number of local decoupling capac­ the filter. Simple rules of thumb tor inductance will itors (one tor every tour output pins) and short, wide not work with reference planes (i.e., power and dispersion etch on the power and ground pins of ground planes). the PLL w�::re required to help counter the noise. The R- C tl lter is limited to PLLs with moderately Designers also used tangential vias to minimize para­ low current draw on the analog power pins. The cur­ sitic inductance, which can severely reduce the t:ffec­ rent generates an IR drop (the voltage d rop caused by tiveness of the decoupling capacitors. Typical surface the current through the resistor) across the resistor R. mount components have dispersion etch, which con­ Typical PLL analog power inputs requir<.: kss than nects the surface pad to a via. Tangential vias attach 1 milliamp (mA), which would allow a reasonable directly to the pad and eliminate any surface etch that value resistor R. Two capacitors should be used in the can act like inductance at high frequency. The PLLs R-C type filter: a bulk capacitor fo r basic tl ltcr response were also located away from other potential no1se and a radio tl-cquency ( Rf' ) capacitor to filter higher sources such as the Alpha microprocessor chip. frequencies. Bulk capacitors are any electrolytic-style capacitor 1 microfarad ( J.LF) or greater. These capaci­ Analog Power Supply Filter tors have intrinsic parasitics that keep them fi·om The most important external circuit to the PLL is the responding to high-frequency noise. The benefit of low-pass filter on the analog power pins. Typically, PLL the L-C fi lter is that, although a single capacitor can b<.: designs have separate analog and digital power and used (r.vo arc still suggested with this style filter), the ground pins. This allows the usc of a low-pass filter to reactance of the inductor increases with ti·equency and prevent local switching noise from entering the analog helps block noise. Both tlltcr styles were us<.:d in the core of the PLL (primarily the voltage-controlled oscil­ Al phaScrvcr 4100 system. lator [VCO]). If a filter is not used, then large edge-to­ edge jitter will develop and will greatly increase clock System Distribution Description skew. Most PLL vendors suggest ti lter designs and PWB layout patterns to help reduce the noise entering The AlphaServer motherboard has fo ur CPU slots, the analog core. The CDC586 PLL was introduced at eight memory slots, and an I/0 bridge module slot. the beginning of the AlphaScrvcr 4100 design, and the Each module in the system, including the mother­ vendor had not yet specified a filter tor the analog board, has at least one PLL. The starting point of the power input. lt is important to note that if any new system is the CPU that plugs into CPU slot 0. Each PLL is considered and preliminary vendor specifica­ CPU module has an oscillator and a bufte r to drive the tions do not include details about the analog power, main system distribution, but the CPU that plugs into the dcsign<.:rshould contact the vendor tor details. slot 0 actually driv<.:s the system distribution. A PLL on Two torms of low-pass tl lters were considered: L-C the motherboard receives the clock source generated and R-C. The L-C filter consists of a series inductor L by the CPU in slot 0 and distributes low skew copies of trom the power source to the analog power pins of the clock to each module slot in the system. Each the PLL and a capacitor C from the same power pins module in the system has one and in some cases r.vo to ground. The R-C tl lter consists of a series resistor PLLs to supply the required copies of the clock locally. R trom the power source to the analog power pins of Figure 3 shows the basic system tlow of clocks.

Digital Tc dlnic;ll )omn.1l Vo l. 8 No. 4 1996 41 CONTROLMOTHERBOARD Skew Management Te chniques LOGIC MEMORY 7 The AlphaScrvcr 4100 system h ad t-( nJr design teams. Each team was assigned a portion of the system. Signal r � intcgriry techniques had to be developed to keep the MEMORYO skew across the system as low as possible. These tech­ niques were structured into a set of design rules that � each team had to applv to their portion ohhc design. To develop these rules, designers explored several MOTHERBOARD areas, including impedance rJngc, termination, tree­ ing, PLL pLlcemenr, and compensation. I CPU 3

Impedance Range PRIMARY CPU 0 DISTRIBUTION Controlled impedance ( +/- 10 perce nt from a target � impedance) r::�iscs the FWB cost bv percent to ,..... lO 20 percent,de pending on board size. Each raw PWJ3 has to be tested and documented lw the PWB sup­ � pliers, which results in a fixed charge t( >r each PWB, J regJrdlcss of size. Theref-ore, smaller PWBs have the highest cost burden. The AJphaScrvcr 4100 uses rela­ 1/0 tively small daughter cards. Since low system cost was � BRIDGE a primary goal, noncontrollcd impedance PWBs had to be considered. Unt()rtunatcly, allowing the PvVB impedance r;mgc (over process) to spread to greater than +/-10 percent makes the task of keeping clock Figure 3 System Clock flow Diagr::un skew low more difticult. Specification of mechanical dimensions with tolerances was the only wav to

provide some control of the impedance range with no additional costs. The Alpha microprocessor used on all CPU options Ta ble l comains the results of simulations per­ to r rhe AJphaScrvcr 4100 system has irs own local to rmed using SIMPEST, a 2-D modeling tool devel­ clock circuitry. The microprocessor uses a built-in oped by DIGITAL, fo r a six-layer PW B used on one of

digital PLL that allows it to lock to an exte rn a l rckr­ the AJphaServer 4100 modules. The PWB dimensions cncc clock at a multiple of its internal clock.' In the and tolerances specified to the vendors were used in context of the AJphaServcr 4100 system, the rekrcnce the simulations. The dielectric constant, the onlv para­ clock is generated by the local clock distribution sys­ meter nor specif-ied to the vendor, ranged fr om 3.8 to tem. The AJphaScrvn 4100 is fullysynchr onous. 5.2, which ovcrbps the rypical industry-published Each CPU in the system has two clock sources: range of 4.0 to 5.0 tor fR4-type material ( epm..)r-glass one fo r the bus distribution (system cycle rime) and PWB).'' Since our PWB material acceptance with the one fo r the microprocessor. This design may appear to vendor is based on meeting dimension tolerances, we be a costly one, but this :1pproach is extremely cost­ used the 6cr impedance range on all SPICE simula­ eftecrive when f-ield upgr.1des are considered . W he n tions, rhus ensuring that all acceptable PWB material AJpha microprocessor new, faster versions of the would work electrically.

become avaibblc, new CPU options will be intro­ Ta ble 2 shows the impedance range t-(>r J controlled duced. To remain svnchronous, the Alpha micro­ impedance PWB t-(>r the target impedance reported in processor internal clocks need to run at a multiple of the system cycle rime. Although the system cycle rime Ta ble 1 goal is 15 ns, the cycle ri me needs to be adjusted the ro Vendor Impedance Ranges Specifying s e e CPU p ed of th option used. Placing the bus oscilb­ Dimensions Only ror, which drives the primary PLL fo r the clock system (cycle rime), on the CPU module and designing the 4cr Yield 6a Yield

clock distribution system to fu nction over a wide t-i·e­ Mean target 71 ohms 71 ohms quency range makes field upgrades as simple as replac­ impedance i ng the CPU modules. The motherboard docs not Impedance 62 ohms to 57 ohms to need to be changed . range 83 ohms 89 ohms

42 Digital Technical journal Vol. 8 No. 4 1996 Ta ble 2 stressed. If the tests indicated stressed parts, designers Vendor Impedance Range for an Impedance would adjust the terminator value accordinglv. To lera nce of +/- 10 Percent

+I-10 Specification Range Tr eeing

Treeing is :-� method of distributing clocks fr om a Mean target 71 ohms single source driver to many receivers. This practice, Impedance which is well known to memory designers, was used Impedance range 64 ohms to 78 ohms on the AlphaServer 4100 memory modules, bus inter­ fa ce logic, and primary distribution clocks on the motherboard . The designers used two basic fo rms of Table I. The difkrence in impedance r;mgc between treeing: the bal:�nced H tr<.:e and the shared output specifYing dimensions and impedance is -7 ohms to tree. The balanced H tree is best suited to r fi xed loads ll ohms. The simulations suggested that the range (receivers) of the same type (i.e., memories, trans­ ditkrcnccs have a minor impact on signal beh;wior. ceivers, ere.). A single, series-terminated clock output The target impedance was based on nominal fe eds a trunk line to a via and then branches to each dimensions and dielectric constant. The target of load. Each branch is equal in length. The total com­ 71 ohn1s \\'aS chosen to optimize routing density and pensated path includes the pre-terminator stub, the to keep the l:�ycr count down ti:>r most designs. main trunk, and the branch ext<.:nding to the load. Another J.dv:tntage was that keeping the minimum Figure 4 illustrates the clock treeing topoiOb'Y The imped:�nce above 50 ohms would minimize loading. shared output tree was used where various module The impedance range covers th<.: fu ll mechanical configurations could altn clock load ing. Specitically, dimensions and dielectric constant ranges. Propt.:rly the distribution on the motherboard is restricted to impkmcntcd, the PLLs would dkctivcly eliminate one PLL to keep the clock skew low. Consequent ly, local etch delay module to module over the ti.d l some outputs needed to drive more than one slot. process rang<.: of the PWJ3s. The main chalknge was A single output driver drove two terminators-one to adequately terminate without sacriflcing skew to r each load. The low driver impedance isolated pert(>rmance at th<.: extreme process r:t nge ( 6u) of rd1ections tl·om perturbing a module when a module the PW B material. slot was leftop en.

Te rmination PLL Pla cement The designers used series termination 011 �111clo cks in Placement of the PLL on each module is critical . Figure the system. P<1 r;11ld terminators would have <.:xceeded 5 is a simplitied view of the primary distribution up to the drive capability of the CDC586. Diode damping and including the PLL on a module. The Al phaServer was not practical when so many copies of the clock were required because of PW B surbce area con­ straints. Normally, the optimal termination value is one that provides critical damping ti:Jr the case where the driver's impedance is the lmvest and the etch R impedance is the highest. Designers can then make adjustments :�t the other extreme corner (high driver impedance and low etch impedance ) to avoid non mo­ notonic behavior such as plateaus. This generally LOCKEDPHASE­ introd uces slope delay uncertainty at the slow corner LOOP MODULE (high driver impedance and low etch impedance), OUTPUTSHARED which c:�n be substantial. To minimize this cftcct, TREE designers selected terminator values th;lt allow over­ MODULE shoot and some bounce-away ti·om the threshold FB region at tiJe extreme process corner. Overshoot can reach the maximum specitied altern:�ting current (AC) input oF the receivers over the worst-case process range. Some receivers have built-in diode clamping to their power supply rails as a resu lt or· ESD circuits in KEY: their input structures (ESD circuits :�re used tc.>rstatic FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP discharge protection). In these cases, the clock sign;ll is R SERIES TERMINATING RESISTOR clamped, which in turn dampens bounce. The injec­ tion currents c:�used by clamping would be tested in Figure 4 SPICE simulations to be sure that the parts were not Clock Treeing

Digital T(( hnical journal Vo l. 8 No. 4 1996 43 . 7 = 4100 system has two types of module connectors: ;d ( T{, + 0) - 7, 2. a Metra connector (Fururcbus+ -style connector) is = ! For 'f{ , 02 (equal etch lengths) , 'f,·d = 7f. 1 used on the CPU modules and the I/0 bridge module, Adding 7/,to the compcns:n ion path yields and an Extended Industrv Standard Architecture ( EISA) connector is used on the memory modules. 'l;'d = ( 7; , + 0) - ( lj2 + ·r; Intrinsic delay on these connectors could differ sub­ For 7{ = (etch e ual = 0 ns, , 7{2 q lengths), 7id stan tially depending on pinning and the signal-to­ where returnratio in the application. The Mcn·al connector is

a right-angle, nvo-picce connector with to ur rows of Tid = the inserti on delay f-rom the connector pins: rows A, B, C, and D. The row A pins arc the pin to the receiver input

shortest, and the row D pins arc the longest. The EISA = lj , the etch delay f-i·om the PLL output connector is an edge connector with nvo rows of pins to the receiver input

with minor length diffe rences pin to pin on either side 'fj 2 = the etch deLl\' of the PLL ofthe connector. Designers had to balance the pinning compensation loop

of these connectors fo r the clock circuits in such a way l/, = the dispersion etch delay connector that the module-to-module skew would not be to the clock-in of the PLL. atlectcd. The Metra! con nector was pinned to replicate One drawback to this method is that the etch lengths the loop i nductance of the EISA connector. could get birlv large , which would result in edge r:ne Dispersion etch is required on each module to con­ degradation . AlphaServer 4100 designers did not usc nect the PLL to the connector. This etch can have dif this placcmcnr method on the current set oh11od ulcs; krent i mpedance and velocity of propagation ti·mn however, they will consider using it on new designs that modu le to module as a result of P\VB process range , require diff-Crcnt location t( >r the PLL. which translates into addition::�! module-to-module :1 The second way of dealing wi th the dispersion etch clock skew. Designers can deal wi th this problem in ti·om the module connector to the clock-in pi n ohhc t\vo wavs. PLL is to make the dispersion etch very short and to First, adding the same dispersion length L, (sec take a skew penalty over the l)WB process. Placement Figure 5) to the compensation loop L2 nulls this error. studies on the various module designs suggest that This becomes obvious if you look at the PLL's basic a 25-mm dispersion etch would allow rclSOILlble 7 · fu nction . The insertion de!Jy d f-1-om the clock-in pin i placcmcllt of PLLs. The :1dditional skew is just under of the PLL to the input pin of the receiver is approxi­ 50 ps, based on a velocity of propagation range of mately 0 ns ifL1 = L2, or 5.59 ps/mm to 7.36 ps/m m. MOTHERBOARD DISPRIMARYTRIBUTION LOCALTYPICAL DIS MODULETRIBUTION DISPERSION R L, CLOCK IN ETCH L3\ FROM CPU 0 R PHASE- \CLOCK IN PHASE- TORECEIVERS LOOPLOCKED CONNECTOR LOOPLOCKED R FB R FB R

COMPENSATION KEY LOOPS FBR SERIEFEEDBSACK TERMI LOOPNATI INNGPUT RESIS FORTOR THE PHASE-LOCKED LOOP ANDL1, L2, L3 ETCH LENGTHS

Figure 5 PrinlJrl' Disn·iburion

44 Digital Tcchniul )ourn,li Vol. R No. 4 1996 Compensation Some modules have a wide variety of circuits receiving LIGHTLY LOADED clocks that, because of input loading, do not balance RECEIVER well with the various treeing methods. Designers used dummy capacitor loading to help balance the HEAVILY LOADED treeing. This approach was particularly useful on RECEIVER memory modules, which could be depopulated to provide different options using the same etch. Surface­ COMPEFB INPUTNSA (PLL)TION WITH LOOP mount pads were added to the etch such that if the NO CAPACITOR depopulated version were built, a capacitor could be added to replicate the missing load on the tree, thus COMPENSATION LOOP keeping it in balance. The CPU modules have a wide CAPFB INPUTACITOR (PLL) WITH variety of clock needs, which results in two fo rms of skew: ( 1) load-to-load sknv at the module and (2) control logic-to-CPU skew, to r control logic KEY: located on the motherboard . The local load-to­ T1 LIGHTLY LOADED RECEIVER CLOCK EDGE TIME load skew is acceptable because only one PLL is T (REFEREHEAVILY LOADEDNCE) RECEIVER CLOCK EDGE TIME used and the output-to-output skew is only 500 ps. T32 COMPENSATION LOOP FB INPUT EDGE TIME WITH Motherboard-to-CPU control logic skew, though, is CAPACITOR critical because of timing constraints. FB FEEDBACK LOOP INPUT FOR THE PHASE-LOCKED LOOP Dummy capacitor loading at each lightly loaded receiver would have reduced the skew, but to compen­ Figure 6 sate tor just one heavily loaded receiver would have feedback Loop Compensation required many capacitors. PWR surrace area and the req uirement of simplicity dictated the need tor an to relax the jitter specitication ti·om 25 ps to 70 ps alternative. The solution was to keep the clock edges RMS, and there were some diffi culties getting good as fast as possible (by adjusting the series terminators) load balance. The specitication did not change, how­ and to add a compensation capacitor at the input (the ever. Reassessing the allocated bus settling time yields fe edback [ Fl3 J) of the PLL's compensation loop. This the fo llowing: effectively reduced the skew from the slowest load on the CPU to the control logic on the motherboard . Bus cycle 15.0 ns Figure 6 shows the disparity between light and heavy Transmitting module (Teo) 5.1 ns loading from T1 to 72. Without teedback compensa­ Setup and hold time fo r the

tion, the PLL self-adjusts to the lightly loaded receiver. receiving module 1.5 115 Tbis ;�djustment results in skew T1 to 72 fr om the Clock skew 2.2 ns heavy load to the control logic on the motherboard . Time allocated fo r bus settling 6.2 ns A capacitor on the fB input of the PLL split the dif SPICE simulations tor a fu lly lo;�ded bus with the fe rcnce berween 73 ro 72 and T.1 to 7] ;�nd minimized worst possible driver receiver position yielded a bus the perceived skew. settling time of 5.7 ns. The relaxed skew of 2.2 ns maximum was acceptable to r the design. Skew Ta rget

Comparative Analysis Designers generated the worst-case module-to-module clock skew specification tor the AlphaServer 4100 A comparison of clock distribution systems between trom vendor specitications, SPlCE simulations, and two other platforms best summarizes the AlphaScrver bench tests using the techniques discussed in this 4100 ystem . The AlphaServer 4100 has price and paper. The worst-case skew goal is 2.2 ns and is sum­ s a performance target berween those of the AlphaServe marized in Ta ble 3. r 2100 and the AlphaServer 8400 systems. Table 4 com­ The reader wi ll note that eight times the vendor's pares the basic difrerences among these systems relat­ specification may appear to be rather conservative <1 ing to clock distribution to r a CPU module ti-om each specification. The usc of this value was based on two platform . concerns: ( 1) the PLL was new at the time, and experi­ Both the Al phaServer 2100 and the AlphaServer enc suggested that the vendor's specification was e 8400 systems have large custom AS!Cs f(x their mod­ aggressive; and (2) some level of padding w;�s required ule's bus interface. The AlphaServer 4100 and the if the exception to the rules was needed . Actual system AlphaServer 8400 systems have bus termination; the testing bore out these concerns. The vendor had AlphaServer 2100 system does not. Allowing a bus to

Digiul Technical Journal Vol. S No. 4 1996 45 Ta ble 3 Worst-case Clock Skew

Stage Source Skew Component

Motherboard Out-to-out skew 500 ps (vendor specification)2 Inputs to modules Load mismatch 100 ps (simulation/bench test) Module to module PLL process 1,000 ps (vendor specification)'- Inputs to receivers Load mismatch 200 ps (simulation/bench test) Inputs to receivers PLL jitter 400 ps (eight times the vendor specification)2

Total clock skew 2,200 ps = 2.2 ns

Ta ble 4 Clock Distribution Comparison of Three Platforms

AlphaServer 2100 System AlphaServer 4100 System Alpha Server 8400 System

Bus width 128 + ECC 128 + ECC 256 + ECC Bus speed 24 ns 15 ns 10 ns Clock skew 1.5 ns 2.2 ns (max.) 1.1 ns (max.) Inputs requiring clocks 10 25 14 Clock drivers used 12 13 11 Number of clock phases 4

settle naturally (with no termination), as in the case of Conclusions the AlphaServer 2100 system, req uires a tighter skew budget fr om the clock system. The trade-off is higher An etkctive, low-cost, high-pcrh>rmance clock distri­ cost, power, and PWJ3 area t()r lower bus speed. bution system can be lk signcd using an off. the-shclf Higher performance systems, such �lS the AlphaServer componcnt as the basic building block. DfGfTA L 8400 and AJphaServer 4100 systems, generally requirc AJ phaServer 4100 s�·stem dcsigncrs accomplished this bstcr bus speeds with terminators. The AJ phaServcr by optimizing the bus and den:loping simple tech­ 4100 has shorter bus stubbing (module transceiver to niqucs structured in the t()rm of dcsign rules. Thcsc connector dispersion etch) Jnd slower bus speed than rules arc the AlphaServer 8400, which allows larger skew (Js • Use positive edges t(x critical clocking. a percentJge of the bus spccd). Table 5 is a comparison of board areJ needed and • Match dcl ay through diftCrc nt connectors usmg cost fo r the clock system. Dcsigncrs analyzed an entry­ appropriate pinning.

levcl system consisting of one CPU module, one mem­ • Usc a fixeddi spersion ctch length fr om the connec­ ory module, and one 1/0 bridgc or interface module. tor to the PLI,. Thc board area shows the spacc required by the active • Ro ute and balance all dock nets on the same PWB components only (the digitJI phase-locked loops, laycr. PLLs, drivers, etc.). • Minimizc adjaccnt-laycr crossovcrs and maximize Both Tables 4 and 5 show that the clock system spacmgs. dcsign t(>r the AJphaScrvcr 4100 system requ ires only one-third the space of either thc Al p haServer 2100 • Use minimum valuc tcrminarors.

systcm or the AJphaServcr 8400 system at a fr action of • Usc tree and loop comrxns

Ta ble 5 Board Utilization and Cost Comparison

AlphaServer 2100 System Alpha Server 4100 System Alpha Server 8400 System

Board area used* 352.8 square centimeters 111.4 square centimeters 371.3 square centimeters Normalized cost 1.00 0.46 4.40

*Note that these measurements do not include decoupling capacitors and terminators.

46 Digital Technical Jourml Vo l. 8 No. 4 1996 The worst-case lab measurement of clock skew between any two modules in a rLd ly conhgured system was l.l ns, which is well within the 2.2 ns calculated mJximum skew.

Acknowledgments

Te rry Skrypek and Bruce Alford assisted with the prototyping and measurements. Cheryl Preston, Andy Koning, Steve Coe, George Harris, and Larrv Derenne worked with the designers to ensure compliance with the signal integrity rules. Darrel Donaldson, Don Smelser, Glenn Herdeg, Jnd Dan Wissel! provided invJiuablc technical guidance.

Note and References

I. S l' I C E is a general-purpose circuit simulc1tor program developed Lw Lawrence Nagel and Ellis Collen of tile \)epclrtmcm of Elecnical Engineering and Computer Sciences, University ofCalirornia at Berkeley.

2. CDC-Clock Distribution Circuits, Data .Book (Dallas, Tex .: Texas Instruments Incorporated, 1994).

3. Alpha 2! 164 /Vlicroprocessor 1-fctrdware Heference /vlmwal (Maynard, Mass.: Digital Equipmem Corpora­

tion, September 1994 ) .

4. C. Cuiks, f:"t ·etything Vo u l:"uer Wa nted to Know Ah(mt Laminates. . But \h're Aj i-aid to As/,o, 4th ed. (Maitland, fla.: Arion, Inc., January 1989).

Biography

Roger A. Dame A principal signal integrity engineer in the !VI idr

Digiul Technical Journal Vo l. 8 No. 4 1996 47 I Glenn A. Herdeg

Design and Implementation of the Alpha Server 41 00 CPU and Memory Architecture

The DIGITAL AlphaServer 4100 system is Digital The DIGI TAL AlphaSuvn 4100 s�'stem is a svrnmet­ Equipment Corporation's newest four-processor ric multiprocessing (SMP) midrange suver that sup­ ports up to four Alph:1 2 164 microprocessors. midrange server product. The server design is I A singk Alph<� 21 164 CPU chip may simultaneously based on the Alpha 21 164 CPU, DIGITAL's latest issue multiple extern:�! accesses to main memory. The 64-bit microprocessor, operating at speeds of Alph:1Servu4100 memory imuconncct was designed up to 400 megahertz and beyond. The memory to maximize this multiple-issue ti::ature of the Alpha architecture was designed to interconnect up 21 164 CPU chip :�nd to t:�kc :llh':111tageoh he perfor­ to four Alpha 21164 CPU chips and up to four mance benefits of the new bmily of memory chips 64-bit PCI bus bridges (the AlphaServer 4100 called synchronous d\'n:unic random·access memories (SDRAMs). To meet the best-in-industry latency <111d supports up to two buses) to as much as 8 giga­ b:mdwidth pertorm:1ncc goa ls, DIGITAL de,-eloped bytes of main memory. The performance goal :1 simple memory interconnect ,1rchitccturc th<�t com­ for the AlphaServer 4100 memory interconnect bines the existing Alpha 2!164 CPU memory inter­ was to deliver a four-multiprocessor server with race with the industry-standard SDRAM interrace . the lowest memory latency and highest mem­ Throughout this paper the term latency reters to the time required to return data ti·om the mcmorv chips ory bandwidth in the industry by the end of ro the CPU chips-the lo\\'er the late y, the better the June 1996. These goals were met by the time the nc put(>rmancc. The AlphaScr\'er 4100 svstcm achic,·cs AlphaServer 4100 system was introduced in May :1 mininlllm latencv of 120 nanoseconds (ns) tl-omrhc 1996. The memory interconnect design enables rime the address appc::lrS ar rbe pins of rhc Alplu 21164

the server system to achieve a minimum mem­ CPU ro the time the CPU internaltv receives tl1e corre­ ory latency of 120 nanoseconds and a maximum sponding data hom any address in m:1in memory. The memory bandwidth of 1 gigabyte per second by term ba ndwidth rdcrs to the ::tmount or' data, i.e., the number of bytes, transferred berwecn the memory using off-the-shelf data path and address com­ chips and the CPU chips per unit of rime-the higher ponents and programmable logic between the the bandwidth, the better the pcrt(mnance. The CPU and the main memory, which is based on AlphaServer 4100 delivers '1 nJ,lXimum memory band­ the new synchronous dynamic random-access width of gig<�byte per second (GB/s). l memory technology. Beh-c in troducing the DIGITALAlphaServer4lOO product in M:1y 1996, rhc development ream con­ ducted :m extensi,·e pcd(mllancc comparison of the top sen·crs in the industry. The bencbnurk tests showed that the AlphaServcr 4100 delivered the lowest memory latency :md rhc highest McC<�Ipin memory b:111dwidth of all the t'vVO- four-processor to systems in the industry. A companion p<�per in this issue of the ]o umol "AipluServer 4100 Pcr­ t(>nllJnce Characterization," contains the comparative int(mnation.1 This p:�perfo cuses on the '1 rchitecturc and design of the rhn:e core modules that \\'ere developed concur­ rently to optimize the ped(mn:�ncc of the entire

41l Di�;i tcll journal Vol. 8 No. 4 1996 Te chnical memory architecture. These three mod u les-the No-External-Cache Processor Module motherboard, the synchronous memory module, and the no-external-cache processor modulc-�H"C shown The no-external-cache processor module is a plug-in in Figure l. card with a 144-bit rncmor�r inred�1 ce that contains one Al pha 21164 CPU chip, eight 18-bit clocked data Motherboard transceivers, to ur 12-bit bidirectional address latches, and control provided by 5 - ns 28-pin PALs and The motherboard contains connectors t()r up to t(>ur 90-MHz 44-pin PLDs clocked at 66 MHz. The Alpha processor modu les, up to t(Jur memory module pairs, 21164 CPU chip is programmed to operate at a syn­ up to two 1/0 interrace modules (tcH1 r peripheral chronous memory intcrtacc cycle time of 66 M Hz component interconnect [PC!] bus bridge chips ( 15 ns) to match the speed ofthe SO RAM chips on the ror:�l), memory address multiplexers/drivers, :m d memory modules. Although there are no external logic t( >r memory control and arbitration.' All con­ cache random-access memory ( RAlvl ) chips on the trol logic on the motherboard is implemented using module, the Alphur processor latency to main memory low and by issuing multiple modules, one to t< >ur memory module pJirs (8-GB references trom the Alpha 21164 CPU tO main mem­ maximum memory), and one I/0 intcrbcc mod ule ory at the same time to increase memory bandwidth, (up to two PCI buses).' the pedormancc of many applications actually exceeds the pertormancc of a processor module with a third­ Synchronous Memory Module level external cache.' Numerous applications perform

better, however, with a large on-board cache. For this The synchronous memory modules arc custom­ reason, the Al phaScrver 4100 ofkrs several variants of designed, 72- bit-wide plug-in cJrds instJIIcd in plug-in compatible processor modules containing a pairs to co1Tr the fu ll width of the 144-bit memory 2-MB, 4-MB, or greater module-level cache. The paper data bus. Synchronous memory modules that provide "The AlphJScrvcr 4100 Cached Processor Module 32 megabytes (MJ)) to 256 MR per pair were designed Architecture and Design," which appears in this issue usmg 16- mcgabit (Mb) SDRAM chips. These ofthejourua/, contains more related information! memory modules contain nine, eighteen, thirty-six, The three components of the core module set were or seventy-two 100- MHz SDR.AM chips clocked at designed concurrently to address fiveissues: 66 MHz, t( >ur 18-bit clocked data rcmsccivcrs, address 1. Simple design bn-our bufkrs, and control provided by 5-ns 28-pin PA Ls. To increase the maximum amount of memor v 2. Quick design rime in the system, a ta mily of plug-in compati ble memory 3. Lowmemory latencv modules was designed, providing up to 2GB per pair 4. High memory bandwidth using 64 -Mb extended data our dynamic random­ 5. ReconfigurJbiliry access mcmorv (EDO DRAM) chips. These modules contain 72 or 144 EDO DRAM chips controlled by Simple Design two custom applic1tion -specitic integrated circuits (ASJC:s) providing data multiplexing and control, t(>ur The Alpha 21164 CPU chip is based on a reduced 18-bir clocked data transceivers, Jnd address bn-out instruction set computing (R.ISC) architecture, which buftl: rs. Consequently, the Al phaServcr 4100 memory has a small, simple set of instructions operating as tast architecture provides main memory capacities of as possible. AlphJScrvcr 4100 designers set the same 32 MB to 8GB with a minimum latency of 120 ns to goal of simplicity t()l· the rest of the server system. :111y address. This paper concentrates on the imple­ The AlphaScrvcr 4100 interconnect between rhc mentation of the synchronous memory modules, CPU and main memory was optimized tor the Alpha although the EDO memory modules arc fu nctionally 21164 chip and the SDRAJ\11 chip. To keep the design compatible. The recontigurabil ity description later in simple, only off the-shelf data path and address com­ this paper pro\·idcs more derails of the implementation ponents and rcprogrJmmable control logic devices of the EDO memory modules. were placed between the Alpha 21164 and SDRAM

Digital Tcchnic11 jounLli Vol. 8 No. 4 1996 49 MOTHERBOARD I (ALPHASERVERMEMORY PAIR 4 410- 320 MBONL TOY) 2GB I MEMORY(ALPHASE PAIRRVE R3 410-320 MBON LTOY) 2GB ..1 1---�------� r-�--��----MEMORY PAIR 2 -32 MB TO 2 GB �1 ------1 MEMORY PAIR 1 -32 MB TO 2GB � �--+------�� I FLOP 1 1 -D RAM_R�O W_A�D�D�R�ES�S� (I + __ _ DRAMS �1-----+�� DRAM COLUMN ADDRESS • c_ : ���1;�� ----1 �;��RoL-; __CON_ TR()L-<} ______1 _ �---AND------CENTRAL _ ...-1 :- __ : ARBITRATION

_ PROCESSORr--- CARD -1 ------: - - , ______CONTROL J :------ALPHA CMD R ..______- . 21CPU16 4 , MDD •ii�� L�A�T;C�tH l.------�----�----- DATA FLOP

PROCESSOR CARD 2 _ ------:, ______CONTROL :------ALPHA CM ADDR J 21164 I�:��D / ... ______LATCH _ CPU DAT�A�� �:.����1:� ======- I FLOP I == 1======:ll-_.l "'•2

PROCESSOR CARD 3 (ALPHASERVER 4100 ONLY) -· ,--- f- _ : . __ _ ...... -.: -coN-T�o� /I j / 44 ALPHA CM /ADDR en ,.______LATCH ::J 21164 . D

s a r------PCI sLoTs To PCI BRIDGE 21 ---'-"C..::.::C:.:.::..::.:..::::..::_I----• 11 1/0 MODULE 2 (ALPHASERVER --..!..:===:c.:.:�9 =---+---4000 ONLY)-----; ... PCI SLOTS TO 12 I1 PCI BRIDGE 3-,11+-.---.l-- -__:_--- +1 --· 1 I ------l-�-.. ,. :_P_:::C.:_:I :::S:.::LO:,cT_:::S__:1"3_:_-T"0---':_61-+-----+-� PCI BRIDGE 4 r� -- -· _-1·--__·_:_· __ _ 1

Note that the AlphaServer 4000 system contains the same inteliace as the AlphaServer 4100 supports half number ol processors and memory modulesCPU-to-memory and twice the ol PCI bridges. The AlphaServer mo o rd was des1gned at the same ti as the Alpha Server o her d but was but 4000the lherb a number4100 m l b ar not produced until alter the AlphaServer 4100 motherboard wasme ava1lable. o

Figure 1 4100 Memory nrerconnecr AlphaServer I

50 DigitJI Te chnical journal Vo l. No. 4 8 1996 chips. The designers removed excess logic and hard­ The data path is clocked at each stage by a copy of ware fe atures, minimized the "glue" logic between the a single-phase clock. The clock is provided by a low­ CPU chip and main memory, reduced memory laten­ skew clock distribution system built from the 52-pin cies as much as possible, and used custom ASlCs only CDC586 phase-locked loop clock driver.' The clock when necessary. cycle is controlled by an oscillator on the processor module and runs as fast as 66 MHz (15-ns minimum Data Path between the CPU and Memory cycle time) while delivering less than a 2-ns worst-case The externalin terface of the Alpha 21164 chip pro­ skew (i.e., the difte rence in the rising edge of me clock) vides 128 bits of data plus 16 bits of error-correcting between any tvvo components, including the Alpha code (ECC), thus enabling single-bit error correction 21164, SDRAMs, and any transceiver on any module. and multiple-bit error detection over the full width of Read transaction data is returned from the pins the data path, which is shown in Figure 2. These 144 of the SDRAMs to the pins of the Alpha 21164 in signals are connected to eight 18-bit bidirectional two dock cycles ( 30 ns ), as shown in Table l. The no­ transceivers on the processor module. As illustrated external-cache processor has no module-level data in Figure l, the motherboard connects up to to ur cache, so data is clocked directly into the Alpha 21164 processor modules and up to fo ur memory mod­ from the transceiver. In Table 1, read data that corre­ ule pairs. Each memory module contains 72 bits of sponds to transactions Rd l and Rd2 is returned fr om information; therdore, a pair of memory modules the same set of SDRAM chips in consecutive cycles. is required to provide the necessary 144 data sig­ Read data that corresponds to transaction Rd3 is nals. Each pair of memory modules contains eight returned from a diffe rent set of SDRAM chips with a additional 18-bit bidirectional transceivers that are one-cycle gap to allow the data path drivers fr om trans­ connected directly to a number of SDRAM chips. action Rd2 to be turned offbetore the data path drivers The data transceiver used on the processor module tor transaction Rd3 can be turned on. This process pre­ and on tl1e memory module is the 56-pin Philips vents tri-state overlap. As a result, consecutive read ALVC16260l in a 14-millimeter (mm )-long package transactions have address bus commands either fo ur or with 0.5-mm pitch pins. Error detection and correc­ fivecycles apart. Note that the Alpha 21164 data, com­ tion using tbe 16 ECC bits is pertormed inside the mand, and address signals are shown tor only one Alpha 21164 chip on all read transactions. Data path processor (CPU1), which issues transactjon Rd l. The errors are checked by the PCI bridge chips on all trans­ other transactions are issued by otl1er processors. actions, including read and write transactions between Write transaction data is also transferred from the each CPU and memory, and any errors are reported pins of the Alpha 21164 CPU to the pins of the to the operating system. SDRAMs in two clock cycles (see Table 2). Write data

MOTHERBOARD r------, r------, ,------L, NO-EXTERNAL-CACHE PROCESSOR I SYNCHRONOUS MEMORY MODULE (1 TO 4) 1 (1 TO 4 PA IRS) 72 SDRAMs ALPHA DATA AND ECC 21 164 FL -_.__��----__!_ ,L- /. OP I ---.., .J...... ,--. �E�� �ER CPU 144 144 72 PAIR) - : A ______I ~ B J L------j

Figure 2 Data Parb between rhe CPU and Memory

Ta ble 1 CPU Read Memory Data Timing

Cycle (15 ns) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Address Bus Command Rd 1 Rd2 Rd3 Rd4 SDRAM Data 1 1 1 1 2 2 2 2 3 3 3 3 Motherboard Data 1 1 1 1 2 2 2 2 3 3 3 CPU 1: Alpha 21 164 Data 1 1 1 1 CPU 1: Alpha 21 164 Command Rd 1 CPU 1: Alpha 21 164 Address Addr1

Oig;ital Tcd111ical Journal Vol . 8 No. 4 1996 51 Ta ble 2 CPU Write Memory Data Timing Cycle ( 15 ns) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1118 Address Bus Command Wr1 Wr2 Wr3 Wr4 SDRAM Data 1 1 1 1 2 2 2 2 3 3 3 3 4 Mot herboard Data 1 1 1 1 2 2 2 2 3 3 3 3 4 4 Alpha 21164 Data 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4

always incurs a one-cycle gap between transactions. Figu re 3. The motherboard latches the fu ll address

As a result, all but the first two consecutive write trans­ and d r ves ti rst the row and then the column portion actions have address bus commands t-ive cycles apart. of the iaddress the memorv modules. Each sy ch­ to n Since the Alph:�Scrver 4100 interconnect between ronous memory module bu fkrs the row col u m n the CPU and main memory was optimized t<>r the address and ta ns our a copy to e:tch ot" the/ SDRA.M SDRAM memory chip, the transaction timing, as ch ips using tclLir 24-bit bufkrs. Similar traditional ro shown in Ta bles and 2, was designed provide data dynamic ra ndom-access memory (DRAM) chips, l to in the correct cycles the SDRA.Ms without the need SDRAM chips usc the roll" address on their pins for ro to r custom AS !Cs to buffer the data between the access the page in their memor�· arr;ws �1 11d the column motherboard and SDRAM chips . This design works address that appears later on the s me pins read or a ro well t()[ an infinite stream of all reads or ;: tilwrites write the desired location within the page. Conse­

because of the SDRAM pi pc lined interface; however, quently, there is no need to provide the enrire 36-bit­ when a write transaction immediately fo llows a read wide address to tbe memory module.� . All address transaction, a gap or "bubble" must be inserted in the components used tor transceivers, btches, multi­ data stream ro account tor the t ct that read ta is plexers, and drivers on the no·exrernal-cKhe proces­ a Lb returned later in the rr:1nsaction than write data. As :1 sor module, rhc motherboard, and rhe synchronous result, every write transaction that irnmedi:ltelv t(>llows memorv module consist ofrbc 56-pin ALVC16260 or the ALV I 62260, which is the s:�mel1arr ll'ith internal a read transaction produces a five-cycle g:�p in the : command pipe l ine. Ta ble 3 shows the read write output resistors.C Address parity is checked by rhe PCJ

transaction timing. / bridge chips on all transact ons , :�nd :�ny errors arc reported the operari ng systemi . ro Address Path between the CPU and Memory The address path uses How-through latches tor the The Alpha 21164 provides 36 address signa ls (byte tl rst half of the address transfer (i.e., the row address) address <39:4>, i.e., bits 4 throug 39 ), 5 command fr om rhe Alpha 21164 ro the SDRAMs. When tile bits, and bit of parity protection. Theseh 42 signals are address appears ar rhe pins of rhe Alpha 21164, l connected directly to t(1ur 12-bit bidirectional latched the latched rranscei,·cr on the processor mod ule, the

transceivers on the processor module, as ill ustrated in multiplexed row address dri,·er 011 the motherboard,

Ta ble 3 CPU Read/Write Memory Data Timing

Cycle (15 ns) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Address Bus Command Rd1 Wr2 Wr3 SDRAM Data 1 1 1 1 2 2 2 2 3 3 Mot herboard Data 1 1 1 1 2 2 2 2 3 3 3

MOTHERBOARD

r------, r------, NO-EXTERNAL-CAC HE PROCESSOR SYNCHRONOUS MEMORY MODULE (1 TO 4) (1 TO 4 PA I RS)

SDRAM S DRAMs A ADDRESS ADD ESS TO ___R__ . ( 1 4 �---�- N --+- _ -- CH SETS PER CPU 42 PA IR) -�;-���� ��g� 1---T---t---+ � EJ BUFFER L. A

______------� � ]

Figure 3

Address Path between the CPU and .Memory

52 Dig;it;t[ Tcchnictl journal Vol. 8 No. 4 1996 and the fa n-out butlers on the memory modules are all and are driven directly and unmodified through the open and turned on, enabling the address information latched address transceivers on the processor module to propagate directly ti·om the Alpha 21164 pins to to become the motherboard command/address. Since the SDR.A.Nl pins in two cycles. The motherboard then the AlphaServer 4100 interconnect between the CPU switches the multiplexer and drives the column and main memory was optimized fix the Alpha 21164 address to the memory modules to complete the CPU chip, the Alpha 21164 externalCMD signals map transaction (see Table 4). Back-to-back memory trans­ directly into the 6-bit encoding of the memory inter­ actions are pipelined to deliver a new address to the connect command used on the motherboard, thus SDRAM chips every fo ur cycles. The fu ll memory avoiding the need fo r custom AS!Cs to manipulate the address is driven to the motherboard in two cycles commands between the CPU and motherboard. (cycles 0-l , 4-5, 8-9), whereas additional intonna­ Prudently chosen encodings of the Alpha 21164 tion about the corresponding transaction (which is external CMD signals resulted in only two command used only by the processor and the l/0 modules) bits (to determine a read or a write transaction) and fo llows in a third cycle (cycles 2, 6, 10). To avoid tri­ one address bit (to determine the memory bank) state overlap, the fo urth cycle is allocated as a dead being used by a 5-ns PAL on the processor module to cycle, which allows the address drivers of the current directly assert a Req uest signal to the motherboard to transaction to be turned offbdore the address drivers use the memory interconnect. Figure 4 shows the tor the next transaction can be turnedon (cycles 3, 7, control path between the CPU and memory. If the ll) . These to ur cycles constitute the address transfer central arbiter is ready to allow a new transaction by that is repeated every to ur or live cycles tor consecutive the processor module asserting a Request signal (i.e., if transactions. Note that the one-cycle gap inserted the memory interconnect is not in usc ) , then a 5-ns between transactions Rcl 3 and Rd4 fo r reasons indi­ PAL on the motherboard asserts th<.: control signal cated earlier in the read data timing description causes Row_CS to each of the memory modules in the tal­ the row address fo r transaction Rd4 to appear at the lowing cycle. At the same time, another 5-ns PAL on pins of the SDRAMs tor three cycles instead of two. the motherboard decodes 7 bits of th<.: address and drives the Sck 1:0> signal to all memory modules to Control Path between the CPU and Memory indicate which of the fo ur memory module pairs is The Alpha 21164 provides five command bits (tour being selected by the transaction. Each synchronous Alpha 21164 CMD signals plus the Alpha 21164 memory module uses another 5-ns PAL to immedi­ Victim_Pending signal) that indicate the operation ately send the corresponding chip select ( CS) signal to being requested by the Alpha 21164 external inter­ the requested SO RAM chips on one of the CS<3:0> f:lCe -" These live command bits arc included in the 42 signals when the Row_CS control signal is asserted if command/address (CA) signals indicated in Figure 3 selected by the value encoded on Sek l:(l>, as shown in Figure 4.

Ta ble 4 CPU Read Memory Address Timing

Cycle (15 ns) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Address Bus Command Rd1 Rd2 Rd3 Rd4 SDRAM Address Row Addr1 Col Addr1 Row Addr2 Col Addr2 Row Addr3 Col Addr3 . .. Row Addr4 Col Addr4 Motherboard Address Mem Addr1 lnfo1 Mem Addr2 lnfo2 Mem Addr3 lnfo3 .. , Mem Addr4 lnfo4 Alpha 21 164 Address Addr1 Addr2 Addr3 Addr4 AddrS

MOTHERBOARD

------� ADDRESS , ------­ 5-NS SYNCHRONOUS MEMORY rNO -EXTERNAL-CACHE PROCESSOR PA L (1 TO 4 PA IRS) MO DULE (1 T0 4) 77 SEL<1:0> CMO/ CS<3:0> SORAMs ALPHA ADDR 5-NS (1 T0 4 5-NS REQUEST n / 21 164 • • PA L 4 SETS PER CPU 73" PAL 5-NS ROW_CS PA IR) PAL A I A A I ------_J ------� A

Figure 4 Conrrol l'ath between rhc CPU and Memory

Dig;it;ll Tcdmical Journal Vol. S No. 4 1996 53 Table 5 shows the control signals between the bet\\'een processors. The third CA cycle occurs onlv processor modules, the memory modules, :md the one cycle after the node asserts the Request signal, cent r:d arbiter on the motherboard t(x multiple however, because of bus parking. Bus parking is an processor modules is uing single read tr:msactions. arbitration te�nu re that causes the central arbiter to The central arbiter receives one or more Request< II> assert the Gram sigrul to the last node to use the bus signals fi·om the processor module� and asserts a when the bus is idle u< )llowing cycle 7 of transaction unique Grant< l'l>signal to the processor mod ule that Rd2 ). Consequently, if the same processor wishes to currently owns the bus. The arbiter then drives a copy use tht bus again, the ::tssertion of CA and Row_CS of the CA signal to every processor module along with signals occurs two cycles e�1dier than it wou ld without

the identical Row_CS signal to every memory module the bus parking katurt . to mark cycle l of a new transaction. Note that the cyck: counter begins at cycle l with e:�ch new Data Tr ansfers between Two CPU Chips CA/Row_C:S assertion and may stall t<>r one or more (Dirty Read Data) 21164 cycles when g:�ps appear on the memon· illterconnect. The Alpha CPU chips contain internal 1\'J'ite­ Two trans::tctions may be pipelined �lt the s::tmc time. back caches. When a CPU writes to a block ofdat::t, the for simplicity of implementation in pro�r:lmmable modified data is IJcld loc::tlly in the write-lnck cache logic de,·ices, the cycle coumer of c:�ch transaction is until it is written back to main memorv at a bter rime. always exactly tour cycles from the other. The modi tied (dirty) copy ohhe block of d:-tta must T:1blc 6 shows a singJc processor module issuing be returned in place of the unmodified (stale) copy two consecutive read transactions (dual-issue) t< >l­ ti·om main memory when another CPU issues a n.:::td lowcd by a third read transaction at a later ti me. transaction on the memory i merconnect. The mem­ Normally, the node issuing the transaction on the bus ory modules return the stale dat::t at the normal time de::tsserrsthe Req uest signal in cycle 2. If a node con­ on the memory interconnect, and the dirty data is tinues to assert the Request sign:d, the centr:-tl arbiter returned by the processor module containing the continues to assert the Grant sign�1l to that node tO moditicd copy in the cycles that tollow. The processor :IIlow guaranteed back-to-back tr;ms::tctions to occur. module issuing the rc:td tr::tnsaction ignores the st::tlc Note th:-tt the tirst CA cycle occurs three cycles after data trorn memo1-v. the :-�sscrtionoftbe Request sigr1:11bee:� use ofthe delay Therer<)re, to m:�int:-tin cache coherencv bet\\'ctn within the central arbiter to switch the Gr:�nt signal the write-b:�ck caches contained in multiple Alpha

Ta ble 5 Multiple CPU Read Memory Control Timing

Cycle Counter 2 (3) 6 7 (1 5-ns cycle) 6 (7) 2 3

J 4 Request 1234 1234 24 24 24 24 3 3 3 3 4 4 4 4

Grant 2 2 2 2 3 3 3 3 4 4 4 4 4

CA, Row CS (New transaction) Address/Command Bus Addr/RdI X1 lnfo1I AddrIX/Rd2 l nfo2I Addr/Rd3 lnfo3l l Addr l/Rd4 X lnfo4 SDRAM CMD (RAS,CAS,WE} ACT 1 ReadJ 1 ACT 2 ReadJ 2 ACT 3 1 Read 3 ACT 4 Read 4

SDRAM CS l I X I X I X lx I X I I X X

Ta ble 6 Single CPU Read Memory Control Timing

Cycle Counter 1 2 3 4 5 6 7 - 2 (1 5-ns cycle) I I 1 2 3 4 5 6 7 - _I, Request 1 1 1 1 1 1 1 1 1 1 Grant 2 2 1 1 1 1 1 1 1 1 1 1 1 1 CA, Row_CS (New transaction) Address/Command Bus Addr/Rd1I X lnfo1 Addr/Rd2l X lnfo2 Addr/Rd3l x lnfo3 CPU1: Alpha 21164 Data 1 1 1 1 2 2 2 2 I l 1

Vo . 54 l H No. 4 19<)6 21164 CPU chips, each read transaction that appears board (arbiter and memory control) uses eight PALs on the memory interconnect causes a cache probe and three PLDs; and each synchronous memory mod­ (snoop) to occur at all other CPU chips to determine if ule uses three PALs. a moditied (dirty) copy of the requested data is found As shown in Table 1, the minimum memory read in one of the internal caches of another Alpha 21164 latency (read data access time) is eight cycles ( 120 ns) CPU chip. I fit is, then the appropriate processor mod­ ti·om the time a new command and address arrive at ule asserts the signal Dirty_Enable fo r a minimum the pins of the Alpha 21164 chip to the time the first of ti.ve cycles to allow the memory module to fi nish data arrives back at the pins. The SDR.Alv'ls are pro­ driving the old data. The processor module deasserts grammed for a burst of tour data cycles, so data is the signal when the dirty data has been fe tched fr om returned in tour consecutive I 5-ns cycles. Two trans­ one of the internal caches and is ready to be driven actions at a time are interleaved on the memory inter­ onto the motherboard data bus. Table 7 shows read connect (one to each of the two memory banks), data corresponding to transaction Rd1 being returned which allows data to be continuously driven in every tl·om CPU2 to CPU 1 five cycles later than the data bus cycle. This results in the maximum memory read ti-om memory, which is ignored by CPU 1. Note the bandwidth of l GB/s. one-cycle gap in cycles 10 and 15 to avoid tri-state overlap between the memory module and processor Trade-offs Made to Reduce Complexity module data path drivers. The Alpha 21164 external interf:1ce contains many As discussed earlier in this section, the AlphaServcr commands required exclusively to support an external 4100 system implements memory address decoding cache. By not including a module-level cache on the and memory control without using custom AS!Cs no-external-cache processor module, only Read, on the motherboard, synchronous memory, or no­ Write, and Fetch commands are generated by the external-cache processor modules. Using PALs allows Alpha 21164 external interface; the Lock, MB, the address decode function and the tim-out buffe ring SetDirty, WriteBiockLock, BCacheVictim, and to the large number of SDRAMs to be performed at ReadMissModSTC commands are not used."·7 This the same time, thus reducing the component count design allows the logic on the processor module that is and the access rime to main memory. All the necessary asserting the Request signal to the central arbiter to be glue logic between the Alpha 21164 CPU and the implemented simply in a small 28-pin PAL because SDRAJvls,including the central arbiter on the mother­ only rwo of the Alpha 21164 CMD signals are board, was implemented using 5-ns 28-pin program­ required to encode a Read or a Write command. mable PALs or 90-JV!Hz 44-pin ispLSI 1016 in-circuit Similarly, allowing a maximum of two memory banks reprogrammable PLDs produced by Lattice Semicon­ in the system, independent of the number of memory ductor. These devices can be reprogrammed directly modules installed, enables the Request logic to the on the module using the parallel port of a laptop per­ central arbiter to be implemented in the 28-pin PAL, sonal computer. Eacb no-external-cache processor since only one address bit (byte address <6>) is module uses t!ve PALs and four PLDs; the motl1er- required to determine the memory bank.

Ta ble 7 Dirty Read Data Timing

Cycle (15 ns) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Address Bus Command Rd1 Rd2 Rd3

SDRAM CS X X X X X X

SDRAM CMD (RAS,CAS,WE) AQ 1 Read 1 AQ2 ...... Read 2 AQ3 Read 3

SDRAM Data 1 1 1 1 2 2 2

Motherboard Data 1 1 1 1 Dirty1 Dirty1 Dirty1 Dirty1 2 2

CPU 1: Alpha 21164 Command Rd1 Rd3 Snp2 Rd5

CPU 1: Alpha 21164 Address Addr1 Addr3 Addr2 Addr5

CPU 1: Alpha 21164 Response Miss2

CPU1: Alpha 21164 Data (1) (1) (1) (1) Dirty1 Dirty1 Dirty1 Dirty1 CPU2: Alpha 21164 Command Rd2 Snp1 Rd4 Snp3 I � � CPU2: Alpha 21164 Address Addr2 Addr1 ddr4 ddr3 I CPU2: Alpha 21164 Response Dirty1

CPU2: Alpha 21164 Data Dirty1 Dirty1 Dirty1 Dirty1 2

Dirty_Enable Dirty Dirty Dirty Dirty Dirty

Digital Tt:c hnical )ounul Vo l. 8 No. 4 1996 55 To decode memorv addresses in 28-pin PA Ls,the 21 164 PAL code, and diagnostic sotiw:�re), \\'hich is AJ phaScr\'er 4100 system usts the concept of memory oti:en placed in re:td- onlv memories (ROMs) on the holes. The memory interconnect architecture and con­ processor module or motherboard, was moved to the sole code support se\·en difkrent sizes of me mory 1/0 subsystem. Only a smJ!l 8-K.B single-bit seri;:tl mod ules and up to t(Jur pairs oF memory modules per ROM (SROM) was placed on each processor module system tor a total system memory capacity of 32 MB to th�H would initialize the Alpha 21!64 chip on power­ 8 GB. Any mix of memory module pairs is supported as up and instruct the AlphJ 21164 to access the rest of long as the largest rnemorvp� lir is plactd in the Jowest­ the tirmwarc code from the 1/0 subsysrem. nurnhered memory slot. The physicJI memorv address Lmge fo r each of the t(ntr memory slots is assigned as Quick Design Time iF all tt>Ltr memory module p:1irs are the same size. Consequently, iF additional memorv pairs that arc t\\'O To prO\·idc stable CPU :m d memory h:miware tor 1/0 smJI !cr than the pair in the lmn:st-numbercd slot subsystem hardware debug ami operating system soli:­ arc installed in the upper memory slots, there will be a warc debug and thus allow the DIGITAL AJphaServn g�l p or "hole" in the phvsica l memory space between 4100 to introduced on sched ule in Mav 1996, the be the two smaller memory pairs (sec T:1blc 8). Ra ther core module set was designed and powered on in less th:tn req uire each memory mod ule to compare the Fu ll than six months. This prim�lry goal of the AlphaServer munory address to a base :t ddrcss and size register to 4!00 project was :t chieved by keeping the design tclm determine if it should respond to the memory transac­ small, by using only programm:1blc logic and existing tion, the 28-pin PAL driving Sek 1 :0> on the mother­ d:tta p:1th components, and by keeping the amount of board (sec Figure 4) uses the seven address bits docume ntation of design intcrt:Kes to a minimum. Addr<32:26> and the size of the memorv mod ule in The dtsign team tor the motherboJrd , no-external­ the lowest-numbered slot to encode the memory slot cache processor module, Jnd S\'nchronous memory number oF the selected memory module pair. Console mod ule consisted of one design engineer, one code detects any memory holes Jt power-up and tells schematic/layout assistant, one sigtd integritv engi ­ the operating systems th:n these arc unusabJe physic:�! neer, and two simulation engineers. The team also memory addresses. enlisted the hdp of members ofthe other AlphaServer Another simplitlcation that the AlphaScrver 4100 4100 design teams. system uses is to remove 1/0 space registers fr om the The Jrchitccture :�nd actual ti n:1l logic design of the data p:�th of the processor Jtlli memory modules. core module set were developed at the same time. Bv Because there are no custom AS !Cs on these mod uks, using pmgrammable Jogic :m d oft�thc-shelf address reading and \\Tiring control registers wo uld ha\·c �l nd data p:1th components, the logic \\\lS \\'rittcn in required :�d dition:�l data path components. Since all ABL code (:1langu� lgc used to tksc ribe the logic fu nc­ the error checking is pedormcd by either the 21164 tions of programmable de\·iccs) and compiled immt­ CPU chip or the PCJ bridge chips :�nd since there arc diatcly into the PALs and l'LDs while the architecture no address decoding control registers required on the was being specified. If the desired timctionality did not memory modules, there was no need tor more than tit into the programmable devices, the architecture a tl:w bits of control int(>rmation to be accessed by was moditied. until the logic did ti t. All three modules sothvJre on the processor or me mory modules. The were designed by the s�lllle engineer Jt the same time, bus (slow serial bus) �l lrcH.iy present in the l2C I/0 so there was no need t( >r interbcc speciticuions to be su bsystem was used tc> r tr:mskrring this small amount written tor each module. FurrlJnmorc, modifications of informJtion . and cn hJncements could be nLldc in parallel to eJch Furthermore, in the process of removing the 1/0 design to optimize pertormancc �md reduce complex­ sp:�ce d:�ta p:tth fr om the motherboard and processor it\' Jcross all three modules. mod ules, the ti rmw:�re (i.e., the console code, Alpha

Ta ble 8 Me mory Hole Example

Memory Slot 1 2-GB Module Pair 000000000 - 07FFFFFFF

Memory Slot 2 2-GB Module Pair 080000000 - OFFFFFFFF

Me mory Slot 3 1-GB Module Pair ·-- 100000000 - 13FFFFFFF Memory Hole 140000000 - 17FFFFFFF Memory Slot 4 1-GB Module Pair 180000000 - 1 BFFFFFFF Unused Memory 1FFFFFFFF 1 COOOOOOO -

56 Vc.>l. 1996 8 No. 4 Because the design did not incorporate any custom Many multiprocessor servers share a common ASICs, the core system was powered on as soon as the command/address bus by issuing a request to use the modules were built. Any last-minute logic changes bus in one cycle, by either waiting fo r a grant to be required to fix problems identified by simulation returnedfrom a central arbiter or performing local arbi­ could be made directly to the reprogrammablc logic n·ation in the next cycle, and by driving the command/ devices installed on the modules in the laboratory. In address on the bus in the cycle that fo llows. This particular, the reset and power sequencing logic on the seq uence occurs for all transactions, even when the motherboard was not even simulated betore power-on memory bus is not being used by other nodes. The and was developed directly on actual hardware. AlphaServer 4100 memory interconnect implements Since the I/0 subsystem was not available when the bus parking, which allows a module to turn on its core module set was first powered on, the software that address drivers even though it is not currently using ran on the core hardware was loaded fi·om the serial the bus. If the Al pha 21164 on that module initiates a port of a laptop personal computer and through the new transaction, the command/address flows directly Alpha 21164 serial port, and then written directly into to memory in t\vo less cycles than it would take to per­ main memory. Diagnostic programs that had been form a costly arbitration seq uence. Transaction Rd 3 in developed for simulation were loaded into the memory Table 6 shows an example of the dkcts of bus parking. of actual hardware and run to test a to ur- processor, fu lly loaded memory configuration. This testing enabled High Memory Bandwidth signal integrity fixes to be made on the hardware at f-ld l speed bet(>re the I/0 subsystem was available. When One of the most important fe atures of the SDRAM the l/0 su bsystem was powered on, the core module chip is that a single chip can provide or consume data set was operating bug fr ee at fi.dl speed, allowing the in every cycle fo r long burst lengths. The AlphaServer AlphaServer 4100 to ship in volume six months later. 4100 operates the SDRAMs with a burst length oftc1L1 r As mentioned in tbe section Simple Design, the cycles fo r both reads and writes. Each SDRAM chip central arbiter logic on the motherboard was imple­ contains t\.vo banks determined by Addr<6>, which mented in programmable logic. Conseq uently, by selects consecutive memory blocks. If accesses are quickly changing to the reprogrammable logic on the made to alternating banks, then a single SDRAM can motherboard instead of perf()l'ming a lengthy redesign continuously drive read data in every cycle. The arbi­ of a custom ASIC, designers were able to avoid several tration of the AlphaServer 4100 memory interconnect logic design bugs that were f(>umi later in the custom supports only t\vo memory banks, so the smallest ASICs of other AlphaServer 4100 processor and mem­ memory module, which consists of one set of ory modules. SDRAMs, can provide the same 1-G B/s maximum read bandwidth as a fu lly populated memory configu­ Low Memory Latency ration, i.e., a system configured with the minimum amount of memory can pertonn as well as a fu lly con­ Minimizing the access time of data being returned to figuredsystem. the CPU on a read transaction was a major design goal To increase the single-processor memory bandwidth, for the core module set. The core module set design was the arbitration allows two simultaneous read trans­ optimized to deliver the Addr and CS signals to the actions to be issued fi·om a single processor module. As SDRA.Ms in two cycles (30 ns) fi·om the pins of long as the arbitration memory bank restrictions and the Alpha 21164 CPU and to returnthe data fromthe arbitration tairness restrictions are obeyed, it is possible SD RAMs tothe Alpha 21164 pins in another two cycles to issue back-to-back read transactions to memory fr om ( 30 ns ). vVith the SO RA Ms operating at a two-cycle a single CPU with read data being returned to theAlpha internal row access and a t\.vo-cycle internal column 21 164 CPU in eigh t consecutive cycles instead of the access to the first data (60 ns total internal SDR.AM usual f(m r (see Ta bles I and 6). This dual-issue kature access ti me), the main memory latency is 120 ns. and the other low memory latency and high memory The low latency was accomplished in f(>Lir ways: bandwidth features of the AlphaServer 4100 architec­ ture enabled the AlphaServer 4100 system to meet the l. By removing custom ASICs and error checking best-in-industry pertonnance goals tor McCalpin mem­ from the data path bet\veen the pins of the Alpha ory bandwidth .' 21164 CPU chip and main memory As discussed in the section Simple Design and illus­ 2. By combining the SDRA.Jvl row/column address trated in Figure 3, to avoid tri-state overlap, whenever multiplexer with addr ss ta n-out buffering on the e read data is returned by a difkrent set of SDR.AMs motherboard (on the same memory module or on a difterent mem­ 3. By simpli�'ing the memory address decode and ory module), a dead cycle is placed bet\veen bursts memory interconnect request logic of ft) Ur data cycles to allow one driver to turn off 4. By using bus parking

Digital Tcdmic1l jounLll Vo l. 8 No. 4 1996 57 bcf(m� the next driver turnson. By keeping the lower­ Using this size chip allowed designns to build synchro ­ order address bits connected ro all SDRAMs, i.e., by nous memon' modules thar cont:1 in 9, 18, 36, and nor interleaving additional banks of memory ch ips on 72 SDRAMs and provide, respectively, 32 MB, 64 MB, low-order address bits, consecutive accesses to alter­ 128 1VI B, and 256 MB of main memory per pair. The nating memory banks such as large direct memory mt.:mory architecture supports synchronous memory access (DMA) sequences can potentially achieve the modules that contain up to 1GB of main memory pa fu ll 1-GB/s read bandwidth of the data bus. With the pair (up to 4 GB per system) by using the 64-Mb

dead cycle inserted, the read bandwidth of the mem­ SDRAi\tl s; however, when the AlphaSt.:rver 4100 sys­ ory interconnect is reduced by 20 percent. tem WJS introduced, the pricing :md availability of the The data bus connecting the processor, memory, 64-Mb SDRAM did not allow these larger capacit:vsvn ­ and 1/0 modules was implemented as a tradition:.JI chronous memory modules ro bt.: built. shared 3.3-volt tri-state bus with a single-phase syn­ At the same time the svnchronous memorv modules chronous clock at all modules. As a result, the bus were being designed, a ramilv of plug-in compatible becomes saturated as more processors are added and memory modules built with EDO DRAMs was bus traftic increases. To keep the design time as short designed and built. The memory architecture supports as possible, the AlphaServer 4100 designers chose nor 1--:DO memory modules containing up to 2 GB of main to explore the concept of a switched bus, on which memory per pair (up to 8 GB per system) by using the more than one private transkr may occur at a time 64-Mb EDO DRAM. When the AlphaServer 4100 sys­ between multiple pairs of nodes. Clearly, the tem w�1s introduced, rhe 64-Mb EDO DRAM was AlphaServcr 4100 system bas reached the practical available and EDO memory modult.:s containing 72 or upper limit of bus bandwidth using the traditional tri­ 144 EDO DRAMs were built providing 1GB and 2GB

state bus approach. of main memorv per pair. To round our the range of memorv capacities and to provide Jn altcrnati ,·e to the

Reconfigurability svnchronous memory moduks in case there was a cost

or design problem with the new 16-Mb SDRAM chips, The AlphaServer 4100 hardware modules were a rJ mily ofEDO memory modules was also built using designed to allow enhancements to be made in the 16-Mb and 4-Mb EDO DRAMs, p roviding 64 MB, fu ture without having to redesign every eleme nt in 256 M B, and 512 t\t!Bof main memory per pair. the system. Although EDO DRAMs can provide data at a higher b�1ndwidth than standard DRAMs, a singk EDO Motherboard Options DRAM cannot return cb ra in t( Hir consecutive 15-ns The AlphaServer 4100 motherboard contains t(J ur C\'clt.:s l i ke the single SDRAM used on the S\'nchronous dedicated processor slots, eight dedicated memorv memory modules. Therdi:>JT, J custom AS IC was used slots (tour memory pairs), and one slot ri:Jr :�n on the EDO memory module to Jccess 288 bits of 1/0 module with two PC! bus bridges. Designed at Lb ta every 30 ns fl-om the EDO DRAMs and multiplex tht.: same time but not produced until after rlw the d:tta onto the 144-bit memory interconnect every AlphaServer 4100 morhnboard was availabk, 15 ns. To imitate the two-bank tcature of a single rht.: AlphaServer 4000 morht.: rboard contains on ly two SDRAM, a second bank of EDO DRAMs is required . proct.:ssor slots, rou r nKmory slots (two memory Consequently, the minimum number of memory pairs), and slots tor rwo 1/0 moduks Jllowing ti.Jur chips per EDO memory moduk is 72 ri:> ur- bit-wide PCI bus bridges. Since module hardware veritication EDO DRAM chips, whereas the minimum number in rhc laboratory is a lengthy process, rhc AlphaServer of memon' chips per svnchronous memorv module 4000 mother board '' as designed ro usc the same l ogic is onlv 18 r(>ur-bit-wide SDRAM chips or as rew as as the AlphaServer 4100 except ri.)r rhe programmabk 9 eight-bit-wide SDRAM chips. arbitration logic, w hic h hJd a different algorithm When rhe AlphaServer 4100 systt.:mwas introduced, bec:wse of the extra I/0 module Wht.:n the signals on tht.: bstest EDO DRAM avJilabk that met the pricing the Al p haServer 4000 motherboard were routed , all requirements was the 60-ns vnsion . When this chip nets were kept shorrt.:rthan tht.:correspo nding nets on is ust.:d on the EDO memory module, data cannot the AlphaServer 4100 motherbo:trd so that every sig­ be returned to the motherboard as bst as data can bt.: nal did not need to be rt.:cxamincd. Only those signals returned tl-om the synchronous memory modules. To

that wert: uniquely ditkrcnt were su bject to tht.: fu ll support the 60-ns EDO DRA.Ms, a one-cvcle (15 ns)

signal integrity veritication process. increase in the access ti me to main mcmorv is required . Support fi: >r this extra n'clc ofbtcncv was designed into

Memory Options the memory interconnect Lw placing; a one-cvcle gap T·he synchronous memory modules available tc>r the between cycles 2 and 3 (st.:eTa ble I)o f anv read trans­ AlphaScrver 4100 arc all based on the 16-Mb SDRAM. Jction :1ccessing a 60-ns EDO nKmory module. Con­ sequently, the read memory Lltt.:ncy is one cycle longer

58 Di,_il.ll Tec hnical journal Vo l. 8 No. 4 !996 and the maximum read bandwidth is 20 percent less processor module is plug-in compatible, and systems when using EDO memory modules built with 60-ns can be upgraded without changing the motherboard. EDO DRAJ\1s. Note that it is possible to have a mixture This is true even if the ti·equency of the synchronous of EDO memory modules and synchronous memory memory interconnect changes, although all processor modules in the same system. In such a case, only the modules in the system must be configured to operate memory read transactions the 60-ns EDO memory at the same speed. The oscillators for both the high­ to module would result in a loss of performance. speed internal CPU clock and the memory intercon­ New versions of the EDO memory modules that nect bus clock are located on the processor modules contain 50-ns EDO DRAMs providing up to 8GB of to allow processor upgrades to be made without mod­ total system memory arc scheduled to be introduced ifYing the motherboard. within a year afterthe introduction of the AJphaServer 4100. These modules will not require the additional Summary cycle oflatency, and as a result they will have identical pertormance to the synchronous memory modules. The high-pertormance DIGITAL AlphaServer 4100 SMP server, which supports up to to ur AJpha 21164 Processor Options CPUs, was designed simply and quickly using offthe­ The no-external-cache processor module was designed shelf components and programmable logic. vVhen the to support either a 300-MHz Alpha 21164 CPU chip AlphaServer 4100 system was introduced in May with a 60-rVlHz (16.6-ns) synchronous memory inter­ 1996, the memory interconnect design enabled the connect or a 400-MHz AJpha 21164 CPU chip with server to achieve a minimum memory latency of a 66 MHz ( 1 5-ns) synchronous memory interconnect. 120 nanoseconds and a maximum memory band­ As previously mentioned, the Alpha 21164 itself width of l gigabyte per second. This industry-leading contains a primary 8-KH data cache, a primary 8-KB performance was achieved by using oH�the-shelf data instruction cache, and a second-level 96-KB three­ path and address components and programmable way set-associative data and instruction cache. The logic between the CPU and th e SDRAM-based main no-external-cache processor module contains no third­ memory. The motherboard , the synchronous memory level cache, but by keeping the latency to main mem­ module, and the no-external-cache processor module ory low and by issuing multiple references from the were developed concurrently to optimize the perfor­ same AJpha 21164 main memory at the same time to mance of the memory architecture. These core mod­ to increase memory bandwidth, the performance of ules were operating successfully within six months of many applications is better than that of a processor the starr of the design. The AJphaServer 4100 hard­ module containing a third-level external cache.' ware modules were designed to allow fu ture enhance­ Applications that are small enough to fitin a large ments without redesigning the system. third-level cache perform better with an external cache, however, so the AJphaServer 4100 offe rs several Acknowledgments variants of plug-in compatible processor modules con­ taining a 2-MB, 4-MB, or greater module-level cache. Bruce AJford from Revenue Systems Engineering In addition, cached processor mod ules are being assisted with the schematic entry, module layout, designed to support AJpha 21164 CPU chips that run manufacturing issues, and power-up logic design, and t:lSter than 400 MHz while still maintaining the maxi­ succeeded in smoothly transitioning the core module mum 66-MHz synchronous memory interconnect. set to his long-term engineering support organization. The architecture of the cached processor module Roger Dame hand led signal integrity and ti ming was developed in parallel with the core module set, analysis, while Dale Keck and Arina Finkelstein and several enhancements were made to the CPU and worked on simulation. Don Smelser and Darrel memory architecture to support the module-level Donaldson provided technical guidance and moral cache. See the companion paper "The AlphaServer support. 4100 Cached Processor Module Architecture and Design" t(>r more int(>rmation.' References and Notes Ve rsions of the AJ pha 21164 chip that operate Cveranovic and D. Donaldson, "AiphaServer 4100 at 400 MHz and faster require 2-vo lt power, while l. Z. slower versions of the Alpha 21164 req uire only Pc rtormance Characterization," Digital Tecb nical 3.3 volts. The AlphaServer 4100 motherboard does Jo urnal, vo l. 8, no. 4 ( 1996, this issue): 3-20. not provide 2 volts of power to the processor module 2. S. Duncan, C. Keefer, :111d T. McLaughlin, "High connectors; consequently, a 3.3-to-2-volt converter Pertormance 1/0 Design in rhe AlphaServer 4100 Sym­ card is used on the higher-speed processor modules merr.ic Multiprocessing System," Digital Te clmicctf provide this unique voltage. Each new version of Jo urnal, vo l. 8, no. 4 1996, this issue): 61-75. to (

Digital Tcdl !1ical journal Vo l. 8 1o. 1996 59 4 3. 'l'lll' .'\l['hc1Scr1·er 4000 wstL·m contc1ins the same CPU­ to-mcmot·y inrcrhcc �s rhe Alphc1Scrvcr 4100 s1·stcn1 bur suppmrs halfrhe number of prm:essors c1 11d nKmorv modules �nd rll'ice the number of PC! bridges . The . Alph�Sen·er 4000 motherbo�rd ,, ,1s designed at the same rime as the Al phaSenTr 4100 morhcrbocnd but

11 cts nor prod uced until after rile AlpluScrver 4100 mothCI'boclrd was available.

4. iVI . Steinman et <1 1., "The AlphaServer 4100 CJChcd Processor Mod ule Architecture and Design," D(�itaf 'l(,cbuicaf Jo urnal. vol. 8, no. 4 ( 1996, this issue): 21-37.

5. R. [);lme, "The AlphaSen er 4100 Low-cost Clock Dis­ tribution s,·srem," D��itof 'J'ecbnicol journ({f. vol. 8,

no . 4 ( 1996, r h is issue) 38-47.

6. Afpho 2 I 16'-i .\ticrop mcessor Hwdtiw·e Nej'e rence ,\ [{{1/IIC!f ( ."v!.n·nard, lviass.: Digital Fquipment Corpora­ tion, Order �o. EC-QAEQA- lT , September 1994 ).

7. The fnch command is not implementni on the

AlphaScn·cr 4100 system, bur tlJL:rc is no mechanism to

keep ir h-om appea ring on rbc CMD pins of the Alph c1 21164 CPI chip. The Ferch comn1and is simply te nni­ n;Jted without anv additional acrion.

Biography

Glenn A. Herdeg c_; lenn Herdcg has been II'Orking on the design ofcom­

putn mod uks since joining Digi ta l in l 9R 3 . A princip;ll

hclrdll';m: engineer in the A l phaScri'C.:r Pbt�(mn Del·elop­

ment p:roup, he 11 .1S the project leader, architect, logic designer, ;l lld module designer fo r the AlphaServer 4100

mothcrbo;ll·d, no o: ternal-ochc processor m od ules, ;l lld

s\'llchronous mcmot-�' modules. He al so led the design of the .\1 pluScrvcr 4000 motherbo:lrd . In earlier work, c_; lenn served as the principal ASIC .1 11d module designer tCJr se1·eral DEC 7000, VAX 7000, and VA X 6000 projects.

He holds a B.A. in physics ti'omC ol by Collep:e ;l nd <1 11 M.S. in computer systems ti-om RcnssL·IJcr Polytechnic Institute and ILlS two p;ltents. Glenn is currently involved in tl nTher

.\ lpl.1;1-b;1sed server system development.

60 Dit>-it.tl Te chnical )ounJ:JI Vol. X No. 4 I 996 I Samuel H. Dnncan Craig D. Keefer Thomas A. McLaughlin High Performance 1/0 Design in the AlphaServ er 4100 Symmetric Multiprocessing System

The DIGITAL AlphaServer 4100 symmetric multi­ The AlpbaServer 4100 is a symmetric multiprocess­ processing system is based on the Alpha 64-bit ing system based on the Alpha 21164 64-bit RJSC RISC microprocessor and is designed for fast microprocessor. This midrange system supports one to fo ur crus, one to tour 64-bit-widc peer bridges to CPU performance, low memory latency, and the peripheral component interconnect ( PCI ), and high memory and 1/0 bandwidth. The server's one to to ur logical memory slots. The goals fo r the 1/0 subsystem contributes to the achievement AlphaServer 4100 system were fa st CPU performance, of these goals by implementing several innova­ low memory latency, and high memory and I/0 tive design techniques, primarily in the system bandwidth. One measure of success in achieving these bus-to-PCI bus bridge. A partial cache line write goals is the AIM benchmark multiprocessor perfor­ mance results. The AJphaServer 4100 system was technique for small transactions reduces traffic audited at 3,337 peak jobs per minute, with a sus­ on the system bus and improves memory latency. tained number of3,018 user loads, and won the AIM A design for deadlock-free peer-to-peer transac­ Hot Iron price/performance award in October 1996.' tions across multiple 64-bit PCI bus bridges reduces The subject of this paper is the contribution of the system bus, PCI bus, and CPU utilization by as T/0 subsystem to these higb-pertonnance goals. In an much as 70 percent when measured in DIGITA L in-house test, 1/0 performance of an AJphaServer 4100 system based on a 300-mcgabertz ( MHz) AlphaServer 4100 MEMORY CHANNEL clusters. processor shows a 10 to 19 percent improvement in Prefetch logic and buffering supports very large I/0 when compared with a previous-generation bursts of data without stalls, yielding a system midrange Al pha system based on a 350-MHz proces­ that can amortizeoverhead and deliver perfor­ sor. Reduction in CPU utilization is particularly bene­ mance limited only by the PCI devices used in ficial fo r applications that usc small transfers, e.g., the system. transaction processing.

1/0 Subsystem Goals

The goal fo r the AlphaServer 4100 I/0 subsystem was to increase overall system performance by

• Reducing CPU and system bus utilization fo r all applications

• Delivering full I/0 bandwidth, specifically, a band­ width limited only by the PCI standard protocol, which is 266 megabytes per second (MB/s ) on 64-bit option cards and 133 MB/s on 32-bit option cards

• Minimizing latency t()r all direct memory access (DMA) and programmed I/0 (PIO) transactions

Our discussion t(x uses on several innovative techniques used in the design of the I/0 subsystem 64-bit-wide peer host bus bridges that dramatically red uce CPU and bus utilization and deliver full PCI bandwidth:

Tc dmic1l Journal Vol. No. 4 1996 61 Digital 8 • A p:�rrial cache line write technique fo r coherent application-specific intcgr:ncd circuit (ASIC) chips, DMA writes. This technique permits :�nr;o device one control chip, and t\\'O sliced data path chips. ro insert data that is smaller than a cache line or The two independent PCI bus bridges arc the inter­ block, into the cache-coherent domain without flrst bees between the system bus and their respective PC! obtaining ownership of the cache bJ ock and pcr­ buses. A PC! bus is 64 or 32 bits wide, transferring tc.Jrming a read-modit)r-write operation. Partial dat:� at a peak of266 MB/s or 133 MB/s, respectively. cache line writes reduce traffic on the svstem bus , In the AlphaServcr 4100 system, the PC! buses arc and improve latency, p

passed in a MEMORY CHANNEL cluster.' The PCT buses connect to �1 PC:! backplane module with a number of expansion slots and a bridge to the • Support to r device-initiated transactions that target other devices (peers) across multiple (peer) PC! Extended Industry Standard Architecture (EISA) bus. buses. Peer-to-peer transactions reduce svstem I n Figure I, each PC! b us is shown to support up to bus utilization, PC! bus utilization, and C PU uti­ r(J ur devices in option slots. The AlphaScrver 4000 series

nuncc PC! devices. When used in combination Ineffi cient use of system resources can limit perfor­ with the PCI delayed-read protocol, the buftering mance on heavilv loaded systems. Svstem designers

62 Digital Te chnical journal Vo l. 8 No. 4 1996 PCI BACKPLANE MODULE I STANDARD 1/0 PORTS 1.--t.,.----t.,.---t;------, ONE DEDICATED ���::S�6 N /# /# /#/# PCI AND THREE SLOTS SHARED PCI/EISA ifif if if SLOTS

------I -- I : PCI BRIDGE MODULE - MEMORY PCI BUS BRIDGE PCI BUS BRIDGE : :I I II I :I I :_-- ____ -}------}---- J COMMAND/ADDRESS ! SYSTEM BUS DATA AND ECC

CPU! CARD I I CPU! CARD I I CPU! CARD I I CPU! CARD

Figure 1 AlphaServer 4100 System with Four CPUs, Two 64-bit Buses

PCI BACKPLANE MODULE I STANDARD 1/0 PORTS 1...--t.------t;-----.,t;------, ����{s?�wg g g� i�ri�:��:A

:" ___ �-- l 8:--�------,1 -�l---t-� - j_ ------":'��I _),")I : !��"'_I_------I PCI BRIDGE MODULE 1 I I i I PCI BUS BRIDGE II PCI BUS BRIDGE I i MEMORY L---}------}---- ____ ] COMMAND/ADDRESS ! SYSTEM BUS DATA AND ECC " t :-----!------!------: ...... -:--t ____.:.._..., CPU CARD CPU CARD I I i I PCI BUS BRIDGE II PCI BUS BRIDGE I i I I : PCI BRIDGE MODULE I

------���������� ��� �����������-�------"" PCI 3 ! 1 , I I ! I T:i�f ������6w ggg �1g � � ������6,

Figure 2 AlphaServer4000 System with Two CPUs, Four 64-bit Buses

Digital Te chnical Journal Vo l. 8 No. 4 1996 63 memory to r all CPUs on the system bus . Th is mergi ng MmTnKnt of blocks of less than 64 lw tcs is of write data into the cache-coherent donlJin is typi­ important ro :1pp lieation performance because there cally done on the PC! bus bridge, which reads the are high -pcrti:mnance dc\'iees that move less thJn cache li ne, merges the new b�'tes, :�nd writes the cache 64 byres. One cx<�mple is DIGITAL's MLMORY

line b::�ck ou t to memory. The read-modir\'-wrin: must CHANNEl. Jdaptcr, which moves 32-byte blocks in �1 be pert<:>rmed as an atomic ope ration to m:lintain b u rst.2 As MEMORY CHANNEL adapters move l:1rge memory consistency. For the duration of tht: atom ic numbers of blocks that art: all Jess than a cache line of read -m odi �1-write operation , tht: system bus is busy. data, the 1/0 subsystem partia l cache line write tC;�rure Const:qut:ntly, a write of less than a cacht: lint:rt:s ults i mproves system bus utilization and eliminates the in a rcad - mod i �� -write that ta kt:s at least thrct: timt:s :�s system bus as a bottlenec k. Message latency across the m:�ny cycles on the system bus as a simple 64-byte­ tab ric of an Alph:1Servn 4100 tv! EMORY CHAN !\II-:!, alignt.:d cache lint: write. cluster ( version 1.0) is <1 pproxi matel\' 6 microseconds For example, if we bad used an urlit:r DIGITAL (fJ.s). Thnc art: two DMA writes in the message: the i mple mentation of a system bus protoco l on the first is a message, and tht:seco nd is a flag to va l ida te the A l phaSt:rver 4100 system, an 1/0 devict: operation message. Thest: DMA writes on the target A lphaSenn on the l)CJ that performed a singl e 16-bytt.:-aligned 4100 contri bu te to mcssJgc brency. The i mprme memory write would have consumt:d system bus ment in l atencv provided by tht: partial cache line 11ritc bandwidth that could have moved 256 bytt:so f data, tCature is approxi mately 0.5 11-s per write. With two or 16 ti mes the amount ofdat a . We tht:rdi:>rt: had to writes per message, l att: ney is red uced b�' approx i­ find a more efficient approach to writi ng subblocks matdy 15 percent over an AlphaServer 4100 system into tht:cache-c oherent domain. with the partia l cache line write tearurc. With version Wt: first examined opportunities ti:>r efficiencygains 1.5 of MEMORY CHANNEL adapters, net Luc nev in tht: memory system 3 Tht: AlphaServn 4100 mem­ will i m prove by �1 bout 3 fLS, and the etlect of pani;:d ory system interrace is 16 bytes wide; a 64-byte cache cache line writt:s will ::tpproach a 30 pcrct:nt i m prove ­ line read or write takes fo u r cycl es on the system bus. ment in message la tency. The memory mod ules themselves can be designed to In summar�', tht: chJIIcngc is to efficientlv mm·c a nnsk one or more of the writes and alloll' :1ligncd block of dat<1 of a common size (mu l tiple of 16 bvtes ) blocks that arc multiples of 16 byres to be ll'rittcn to that is smaller than a cache line into the cache-coherent memory in a si ngle system bus trans�lction. Tht: prob­ domain. vVithout anv t(u·ther imprm'emellt, the tech­

lem wi th permitting a Jess than compkte c:1che line nique reduces system bus u tilization bv as much as �1 write, i .e . , less tha n 64 bytes, is that the writt: goes to tacror of t<:>ur. This tcclmique allows su bblocks to be

main memorv, but the only up- to- date/complt:tc merged \\ · ithour incurring the overhead ofre:�d- modi�'­

copv of a cache l i ne may be in a CPU card 's cache. write, yet m;�inrains c:�che coherency. The on ly draw­ To permit the more efficient partial cache l i ne back to the technique is some increased complexi ty in wri te operati ons, we mod i fied th e system bus cache­ the CPU cache controller to support this modt: . We cohuency protocol . When a PCI bus bridge issues considered the alternative of adding a small cache to the a parti;�l CKhc line write on the system bus, c:.

the write is dirty. In the evt:nt that tht: target cache into a c::tcht:.This appro::tchadds significant complexitY block is dirtv, the CPU sign;�ls tht: PC! bus bridge and increases pcrtcm11ance onlv if transactions that tar­ bdi:>re rhe end of the partia l wrirt: . On dirty partial get the same cache line art: \UV close together i n time . eacht: lint: write transactions, the bridge simp!�, per­

ti: ml1S <1 second transaction as a read -modit\1 -write . If Peer-to-Peer Tra nsaction Support the t<1rgct cach e block is nor dirty, rht:o perJtion com­

pletes in a singk systt:m bus transaction. Systt:m bus and PC! bus u ri li z:1tion can be optimi zed Addrt:ss traces taken during prod uct development fc:>rcert ain applications by limiting the numbt:rof times

were sim ulated to determi ne the ti·equt:ncy of di rty the sanK block of dar:� moves throu gh the system. cache blocks that are targets of DMA writes. Our sim­ As noted in tht: section A lp haScrvcr 4100 Svstem ulations showed that, tor the add ress trace wt: used, Overview, the PCI subsystem can contain two or ti: H!r tl-cquency was extremely rare . Mt:asurcmcnr ta ken indepcndcllt PC ! bus bridges . Our design al.kl\l·s exter­

ti·om St:VeraJ appJiutiOJlS and benchmarks con fl rmed nal devices eonncctt:d to these separate peer PC! bus

that a di rty cache block is almost never asserted with bridges to sh�u-c data without accessing main mt:mor\'

a parri;� l cache line wri te. and bv using a min imal amount of host bus bandwidth. The DMA transft:r of blocks thJt arc aligned In other words, external dC\'iccs can efkct direct access

mu l tiples of 16 bytes but less t han :1 cache line is ti: >ur to data on 3 peer-to-peer basis. times more efficientin the 4100 svstem than in earlier DIGITAL i mpl ementations .

64 Vol .� No. 4 Jl)96 In conventional systems, a data file on a disk that is a CPU daughter card. Each PCI bridge to the system requested by a client node is transferred by DMA trom bus has a translation look-aside butle r (TLB) that con­ the disk, across the PC! and the system bus, and into verts PC! addresses into system bus addresses. The use main memory. Once the data is in main memory, a net­ of a TLB permits hardware to make all of physical work device can n.:ad the data di n.:ctly in memory and memory visi ble th rough a relatively small region of send it across the network to the client node. In a 4100 address space that we call a DMA window. system, device peer-to-peer transaction circumvents A DMA window can be specified as "direct the transter to main memory. However, peer-to-peer mapped" or " scatter-gather mapped." A direct­ transaction requires that the target device have certain mapped DMA window adds an offset to the PCI sca properties. Tbe essential property is that the device tar­ address and passes it on to the system bus. A tter­ get appear to the source device as if it is main memory. gather mapped DMA window uses the TLB to look up The balance of this section explains how conven­ the system bus address . tional DMA reads and writes are performed on the Figure 3 is an example of how PCI memory address

AlphaServer 4100 system, bow the in frastructure fo r space might be allocated tor DMA windows and tor conventional DMA can be used for peer-to-peer trans­ PCI device control status registers ( CSRs) and memory. actions, and how dead loci( avoidance is accomplished. A PCl device initiates a DMA write by driving an address on the bus. In Figu re 4, data from PCl devices Conventional DMA 0 and l are sent to the scatter-gather DMA windows; We e xtended the k

. Alp haSaver 4100 system to support peer-to-peer Di\1A wi ndow When an address hits in one of the

transaction. Conventional Di\1A in the 4100 system DMA windows, the PC! bus bridge acknowledges

. works as ta l lows the address and immediatel y begins to accept write Address space on the Alpha processor is 2 ��� or l tera­ data. While consuming write data in a bufter, the PC! byte; the AlphaServer 4100 system supports up to bus bridge translates the PCl address into a system 8 gigabytes ( GB) of main memory. To directly address address. The bridge then arbitrates tor the system bus all of memory ·without using memory management and, using the translated address, completes the write

hardware, an address must be 33 bits. (Eight GB is transaction. The write transaction completes on the equivalent to 2'' bytes.) PC! before it completes on the system bus . Because the amounr ofme mory is large compared to A DMA read transaction has a longer latency than address space available on the PCI, some sort of mem­ a DMA write because the PCI bus bridge must first ory management hardware and soft-ware is needed to translate the PC! address into a system bus address and make memory directly addressable by PC! devices. tCtch the data before completing the transaction. That Most PCI devices use 32-bit Dlvi.A addresses. To pro­ is to say, the read transaction completes on the system vide direct access fo r every PC! device to all of the sys­ bus before it can complete on the PCI . tem address space, the PC! bus bridge has memory Figure 5 shows the address path through the PC!

management hardware similar to that which is used on bus bridge . All DMA writes and reads are ordered

SYSTEM ADDRESS SPACE PCI MEMORY ADDRESS SPACE (240 BYTES) (232 BYTES)

8MB PCI DEVICE CSRs 8MB SCATTER-GATHER WINDOW 0 112MB PCI DEVICE CSRs I· 384 MB (UNUSED) 1GB 512 MB SCATTER-GATHER WINDOW 1 1-----jl , PCI DEVICE PREFETCHABLE 1GB MEMORY SPACE

1GB DIRECT-MAPPED WINDOW 2

1 GB SCATTER-GATHER WINDOW 3

T T

Figure 3 Es �mplc ofl'Cl Memory Address Space M�ppcd ro DMA Windows

Digiral Tcc" hnical journal Vol. S No. 4 1996 65 ....- .. - .. - - ...... ------

1------1-

11__.------.

Figure 4 Exam ple of PCI Dn·ice ReJds or Wr ites to DMA Windo"s Jnd Address Translation ro S1·stem Bus Addresses

SYSTEM BUS

�------t t ----- PCI BUS 1 BRIDGE I I I I I I 10 I I I I I I I I I I I I I I I -rr------64-BIT PCI

Figure 5 in l DiJgram of Data Paths J Si ng e PC! Bus Bridge

through the outgoing queue (OQ) en route to the sys­ Following is an example of how a conventional

tem bus. DMA read data is passed through an incom­ " bounce " DJ\!lA operation is used to move a filefro m a ing queue (IQ) bypass by way of a DMA filldata butfc r local storage device to a network device. The example en route to the PC!. illustrates how data is ll'rirren into memory by one Note that the IQ orders CPU-initiated PIO transac­ de1 icc ll'here it is temporarilv stored. Later the data is tions. The IQ bypass is neccssarv fo r correct, dead­ read by another DlVIA device . This operation is called lock-tree operation ofpeer-ro-pcer transactions, which a "bounce I/0" because the data "bounces" off are next section. explained in the

66 1996 Diginl Te chnical journal Vol. R No. 4 memory and out a network port, a common operation the PCI master: The device driver provides the master tor a network file server application. device with a target address, size of the transfer, and Assume PC! device A is a storage controller and PC! identification of data to be moved. In the case in which device B is a net\vork device: a data file is to be read from a disk, the device driver software gives the PC! device that controls the disk a l. The storage control ler, PC! device A, writes the file "handle," which is an identifierfo r the data fileand the into a buffe r on the PC! bus bridge using an PCI target address to which the fileshould be written. address that hits a DJ\1A window. To reiterate, in a conventional DMA transaction, the 2. The PCI bridge translates the PC! memory address target address is in one of the PCI bus bridge DMA into a system bus address and writes the data into windows. The DMA. window logic translates the memory. address into a main memory address on the system bus. 3. The CPU passes the net\vork device a PCI memory In a peer-to-peer transaction, the target address is

space address that corresponds to the system bus translated to an address assigned to another PCI device. address of the data in memory. Any PC! device capable of DMA can perform peer­ 4. The network controller, PC! device B, reads the file to-peer transactions on the AlphaServer 4100 system. in main memory using a DMA window and sends For example, in Figure 6, PCI device A can transfer the data across the nct\vork. data to or from PC! device B without using any resources or fa cilities in the system bus bridge. The use If both controllers are on the same PC! bus segment of a peer-to-peer transaction is controlled entirely by and if the storage controller (PC! device A) could soft\vare: The device driver passes a target address to write directly to the nctvvork controller ( PCI device PCI device A, and device A uses the address as the B), no traffic would be introduced on the system bus. DMA data source or destination. Traffic on the system bus is reduced by saving one If the target of the transaction is PCI device C, then DMA write, possibly one copy operation, and one system services software allocates a region in a scatter­ DMA read. On the PC! bus, traffic is also red uced gather map and specifies a translation that maps the because there is one transaction rather than two. scatter-gather-mapped address on PCI bus 0 to a sys­ When the target of a transaction is a device other than tem bus address that maps to PC! device C. This main memory, the transaction is called a peer-to-peer. address translation is placed in the scatter-gather map. Peer-to-peer transactions on a single-bus system arc When PC! device A initiates a transaction, the address simple, bordering on trivial; but deadlock-free support matches one of the DMA windows that has been ini­ on a system with multiple peer PCI buses is quite a bit tialized for scatter-gather. The PCI bus bridge accepts more difficult. the transaction, looks up the translation in the scatter­ This section has presented a high-level description gather map, and uses a system address that maps of how a device address is translated into PC! DMA through PCI bus bridge l to hit PC! device C. The a svstem bus address and data arc moved to or fr om ; transaction on the system bus is between the two PCI m in memory. In the next section, we show how the bridges, with no involvement by memory or CPUs. In same mechanism is used to support device peer-to­ this transaction, the system bus is utilized, but the data peer transactions and bow trafficis managed fo r dead­ is not stored in main memory. This eliminates the lock avoidance. intermediate steps and overhead associated with con­ ventional DMA, traditionally done by the "bounce" of A Peer-to-Peer Link Mechanism the data through main memory. For direct peer-to-peer transactions to work, the target The fe atures that allow software to make a device on device must behave as if it is main memory; that is, one PCI bus segment visible to a device on another are it must have a target address in pretetchable PCI mem­ all implicit in the scatter-gather mapping TLB. For ory space.' The PCI specification fu rther states that peer-to-peer transaction support, we extended the devices are not allowed to depend on completion of range of translated addresses to include memory space a transaction as master.' Two devices supported by on peer PC! buses. This allows address space on one the DIGITAL UNIX operating system meet these independent PC! bus segment to appear in a window criteria today with some restrictions; these arc the of address space on a second independent peer PC! MEMORY CHANNEL adapter noted earlier and bus segment. On the system bus, th e peer transaction the Prestoscrve NVRAM, a nonvolatile memory stor­ hits in the address space of the other PC! bridge. age device used as an accelerator for transaction

processing. The PNVRAM was part of the configura­ Deadlock Avoidance in Device Peer-to-Peer Tra nsactions tion in which the AIM benchmark results cited in the The definition of deadlock, as it is solved in this introduction were achieved. design, is the state in which no progress can be made Both conventional and peer-to-peer trans­ DMA on any transaction across a bridge because the queues actions work the same way trom the perspective of are fil led with transactions that will never complete.

DigitJI Tcc hnic1l journal Vo l. 8 No. 4 1996 67 CPU CPU CPU2 CPU MAIN 0 1 3 MEMORY I I I I I I I I I A COMMAND/ADDRESS t t t t t SYST EM BUS DATA AND ECC )" ______r BRIDGE --- -- r BRIDGE------0 : 1 1 t t I I I I I 10 10 I i J�POSTED PIO I WRITESPENDED BYP PIOASS I READS I I I I I I

� - � ---

---- �:>�C�, ::: :,:,�,- - -- �C:C�C�C �;- �:,:,�0- - PCI DEVICE E PCI DEVICE � PCI DEVICE G PCI DEVICE � J.._:ll:.._. F J.._:ll:.._. H

Figure 6 4100 AlpluScrvcr System DiagrJm Showing Dat� P.nhs through PC! Bus Ih idgcs

A deadlock situation is analogous to highway gridlock This section assumes th�t the reader is bmiliar with in which two lines of automobiles race each other on the PC! protocol and ordering rules ' a single-lane road; there is no room to pass and no way Figu re 6 shows the data paths through two PC:! to back up. Rules tor deadlock avoidance arc analo­ bus bridges. Transactions pass through these bridges go us to the rules fo r directing vehicle tr

t(>r deadlock. req uired to avoid deadlocks that m�1y occur during The design t(n dcJdlock-ti-eepeer-to-peer transaction device peer-to-peer transactions. All PC! ordering ru les support in the Alph:�Scrvcr410 0 system includes the arc satisfied ti·om the point ohicw of any single device in the system. The to llowing example dcmonstr:Hcs • Implemcnt

68 Digit;J[ Vol. 8 No. 4 1<)96 Tcchnic1l jound The configuration in the example is an AJphaServer shown in Figure 6. In the AlpbaServer 4100 deadlock­ 4100 system with fo ur CPUs and two PCI bus bridges. avoidance design, the IQ will always empty, which in Devices A and C are simple master-capable DMA turn allows the OQ to empty. controllers, and devices B and D are simple targets, Note that the IQ bypass logic implemented for deadlock avoidance on the AJphaServer 4100 system e.g., video RAMs, nerwork controllers, Pl'-TV RAJ.'vl,or any device with pretetchable memory as defined in the may appear to violate General Rule 5 tl-om the PC! PC! standard . specification,Append ix E: Example of device peer-to-peer write block comple­ A read transaction must push ahead of it through tion ofpended PIO read-return data: the bridge any posted writes originating on l. PCI device A initiates a peer-to-peer burst write the same side of the bridge and posted before the targeting PCI device D. read. Before the read transaction can complete on 2. Write data enters the OQ on bridge 0, filling three its originating bus, it must pull out of the bridge posted write buffers. any posted writes that originated on the opposite side and were posted before the read command 3. The target bridge, bridge 1, writes data from completes on tbe read-destination bus.' bridge 0. In fa ct, because of the characteristics of the CPUs 4. When the IQ on bridge l hits a threshold, it uses the system bus flow-control to bold off the and the flow-control mechanism on the system bus, all next write. rules are followed as observed fl· om any single CPU or PCI device in the system. Because reads that target 5. As each 64-byte block of write data is retired out a PCI address are always split into separate request and of the JQ on bridge 1, an additional 64-byte response transactions, the appropriate ordering rule (cache line size) write of data is allowed to move for this case is PCI Specification Delayed Transaction fr om the OQ on bridge 0 to the JQ on bridge l. Rule 7 in Section 3.3.3.3 of the PC! specification: 6. Ifthe OQ on bridge 0 is fu ll, bridge 0 will discon­ nect from the current PCI transaction and will Delayed Requests and Delayed Completions retry all transactions on PC! 0 until an OQ slot have no ordering requirements with respect to becomes available . themselves or each other. Only a Delayed Write Completion can pass a Posted Memory Write. A 7. PCI device C initiates a peer-to-peer burst write, Posted Memory Write must be given an oppor­ targeting PCI device B; the same scenario fo llows tunity to pass everything except another Posted as steps 1 through 6 above but in the opposite Memory Write.' direction. 8. CPU 0 posts a read of PCI memory space on PCI Also note that, as shown in Figure 6, the DMA fill device E. data buffe rs bypass the IQ, apparently violating General Rule 5. The purpose of General Rule 5 is to 9. CPU 1 posts a read ofPC! memory space on PCI provide a mechanism in a device on one side of a bridge device G. to ensure that all posted writes have completed. This 10. CPU 2 posts a read ofPCI memory space on PCI rule is required because interrupts on PC! are side­ device F. band signals that may bypass all posted data and signal 11. CPU 3 posts a read ofPCI memory space on PCI completion of a transaction before the transaction has device H. actually completed . In the AJphaServer 4100 system, 12. Deadlock: all writes to or from PCI devices are strictly ordered, and there is no side-band signal notit),inga PCI device -Both OQs are stalled waiting fo r the corre­ of an event. These system characteristics allow the PCI sponding IQ to complete an earlier posted write. bus bridge to permit DMA fill data (in PC! lexicon, tl1is -The design has two PIO read-return data (fill) could be a delayed-read completion, or read data in a buffers; each is fi.dl. connected transaction) to bypass posted memory -The PIO read-return data must stay behind the writes in the IQ. This bypass is necessary to limit PCI posted writes to satis f)' PCI -specified posted target latency on DMA read transactions. write buffer flushing rules. We have presented two IQ bypass paths in the -A third read is at the bottom of each IQ, and it AJphaServer 4100 design. We describe one IQ bypass cannot complete because there is no fill buffe r as a required fe ature fo r deadlock avoidance in peer­ available in which to put the data. to-peer transactions between devices on diffe rent buses. The second bypass is required fo r performance To avoid this deadlock, posted writes are allowed reasons and is discussed in the section JjO Bandwidth to bypass delayed (pended) reads in the IQ, as and Efficiency.

Journal 1996 69 Digit31 Te chnical Vol. 8 No.4 CPU CPU CPU CPU 0 I I 1 I I 2 I I 3

COMMAND/ADDRESS t SYSTEM BUS DATA AND ECC -- t - ---:::�- - ·r------. ------�------�------�------�--·' ------� - - - BRIDGE 1 I E 0 ' I -; � 1 ' "' ''-t - � ...... ''-� - I ; 10 I 10 / " , , I I I + I 1-----� 1 I DMAR EAD 1-----� 1 PR EFETCH I I AD DRESS 1-----� 1 I I t I ' 00 00 PEER WRITE PEER WRITE 1---:P:-::E:=E::-R--:W:::R--:IT=E-1 I PEER WRITE I I WRITE PEER PEER WRITE 1-----'-P-=E=E--'-R-'-W---R"-'IT-=E-1 I PEER WRITE DMA I PEER WRITE I PEER WRITE ""' 1 PEER WRITE FILL 1-----'-P-=E =E--'-R-'-w---R IT-=E - PEER WRITE I DATA I I PIO READ FILL PEER WRITE PEER WRITE I PIO READ FILL I PIO READ REQUEST I PIO READ REQUEST I I I I t I I / " I I I I DMA W RITE I I TERRUPTS I I OR R EAD t I I I I EJ I � I I I I t t t I I ______I ------1------PCI O PCI 1

PCI DEVICE A PCI DEVICE B PCI DEVICE C PCI DEVICE D

MASTER OF ..._... - TA RGET OF MASTER OF - � TARGET OF PEER WRITES PEER WRITE PEER WRITES PEER WRITE

PCI DEVICE E PCI DEVICE F PCI DEVICE G PCI DEVICE H TARGET OF - TARGET OF PIO TARGET OF - - TARGET OF PIO PIO READ READ REQUEST PIO READ READ

Figure 7 Block Di:�gra111 Showing Deadlock Case without IQ Bypass l\1th

Required Characteristics for Deadlock-free Peer-to-Peer 1/0 Bandwidth and Efficiency Ta rget Devices PC! devices must fo llow all PCI standard ordering With overall system performance as our goal, 1\'e rules for dead lock-free peer-to-peer transaction. The selected rwo design approaches to deJi,·er fu ll PC! specificrul e reln·ant to the AlphaServer 4100 design bandwidth without bus stalls. These were support to r fo r peer-to-peer transaction support is Delayed large bursts of PCI-de,·ice-initiated DMA, and suffi­ Transaction Rule 6, which guarantees that the IQ wi ll cient buffe ring and prefetching logic to keep up \\'ith always emprv: the PCI and a\'oid introducing stalls. vVe open this sec­ tion with a re,·iew of the bandwidth and latency issues A target must accept all memory writes 11-e examined in our efforts ro achie\'e greater band­ add ressed it while completing a request using to width efficiency. Delayed Transaction termination.' The bandwidth available on a plattorm is dependent Our design includes a link mechanism using scatter­ on rhe efficiency of the design and on rhe type of gat her TLBs to create a logical connection between two transactions performed . Bandwidth is measured in PC! devices. It includes a set of rules tor bypassing data millions of bytes per second (MB/s). On a 32-bit that ensures deadlock-tree operation when all partici­ l'Cl, the available bandwidth is effi.ciencv multiplied pants in a peer-to-peer transaction follow the ordering by 133 MB/s; on a 64-bit PCI, available bandwidth is rules in the PC! standard . The link mechanism provides efficiency multiplied by 266 MB/s. By efficiency, we a logical path to r peer-to-peer transactions and the mean the amount of rime spent actually rranskrring

bypassing rules guarantee the IQ will aJ,,·avs drain. data as compared with total transaction rime. The key kature, then, is a guarantee that the lQ will Both parties in a transaction contribute to efficiency al\\·ays drain, thusensuring deadlock-tYee operation. on the bus. The AlphaServer 4100 1/0 design keeps the o,·crhead introduced by the system to a minimum and supports large burst sizes O\'er which the per­ tr;.ll1sacrion m·erhead can be amortized .

Vol. 8 No. 4 1996 70 Support for Large Burst Sizes To predict the efficiency of a given design, one must break a transaction into its constituent parts. For exam­ 100% ple, when an ljO device initiates a transaction it must 90%

• Arbitratetor the bus PERCENT AVA ILABLE Connect to the bus (by driving the address of the CYCLES • 512 transaction target) SPENT 256 MOVING 128 DATA 64 • Transfer data (one or more bytes move in one or (EFFICI ENCY) 30% 32 16 more bus cycles) 20% DATA 8 4 CYCLES • Disconnect from the bus 2 IN A 12 BURST 16 20 Time actually spent in an I/0 transaction is the 24 28 OVERHEAD CYCLES sum of arbitration, connection, data transter, and (LATENCY PLUS STALLS) disconnection. The period of time before any data is transferred KEY: 90 %-100% 40% - 50% is typically called latency. With small burst sizes, band­ • • 80% -90% 30% -40% • D width is limited regardless of latency. Latency of 70% -80% 20% - 30% . • D arbitration, connection, and disconnection is f1irly 60% - 70% 10%-20% D 50% - 60% 0% -10% constant, but the amount of data moved per unit of • • time can increase by making the I/0 bus wider. The AJphaServer4100 PCI buses are 64 bits wide, yielding Figure S PC! Efficiencyas a Function of Burst Size and L-ttency (efficiency 266 MB/s) of available bandwidth. X As shown in Figure 8, efficiency improves as burst size increases and overhead (i.e., latency plus stall cycles. Peer-to-peer reads of devices on different bus time ) decreases. Overhead introduced by the segments are always converted to de layed-read trans­ AlphaServer 4100 is fa irly constant. As discussed ear­ actions because the best-case initial latency will be lier, a DMA write can complete on the PCI before it longer than 32 PCI cycles. completes on the system bus. As a consequence, we PCI initial latency tor DMA reads on the were able to keep overhead introduced by the plat­ AlphaServer 4100 system is commensurate with fo rm to a minimum for DMA writes. Recognizing that expectations tor current generation quad-processor efficiency improves with burst size, we used a queuing SMP systems. To maximize efficiency, we designed model of the system to predict how many posted write prefetching logic to stream data to a 64-bit PCI device buffers were needed to sustain D.MAwrite bursts with­ without stalls afterthe initial-latency penalty bas been out stalling the PCI bus. Based on a simulation model paid. To make sure the design could keep up with an of the configurations shown in Figures and 2, we l uninterrupted 64-bit DMA read, we used the queuing determined that three 64-byte butlers were sufficient model and analysis of the system bus protocol and to stream DMA writes ti·om the (266 MB/s) PCI bus decided that three cache-line-size pretetch bufters to the (I Gl3/s) system bus. would be sufficient. The algorithm to r pretetching Later in this paper, we present measured perfor­ uses the advanced PCI commands as hints to deter­ mance ofDMA write bandwidth that matches the sim­ mine how fa r memory data prefetching should stay ulation model results and, with large burst stzes, ahead of the PCI bus: actually exceeds 95 percent efficiency. • Memory Read (MR): Fetch a single 64-byte cache line. Prefetch Logic DMA writes complete on the PCI before they com­ • Memory Read Line (MRL): Fetch t\vo 64-byte plete on the system bus, but DMA reads must wait for cache lines.

data fe tched fr om memory or fr om a peer on another • Memory Read Multiple (MR.M): Fetch t\vo PCI. As such, latency for DMA reads is always worse 64-byte cache lines, and then ktch one line at than it is tor writes. PC! LocaL Bus Sp ec ification a time to keep the pipeline fu lL Revision provides a delayed -transaction mechanism 2. 1 After the PCI bus bridge responds to an M com­ to r devices with latencies that exceed the PCT initial­ RM mand by fe tching t\vo 64-byte cache lines and the sec­ latency requirement.' The initial-latency requirement ond line is returned, the bridge posts another read; as on host bus bridges is 32 PC:I cycles, which is the max­ the oldest bufter is unloaded, new reads arc posted, imum overhead that may be introduced betore the keeping one buffer ahead of the PCI. The third first data cycle. The AlphaServer 4100 initial latency pretetch buffer is reserved tor the case in which a DMA fo r memory DMA reads is bet\veen 18 and 20 PCI

Technical Journal Vol. 8 No. 4 l996 Digital 71 MRM completes while there arc still prdctch reads 300

outstanding. Reservation of this buffer accomplishes � 250 I- two things: (1) it eliminates a time-delay bubble that 0 would appear between consecutive DMA read trans­ u � 200 actions, and (2) it maintains a resource to fe tch a a: w scatter-gather translation in the event that the next � 150 transaction address is not in the TLB. Measured DMA w f- hi bandwidth is presented later in this paper. in 100 <{ The point at which the design stops prefetching is on (9 w 50 page boundaries. As the Dl'vlA window scatter-gather 2 map is partitioned into 8-K.B pages, the interface is 0 32 64 128 256 512 1024 2048 4096 designed to disconnect on 8-KB-aligned addresses. BURST SIZE (BYTES) The advantage of prefetching reads and absorbing KEY: posted writes on this system is that the burst size can 0 IDEAL PCI be as large as 8 KB. With large burst size, the overhead 0 MEMORY WRITE (MEASURED) of connecting and disconnecting fr om the bus is amortized and approaches a negligi ble penalty. Figure 9 Comp�1rison of Measured DMA Write Performance on an Ideal 64-bit PC! :�nd on an AlphaServer 4100 Svsrem DMA and PIO Performance Results

We have discussed the relationship between burst size, buffers. Simulation predicted rhat rhis number of initial latency, and bandwidth and described several bufkrs would be sufficient to sustain fu ll bandwidth techniques we used in the Alph:�Scrver 4] 00 PC! bus DNIA writes-even when the system bus is extremely bridge design w meet the goals fo r high-bandwidth busy-because the bridges to the PCI arc on a shared I/0. This section presents the perf-(xmance delivered svstem bus that has roughly l GB/s available band­ by the 4100 I/0 subsystem design, which has been ,�·idth. The PC! bus bridges arbitrate fo r the shared measured using a high-performance PC! tr:�nsaction system bus at a priority higher than the CPUs, but the generator. bridges arc permitted to execute onlv a single transac­ We coJ iected performance dat:t under the UNIX tion each rime rhcy win the system bus. Therefore, in operating system with a reconfigurablein terf:1 ce card the worst case, a PCI bus bridge will wait behind three developed at DIGITAL, called PCI Pamette. It is a other PC! bus bridges t(x a slor on rhe bus, and each 64-bit PCI option with a Xilinx FPGA interface to bridge will have at least one quarter of the ;w ailable PCI. The board was configured as a programmable svstem bus bandwidth. With 250 MB/s available but PCI transaction generator. In this configuration, the \�•ith potential delay in accessing the bus, three posted board can generate burst lengths of l ro 512 cyc les. write buffers are sufficientto maintain fu ll PCI band­ DMA either runs to a fixed count of words transferred width t

zero wait states. DMA Read Efficiency and Performance As noted in the section Prefctch Logic, bandwidth DMA Write Efficiency and Performance pertemllJilCC ofDMA reads will be lower than the per­ Figure 9 shows the close comparison between the formance of DMA writes on all systems because there AlphaServer 4100 system and a nearly perkct PC! is de!Jy in ktching the read data fi-om memory. For design in measured DMA write bandwidth. As this reason, we included three cache-line-size preferch explained above, to sustain large bursts of DMA buffe rs in the design. writes, we implemented three 64-byte posted write

72 Digiral Tc chnic.ll journal Vo l. 8 No. 4 1996 Figure 10 compares DMA read bandwidth mea­ system with a single CPU, and the results are pre­ sured on the AJphaServer 4100 system with a PCI sys­ sented in Figure 11. The pended protocol to r flow tem that has 8 cycles of initial latency in delivering control on the system bus limits the number of read DMA read data. This figure shows that delivered transactions that can be outstanding from a single bandwidth improves on the AJphaServer 4100 system CPU. A single CPU issuing reads will stall waiting fo r as burst size increases, and that the effect of initial read-return data and cannot issue enough reads to latency on measured performance is diminished with approach the bandwidth limit of the bridge. Measured larger DMAbur sts. read performance is quite a bit lower than the theoret­ The ideal PCI system used calculated performance ical limit. A system with multiple CPUs doing PIO data fo r comparison, assuming a read target latency of reads-or peer-to-peer reads-will deliver PIO read 8 cycles; 2 cycles are fo r medium decode of the bandwidth that approaches the predicted performance address, and 6 cycles are tor memory latency of 180 of the PCI bus bridge. PIO writes are posted and the nanoseconds ( ns). This represents about the best per­ CPU stalls only when the writes reach the IQ thresh­ fo rmance that can be achieved today. old. Figure 11 shows that PIO writes approach the Figure 10 shows memory read and memory read theoretical limit of the host bus bridge. line commands with burst sizes limited to what is PIO bursts are limited by the size of the I/0 read expected from these commands. As explained else­ and write merge buffers on the CPU. A single where in this paper, memory read is used fo r bursts of AJphaServer 4100 CPU is capable of bursts up to less than a cache line; memory read line is used tor 32 bytes. PIO writes are posted; therefore, to avoid transactions that cross one cache line boundary but are stalling the system with system bus flow control, in the less than two cache lines; and memory read multiple maximum configuration (see Figure 2 ), we provide a is fo r transactions that cross two or more cache line minimum of three posted write buffe rs that may be boundaries. filled before flow control is used. Configurations with The efficiency of memo1y read and memory fe wer than the maximum number of CPUs can post read line does not improve with larger bursts because more PIO writes betore encountering flowcontrol . there is no prefetching beyond the first or second cache line respectively. This sho·ws that large bursts Summary and use of the appropriate PC! commands are both necessary for efficiency. The DIGITAL AJphaServer 4100 system incorporates design innovations in the PC! bus bridge that provide Performance of P/0 Operations a highly efficient interface to 1/0 devices. Partial PIO transactions are initiated by a CPU. AJphaServer cache line writes improve the efficiency ofsmall writes 4100 PIO performance has been measured on a to memory. The peer link mechanism uses TLBs to

300

250 Cl z 0 &l 200 - F <300 100 ::;:LU 50

0 Ill32 64 128 256 512 1024 2048 4096 BURST SIZE (BYTES) KEY: 0 IDEAL PCI (8 CYCLES TARGET LATENCY) • MEMORY READ MULTIPLE (MEASURED) 0 MEMORY READ LINE (MEASURED) 0 MEMORY READ (MEASURED)

Figure 10

Comparison of DMARead Bandwidth on the AlphaServer 4100 System and on an Ideal PCI System

Digital Tl'c hnical Journal Vo l. 8 No. 4 1996 73 0 z 160 0 0 140 w (f) 120 a: w 100 o._ (f) 80 w f- 60 � 40 20 w<3 � PIO WRITE, 32-BIT PCI PIO READ, 32-BIT PCI PIO WRITE, 64-BIT PCI PIO READ, 64-BIT PCI

KEY: MEASURED PERFORMANCE 0 THEORETICAL PEAK PERFORMANCE 0

Figure 11 Comparison of Al phaServer 4100 I O with Theoretical 32-byte Burst Peak erfo rmance P Pertormancc P

map device address space on independent peer PCI References and Note buses ro permit direct peer transactions. Reordering of Wimer UNIX Hot Iron A\\'ards, U�IX EXPO Plus, transactions in queues on the PCI bridge, combined l. October 9, 1996, http://WI\w .aim.com (Menlo witb the use of PCI delayed transactions, provides a Pcnk, deadlock-free design to r peer transactions. Bufrers and Calif. AIM Te chnolog\' ).

prdetch logic that support very large bursts without 2. R. Cillett, CHAl,NEL Net11 0rk PC!," ",v! H·l ORY t()l" stalls yield a system that can amortize overhead and /f:Ff: Jiicro (FcbruarY !996): 12-18. deliver performance limited only by the PC! devices 3. G. He d cg, "Design and Implementation of the used in the system. r AlphaSen·er 4100 and Memorv Architecwre," In summary, this system meets and exceeds the per­ CPU Di.�itul Te ch nical }oumal. vol. 8, no. 4 ( 1996, this fo rmance goals established fo r the I/0 su bsystem. issue ): 48-60. Notably, I/0 subsystem support fo r partial cache line 4. Local Bus Sp ecification, Ret'ision ortla nd, writes and fo r direct peer-to-peer transactions signifi­ PC! 2. 1 ( P Oreg.: PC! Spe i Interest Group, I995) cantly improves efficiencyof operation in a MEMORY c al . CHANNEL cluster system. 5. In PC! terminology, a master is any device that arbitrates fo r the bus and initiates transactions on the (i.e., PC! ped rms DMA) before accepting a transaction as target. Acknowledgments o Biographies The DIG!Ti\L Al phaServer 4100 IjO design te am was responsible f()r the I /0 subsystem implementa­ tion. The design team included Bill Bruce, Steve Coe, Dennis Hayes, Craig Keefer, Andy Koning, Tom McLaughlin, and John Lynch. The I/0 design verin­ cation team was also key to delivering this product: Dick Beaven, Dmen·o Ko rmeluk, Singer, and Art Hitesb Vyas, with CAD support f!· om Mark Matulatis and Dick Lombard. Several system team members contributed to inven­ tions that improved product performance; most notable Samuel H. Duncan architect were Paul Guglielmi, Rick Hetherington, Glen Herdeg, A consultant enginee1· and the fo r the AlphaServer 4100 1/0 subsystem design, Sam Dunec1nis cu rremly and Maurice Steinman . We also extend thanks to our working on core logic design and architecwre fo r the next performance partners Zarka Cvetanovic and Susan generation of Alpha servers and . Since join­ Carr, who developed and ran the gueujng models. ing DIGITAL 979, he has been part ofAlplu and VAX in I Mark Shand designed the PC! Pamette and pro­ svstem engineering teams and has rep resented DICITA I. on scvcrc1l ind us r standards bodies, includin!J, PC! vided the ped()fmance measurements used in this t y the Spccic1l Interest Gwup. also chaired t group that paper. Many thanks for the nights and weekends spent He h e dcl'eloped the Sr:111dard fo r CommunicHing Among remotely connected to the system in our lab to gather IEEE roCl: ssors and Peripherals Using Shared tV !cmon·. P He Ius this data . been a11·ardcd one pcuenr and has u patems filed fo r tc)l" i1m:ntions in the AlphaSerl'er 4l00 S\'Stem. Sam rccci,·ed a fmm Tu ftsUn i,·ersitv. B.S.E.E.

1996 74 Digital Technical journal Vol. 8 No. 4 Craig D. Keefer Craig Keefer is a principal hardware engineer whose engi­ neering experrise is designing gate arrays. He was the gate array designer fo r one of the two 235K CMOS gate arrays in the AJphaServer 8200 system and the team leader fo r the comn1and and address gate array in the AJphaServer 8400 l/0 module. A member of the Server Product Development Group, he is now responsible fo r designing gate arrays fo r hierarchical switch hubs. Craig joined DIGITAL in 1977 and holds a B.S.E.E fr om the University of LowelL

Thomas A. McLaughlin

Tom McLaughlin is a principal hardware engineer work­ ing in DIGITAL's Server Product Development Group. He is currently involved with the next generation of high­ end server platforms and is fo cusing on logic synthesis and ASIC design processes. For the AJphaServer 4100 project, he was responsible fo r the logic design ofthe l/0 subsystem, including ASIC design, logic synthesis, logic verification, and riming verification. Prior to joining the AJphaServer 4100 project, he was a member of Design and Applications Engineering within DIGITAL's External Semiconductor Technology Group. Tom joined DIGITAL in 1986 after receiving aRT E.ET fr om the Rochester Institute ofTechnology; he also holds an M.S.C.S. degree ti·om the Worcester Polytechnic Institute.

Digital Te chnical Journal VoL 8 No. 4 1996 75 I Vipi.n V. Gokhale

Design of the 64-bit Option for the Oracle7 Relational Database Management System

Like most database management systems, the Introduction Oracle7 database server uses memory to cache data in disk files and improve the performance. Historically, the limiting tacror tor the Oracle7 rela­ tional database managcment system (RDBMS) pertor­ In general. larger memory caches result in better mancc on any given platform has been" thc amount of performance. Until recently, the practical limit computational and I/0 rcsources available on a single on the amount of memory the Oracle7 server node. Although CPUs havc bccomc taster by an order could use was well under 3 gigabytes on most of magnitude over thc last sc1·eral ycars, I/0 speeds 32-bit system platforms. Digital Equipment ha1·c not imprm·ed commensur:nclv. For instance, the Corporation's co mbination of the 64-bit Alpha Alpha CPU clock speed alone has increased to ur times since its introduction; during the same time period, system and the DIGITAL UNIX operating system disk access times have improved by a t: Kror of two differentiates itself from the rest of the com­ at bcst. The overall throughput of database software is puter industry by being the first standards­ critically dependent on the speed of access to data. compliant UNIX implementation to support To overcome the ljO specd limitation and to maxi­ linear 64-bit memory addressing and 64-bit mize performance, the standard Oracle7 database server application programming interfaces, allowing alreadv utilizes and is optimized tor various paraUeliza­ tion techniques in software (e.g., intelligent caching, high-performance applications to directly access data prcfctching, and parallel query execution) and in memory in excess of 4 gigabytes. The Oracle7 hardware (e.g., symmeu·ic multiprocessing [SMP] sys­ database server is the first commercial data­ tems, clusters, and massi1·clv parallel processing [MPP] base product in the industry to exploit the per­ systems). Given the disparity in latency fo r data access formance potential of the very large memory between memory (a tew tcns of nanoseconds) and disk configurations provided by DIGITAL. This paper (a te w milliseconds), a common technique fo r maximiz­ ing performance is to mini mize disk ljO. Our project explores aspects of the design and implementa­ originated as an investigation into possible additional tion of the Oracle 64 Bit Option. performance improvements in the Oracle7 database scrver in tl1e context of increased memory addressability and execution speed provided by the AlphaServer and DIGITAL UNDC system. Work done as part oftl1is proj­

ect subsequently became the foundation tor product development of the Oracle 64 Bit Option. Of the memory resource that the Oracle7 database uses, the largest portion is used to cache the most fr e­ quently used data blocks. With hardware and operat­ ing system support fo r 64-bit memory addresses, new possibilities have opened up for high-performance applic:nion software to take advantage of large mem­ ory configurations. Two of the concepts utilized are hardly new in data­ base development, i.e., improl'ing database server per­ fo rmance by caching more data in memory and improving ljO subsystem throughput by increasing data transfEr sizes. However, various conflicting ftc­ tors contribute to the practical upper bounds on

76 Journal Vol . 8 1996 Digital Te chnic� I No. 4 performance improvement. These fa ctors include basic unit tor I/0 and disk space allocation in the CPU architectures; memory addressability; operating Oracle7 RDBMS. Large block sizes mean greater den­ system fe atures; cost; and product requirements tor sity in the rows per block to r the data and indexes, and portability, compatibility, and time-to-market. An typically benefitdecisio n-support applications. Large additional design challenge for the Oracle 64 Bit blocks are also usdi.d to applications that require long, Option project was a requirement for significantper­ contiguous rows, tor example, applications that store fo rmance increases fo r a broad class of existing data­ multimedia data such as images and sound. Rows that base applications that use an open, general-purpose span multiple blocks in Oracle7 require proportion­ operating system and database software. ately more 1/0 transactions to read all the pieces, This paper provides an overview of the Oracle 64 resulting in performance degradation. Most platforms Bit Option, fa ctors that influenced its design and that run the Oracle7 system support a maximum data­ implementation, and performance implications tor base block size of 8 kilobytes (KB); the DLGrTAL some database application areas. In-depth information UNIX system supports block sizes of up to 32 KB. on Oracle7 RDBMS architecture, administrative com­ The shared global area ( SGA) is that area ofmemory mands, and tuning guidelines can be found in the used by Oracle7 processes to hold critical shared data Orac!e 7 Seruer Docu mentation Set .' Detailed analysis, structures such as process state, structured query lan­ database server, and application-tuning issues arc guage (SQL)-Ievel caches, session and transaction deferred to the references cited. Overall observations states, and redo buffers. The bulk of the SGA in terms and conclusions from experiments, rather than specific of size, however, is the database buffer (or block) details and data points, are used in this paper except cache. Use of the buffer cache means that costly disk where such data is publicly available. l/0 is avoided; therefore, the performance of the Oracle7 database server relates directly to the amounr Oracle 64 Bit Option Goals of data cached in the buffercache . LSGA seeks to use as much memory as possible to cache database blocks. The goals fo r the Oracle 64 Bit Option project were as Ideally, an entire database can be cached in memory follows: (an "in-memory" database) and avoid almost all 1/0 during normal operation. Demonstrate a clearly identifiable performance • A transaction whose data request is satisfied ti·om increase for Oracle7 running on DIGITAL UNJX the database buffer cache executes an order of magni­ systems across two commonly used classes ofda ta­ tude faster than a transaction that must read its data base applications: decision support systems ( DSS) fi·om disk. The difference in pcrtcxmance is a direct and online transaction processing (OLTP). consequence of the disparity in access times tor main • Ensure that 64-bit addrcssability and large memory memory and disk storage. A database block tound in configurations arc the only two control variables the buffer cache is termed a "cache hit." A cache miss, that influence overall application performance. in contrast, is the single largest contributor to degra­

• Break the 1- to 2-GB barrier on the amount dation in transaction latency. Both BO l3 and LSGA use of directly accessible memory that can practically memory to avoid cache misses. The Oracle7 bufkr be used tor typical Oracle7 database cache cache implementation is the same as that of a typical implementations. write-back cache. As such, a cache miss, in addition to resulting in a costly disk can have secondary • Add scalability and performance features that com­ 1/0, plement, rather than replace, current Orade7 efkcts. For instance, one or more of the least recently server SMP and duster ofterings. used buffers may be evicted from the butkr cache if no tree bufkrs arc available, and additional transac­ Implement all of the above goals without signifi­ 1/0 • tions may be incurred if the evicted block has been cantly rewriting Oracle7 code or introducing appli­ modifiedsince the last time it was read trom the disk. cation incompatibilities across any of the other Oracle7 buffe r cache management algorithms already platforms on which the Oracle7 system runs. implement aggressive and intelligent caching schemes and seek to avoid disk Although cache-miss Oracle 64 Bit Option Components l/0. penalties apply with or without the 64-bit option, "cache thrashing" that results from constrained cache Two major components make up the Oracle 64 Bit sizes and large data sets can be reduced with the Option: big Oracle blocks (BOB) and large shared option to the benefitof many existing applications. global area (LSGA). They are brieflydescribed in this The Oracle7 buffer cache is specifically designed section. and optimized tor Oracle's multi-versioning read­ The BOB component takes advantage of large consistency transactional model. ( Oracle7 buffer memory by maki ng individual database blocks larger cache is independent of the DIGITAL UNIX unified than those on 32-bit platf(mm. A database block is a buffe r cache, or UBC.) Since Oracle7 can manage irs

Digital li:dmical journal Vol. S No. 4 I 9Y6 77 own butkr cache more etkcti,·clv than fill:: system strained by the bet that this resource is also shared by butkr caches, it is oth.:n recommended that the file many other critical dau structures in rhe SGA besides system cache size be red uced in tiwor of a larger the bu tler cache and the memory needed by the oper­ Or;�clc7 buffe r cache when rhc database resides on <�ting system. By eliminating the need to choose

�1 filesyste m. Red ucing filesystem cache size also mini­ between the size of the database blocks and bufkr mizes redundant c;�ching of dJta at the file system cJche, Oracle? on a 64-bir pl:!rtcm11 can run a greater level. For this reason, we rejected early on the obvious application mix without sacrificingper formance. design solution of using the DIGITAL UNIX filesys­ Despite the codependency and the common go;�l tem as a large cache t( >r taking advantage of brge of red ucing costly disk 1/0, BOB and LSGA address memory configurations-even though it had the t\\'o diftcrent dimensions of d'1tabase scalability : BOB appeal of complete transpJrency and no code changes ;�ddresses on-disk dat;�basc size, Jnd the LSGA add resses the Oracle? system. to in-memory database size. Application de,·elopers ;�nd daubase ad ministrators have complete flexi bility to Background and Rationale for Design Decisions bvor one over the other or usc them in combin;�rion. to In Oracle?, the on-disk d;�ta structures that locate

The primary impetus t(>r this project was to evaluate :1 row of data in the datJbasc usc <1 block-address­ the impact on the Oracle? dJtabase server of emerging byte-onset tuple. The data block address (DBA) is a 64-bit platfo rms, such as the AlphaServer system and 32-bit quantity, which is fu rther broken up into file D!GITAL UNIX operating system. Goals set t(>rth number and block other within rlut fi le. The byte off­ t(>r this project and subscqw.:nr design considerations set within a block is a 16-bit quantit\'. Although the therctore excluded any pcrt(mnJnce and fu nctionality number of bits in the DBA used t<>r filenumb er and cnhJncements in the Oracle? RDBMS that could not block oftsct are platt(mn dependent ( 10 bits tor the file be attributed to the benefits ofkrcd by a typical 64 -bit number and 22 bits t()r the block oftsct is a common platt(mn or otherwise cncapsuiJtcd within platt(mn­ r< m1ut ), there exists a theoretic:d upper limit to the spccificlayers of the dat;�bsc server code or the oper­ size of an Oracle? dJtabase. With some exceptions, ating system itself. most 32-bit platto rms support a maximum data block Common areas of potential benefit for a typical size of 8 with 2 as the dd�ult. For example, K.B, K.B 64-bit pl;�ttorm (when compJred to its 32-bit coun­ using a 2- block size, the upper limit for the size KB terpart) are (a) increased direct memory address<�bility, of the database on DIGITAL UNIX is slightly under and (b) the potential t(>r configuring systems with 8 te r<�Lwtes (TB); whereas ,1 32-K.B block size raises greater than 4GB of memor�'· As noted above, appli­ that limit to slightly under 128 TB. The abilit\' to sup­ cation performance of the Oracle? database sen·cr port bufte r cache sizes ,,·ell bcvond those of 32-bit depends on whether or not data Jre t( JLllld in the datJ­ plartnn. BOB and LSGA reflect the only logical design (especially if the data file is a file system managed choices available in Oracle? take advantage of this object) Jnd therefore may not be able to use all of the to extended addressability and meet the project goals. ,wJiiJblc block oHset r;�ngc in the existing DBA fo r­ Implementation of these components focused on nut. The largest database size that can be supported in ensuring scalability and maximizing the eftectiveness such a case is eYen smaller. Addressing the percei,·ed ofav,1ilable memory resources. limits on the size of an Oracle? databJse was an impor­ tant consider;�tion. Design Jl tcmatiYes that required BOB: Decisions Relevant to On-disk Database Size ch;�nges to the lavout or an interpretation of DBA tc>r­ L<�rgcr database blocks consume proportionately mat were rejected, at least in this project, because such larger ;�mounts of memory when the data conr;�ined in chJnges would have introduced incompatibilities in those blocks are re<�d fr om the disk into the database on-disk data structures. butler cache. Consequently, the size of the buftc r It should be pointed out th<�t on current Al ph:.1 cache itself must be increased ibn application requires processors using an 8-KB page size, a 32-KB data a greater number of these larger blocks to be cached . block sp<�ns to ur memory pages, and 1/0 code p<�ths in the operating system need lock/u nlock tc) ltr For any given size of database buftc r cache, Oracle? to cb tabase administrators of 32-bir platforms have times as m;�ny pages when pert(>rming an 1/0 trans­ had to choose between the size of each database block :.lction. The larger transkr size also :�dds to the total :m d the number of cbtabasc blocks that must be in time t<�ken to pedorm an 1/0. For instance, tour the cache minimize disk 1/0, the choice depending pJges of memory that cont<� in the 32- data block to KB on data access patterns of the applications. Memory may not be physically contiguous, :�nd " scatter-gather available to r rhe database buftc r cache is fu rther con- operation may be required. Although the Oracle7

No.4 J996 Dig;itJITe chnical journal VoL X database supports row-Jevel locking tor maximum would use segmented allocations if the size of concurrency in c:tseswhere unrelated transactions may the memory allocation request exceeds a platform­ be accessing differentrows within a given data block, dependent threshold. In particular, the size in bytes access to the data block is serialized as each individual for each memory allocation request (a platt(.>rm­ change (a transaction-level chJnge is broken down dependent value) was assumed to be well under 4 GB, into multiple, smaller units of change) is applied to the which was a correct assumption tor all 32-bit plat­ datJ block. Larger data blocks accommodate more forms (and even fo r a 64-bit platform without LSGA). rows of data and consequently increase the probability Internal data structures used 32-bit integers to repre­ of contention at the data bJock level if applications sent the size of a memory allocation request. change (insert, update, delete) data and have a locality For each buffe r in the buffe r cache, SGA also of rekrence. Experiments have shown, however, that contains an additional d::tta structure (bufkr header) this added cost is only marginal relative to the overall to hold all the metadata associated with that buf.. performance gains and can be oftset easily by carefully fer. Although memory tor the buffer cache itself was tuning the application. Moreover, applications that allocated using a special interface into the memory mostly query the data rather than modif)' it (e.g., DSS management layer, memory allocation tor butkr applications) greatly benefit fr om larger block sizes headers used conventional interfaces. A ditkrent since in this case access to the data block need not be allocation scheme was needed to allocate memory serialized. Subtle costs such as the ones mentioned for buffe r headers. The bufkr header is the only above nevertheless help explain why some applications major data structure in Oracle7 code whose size may not necessarily see, tor example, a fourfOld per­ requirements are directly dependent on the number of formance increase when the change is made fi-om an buffers in the bufter cache. Existing memory man­ 8-KB block size to a 32-KB block size. agement interfaces and algorithms used prior to LSGA Aswith Oracle7 implementations on other platfonm, work were adequate until the number of buffers in database block size with the 64-bit option is determined the buffer cache exceeded approximately 700,000 at database creation time using db_block_size con­ (or buffe r cache size of approximately 6.5 GB). Minor figurationparamet er.' It cannot be changed dynamically code changes were necessary in memory manage­ at a later time. ment algorithms to accommodate bigger allocation requests possible with existing high-end and fu ture LSGA: Decisions Relevant to In-memoryDatabase Size AlphaServer configurations. The focus f<>r the LSGA eftort was to idcntif)'and elim­ The AlphaServer 8400 platform can support mem­ inate any constraints in Oracle7 on the sizes to which ory configurations ranging from 2 to 14 GB, using the database buffer cache could grow. DIGITAL UNIX 2-GB memory modules. Some existing 32-bit plat­ memory allocation application programming interfaces forms can support physical memory configurations (APis) and process address space layout make it fa irly that exceed their 4-GB addressing limit by way of seg­ straightforward to allocate and manage System V mentation, such that only 4 GB of that memory is shared memory segments. Although the size of each directly accessible at any rime. Programming complex­ shared memory segment is limited to a maximum of ity associated with such segmented memory models 2GB (due to the requirement to comply with UNIX precluded any serious consideration in the design standards), multiple segments can be used to work process to extend LSGA work to such platforms. around this restriction. The memory management Significantly rewriting the Oracle7 code was specifi­ layer in Oracle7 code therefore was the initial area of cally identifiedas a goal not to be pursued by this proj­ focus. Much of the Oracle7 code is written and archi­ ect. The Alpha processor and DIGITAL UNIX system tected to make it highly portable across a diverse range provides a Aat 64-bit virtual address space model to of platf(>rms, including memory-constrained 16-bit the applications. DIGITAL UNIX extends standard desktop platforms. A particularly interesting aspect of UNIX APis into a 64-bit programming environment. 16-bit platforms with respect to memory management Our choice of the AlphaScrver and DIGITAL UNIX as is that these platforms cannot support contiguous a development platform fo r this project was a fairly memory allocations beyond 64 K.B. Usersarc fo rced simple one ti-om a time-to-market perspective because to resort to a segmented memory model such that it allowed us to keep code changes to a minimum. each individual segment docs not exceed 64 K.B in Efficiently managing a buffer cache of� for example, size. Although such restrictions are somewhat con­ 8 or 10 GB in size was an interesting challenge. More straining (and perhaps irrelevant) tor most 32-bit than five million buffers can be accommodated in a platforms-more so tor 64-bit platforms-which can 10-GB cache, with a 2-KB block size. That number of easily handle contiguous memory allocations well buffers is already an order of magnitude greater than in excess of 64 K.B, memory management layers in what we were able to experiment with prior to the Oracle7 code are designed to be sensitive and cautious LSGA work. The Oracle7 butter cache is organized as about large contiguous memory allocations and an associative write-back cache. The mechanism tor

Digital Technical Joumal Vol. 8 No. 4 1996 79 lo-=ating a data blo-=k of interest in this -=a-= he is supported entry a/lo\\' Alpha usc :1 single trans/Jtion the CPU to by -=omrnon algorithms :md data structures such as hash look-aside buffe r (TLB) entrv to map a 512K physic:1l fu nctions and linked lists. In manv cases, traversing criti­ memory space. Using one TLB entry to map larger cal /inked lists is serialized among contending threads of physiul memory has the potential to reduce proces:;or execution to maintain the imcgrity ofthe lists themselves stalls during TLB misses and re fills.Also, because oftl1e and secondary data structures managed by these lists. requirement that the grJnularity region be both As hint a result, the size of such critical lists, t()r example, has an virtually and physically contiguous, it is Jllocated at sys­ impact on overall cotKUtTcncy. The larger buHer count tem startup time and is not subject to normal virtual now possible in LSCA conf-igurationsh: Ki the net eftcct memory management; t(>r example, it is never paged in of reduced concurrency because the size of these lists is or out, and subsequcntlv the cost of a page ta ult is mini­ proportionate/\' larger. LSCA pro\'ided a ti·amework to mal. Since pages in granulatity hint regions are ph)'Si ­ test contributions ti-om other unrelated projects that callv contiguous, anv I/0 done h·om this region of addressed such potenti:�l bottl enecks concurrency, as memory is rdati\'eiv more efficient because it need not to it could realistically simubte re latively more stringent go through the scatter-garber plusc. boundary conditions than bd(>rc. Summaryof Te st Results Scalability Issues One of the project gor both applications, that lyzing and addressing the scaL1bility issues in the base is, TPC-C: t()r OLTP and TPC-D t(Jr DSS. An industrv­ operating system

Such :1 mcmorv layout Jllows Dil;ITAL UNIX to take with a 2- databJse block size KB JdvJntJge of the gr.:mularity hint tl:Jture supported by • A 64-bit option-enabled configm:1tion with a 7-CB Alpha processors. Granularity hint bits in a page table SCA �md 32- database block size KB

Dig:it:1l Tcchnic1l Journal Vol. R No. 4 !996 PERFORMANCE RATIOS OF LSGA TO SGA of the 64-bit option) is a standard offering in the 251 9 Oracle7 database server product since release 7 .l. Use 250.0 ,---:.. �2 2E:_8 of parallel query in this test illustrates the efkct of the 64-bit option enhancements on preexisting mecha­ Q 200.0 � nisms tor database performance improvement. a: All other things being equal, if the only difference w 150.0 0 z between a standard configuration and a 64-bit­

Digital Tec hnical )ourn;ll Vol. 8 No. 4 1996 81 Unlike fu ll table scans, the sort/merge operation Acknowledgments generates intermediate results. Depending on the size of these partial results, they may be stored in main i'vL lnl' people \\'ithin se,·nal groups and d iscip l im:s <1 t botl1 DIGITAL memory if an adequate amount of memory is avail ­ Oracle .md ha\·e contributed to the succc'' of this pmjcct. I \\ Ould like thank the r(JIIo\\'ing:i n di,·id u .Jis li·om able; or they mav be written back to temporary storage ro Oracle: vV,l ltcr Bnt, istc ll , Saar Maoz, et' Ke n nnh· and space in the database. The latter operation results in a l D,l\ id lr\\'in of the DIGITAL Svsrem nu,i ncss Lnir; additional 1/0s, proportionately more in number as a nd h·om DIGITAL: Jim Wo odward, PauLl Lo ng, ]),lrrc\1 inputs to the sort/merge grow in size or count. The Dunnuck, and Da\'l: Winchell of the D!Cl'I':\ L C :--.: IX 64-bit option makes it possi ble to eliminate these TjOs t-: nginecring group. M e m bers of the Compu rn s\'StCillS as well, as illustrated in transaction types 4 through 6. Division's Performance Group ar DIGITAL have also con ­ Pcrtim11ance i mprovements are greater as the com­ tri buted ro this project. plexityof queries increases .

Conclusion References

I. Omclc7 Scrt•cr Documentatiou Set (Redwood Shtncs, The disparity between memory speeds and disk speeds C:�lif.: Oracle Corpor:�tion ). is likely to continue t( )r the f(xcseeablc fu ture. L1 rgc memory configurations represent an opportunity to 2. t>t(,'f'liU l.\/X \·4.0 Release Notes ( M avn:t rd, M ass . : overcome this disparity .m d to increase application Digital Equipment Corporation, 1996 ) . ( by a h ng large amount of data in pert mluJKe c c i a 3. R. S ites :�nd R. Wi tek , eels., Alpba Archi!cc/11re Hefc r­ memory. Fven though the Oracle 64 Bit Option ence ,\'hunwl ( ':'-Je,,ron, Mass.: Digiul l'rcss, 199S ). improves database pertormance-two orders of mag­ 4. Orocle 64 Bil Op tion Peifomtmtce Hcport ou f)(r;i!al nitude in some Glses-specific application characteris­ I '. \'IX ( Rcd\\'ood Shores, C:�lif: Or,lclc C:orpor tion , tics must be evaluated to determine the best means fo r :� parr number C 10430, 1996) maximizing overall pedormancc and to balance the

significant increase in hardware cost fo r the large S. ]. Pian tcdosi , 1\ . S;lth ;we, and D. Shakshobcr, " Pnr(Jr­ amount of memory. The Oracle 64 Bit Option com ­ mance McclSUJ'elllctH of Tru C:Iustcr Svsrems under rile 'l'PC-C: lknchm;lrk," and T. Kawar; D. Slukshobcr, and pl eme nts existing Oracle7 features and functionali ty. D. St n l ey, "Perform;HKe An:tlvsis Using Vcrv Luge The exact extent of the increases in speed with the a McnJOry on the 64- bir AlphaSen·er System," f)ip, i!al 64-bit option varies based on the type of database 'f(>ch n ical jn umul. vol. 8, n o. 3 ( 1996 ): 46-6S. operation . F;:�stcr CPUs and denser memory allow tiJr even more pcrt(mllancc improvements than have Biography been demonstrated . Factors of importance to new or existing app lications, particularly those sensitive to response time, arc an order of magnitude performance Vipin V. Gokhale

in terms of speed increases and the abi l i ty to utilize Vi pin (;okh;l\c is ;1 Consulting Sofnv

configur;1 tions require special hardware (tor example, nonvolatile random access memory [RAM]). Because a 64-bit AlphaServer and DIGITAL UNIX

operating system transparently extends existing 32-bit AP!s into a 64-bit programming model, applications can rake advant;'lge of added addressability without

using specialized A Pis or m a king significant code changes. Pcrt(mll:tnce levels equal to or better than previously possible wi th specia l ized hardware and soft­ ware can now be achieved with industry-standard, ope n, ge ne ral-purpose platforms.

82 Digital Tc dmic;�l )ourn�l Vol. 8 No. 4 1996 I T. K. Rengarajan Maxwell Berenson Ganesan Gopal VLM Capabilities of Bruce McCready 11 Sapan Panigrahi the Sybase System Srikant Subrarnaniarn SQL Server Marc B. Sugiyanu

Software applications must be enhanced to The advent of the System ll SQL Server trom Sybase, take advantage of very large memory (VLM) Inc. coincided with the widespread availability and use of very large memory (VLM) technology on system capabilities. The System 11 SQL Server DTGITAL's Alpha microprocessor-based computer from Sybase, Inc. has expanded the semantics systems. Technological features of the System ll SQL of database tables for better use of memory Server were used to achieve record results of 14,176 on DIGITAL 64-bit Alpha microprocessor-based transactions-per-minute C (tpmC) at $198/tpmC systems. Database memory management for on the DIGITAL AJphaServer 8400 server product.' the Sybase System 11 SQL Server includes the One ofthese features, the Logical Memory Manager, provides the ability to tine-tune memory manage­ ability to partition the physical memory avail­ ment. It is the first step in exploiting the semantics of able to database buffers into multiple caches database tables ror better usc of memory in VLM sys­ and subdivide the named caches into multiple tems. To partition memory, a database administrator buffer pools for various 1/0 sizes. The database (DBA) creates multiple named bufkr caches. The management system can bind a database or DBA then subdivides each named cache into multiple one table in a database to any cache. A new buffe r pools fo r various 1/0 sizes. The DBA can bind a datab<1Se or one table in a dat<1base to anv cache. facility on the SQL Server engine provides A new thread in the SQL Server engine, called the nonintrusive checkpoints in a VLM system. Housekeeper, uses idle cycles to provide free (non­ intrusive) checkpoints in a large memory system. In this paper, we briet1y discuss VLM technology. Then we describe the capabilities of the Sybasc System l l SQL Server that address the issues of fast access, checkpoint, and recovery ofVLM systems, namely, the Logical Memory Manager, a VLM query optimizer, the Housekeeper, and fuzzy checkpoint.

VLM Te chnology

The term very large memory is subjective, and its widespread meaning changes with time. By VLM, we mean systems with more than 4 gigabytes (GB) of memory. In late 1996, personal computer servers with 4 GB of memory appeared in the marketplace. At $10 per megabyte ( M B), 4GB of memory becomes afford­ able ($40,000) at the departmental level fo r corpora­ tions. \Ve expect that most of the mid-range and high-end systems will be built with more memory in 1997. Growth in the amount ofsystem memory is an ongoing trend. Growth beyond 4 GB, however, is a significant expansion; 32-bit systems run out of mem­ ory alter 4 GB. DIGITAL developed 64-bit computing with its Alpha line of microprocessors. Digital is now

Digital Te chnical Journal Vo l. 8 No. 4 !996 83 \\Til-positioned to facilitate the transition ti-om 32-bit provides technological ad\·ances that take advantage of to 64-bit S\'Stcms. Sybase, Inc. provided one of the first VLM systems. These arc the Logical Memory relational database management svstcms to usc VLM Manager, VLM query optimization, the Housekeeper technology. The Svbase System ll SQL Server pro­ thread, and fu zzy checkpoints. We discuss the signifi­ ,·idcs fu ll, native support of64-bit Alpha microproces­ cance of these adv,mccs in the remaining sections of sors and the 64-bit DIGITAL UNIX operating system. this paper. DIGITAL UNIX is the firstoperating system to provide a 64-bit address space fo r all processes. The System 11 Logical Memory Manager SQL Server uses this large address space primarily to c:tchcla rge portions of the database in memory. The Sybase SQL Server consists of several DIGITAL VLM technology is appropriate t(x usc with applica­ UNIX processes, called engines. The DBA configures tions that have stringent response time requirements. the number of engines. As shown in Figure l, each With these applications, tor example, call-routing, it engine is permanently dedicated to one CPU ofa sym­ becomes necessarv to fit the entire database in mcm­ metric multiprocessing (SMP) machine. The Sybasc orv.' 'The usc of VLM svstcms can also be beneficial engines share virtual memory, which has been sized to when the priccjperform::mce is improved by adding include the SQL Server executable. The virtual mem­

more memory.' ory is locked to physic:.1l memorv. As :.1 result, there is never any operating system paging f( .>r the S�'bJSC l\llain Memory Database Systems memory. This shared memory region also uses large operating system pages to minimize translation look­ The widespread availability of VLM systems raises aside buffer(T LB ) entries t(x the CPU.-'The shared the possibility of building main memory database memory holds the database buffers, stored procedure (MMDB) systems. Several techniques to improve the cache, sort bu ftc rs, and other dynamic memory. This pcd(xmance of MMDB systems have been discussed memory is managed exclusively by the SQL Server. in the database literature. R.ckrcncc 5 provides an One SQL Server usu;�lly processes transactions on excellent, detailed sun•qr. \Vc provide a brief discus­ multiple databases. EKh database has its own log. sion in this section. Transactions can span databases using two-phase com­ Lock contention is low in MMDB svstcms since the mit. For fu rther details on the SQL Server architec­ datcl resides in memorv. Hence, the granularity ofcon­ ture, please sec rclcrcncc 9. cutTcncv control can be increased to minimize the The Logical Mcmorv Nl anager (LMM) provides the overhead of lock operations. The lock manager data ability tor a DBA to partition the physical memory structures can be combined with the database objects available to database bufkrs. The DBA can partition to reduce memory usage. Specialized, stable memory the mcmorv used f(Jr the database buffe rs into multi­ hardware c:m be used to minimize latency of logging. ple caches. The DBA needs to specifYa size and a name Early release of transaction locks and group commit t(x each cache. After all named caches have been during commit processing c:m be used to increase defined, the system ddincs the remaining memory as concurrency and tbroughput. Since random access is the default cache. Once the DBA partitions the mem­ bst in MMDBs, access methods can be developed with ory, it can then bind database entities to a particular no key ,·alucs in the index but only pointers to data cache. The datab:.1sc entity is one of the fo llowing: an rows in mcmory.6 Querv optimizcrs need to consider CPU costs, not 1/0 costs, when comparing various altcmativc plans to r a querv. In an i'viMDB, check­ CPU CPU pointing and ta ilure recovery arc the only reasons fo r pcrt(m11ing disk operations. A checkpoint process can be made "h.1 z.zy" with low impact on transaction throughput. After a system failure, incremental recov­ ery processing allows transaction processing to resume bd(H-c the recovery is complctc.7 As memory sizes increase with VLM systems, data­ base sizes arc also increasing. In general, we expect that databases will not fi t in mcmorv in the next decade. Therefore, tor most of the databases, MiviDB MEMORY techniques can be exploited onlv for those p

Dig:italTe chnical )ourn,ll Vo l. 1l No. 4 1996 entire database, one table in a database, or one index costs are reduced to an estimate of time. Since the on one table in a database. There is no limit to the number of I/0 operations depends on the amount of number of such entities that can be bound to a cache. memory available, the optimizer uses the size of the This cache binding directs the SQL Server to use only cache in the cost calculations. With LMM, the opti­ that cache tor the pages that belong to the entity. mizer uses the size of the named cache to which a cer­ Thus, the DRA can bind a small database to one cache. tain table is bound. Therefore, in the case of a database In a VLM system, if the cache were sized to be larger that completely ti ts in memory in a VLM system, the than the database, an MMDB would result. optimizer choices are made purely on the basis of CPU Figure 2 shows the table bindings to named caches cost. In particular, the I/0 cost is zero, when a table with the LMM. The procedure cache is used only or an index fits in a named cache. fo r keeping compiled stored procedures in memory The Sybase System 11 SQL Server introduced the and is shown tor completeness. The item cache is a notion of the fe tch-and-discard buffe r replacement small cache of l GB in size and is used fo r storing policy. This strategy indicates that a buffer read from a small read-only table (item) in memory. The default disk will not be used in the near fi1ture and hence is cache holds the remaining tables. Figure 2 shows one a good candidate to be replaced fi·om the cache. The table bound to the item cache and the other tables buffe r management algorithms leave this buffe r close bound to the ddault cache. By being able to partition to the least-recently-used end of the buffe r chain. In the use of memory fo r the item table separately, the the simplest example, a sequential scan of a table uses SQL Server is now able to take advantage of MMDB this strategy. With VLM, this strategy is turned off techniques tor only the item cache. if the table can be completely cached in memory. The Each named cache can be larger than 4 GB. The size fetch-and-discard strategy can also be tuned by appli­ is limited only by the amount of memory present in cation developers and DBAs if necessary. the system. Although we do not expect such a need, it is also possible to have hundreds of named caches; Housekeeper 64-bit pointers are used throughout the SQL Server to address large memory spaces. One of the motivations fo r developing VLM was the The LMM enables the DBA to fine-tunethe use of extremely quick response time requirements fo r trans­ memory. The LMM also allows fo r the introduction actions. These environments also require high avail­ ofspecitlc MMDB algorithms in the SQL Server based ability of systems. A key component in achieving high on the semantics of database entities and the size of availability is the recovery time. Database systems named caches. For example, in the fu ture, it becomes write dirty pages to disk primarily for page replace­ possible fo r a DBA to express the tact that most of one ment. The checkpoint procedure writes dirty pages to table tits in one named cache in memory, so that SQL disk to minimize recovery ti me. Server can use clock butfe r replacement. The Sybase System 11 SQL Server introduces a new thread called the Housekeeper that runs only at idle VLM Query Optimization time tor the system and does usefulwork. This thread is the basis fo r lazy processing in the SQL Server fo r The SQL Server query optimizer computes the cost now and the future. In System 11, the Housekeeper of query plans in terms of CPU as well as I/0. Both writes dirty pages to disk. At first, it writes pages to disk from the least-recently-used buffe r. In this sense, it helps page replacement. In addition to ensuring that there are enough clean buffers, the Housekeeper also PROCEDURE CACHE, 0.5 GB attempts to minimize both the checkpoint time and the recovery time. If the system becomes idle at any ITEM CACHE, 1 GB time during transaction processing, even fo r a few mil­ m liseconds, the Housekeeper keeps the disks (as many as possible) busy by writing dirty pages to disk. It also m makes sure that none of the disks is overloaded, thus DECAULT CACHC, 45 GB • preventing an undue delay if transaction processing resumes. In the best case, the Housekeeper automati­ cally generates a fr ee checkpoint tor the system, I thereby reducing the performance impact of the checkpoint during transaction processing. In steady Figure 2 state, the Housekeeper continuously writes dirty pages Table Bindings to Named Caches with Logical Memory Manager to disk, while minimizing the number of extra writes incurred by premature writes to disk.10

Digital ·rcchnical journal Vo l. 8 No. 4 1996 85 Checkpoint and Recovery the aver:1ge) 8 transactions to complete, assuming uni­ f(Jrm arrival rates at commit point. This indicates a nat­ As the size of memory increases, the h: >llowing two ural grouping of 8 transactions per log write. For the bctors increase as well: (1) the number of writes to same system, if the log disk is rated at 3,600 rpm, the disk during the checkpoint �m d (2) the number of same calculation yields 16 transactions per log write. disk !jOs to be done during recovery. The Sybasc The group commit algorithm used by the SQL System 11 SQL Server allows the DBA to tune the Server also takes ack111tagc of disk arrays by initiating amount of buffe rs that will be kept clean all the time. multiple asynchronous writes to diffe rent members of This is called the wash region. In essence, the wash the disk array. The SQL Server is also able to issue up

region represents the amount of memory that is always to 16 ki lobytes in one write to a single disk. Together, clean ( or strictly, in the process of being written to the group commit algorithm, large writes, and the disk). For example, if the total amount of memory tcJr ability to drive multiple disks in a disk array climin;He databJse butle rs is 6GB and the wash region is 2 GB, the log bottleneck t(Jr high-throughput systems. rhen at any time, only 4 GB of memor y can be in an

updated StJte (dirty ). The ability to tunc the wash Future Work region reduces the load on the checkpoint procedure, as well :�srecovery. 'vVhen a VLM svstcm tj ils, the large number of data­ The Sybase System 11 SQL Server has implemented base butle rs in mcmorv that are dirtv need to lx: . . a fu zzy checkpoint thJt allows transactions to proceed recovered . Therd(Jre, database recovery time grows

even during a checkpoint operation. Trans::�crions with the size of m emory in the VLM system, at least are srallcd only when they try to upd:ltc a database tor all database systems that usc log-based recovery. page that is being written to disk by the checkpoint. In addition, since there arc a large number of dirty Even in that case, the stall lasts only tcJr the time buffers in memory, the pcrt(xmance impact of check­ it takes the disk write to complete. In addition, in point on transactions also increases with memory size. tbe SQL Server, the checkpoint process can keep mul­ To minimize the recovery time, one may increase the tiple disks busy by issuing a large number ofasy nchro­ checkpoint ti-equency. The checkpoints have a higher nous writes one after another. During the time of impact, however, ;m d need to be done infrequentlv. rhc checkpoint, the Housekeeper ofTen becomes These conflicting requirements need to be addressed active due to extra idle time cre:ued by the checkpoint. for VLM systems. The Housekeeper is selfpacing; it docs nor swamp the When a database tits in mcmorv, the buffer replace­ storage system with writes. ment algorithm can be eliminated. For example, t( Jr a single table that tits in one named cache, this opti­ Commit Processing mization can be done with the LMM. In addition, if a table is read-only, it is possible to minimize the syn­ The SQL Server uses the group commit algorithm to chronization necessary ro access the buffers in mem­ improve throughput.8·" The group commit algorithm ory. These optimizations require syntax f()f the DBA collects the log records of m u l tiple trJnS

writes them to the disk in one l/0. This allows hightT as well as properties of named caches (for example,

transaction throughput due to the amortization of bufkr replacement a l gori thms). disk I/0 costs, as well as committing more ;:m d more These two areas as well as other MlvlDBtec hniques trJnsactions in each disk write to the Jog ti le. The SQL will be explored by the SQL Server developers tc Jr Server docs not use a timer, however, to improve the incorporation in ti.1turc releases. grouping of transactions. Instead, the duration of the previous log I/0 is used to collect transactions to be Summary committed in the next batch. The size ohhe batch is determined by the number of transactions that reach The Sybasc System I 1 SQL Server supports VLM

commit processing during one ro ta tion of the log systems built and sold by DIGITAL. The SQL Server disk. This self-tuning algorithm adapts itself to various can completely cache parts of a database in memory. speeds of disks. For the same transaction processing It can also cache the entire database in memory if system, the grouping occurs more often with slower the database size is smaller than the amount of mem­ disks than with Elster disks. ory. Svstcm 11 has bcilirics that address issues of Consider, t()r example, a system pedcmning l ,000 fa st access, checkpoinr, �md recovcrv ofVLM systems; transactions per second. Let us assume the log disk is these L1 cilities arc the Logic::d Memory Manager, the rated at 7,200 rpm. Each rotation of the disk takes VLM query optimizer, the Housekeeper, and fu zzv � milliseconds. Within this duration, we expect (on checkpoint. The SQL Server product achic1'Cd

86 Di�ircd Technical Journal Vo l. 8 No. 4 1996 SMP TPC performance of 14,176 tpmC at Biographies $198/tpmC on a DIGITAL VLM system. The tech­ nology developed in System ll lays the groundwork fo r fu rther implementation of MMDB techniques in the SQL Server.

Acknowledgments

We gratefully acknowledge the various members of the SQL Server development team who contributed to the VLM capabilities described in this paper.

References and Notes T.K. Rengarajan I. t-:or more information about audited tpmC measure­ T. K. Rengarajan has been building high-performance

ments, see the Transaction Processing Performance database systems tor the past 10 years. He now leads the Council home page on the Wo rld Wide Web, Server Performance Engineering and Development (SPeeD) http:/jW\VW. tpc.org. Group in SQL Server Engineering at Sybase, Inc. His most recent focus has been System 11 scalabi l ity and selftuning 2. S.-0. Hvasshovd, 0. Torbjornsen, S. 13ratsberg, and algorithms. Prior to joining Sybase, he contributed to the P. Hobger, "The ClustRa Te lecom Database: High DEC Rdb system at DIGITAL in the areas of butTerman­ Availability, High Throughput, and Real-Time agement, high availabilit-y, OLTI' performance on Alpha systems, and multimedia databases. He holds degrees Response," Proceedings of tbe 21st limy Lar8e M.S. in computer-aided design and computer science ti·om the Database Co nference, Zurich, Switzerland, L 995. University of Kentucky and the University of Wisconsin, 3. H. Jagadish, D. Lieuwen, R. Rastogi, A. Silberschatz, respectively. and S. SudharsllJn, "Dali: A High Performance Main Memory Storage Manager," Proceedings of the 20tb \le rv LW�!Je Database Co nference Conference, Santiago, Chile, 1 994 .

4. M. Heytens, S. Listgarten, M.-A. Neimat, and K. Wilkinson, "Smallbase: A Main-Memory DlHviS for High-Pertormance Applications" (1995).

5. H. Garcia-Molina and K. Salem, "Main Memory Database Systems: An Overview," IFF/:.' Tra nsactions on Knowlec(!!,ean d Du!CtEngin eering, vol. 4, no. 6 (1992): 509-516.

6. D. GJwlick and D. Kinkade, "Varieties of Concurrency Ma..xwell Berenson Control in Fast Path," Database Fn,[i ineer­ Hv! SjVS Max Berenson is a staff software engineer in the Server in,[i Bullet in, 8, 2 ( 1985 ) 3- 0 vol. no. : 1 . Performance Engineering and Development Group in SQL Server Engineering at Sybase, Inc. During his tour years at 7. E. Levy and A. Silbcrschatz, Incremental Recovety Sybase, Max has developed the Logical Memory J'vl anager in Aiain lvl emory Datai?ase Systems (University of to r System 1 1 and has made many bufkr manager modifi­ Texas at Austin, Te chnical Report TR-92-01, January cations improve Sl\>IPsca lability. Prior to joining Sybase, 1992). to Max worked at DIGITAL, where he developed a rdational database engine. 8. j. Hennessy and D. Patterson, Co mplller Architec­ ture: A Quantitatiue Approach, Second Edition (S�m han cisco: Morgan Kautinann Publishers, Inc., 1995 ) . Ganesan Gopal 9. S. Roy and M. Sugiy

Di�ital Technical Journal Vol. 8 No. 4 1996 87 Bruce McCready Bruce McCt·cadv is an SQL Server pcrfixmance engineer in rhe Scn·er Pertormance Engi tH:cring and Development C3rou p ar Sybase, Inc. B ruce received a B.S. in computer science ti·om the University of ( :alil(>rniaat 13crkcleyin 1989.

Sapan Panigrahi A senior performance e ngi neer, Sapan Panigra hi works in the Servu Pcrtormance Engineering :llld Development Group •H Sybase , Inc. He \\';\� rL·sponsible fo r TPC bench­ m,Hks and performance ana l vsis I(Jr the Sybase SQL Server.

Srikant Subramaniam A member of the Sen·er Pcrlim11ancc 1-'nginecringand Deve lopment Group :l t Syb;Jsc, Inc., Sri kant Subramaniam was involved in the design and impl eme ntation of the VLM support in the Sybase SQL Server. He was a mem ber of the ream that implcmemed the Logical M emory Man

Cuuda. His specialtv ;Ire;\ was the performance ofslun:d­ memorv multiprocessor systems.

Marc B. Sugiyama Marc S ugivama is a staff softw are engineer in the SQL Server Performance Engi neering and DcvelopnH::nrGroup

88 Digital Tc chnie

The performance of an application can be The Alpha Architecture and its initial implementations expressed as the product of three variables: were limited in their ability to manipulate data values at the byte and word granularity. Instead of allowing (1) the number of instructions executed, (2) the single instructions to manipulate byte and word val­ average number of machine cycles required to ues, the original Alpha Architecture required as many execute a single instruction, and (3) the cycle as sixteen instructions. Recently, DIGITAL extended time of the machine. The recent decision to the Alpha Architecture to manipulate byte and word add byte and word manipulation instructions data values with a single instruction. The second gen­ to the DIGITAL Alpha Architecture has an effect eration ofthe Alpha 21164 microprocessor, operating at 400 megahe tz ( M Hz) or greater, is the fi rst imple­ upon the first of these variables. The perfor­ r mentation to include the new instructions. mance of a commercial database running on This paper presents the results of an analysis of the Windows NT operating system has been the effects that the new instructions in the Alpha analyzed to determine the effect of the addition Architecture have on the performance, code size, and of the new byte and word instructions. Static dynamic instruction distribution of a consistent execu­ and dynamic analysis of the new instructions' tion path through a commercial database . To exercise the database, we modified the Transaction Processing effect on instruction counts, function calls, and Performance Council's (TPC) obsolete TPC-B bench ­ instruction distribution have been conducted. mark. Although it is no longer a valid TPC bench­ Te st measurements indicate an increase in per­ mark, the TPC-B benchmark, along with other TPC formance of 5 percent and a decrease of 4 to benchmarks, has been widely used to study database 7 percent in instructions executed. The use of performance.'-5 prototype Alpha 21164 microprocessor-based We began our project by rebuilding Microsoft Corporation's SQL Server product to use the new hardware and instruction tracing tools showed Alpha instructions. vVe proceeded to conduct a static that these two measurements are due to the code analysis of the resulting images and dynamic link use of the Alpha Architecture's new instructions libraries (DLLs). The focus ofthe study was to investi­ within the application. gate the impact that the new instructions had upon a large application and not their impact upon the oper­ ating system. To this end, we did not rebuild the Windows NT operating system to use the new byte and word instructions. We measured the dynamic effects by gathering instruction and fu nction traces with several profiling and image analysis tools. The results indicate that the Microsoft SQL Server product benefits from the additional byte and word instructions to the Alpha microprocessor. Ourmeasur ements of the images and DLLs show a decrease in code size, ranging ti-om neg­ ligible to almost 9 percent. For the cached TPC-B transactions, the number of instructions executed per transaction decreased ti-om 111,288 to 106,521 (a 4 percent reduction). For the scaled TPC-B trans­ actions, the number of instructions executed per

Digital Tec hnical Journal Vo l. 8 No. 4 1996 89 tr:�ns:�ction dccrc:�sed fro m 115,895 to 107,854 and 12-cntt·y ITB. The chip contains three on -ch ip

(a 7 percent reduction). caches. The level one (Ll) caches include :1.11 8-KB, The rest of this paper is divided as ta llows: wc begin dircct-mJppcd I-cJchc and :�n 8-KB, du�1l-portcd, with a brief ovcn·icw oft he Alpha Architecture and its direct-mapped, write-through D-cJchc. A third introd uction of the new byte and word mJnipulation on-chip uchc is J 96-KB, thrce- w:�v set-associative, instructions. Next, we describe the hardware, soft\\'�lrc, write hJCk mixed instruction and d:�ra cache. The and tools used in our experiments. Lastly, we prcl\'ide tloating-point pipeline was reduced to nine stJ�cs, and

:tn an:�l ysis of the instruction distribution and count. the CPU has t\\'O integer units and rwo t) oJting-point c:-.:ccution units.9 Alpha Architecture The Exclusion of Byte and Word Instructions The Alpha Architecture is a 64-bit, load and store, reduced instruction set computer (RISC)archit ecture The origin:d Alpha Architecture intended that opera­ that was designed with high pertormancc and longev­ tions involved in loading or storing aligned bytes and ity in mind. Irs major areas of concentration arc words would involve sequences as given in Ta bles 1 the processor clock speed, the multiple instruction :�nd 2.'" As nuny as 16 additional instructions arc

issue, and multiple processor implementations. for a required to accompl ish these operations on u nal igned det:tilcd account of the Alpha Architecture, its major tbta. These same operations in the MIPS Architecture design choices, and overall benefits, sec the p:tpcr im·oh-c only J single instruction: LB, LvV, SB, :tnd by R. Sites.'' The origin:�! Jrchitecture did not dctinc SW. " The M I PS Architecture also includes single the c1pability to manipulate byte- and word-Jc\·cl instructions to do the same tor unaligned d:�ta. Gi\Tn data with a single instruction. As a result, the tirst �� situation in \\'hich all other bctors arc consistent, this three implcmcnt:t tions of the Alpha Architectmc, the \\'ould appc:1rto give the MIPS Architecture an ach·an­ 21064, the 21064A, and thc 21164 microprocessors, tagc in its abi lirv to reduce the number of instructions were to rccd to usc as many as sixteen additional executed per workload. instructions ro accomplish this task. The Alpha Sites has presented several kev Alpha Architecture

Architecture w:ts recently extended to include six new design decisions." Among them is the decision 11ot ro instructions t(Jr manipulating data at byte :tnd word include byte load and store instructions. Key design boundaries. The second implementation of the 21164 assumptions related to the exclusion of thcsc katurcs bmily of microprocessors includes these extensions. include the to llowing: The t! rsr implementation of the Alpha Archi­ • The majority of operations would involve n aturally tecture, the 21064 microprocessor, was intro­ aligned data clements. duced in Nove m ber 1992. It was ta.bric:tted in a 0.75-micromctcr (f.Lm) complementary metal-oxide semiconductor (CMOS) process and operated at Ta ble 1 speeds up to 200 MHz. It had both an 8-kilobvtc Loading Aligned Bytes and Words on Alpha ( KB ), direct-mapped, write-through, 32- bytc line instruction cache (I -c:tche) and data cache ( D-cachc). Load and Sign Extend a Byte The 21064 microprocessor was able to issue two LDL R1, D.lw(Rx) instructions per clock cycle to a 7 -stage intege r EXTBL R1, #D.mod, R1 pipeline or a 10-stage tloating-point pipcline.c The second i mplcmentation of the 21064 generation w:�s Load and Zero Extend a Byte the Alph:t 21064A microprocessor, introduced in October 1993. It was manufa.ctured in a 0.5-f,Lm LDL R1, D.lw(Rx) CMOS process and operated at speeds of233 MHz to SLL R1, #56-8*D.mod, R1 275 MHz. This implementation increased the size ot' SRA R1, #56, R1 the l-c;tchc and D-cachc to 16 KB.Va rious other dif fC rcnccs exist bcn.vcen the two implementations and Load and Sign Extend a Word arc outlined in the product data sheet." The Alpha 21164 microprocessor was the second­ LDL R1, D.lw(Rx) generation implementation of the AlphaAr chitecture EXTWL R1, #D.mod, R1 and wJs introduced in October 1994. It was manu­ fa ctured in J 0.5-f,Lm CNIOS technology and has the Load and Zero Extend a Word ability to issue fo ur instructions per clock cycle. It LDL R1, D.lw(Rx) contains a 64-cntry data translation buffer(D TB) and a 48-cnrrv instruction translation bufter (ITB) com­ SLL R1, #48-8*D.mod, R1 pared ro the 21064A microprocessor's 32-enrrv DTB SRA R1, #48, R1

90 Digital Tc chnic1\ joum;l\ Vol. � :'\o.4 1996 Ta ble 2 Windows NT, to the Alpha platform. The Windows Storing Aligned Bytes and Words on Alpha NT operating system had strong links to the Intel

Store a Byte and the MIPS Architectures, both of which included instructions tor single byte and word manipulation.14 LDL R1, D.lw(Rx) This strong connection influenced the Microsoft devel­ INSBL RS,#D.mod, R3 opers and independent software vendors (ISVs) to MSKBL R1, #D.mod, R1 favor those architectures over the Alpha design. BIS R3, R1, R1 Another fa ctor contributed to this issue: the major­ STL R1, D.1w(Rx) ity of code being run on the new operating system came fr om the MicrosoftWin dows and MS-DOS envi­

Store a Word ronments. In designing software applications tor these two environments, the manipulation of data at the LDL R1, D.lw(Rx) byte and word boundary is prevalent. With the Alpha INSWL RS,#D.mod, R3 microprocessor's inability to accomplish this manipu­ MSKWL R1, #D.mod, R1 lation in a single instruction, it suffered an average of BIS R3, R1, R1 3: l and 4: l instructions per workload on load and STL R1, D.1w(Rx) store operations, respectively, compared to those architectures with single instructions tor byte and word manipulation. To assist in running the ISV applications under the • In the best possible scheme tor multiple instruction issue, single byte and write instructions to memory Windows NT operating system, a new technology was are not allowed . needed that would allow 16-bit applications to run as if they were on the older operating system. Microsoft The addition of byte and write instructions would • developed the Virtual DOS Machine (VDM) environ­ require an additional byte shifter in the load and ment fo r the Intel Architecture and the Windows­ store path. on-Windows (WOW) environment to allow 16-bit These fa ctors indicated that the exclusion of specific Windows applications to work. For non-Intel architec­ instructions to manipulate bytes and words would be tures, Insignia developed a VDM environment that advantageous to the performance of the Alpha emulated an Intel 80286 microprocessor-based com­ Architecture. puter. Upon examining this emulator more closely, The decision not to include byte and word manipu ­ DIGITAL found opportunities tor improving perfor­ lation instructions is not without precedents. The mance if the Alpha Architecture had single byte and original MIPS Architecture developed at Stanford word instructions. University did not have byte instructions.1' Hennessy Based upon this information and other factors, a et al. have discussed a series of hardware and software corporate task fo rce was commissioned in March 1994 trade-otis fo r pcrfcxmance with respect to the MIPS to investigate improving the general performance of processor. 13 Among those trade-ofts are reasons tor Windows NT running on Alpha machines. The further not including the ability to do byte addressing opera­ DIGITAL studied the issues, the more convincing the tions. Hennessy et al. argue that the additional cost argument became to extend the Alpha Architecture to of including the mechanisms to do byte addressing include single byte and word instructions. was not justified. Their studies showed that word ref. This reversal in position on byte and word instruc­ erences occur more freq uently in applications than do tions was also seen in the evolution of the MIPS byte references. Hennessy et al. conclude that to make Architecture. In the original MIPS Architecture devel­ a word-addressed machine feasible, special instruc­ oped at Stanford University, there were no load or tions are required fo r inserting and extracting bytes. store byte instructions.12 However, for the first com­ These instructions arc available in both the MIPS and mercially produced chip of the MIPS Architecture, the the Alpha Architectures. MIPS R2000 RJSC processor, developers added instructions fo r the loading and storing ofbytes.11 One Reversing the Byte and Word Instructions Decision reason tor this choice stemmed fr om the challenges posed by the UNIX operating system. Many implicit During the development of the Alpha Architecture, byte assumptions inside the UNIX kernelcaused per­ DIGITAL supported two operating systems, OpenVMS fo rmance problems. Since the operating system being and ULTRJX.The developers had as a goal the ability implemented was UNIX, it made sense to add the byte to maintain both customer bases and to fa cilitate their instructions to the MIPS Architecture.1' transitions to the new Alpha microprocessor-based In June 1994, one of the coarchitecrs oftbe Alpha machines. In 1991, Microsoft and DIGITAL began Architecture, Richard Sites, submitted an Engineering work on porting Microsoft's new operating system,

Digital Te chnical Journal Vol. 8 No. 4 1996 91 Change Order ( ECO) to r the extension of the archi­ Ta ble 4 tecture include byte and word instructions. It was Three Methods for Execution of the New Instructions ro speculated at the time that an increase of as much as Nomenclature Description 4 percent in ovnall performance would be achieved using the new instructions. In June 1995, six new Original Compiled with instructions that can execute on all Alpha instructions were added to the Alpha Architecture. implementations The new instructions are outlined in Table 3. The tirst implementation to include support fo r the new Byte/Word Compiled using the new instructions that will execute instructions was the second generation of the Alpha on second-generation 21164

21164 microprocessor series. Th is reimplementation implementations at full speed of the tirsr Alpha 21164 design was manufactured Emulation Compiled with new instructions in a 0.35-J-Lm CMOS process and was introduced in and emulated through software October 1995.

Te sting Environment on a prototype of the AlphaStation 500 work­ We set up tests to measure the performance of equip­ station that was based upon the second-generation ment with and without the new i nstru tio s . To con­ c n 21 164 microprocessor operating at 400 M H z. (The duct our experiments, we used prototype hardware AlphaStation 500 is a fa mily of high-pert(mmnn:, that included the second -generation Alpha 21164 mid - range graphi cs workstations .) The prototype was microprocessor, and we devised a method to enable configured with 128 megabytes ( MB) of memorv and and disable the new instructions in hardware. At the a single, 4-gigabyte ( GB) tast-wide-difkrential ( r:wn ) same rime, we investigated the projected performance small computer systems i nte ace (SCSI-2) disk. rf of the software emulation mechanism to execute the New t rmw re allowed us to alternate between i a new instructions on older processors. Finally, we built direct hardware execution and software emulation of two separate versions of the Microsoft SQL Server the new byte and word instructions. We modified the

application, one that used the new instructions and Adva nced RISC Consortium (ARC ) code to a llow us one that did not. for the purposes of discussing the to switch between the two firmware versio s through n diffe rent scenarios under study, we summarize the a simple power-cycle uti ity cal led the tail-sate loader. l , 1'' three execution schemes in Table 4. We usc the associ­ When the machine is powered on, it loads code ti·om a ted nomenclature given there in the rest ofthis paper. a serial read-only memory (SROM) storage device. In the remainder of this section, we describe each of This code then loads the ARC fi rmwa re tl-om non­ the hardware, software, compiler, and analysis tools. volatile flash ROM. The fa il-sak loader allowed the ARC firmwareto be loaded i nto physical memory and Pro totype Hardware not into the fl ash ROM. The new f r w re w:1sinitial­ i m a As previously mentioned, our machine was capable ized by a reset of the processor and was executed as of operati ng with and without the new instructions. if it were loaded from th e Hash RO M. When the By using the same machine, we were able to mini­ machine was turned off and then back on, the version

mize eftccts that could be introduced fr om variations oHi rmware that was stored in nonvolatile memory was in machine designs or processor fa milies that could l oaded and executed. cause an increase in the executed code path through the operating system. All experiments were run Operating System We used a beta copy of the Microsoft Windows NT version 4.0 operating system. We chose this operating system fo r irs capability to allow us to xami e the Ta ble 3 e n New Byte and Word Manipulation Instructions impact of emulating the new byte and word instruc­ tions in the operating system. Mnemonic Opcode Function By default, version 4.0 of the Windows opeLH­ NT stb OE Store byte from register ing system disables the trap and emulation ca p�1bility to memory t()r the new instructions. This approach is si milar to stw OD Store word from register the one VVindows NT provides for the A pha micro­ l to memory processor to handle unaligned data references. for ldbu OA Load zero-extended byte te ting purposes, we nabled and disabled the trap and s e from memory to register em ulation capability of the new instructions. VV hen ldwu Load zero-extended word oc this option is enabled, the operati ng S\'Stem treats each from memory to register new instruction listed in Table 3 as an illegal instruc­ sextb 1C.OOOO Sign extend byte tion and emulates the instruction. The trap an d emu­ sextw 1 C.0001 Sign extend word late strategy rakes approximate!�' 5 to 7 microseconds

92 DiJ;it�lTec hnical Journ:1l Vol. 8 No. 4 1996 per emulated instruction. When it is disabled or not Sites and Peri used an early version of the Microsoft present, the action taken depends upon the hardware SQL Server version 6.0, in which the fastest network support fo r the new instructions. If disabled in hard­ transport available at that time was Named Pipes. In ware, the instruction is treated as an illegal instruction; the fi nal release of SQL Server version 6.0 and sub­ if enabled, it is executed like any other instruction. sequent versions of the product, the Transmission Control Protocol/Internet Protocol (TCP /IP) Microsoft SQL Server replaced Named Pipes in this category. Based upon this, \ve rebuilt the libraries associated with TCP /lP To observe the effects of the new instructions, we instead of those associated with Named Pipes. Other chose the Microsofi:SQL Server, a relational database networking libraries, such as those to r DECnet and management system ( tU)BMS) to r the Windows NT Internetwork Packet Exchange/Sequenced Packet operating system. Microsoft SQL Server was engi­ Exchange (IPX/SPX), were not rebuilt. neercd to be a scalable, multiplatform, multithreaded RDBMS, supporting symmetric multiprocessing (SMP) systems. It was designed specitically tor distrib­ Ta ble 5 uted client-server computing, data warehousing,

w X w w X (fJ -' > -' -' (fJ (fJ 0-' U) 0 w -' -' -' >- >- -' a: -' 0-' 0 -'0 (fJ >- 0 w _j U) -' w >- 0 (fJ a: (\J 1-' -' -' (fJ (fJ X U) -' > z -' (fJ 0l{) (fJ -'0 0-' (\J U) � 0-' >- a: 0 a:i >- (fJ a: >- X w z en (fJ (fJ (fJ z a: -' :::::; "' _j ;> (fJ� �C\i >- f- w � co (fJ 0 w 0 >-(fJ "' 1-' £l_ 0 >- >- �"' >- £l_ £l_-' ()) _j 0 -' £l_ -' a: ()) u (fJ U) C\i w (fJ (fJ en z z en 0-' 0-' (fJ -' 1'-" z � EO (fJ � a: 0 0.: Q 0 UJ ()) (.) en "' (.) u 0 -'0 :5: u._ UJ a: _j f- z -' (fJ z f- 1-' ci a: z UJ en z 0. u._ U) -' f- f- UJ iiiu 0. <{ a: a: I;:;: U) co UJ co <{ 0 U) U) 0. aU) f-z f-z z z 0 � (fJ z I ;> (.) (.) :5: u:::::; 0 z z u._ a: z0 U) (.) � a f- <{ ()) <{ � :::l :::l :5 100.00 f- z w (.) "-.., 0.ffi 10.00

uiz '\ i= 0:::l � � a: 1.00 .....__.�� �� ...... � ...... � .....__. zf- >----< r---.,.....__. w0. 0.10 �,__ U) ...... w r-....� �,__ � i= 0.01

Figure 1 Images/DLL� Involved in a TPC-B Trc1nsacrion fo r Microsoft SQL Server Based on Sites and Peri's AIJ:�Iysis

VoL 8 1996 93 Digital Tcdmical Journal No. 4 Compiling Microsoft SQL Server to The application benchmark can be run in t\vo dif Use the New Instructions krent modes: cached 3nd sc: ded. The cached, or in­ memory mode, is used to estimate rhe system 's Our goal was to measure only the dkcrs inrroduced maximum perf(mnance in this benchmark environ­ by using the new instructions ;md not dlccts inrro­ ment. This is :1ccomplished by building 3 small database duced by different versions or generations of compil­ that resides co mpletely in the database cache, which in ers. Therefore, we needed to find a W3)' to use the same turn fits within the system's physical ra ndom-access version of a com piler that difkrcd only in irs usc or memory . Since the entire database resides in ( RAt'vl) nonuse of the new instructions. To do this, we used memory, ljO activity is eliminated with the excep­ <111 a compiler option available on the Microsoft Visual tion of log writes. Consequently, the benchmark on ly C++ compiler. This switch, available on all !U SC: pl:tt­ pcrf(mllS one disk l/0 f( >r each transaction, once the fo rms that support Visual C++, allows the gener3tion cmire database is read off the disk and into the database

of optimized code for a specific processor within �l cKhe. The result is ;1 representation of the maximum processor familywhile maintaining bi n3ry compatibi l­ number of tps th;lt system is c1pable of sustaining. rhc itY with all processors in the processor fa mily. Processor The sc::dcd mode is run using a bigger database with optimizations are accompl ished by 3 combination of a brgcr :�mount of disk 1/0 activity. The increase in specific code-pattern selection and code scheduling. disk 1/0 is result of the need re:�d and write data to a to The default action of the compiler is to usc 3 blended locations that 3re nor within the database cache. These model, resulting in code tbat executes equally well additional reads and writes add extra disk 1/0s. The across all processors within a platf(mn t:m1ily. resu lt is normally charJCterizcd as having to do one Using this compiler option, we built two versions read and one write to the database :m d a single write to of the aforementioned images within the SQL the tr�nsaction log for each transaction. The combina­ Server applicatjon, varying only their usc of the code­ tion of a larger database and additional I/0 activity generation switch. The first version, rdcrred to as the dccrc�1scs the tps nluc from the c3ched \·crsion. Based Original build, was built without specii),ing an argu ­ upon our previous experience running this benchmark, ment for the code-generation swi tch. The second one, the scaled benchmark can be expected to reach approx­ referred to as Byte/Word, set the switch to generate imately 80 percent of the cached pert(mllance. code patterns using the new byte and word manipula­ For the scaled tests, we built a database sized to tion instructions. AJIother required fi les came from the 3ccommodatc 50 tps. This was less than 80 percent SQL Server version 6.5 Beta II distribution of the maximum tps prod uced by the cached results. CD-ROM. 'We chose this size because we were concentrating

Th e Benchmark on isol3ting ;l single scaled transaction under 3 moder­ The benchmark we chose was derived fi·om the TPC:- ate lo;ld and not under the maximum scaled perfor­ B benchmark. As previously mentioned, the mance possi ble. 'ITC-H benchmark is now obsolete; however, it is sti ll useful

fo r stressing a database and its interaction with �l co m­ Image Tra cing and Analysis To ols purer system. The TPC. B benchmark is relatively Collecting only static me3SLII-cments of the executables easy to set up and scales readily. It has been used by and DLL� afkctcd was insufficient to determine the both database vendors and computer m:1nufacrurers applicability of the new instructions. \,Ye collected the to measure the performance of either the computer actual instruction traces of SQL Server while it exe­ system or the actual database. We did not include all cuted the application benchmark. Furthermore, we the required metrics of the TPC-B benchmark; there­ decided that the ;l bility to trace the actual instructions fore, it is not in fu ll compliance with published guide­ being executed was more desirable than developing or lines of the TPC. We refer to it hcncdorth simply :�s extending a simulator. To obtain the traces, we needed the application benchmark. tool that wou ld allow us ;1 to The application benchmark is characterized by sig­ • Collect both system- and user-mode code. nificant disk ljO activity, moderate system and applica­ tion execution time, and transaction integrity. The • Collect fu nction traces, which would allow us to application benchmark exercises and measures the effi­ align the starting and stopping points of different ciency of the processor, I/0 architecture , and RD BMS. benchmark runs.

The resu lts measure performance by indicating how • \.York without modifYing either the 3pplicarion or many sim.ul ated banking transactions can be com­ the operating system. pleted per second . This is defined as trans3ctions per In the p3St, the only tool tb3t wo uld provide second (tps) and is the total number of committed instruction traces under the vV indows NT operating transactions that were started and completed during system was the debugger running in single-step mode. the measurement interval.

Digitol l Te chnic� journal Vo l. 94 I ll No. 4 1996 Obtaining traces through either the ntsd or the Results windbg debugger is quite limited due to the fo llowing problems: We collected data on three different experiments. In the firstin vestigation, we looked at the relative perfor­ The tracing rate is only about 500 instructions per • mance of the three diffe rent versions of the Microsoft second. This is fa r too slow to trace anything other SQL Server outlined in Table 4. We compared the than isolated pieces of code. tbree variations using rbe cached version of the appli­

• The trace fa ils across system calls. cation benchmark.

• The trace loops infinitely in critical section code. In tbe second experiment, we observed how the new instructions affect the instruction distribution in • Register contents arc not easily displayed for each instruction. the static images and D LLs that we rebuilt. We com­ pared the Byte/Word versions to the Original versions Real-rime analysis of instruction usage and cache • ofrhe images and DLLs. We also attempted to link the misses are not possible. diffe rences in instruction counts to the use of the new Instruction traces can also be obtained using the instructions. PatchWrks trace analysis tool.' Although this tool Lastly, we investigated the variation between the operates with near real-time pertormance and can Original and the Byte/Word versions with respect to trace instructions executing in kernel mode, it has the instruction distribution on the scaled version of the to llowing limitations: benchmark. This comparison was based upon the code path executed by a single transaction. • It operates only on a DIGlTAL Alpha AXPpersonal computer. Ca ched Performance 4 MB • It requires an extra 0 of memory. In the fi rst experiments, we compared the relative per­

• All images to be traced must be patched, thus to nnance impact of using the new instructions. We slightly distorting text addresses and fu nction sizes. chose to measure performance of only the cached ver­ sion of the application benchmark because the l/0 • Successive runs of application code are not repeat­ able due to unpredictable kernel interrupt behavior subsystem available on the prototype of the (the traces are too accurate). AlphaSrarion 500 was not adequate fo r a fu ll-scaled measurement. We ensured that the database was fully The solution was Ntstcp, a tool that can trace user­ cached by using a ramp- up period of 60 seconds and a mode instruction execution of any image in the ramp-down period of 30 seconds. This was veri tied as Windows NT/Al pha environment through an innov­ steady state by observing that the SQL Server buffe r

" ative combination of breakpointing and Alpha-on­ cache hit ratio remained at or above 95 percent. The Alpha" emulation. It has the ability to trace a measurement period to r the benchmark was 60 sec­ program's execution at rates approachin.g a million onds. We ran the benchmark several times and took instructions per second. Ntstep can trace individual the average tps fo r each of the three variations outlined instructions, loads, stores, fimction calls, I -cache and in Table 4. D-cache misses, unal igned data accesses, and anything The results of the three schemes arc as fo llows: 444 else that can be observed when given access to each tps tor the Original version, 460 tps for the Byte/ instruction as it is being executed . It produces sum­ Wo rd version, and 116 rps fo r the Emulation ver­ mary reports of the instruction distribution, cache line sion. The new instructions contributed a 3.5 percent usage, page usage (working set), and cache simulation gain in performance. The impact of emulating the statistics fo r a variety ofAl pha systems. instructions is a loss of 73.9 percent of the potential Ntstep acts like a debugger that can execute single­ performance. step instructions except that it executes instructions using emulation instead of single-step breakpoints Static Instruction Counts whenever possible. In practice, emulation accounts fo r To analyze the mixture of instructions in the images the majority of instructions executed within Ntstep. and DLLs, we disassembled each image and DLL in Since a single-step execution of an instruction with the Original and Byte/Word versions. We then looked breakpoints rakes approximately 2 milliseconds and at only those instructions that exhibited a difference tion of an Alpha instruction requires only emula 1 or 2 between the two versions within the images or DLLs. microseconds, Ntstep can trace approximately 1,000 The variations in instruction counts ofthese are shown times faster than debugger. Unlike most emulators, a in Table 6. the application executes normally in its own address To examine the images more closely, we disassem­ space and environment. bled each image and DLL and collected counts of code

Digital Technical Journal Vol. 8 No. 4 1996 95 Ta ble 6 Instruction Deltas (Normal Minus Byte/Word) for the SQL Server Images and DLLs

Instruction dbmssocn.dll ntwdblib.dll opendsGO.dll sqlservr.exe ssmsso60.dll Instruction dbmssocn.dll ntwdblib.dll opendsGO.dll sqlservr.exe ssmssoGO.dll

Ida 0 -3 -247 -8524 -4 xor 0 0 -2 119 0 ldah 0 0 -27 18-18 0 sll 0 0 .f 2 -2359 0 ldl -9 -11 -597 -13133 -46 sra 0 0 -3534 -15 -4 ldq 0 0 -29 -2980 0 sri 0 0 0 -295 0 ldq_l 0 0 0 -9 0 cmpbge 0 0 -1 -18 0 - ldq_u -10 -2 -31 1 -8529 -18 mskbl -3 - 1 -196 -3647 -8 stl -5 -11 -278 -7932 -11 mskwl 0 5 -41 -1604 0 stb +3 ·1 +216 +3969 + 7 zap not -5 0 -115 - 2135 -33 stw +2 ·5 +59 + 2798 +3 addl 0 0 0 -8 0 stq 0 0 -4 -53 0 addq 0 0 0 +3 0 stq_c 0 0 0 -9 0 s4addl 0 0 0 -4 0 beq 0 5 +1 -1236 0 cmovge 0 0 0 +1 0 bge 0 0 0 +8 0 cmovne 0 0 0 +2 0 bgt 0 0 0 +3 0 cmovlt 0 0 0 0 -1 blbc 0 0 -1 -19 0 cmovlbc 0 0 0 -2 0 bibs 0 0 0 -4 0 callsys 0 0 0 0 0 bit 0 0 0 0 0 extqh 0 0 -14 -426 -4 bne 0 0 +1 +24 0 ldwu +4 0 +193 +6320 +35 br 0 -4 +1 -1120 0 ldbu +9 +3 +464 +10231 +18 bsr 0 0 0 -6 0 mull 0 0 0 +1 0 ret 0 0 +4 +15 0 subl 0 0 +1 +6 0 cmpeq 0 0 0 +9 0 subq 0 0 0 +3 0 cmplt 0 0 0 +15 0 insll 0 0 0 1-1 0 cmple 0 0 0 +5 0 inswl -2 -3 -54 -2647 -3 cmpult 0 0 -1 1-1 0 cal l_pal +2 +1 +1 + 161 0 empule 0 -5 -2 1183-1 183 0 extlh 0 0 0 - 14 0 and -2 -6 -364 -6435 -8 insbl -2 -135 -3163 -6 -1 bic -3 -11 -287 -7242 -8 extll 0 0 0 -20 0 bis -4 -7 -208 -7097 -9 extbl -10 -6 -367 -10656 -14 ornot 0 0 0 +4 0 extwl -1 0 -84 324 -1

size, the number of fu nctions, the number :mdtvpc of Of the rebuilt images and DLLs, sq lservr. nc and new byte and \\'Ord instructions, and bstl y, nop and opcnds60.dll showed rhe most \'ariations, with the new trapb instructions. The results are presented in Tables instructions making up 3.73 perccm .md 3.9 percent 7 through 10. of these ti les. The most fr equcntlv occurring new We expected that the instructions used manipulate instruction W•1S ldbu, rollowcd l w u. The least­ to bv d bytes and words in the original Alpha Arc hitecture used instructions were sextb and scxrw. The size of (Tables and 2) would dec sc proportionally to the the imJgcs was reduced in thre out of rive images. l rc:� e usage of the new instructions. These assumptions held The ima e size reduction ranged ti·om negligible to g true f() r all the images and DLLs that used the new just over 4 perccm. ln all cases, rile size of the code instructions. For e xample , in the original Alpha section \\':ts red uc and ranged from insigniti.cJnt ed Architecture, the i nstructions MSKBL and MSKWL arc to :1pproxim:�tclv8.5 percent. There ,,.,,s no c!J,11lge in used to store a byte •md \\'Ord , res ectivel . In the the number oft-i111ctions in any l es. p y of the ti sqlservr.exc image, these rwo instructions showed a decrease of 3,647 and l ,604 instructions, rcspccrivc lv. Dynamic Instruction Counts Compare this \\'ith the corresponding addition of3,969 We gathered datJ fr om the application benchmark STB and 2,798 STvV instructions in the same image. runn ing in both cached and scaled modes. \Nc ran ar

, ­ Looki ng fl.trthcr imo the sqlservr.cxe i mage we also saw least one ire ration of the bcncllmJrk test prior to ga th that 10,231 instructions were used and the ering rrace data to allow both the vV indows oper­ LDBU NT usage of the EXTBL i nst uc on was r du ed bv 10,656. ating system and the Microsoh SQI. Server database to r ti e c Although these numbers do not correlate on '1 one-t(>r­ reac h •l steadv state of operation on the S\'Stcm under

onc basis, we believe this is due ro other usage ofrhcsc tes t (Sl: f' ). Steady sure was ac h ieved '' hen rhe SQL instructions. Other usage might include the compi ler Server cache-hit ratio reached 95 percent or greater,

, scheme for intro d ucing the n ew instructions in pLKcs the number of transactions per second wJs consta nt where it used an LDL or Jn in the Original image. •md the uri l iz tion was as close 100 percent as LDQ CPU J to possible. The traces were gathered over a suthcient

96 1<;!96 Vol. ll N". 4 Ta ble 7 Byte/Word Images and DLLs

lmage/DLL To tal To tal To tal Number To tal File Te xt Code of Byte/ %Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW To tal To tal Bytes Bytes Bytes Functions Word Word % Count % Count % Count % Count NOPs TRAPB Count '% Count %

sqlservr.exe 8053624 2981148 2884776 3364 26869 373 10231 38.077 6320 23.5215 3969 14.7717 2798 10.4135 139 0.517325 3412 12.6986 5929 2219 dbmssocn.dll 13824 5884 5520 13 18 1.3 9 50 22.2222 16.6667 11.1111 0 0 0 0 21 0 ntwdblib.dll 318464 2463 16 231688 429 0.02 33.333 1 11.1111 55.5556 767 10 opends60.dll 212992 104204 97240 243 948 3.9 464 48.945 193 20.3586 216 22.7848 59 6.22363 0.949367 0.738397 391 128 ssmsso60.dll 70760 9884 9128 19 67 2.94 18 26.866 35 52.2388 7 10.4478 4.47761 5.97015 0 25

Ta ble 8 Original Build of Images and DLLs

lmage/DLL Total To tal To tal Number To tal File Te xt Code of Byte/ %Byte/ LDBU LDBU LDWU LDWU STB STB 5TW STW SEXTB SEXTB SEXTW 5EXTW To tal To tal Bytes Bytes Bytes Functions Word Word Count % Count % Count % Count % Count % Count % NOPs TRAPB

s ls r.exe 8337248 3264108 3364 0 0 0 0 0 0 0 6207 2252 q erv 3153480 0 0 0 dbmssocn.dll 13824 6012 5656 13 0 0 0 0 0 0 0 0 0 0 16 0 ntwdblib.dll 318464 246620 231904 429 0 0 0 0 0 0 770 10 opends60.dll 222720 1140 12 105536 243 0 0 405 128 ssmsso60.dll 71284 10300 9424 19 18 0

Ta ble 9 Numerical Diffe rences of Original Minus Byte/Word Images and DLLs

lmage/DLL To tal To tal To tal Number To tal File Te xt Code of Byte/ %Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW To tal To tal Bytes Bytes Bytes Functions Word Word Count % Count % Count % Count % Count % Count % NOPs TRAPB

lsqlservr.exe -283624 -282960 -268704 0 +26869 + 4 +10231 +6320 +24 +3969 +15 +2798 + 10 +139 +3412 +13 -278 -33 -dB dbmssocn.dll 0 -128 -136 0 -18 +9 +50 1-4 +22 + 3 +17 -2 t- 11 0 0 0 +5 +I ntwdblib.dll 0 -304 -216 0 +9 0 +3 33 0 0 +11 + 56 0 0 0 -3 + +1 �5 opends60.dll -9728 -9808 -8296 0 +948 +4 + 193 +20 "216 +23 ·t· 6 +1 -14 +464 +49 ; 59 +9 +7 ssmsso60.dll -524 -416 -296 0 +67 +18 +27 +52 +3 +4 0 0 +7 +3 -d5 �7 +10 -+4

Ta ble 10 Percentage Va riation of Original Minus Byte/Word Images and DLLs

lmage/DLL To tal To tal To tal Number To tal File Text Code of Byte/ %Byte/ LDBU LDBU LDWU LDWU STB STB STW STW SEXTB SEXTB SEXTW SEXTW To tal To tal Bytes Bytes Bytes Functions Word Word Count % Count % Count % Count % Count % Count % NOPs TRAPB

sqlservr.exe -3.402% -8.669% - 8.52 1% 0.000% -4.479% - 1.465% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A dbmssocn.dll 0.000% -2.129% - 2.405% 0.000% + 31.250% oc N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A ntwdblib.dll 0.000% -0.123% -0.093% 0.000% -0.390% 0.000% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A opends60.dll -4.368% -8.603% -·7.861% 0.000% -3.457% 0.000% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A ssmsso60.dll -0.735% -4.039% -3.141% 0.000% + 38.889% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A period of time to ensure that we captured several trace, and Figure 3 shows an example output f()r a transactions. The traces were then edited into separate fu nction trace from Ntstcp. Since Ntstcp can attach to individual transactions. The geometric mean was a running process, we ;: dlowcd the application bench· taken fi·om the resulting traces and used tor all subsc· mark ro achieve steady state prior to data collection. quent analysis. This approach ensured that we did not sec the cfkcts of

We used Ntstcp to gather complete instruction and warming up either the machine caches or the SQL function traces of both versions of the SQL Server data· Server database cache. Each instruction trace consisted base while it executed the applic:�tion benchmark. of approximately one million instructions, which was Figure 2 shows an example output t(Jr an instruction sufficientto cover multiple transactions. The data was

0 ** Breakpoint (pi d 0 X d 1, Tid Oxb2) SQLSERVR .EXE pc 77f39b34 0 ** Trace begins at 242698 opends60 ! FetchNextComma nd 1 00242698 : 23deffb0 Lda sp, -50(sp) I I sp now 72bff00 2 0024269c : b53e0000 stq sO, OCsp) I I @072bff00 = 148440 3 002426a0: b55e0008 stq s 1 , 8Csp) I I @072bff08 = 0 4 002426a4: b57e001 0 stq s2, 10Csp) I I @072bf f10 = 5 5 002426a8: b59e001 8 stq s3, 18Csp) I I @072bff18 = 1476a8 6 002426ac: b5be0020 stq s4, 20 Csp) II @072bff20 = 2c4 7 002426b0 : b5de0028 stq s5, 28 (sp) II @072bf f28 = 41 8 002426b4 : b5fe0030 stq fp, 30 Csp) I I @072bff30 = 0 9 002426b8 : b75e0038 stq ra, 38 Csp) I I @072bff38 = 242398 1 0 002426bc : 47f00409 b is zero, aD, sO II sO now 148440 11 002426c0: 47f1 040a b is zero, a 1 , s 1 I I s 1 now 72bffa0 1 2 002426c4: 47f2040b b is zero, a2, s2 I I s2 now 72bffa8 1 3 002426c8 : d3404e67 bsr r a, 00256068 II ra now 2426cc opends60 1netiOReadData 1 4 002560 68 : 23deffa0 Lda sp, -60Csp) I I sp now 72bfea0 1 5 0025606c : 43f1 0002 add l zero, a 1 , t 1 II t 1 now 72bffa0 1 6 00256070 : b53e0000 stq sO, OCsp) II @072bfea0 = 148440 1 7 00256074 : b55e0008 stq s 1 , 8Csp) II @072bfea8 = 72bffa0 1 8 00256078 : b57e0010 stq s2, 10Csp) II @072bfeb0 = 72bffa8 1 9 0025607c : b59e0018 stq s3, 18Csp) I I @072bfeb8 = 1476a8 20 00256080 : b5be0020 stq s4, 20Csp) I I @072bfec0 = 2c4 21 00256084 : b5de0028 stq s5, 28Csp) I I @072bfec8 = 41 22 00256088 : b5fe0030 stq fp, 30Csp) II @072bfed0 = 0 23 002560 8c : b75e0038 stq ra, 38Csp) I I @072bfed8 = 2426cc 24 00256090 : a1 d01 140 L d L s5, 1140Ca0) I I @001 49580 1479e8 25 002560 94 : 47f00409 b is zero, aD, sO I I sO now 148440 26 002560 98 : a1f001d0 L d L fp, 1d0Ca0) I I @001 48610 dbbaO 27 002560 9c : 47e0340d bis zero, # 1 , s4 I I s4 now 1 28 002560a0 : a0620000 L d L t2, 0 ( t 1 ) I I @072bffa0 155c58 29 002560a4: b23e004c s t L a 1 , 4c(sp) II @072bfeec = 72bffa0 30 002560a8 : b25e0050 s t L a2, 50Csp) II @072bfef0 = 72bffa8 31 002560ac: b27e0054 s t L a3, 54Csp) I I @072bfef4 = 1476a8 32 002560 b0 : e460001d beq t2, 002561 28 I I ( t 2 i s 155c58) 33 002560b4 : 220303e0 Lda aD, 3e0Ct2) I I aD now 156038 34 002560 b8 : 47f00404 b is zero, aD, t3 II t3 now 156038 35 002560 bc : 63ff4000 mb II 36 002560c0: 47e03400 b is zero, # 1 , vO I I vO now 1

37 002560c4: a8240000 L d L - L tO, 0Ct3) I I @001 56038 0 = 38 002560c8 : b8040000 s t L - c vO, 0Ct3) I I @001 56038 1 39 002560cc : e40000b6 beq vO, 002563a8 I I CvO is 1 ) 40 002560d0 : 63ff4000 mb I I 41 002560d4 : e4200001 beq tO, 002560dc I I CtO is 0) opends60 !netiOReadData+Ox74 : 42 002560 dc : a1be004c L d L s4, 4c(sp) I I @072bfeec 72bffa0 43 002560e0 : aOOdOOOO L d L vO, 0Cs4) I I @072bffa0 155c58 44 002560 e4: a04003dc L d L t 1 , 3dc (v0) I I @001 56034 0 45 002560e8 : 20800404 Lda t3, 404 Cv0) I I t3 now 15605c 46 002560ec: 405f05a2 cmpeq t 1 , zero, t 1 I I t1 now 1 47 002560f0: e4400003 beq t 1 , 002561 00 II ( t 1 is 1) 48 002560f4 : a0600404 L d L t2, 404 Cv0) II @001 5605c 15605c 49 002560f8: 406405a3 cmpeq t2, t3, t2 II t2 now 1 50 002560fc: 47e30402 bis zero, t2, t 1 I I t 1 now 1 51 002561 00 : 47e2040d bis zero, t 1 , s4 II s4 now 1

Figure 2 Example oflnstruction Trace Output ri·om Ntstep

98 Digital Technical joumal Vo l. 8 No. 4 1996 e44D0005 beq t 1 , DD2561 1c ( t 1 is 1 ) 52 002561 04 : I I aOaDDOOO L d L t4, DCvD) @001 55c58 204200 53 OD2561 08 : I I Ldah t5, 8D(zero) t5 now 800000 54 OD2 561 0c : 24dfDD80 II zapnot t4, #3, t4 t4 now 4200 55 002 561 10: 48a07625 II 4Da60005 add l t4, t5, t4 t4 now 804200 56 002561 14: II 002561 18: bDaDODDD s t L t4, DCvD ) @001 55c58 = 80420D 57 I I 002561 1c: aDfe004 c L d L t6, 4cCsp) @072bfeec 72bffa0 58 I I OD2 561 2D : aDe?DDOD L d L t6, D(t6) @072bffa0 155c58 59 II OD2 561 24 : b3e7D3eD s t L zero, 3eDCt6) @001 56038 = D 60 II 002561 28 : e5aOOD61 beq s4, DD2562bD Cs4 i s 1 ) 61 II 257f0026 Ldah s2, 26C zero) s2 now 260000 62 DD2561 2c: II 216b62f8 Lda s2, 62f8Cs2) s2 now 2662f8 63 002561 3D: II 64 002561 34 : 5fffD41 f cpys f 31 , f31 , f31 II 65 002 561 38 : a21e0054 L d L aD, 54C sp) @072bfef4 1476a8 II 66 002561 3c : 225eDD40 Lda a2, 4D Csp) a2 now 72bfeeD II 67 002561 40 : aDDbDOOD L d L vD, DCs2) @D02662f8 77e985aD I I 68 002561 44: 227e0D48 Lda a3, 48Csp) a3 now 72bfee8 II 69 002561 48: a23e0050 L d L a 1 , 5DCsp) @072bfef0 72bffa8 II 70 OD2561 4c : 47ef041 4 bis zero, fp, a4 a4 now dbbaO II 71 002561 5D : a2100000 L d L aD, DC aD) @001 476a8 2c0 II 72 002561 54 : 6b404000 j s ra, CvO),D ra now 256158 r I I KERNEL32 1 GetQueuedCom pletionStatus : 73 77e985a0 : 23deffc0 Lda sp, -4D Csp) sp now 72bfe60 I I 74 77e985a4: b53eOODO stq sO, OCsp) @072bfe60 = 148440 II 75 77e985a8: b55e0008 stq s 1, 8Csp) 11@072bfe68 = 72bffa0 76 77e985ac: b57e001 D stq s2, 1DCsp) @072bfe70 = 2662f8 II 77 77e985b0 : b59eOD18 stq s3, 18Csp) @072bfe78 = 1476a8 II 78 77e985b4 : b75eD02D stq ra, ZDCsp) @072bfe80 = 2561 58 I I 79 7('e985b8 : 47f00409 b is zero, aO, sO sO now ZeD I I 80 77e985bc : 47f1D4Da bi s zero, a 1, s1 s1 now 72bffa8 I I 81 77e985c0 : 47f2D40b bis zero, a2, s2 s2 now 72bfee0 I I 82 77e985c4: 47f3040c bi s zero, a3, s3 s3 now 72bf ee8 I I 83 77e985c8 : 47f4D 411 bis zero, a4, a 1 a1 now dbbaO I I 84 77e985cc: 221 eD038 Lda aD, 38Csp) aD now 72bfe98 I I 85 77e985d0 : d3405893 bsr ra, 77eae820 ra now 77e985d4 II

Figure 2 (continued) Example Output of l ostruction Trace trom Ntstcp then reduced to a series of single transactions and ana­ trace. The remammg to ur instructions combine to lyzed tor instruction distribution. For both the cached­ make up 2.6 percent and 2.7 percent of the instruc­ and the scaled-transaction instruction counts, we com­ tions executed per scaled and cached transaction, bined at least three separate transactions and took the respectively. Other observations include the fo llowing: geometric mean of the instructions executed, which The number of instructions executed decreased caused slight variations in the instruction counts. Al l • 7 percent fo r scaled and 4 percent tor cached resulting instruction counts were within an acceptable transactions. standard deviation as compared to individual transac­ tion instruction counts. • The number of ldl_ljstl_c sequences decreased We col lected the fi.mctiontraces in a similar fashion. 3 percent fo r scaled transactions. All the instructions that are identified in Tables Once the application benchmark was at a steady state, • l we began collecting the fi.mction call tree. Based on and 2 show a decrease in usage. Not surprisingly, previous work with the SQL Server database and con­ the instructions mskwl and mskbl completely disap­ sultation with Microsoft engineers, we could pinpoint peared. The inswl and insbl instructions decreased the beginning of a single transactjon. We then began by 47 percent and 90 percent, respectively. The sll collecting samples tor both traces at the same instant, instruction decreased by 38 percent, and the sra using an Ntstep feature that allowed us start or stop instruction usage decreased by 53 percent. These to sample collection based upon a particular address. reductions hold true within to 2 percent fo r both l The dynamic instruction counts fo r both the scaled scaled and cached transactions. and the cached transactions are given in Tables and II • The instructions Jdq_u and Ida, which are used 12. We also show the variation and percentage varia­ in unaligned load and store operations, show a tion bet\veen the Original and the Byte/Word versions decrease in the range of20 to 22 percent and 15 to of the SQL Server. Two of the six new instructions, 16 percent, respectively. sextb and sextw, are not present in the Byte/Word

Digiral l!:chnical Journal Vol. 8 No. 4 1996 99 0 ** Break p oint (Pid Oxd7, Tid Oxdbl SQLSERVR .EXE pc 77f39b34 0 ** Trace beg ins at 00242698 0 ** opends60 1 FetchNextCommand 1 3 ** opends60 !netiOReadData 72 ** KERNEL32 1 GetQueuedComp letionStatus 85 ** KERNEL32 !BaseForma tTime0ut 99 ** ntdl l!NtRemoveioCompletion 129 ** opends60 1netiOCom pletionRoutine 272 ** opends60 !netiORequestRead 285 ** KERNEL32 1ResetEvent 290 ** ntdl l!NtCLearEvent 318 ** SSNMPN60 1 * 0x06a1 31 f0* 348 ** KERNEL32 1 ReadFi Le 399 ** ntdll !NtReadFi Le 412 ** KERNEL32 1BaseSetlastNTError 41 7 ** ntdll 1RtlNtStatusToDosError 423 ** ntdll1Rt LNtStatusToDosErro rNoTeb 509 ** KERNEL32 1GetlastError 560 ** opends60 1get_ client event 665 ** opends60 1processRPC 682 ** opends60 !unpack_rpc 749 ** opends60 1execute_e vent 762 ** opends60 1execute_sq lserver_e vent 802 ** opends60 1 unpack_rpc 864 ** SQLSERVR !execrpc 91 1 ** KERNEL32 1WaitForSingleObj ectEx 937 ** KERNEL32 ! BaseFormatTime0u t 950 ** ntdll1NtWaitForSing leObj ect 1 0 2 4 ** SQLSERVR 1UserPerfStats 1038 ** KERNEL32 1GetThreadTimes 1055 ** ntdll1Nt QuerylnformationThread 11 7 3 ** SQLSERVR 1ini t_recvbuf 1208 ** SQLSERVR !init sendbuf 1 2 2 7 ** SQLSERVR 1port_ex_hand le 1263 ** SQLSERVR1_0tssetjmp3 131 3 ** SQLSERVR !memalloc 1365 ** SQLSERVR 1 OtsZero 1405 ** SQLSERVR ! recvhost 1437 ** SQLSERVR 1 OtsMove 1500 ** SQLSERVR !mema lloc 1577 ** SQLSERVR ! rn char ** SQLSERVR recvhost 1580 1 1612 ** SQLSERVR 1 OtsMove 1777 ** SQLSERVR !parse_name 1808 ** SQLSERVR 1dbcs strnchr 21 1 5 ** SQLSERVR ! rpcprot 21 31 ** SQLSERVR 1 mema lloc 2183 ** SQLSERVR ! OtsZero 22 52 ** SQLSERVR 1getproc id 2319 ** SQLSERVR 1procrelink+Ox1250 2546 ** SQLSERVR 1_0tsRema inder32 2559 ** SQLSERVR 1 OtsDivide32+0x94 2597 ** SQLSERVR!opentable 2642 ** SQLSERVR 1parse_n ame 2673 ** SQLSERVR 1dbcs strnchr 2979 ** SQLSERVR 1parse_name 301 0 ** SQLSERVR 1dbcs strnchr 3323 ** SQLSERVR 1open tabid 3363 ** SQLSERVR 1getdes 3493 ** SQLSERVR 1GetRunidF romDefid+Ox40 3510 ** SQLSERVR ! OtsZero 3658 ** SQLSERVR 1initarg 3668 ** SQLSERVR 1setarg 3703 ** SQLSERVR! OtsFieldinsert 3764 ** SQLSERVR 1 setarg 3799 ** SQLSERVR1 OtsFieldlnsert 3857 ** SQLSERVR ! startscan 3901 ** SQLSERVR !getindex2 3978 ** SQLSERVR 1 getkeepslot 4064 ** SQLSERVR 1 rowoffset 4109 ** SQLSERVR 1rowof fset 4170 ** SQLSERVR 1 OtsMove 4331 ** SQLSERVR 1 memcmp 5323 ** SQLSERVR 1 bufunhold 5436 ** SQLSERVR 1pr epscan 5550 ** SQLSERVR 1match_s args_to_i ndex

Figure 3 [\,unpk ofFunction Tr·�cc Ourpur ri·om �rsrep

Vo l. R 1\:o.4 l9% 5828 ** SQLSERVR 1srchindex 5895 ** SQLSERVR ! getpage 5942 ** SQLSERVR !bufget SQLSERVR ! OtsDivide 5976 ** SQLSERVR ! OtsDivide32+0x94 5985 ** 6090 ** SQLSERVR 1 getkeepslot 6356 ** SQLS ERVR !bufrlockw ait 6539 ** SQLSERVR !srchpage 6720 ** SQLSERVR !nc __ sqlhi lo+Ox8b0 691 2 ** SQLSERVR 1nc __ sqlhi lo+Ox8b0 7309 ** SQLSERVR 1nc __ sqlhi lo+Ox8b0 __ 7728 ** SQLSERVR 1nc sqlhi lo+Ox8b0 __ 81 25 ** SQLSERVR 1nc sqlhi lo+Ox8b0 8522 ** SQLSERVR 1nc __ sq lhi lo+Ox8b0 89 19 ** SQLSERVR 1nc __ sq lhi lo+Ox8b0 9410 ** SQLSERVR !index_be fores leep+Ox1 00 9465 ** SQLSERVR !buf run lock 964 1 ** SQLSERVR !trim_sqoff+OxfO 9661 ** SQLS ERVR ! qua lpage 9809 ** SQLSERVR !nc __sqlhi lo+Ox8b0 10212 ** SQLSERVR !nc __ sqlhi lo+Ox8b0 1061 6 ** SQLSERVR 1rowof fset 10702 ** SQLSERVR 1getnext 10769 ** SQLSERVR 1 OtsFieldlnsert 10822 ** SQLSERVR ! g etrow2 10838 ** SQLSERVR 1getpage 10885 ** SQLSERVR !bufget 10919 ** SQLSERVR !_OtsDivide 10928 ** SQLSERVR ! OtsDivide32+0x94 11033 ** SQLSERVR 1 getkeeps lot 11359 ** SQLSERVR ! OtsMove 11489 ** SQLSERVR!endscan 11557 ** SQLSERVR ! bufunkeep 11675 ** SQLSERVR ! bufunkeep 11853 ** SQLSERVR !closetable 11907 ** SQLSERVR !endscan 12044 ** SQLSERVR 1get_s p inlock 12103 ** SQLSERVR !opentabid 12138 ** SQLSERVR !getdes 12291 ** SQLSERVR ! OtsZero 12464 ** SQLSERVR 1closetable 12524 ** SQLSERVR !endscan 12661 ** SQLS ERVR ! get_s pi nlock 12729 ** SQLSERVR ! prot ect 12756 ** SQLSERVR ! port_ex_hand le 12792 ** SQLSERVR !_Otssetjmp3 12845 ** SQLSERVR !pr ot_sea rch 12887 ** SQLSERVR 1dbtblfind 12958 ** SQLSERVR ! check_p r otect 13025 ** SQLSERVR 1mema lloc 13077 ** SQLSERVR 1 OtsZero 13127 ** SQLSERVR ! memalloc 13179 ** SQLSERVR 1_0tsZero 13263 ** SQLSERVR! rn i 2 13267 ** SQLSERVR 1 recvhost 13299 ** SQLSERVR ! OtsMove 13369 ** SQLSERVR !recvhost 13401 ** SQLSERVR 1 Ot sMove 13477 ** SQLSERVR !recvhost 13509 ** SQLSERVR 1 Ot sMove 13562 ** SQLSERVR !recvhost 13594 ** SQLSERVR 1 OtsMove 13670 ** SQLSERVR 1recvhost 13702 ** SQLSERVR ! OtsMove 13755 ** SQLS ERVR 1recvhost 13787 ** SQLSERVR 1 OtsMove 13847 ** SQLS ERVR !bconst 13895 ** SQLSERVR !mkconstant 13921 ** SQ LSERVR 1mema lloc 14046 ** SQLSERVR 1memalloc 14098 ** SQLSERVR 1 OtsZero 14157 ** SQLSERVR !rn i4 14161 ** SQLSERVR 1recvhost 14193 ** SQLSERVR ! OtsMove

Figure 3 (continued) Example of Function Trace Output fr om Nrsrep

8 1996 101 DigitalTec hnical Journal Vol. No. 4 Ta ble 11 Instruction Count and Va riations for Scaled Transaction

Instruction Original Byte/Word Delta %Delta Instruction Original Byte/Word Delta %Delta

stb 0 174 +174 N/A stt 334 334 0 0% stw 0 219 +219 N/A cmple 368 358 10 -3% ldwu 0 1215 +1215 N/A inswl 390 207 183 -47% ldbu 0 1216 +1216 N/A sri 457 398 59 -13% cmpbge 2 0 -2 -100% extqh 441 317 124 -28% cmovlbs 2 2 0 0% em pule 468 450 18 -4% a ddt 3 3 0 0% cmpult 563 518 45 -8% cmovlbc 5 4 -1 -20% cmplt 565 534 31 -5% cmovle 5 5 0 0% rdteb 604 597 7 -1% insqh 6 6 0 0% extwl 660 345 315 -48% cmovgt 13 13 0 0% stq_u 688 688 0 0% callsys 18 14 -4 -22% bit 784 771 13 -2% mulq 13 13 0 0% bic 771 347 424 -55% s8subq 17 17 0 0% ext II 789 761 28 -4% cmovlt 16 16 0 0% extlh 789 761 28 -4% ldt 25 25 0 0% bge 828 819 9 -1% zap 34 33 -1 -3% mb 961 94 1 20 -2% umulh 35 35 0 0% sll 949 590 359 -38% mull 60 62 +2 +3% sub I 1052 1061 (9) +1% arnot 52 52 0 0% br 1160 1080 80 -7% cmpeq 64 61 -3 -5% sra 1211 562 649 -54% insql 61 61 0 0% bsr 1203 1191 12 -1% bibs 69 69 0 0% s4addl 1176 1166 10 -1% s8addl 71 74 +3 +4% ret 1282 1264 18 -1% mskwl 74 0 -74 -100% zapnot 1262 910 352 -28% jsr 98 89 -9 -9% addq 1704 1685 19 -1% cpys 104 41 -63 -61 % subq 2159 2140 19 -1% mskqh 155 153 -2 - 1% ldah 2793 2746 47 -2% cmovne 147 141 -6 -4% extbl 2902 1668 1234 -43% mskbl 163 0 -163 -100% xor 3426 3380 46 -1% cmoveq 183 173 -10 -5% and 3402 2969 433 -13% insbl 182 19 -163 -90% bne 4537 4440 97 -2% extwh 196 196 0 0% addI 4897 4855 42 -1% trapb 203 215 +12 +6% ldq_u 5046 3933 1113 -22% mskql 204 202 -2 -1% stl 5753 5301 452 -8% jmp 208 200 -8 -4% Ida 6496 5435 1061 -16% cmovge 291 287 -4 -1% stq 6778 6713 65 -1% blbc 249 249 0 0% ldq 7018 6519 -499 +7% bgt 331 328 -3 -1% beq 7607 7455 152 -2% -5 ldl_l 344 335 -9 -3% bis 11284 10707 577 % stl_c 344 335 -9 - 3% ldl 15962 14260 1702 -11% extql 329 327 -2 -1% To tals 115895 107854 8042 - 7 %

For the scaled transaction, a decrease in 58 out of instructions per transaction measured in Table 13. If 81 instructions types occurred. Of the remaining 25 this correlation holds true, we would expect to sec an instructions, 21 had no change and only 4 instructions, increase in pcrri.>rmancc of approximately 7 percent mull, s8addl, trapb, and sub], showed an increase. For t(>r scaled transactions runs. cached transactions, 22 instruction counts decreased, 29 increased, and 22 remained unchanged. Dynamic Instruction Distribution The performance gain of 3.5 percent measured fo r The pcrtcxmancc of the Alpha microprocessor using the cached version of the application benchmark cor­ technical and commercial workloads has been evalu­

relates closely to the decrease in the number of ated.' The commercial worklo::td used WJS debit-

102 Digiral Tcchniol Journal Vo l. H No. 4 1996 Ta ble 12 Instruction Count and Va riations for Cached Transaction

Instruction Original Byte/Word Delta %Delta Instruction Original Byte/Word Delta %Delta stb 0 174 +174 N/A stt 334 334 0 0% stw 0 217 +217 N/A cmple 367 374 +7 +2% ldwu 0 1189 +1189 N/A inswl 381 203 -178 -47% ldbu 0 1333 +1333 N/A sri 433 383 -50 -12% cmpbge 2 0 -2 -100% extqh 434 314 -120 -28% cmovlbs 2 2 0 0% cmpule 450 440 -10 -2% a ddt 3 3 0 0% cmpult 550 572 + 22 +4% cmovlbc 4 5 +1 +25% cmplt 561 585 + 24 +4% cmovle 5 5 0 0% rdteb 587 590 +3 +1% insqh 6 6 0 0% extwl 654 340 -314 -48% cmovgt 13 13 0 0% stq_u 689 687 -2 0% cal lsys 15 16 +1 +7% bit 751 770 +19 +3% mulq 13 13 0 0% bic 759 346 -413 -54% s8subq 13 14 +1 +8% ext II 784 805 +21 +3% cmovlt 16 16 0 0% extlh 784 805 +21 +3% ldt 25 25 0 0% bge 813 831 +18 +2% zap 26 27 +1 +4% mb 883 901 + 18 +2% umulh 32 32 0 0% sll 899 569 -330 -37% mull 46 48 +2 +4% subI 983 995 +12 +1% ornot 46 46 0 0% br 1130 1100 -30 -3% cmpeq 53 53 0 0% sra 1134 528 -606 -53% insql 61 61 0 0% bsr 1158 1165 +7 +1% bibs 63 63 0 0% s4addl 1160 1170 +10 +1% s8addl 69 70 +1 +1% ret 1232 1239 +7 +1% mskwl 73 0 -73 -100% zapnot 1247 91 1 -336 -27% jsr 90 92 +2 +2% addq 1589 1631 +42 +3% cpys 87 41 -46 -53% subq 1994 2046 +52 +3% mskqh 152 157 +5 +3% ldah 2684 269 1 +7 +0% cmovne 160 165 +5 +3% extbl 2921 1682 -1239 -42% mskbl 163 0 -163 -100% xor 3278 3332 +54 +2% cmoveq 182 190 +8 +4% and 3361 2990 -371 -11% insbl 182 19 -163 -90% bne 4328 4376 +48 +1% extwh 195 196 +1 +1% add I 4734 4856 +122 +3% trapb 210 21 1 +1 0% ldq_u 5061 4046 -101 5 -20% mskql 201 203 +2 +1% stl 5418 5052 -366 -7% jmp 209 21 5 +6 +3% Ida 6289 5344 -945 -15% cmovge 226 236 +10 +4% stq 6464 6588 +124 +2% blbc 238 238 0 0% ldq 6685 6359 -326 -5% bgt 292 302 +10 +3% beq 7355 7466 + 111 +2% ldl_l 314 320 +6 +2% bis 10890 10668 -222 -2% stl_c 314 320 +6 +2% ldl 14964 13772 -1192 -8% extql 326 329 +3 +1% Totals 111288 106521 - 4767 -4%

credit, vv hich is similar to the TPC-A benchmark. The instruction makeup ofeach group. Figure 4 shows the TPC-B benchmark is similar to the TPC-A, differing percentage of instructions in each group fo r the tour only in its method of execution. Cvetanovic and alternatives we studied. In all fo ur cases, INTEGER Bhandarkar presented an instruction distribution LOADs make up 32 percent of the instructions exe­ matrix f(x the debit-credit workload. The Alpha cuted. In tbe scaled Byte/Word category, the new instruction type mix is dominated by the integer class, ld bu and ldwu instructions compose l percent of the tollowed by other, load, branch, and store instructions, integer instructions, and the new stb and stw instruc­ in descending order.17 We took a similar approach tions accounted fo r 18 percent of the integer store but divided the instructions into more groups to instructions executed. achieve a ti ner detailed distribmion. Table 13 gives the

Digital Tcchnicll Journal Vo i. 8No. 4 1996 103 Ta ble 13 the method of loading or storing byres :\ lld words Instruction Groupings on the original Alpha Architecture made hcJ\'\' usc of

Instruction these types of instructions. Group Group Members In our last examination, \\'e looked �l t the instruc­ tion \'ariation between a suled and a cached trans­ Integer loads ldwu, ldbu, ldl_l, ldah, ldq_u, action. The major difference between the two Ida, ldq, ldl transactions is the additional I/0 required by the Integer stores stb, stw, stl_c, stq_u, stl, stq scaled version of the benchmark. Table 14 gives the Integer control bibs, jsr, jmp, blbc, bgt, bit, bge, resu lts. The Original version of the SQL Server cb t:t­ br, bsr, ret. bne, beg basc executed an extra 4,596 instructions during the Integer arithmetic cmpbge, s8subq, umulh, mull, cJchcd tr:tnsaetion as compared to the scaled trans­ cmpeq, s8addl, cmple, cmpule, cmpult. cmplt, subl, s4addl, action. for the Byte/Word version, only an additional addq, subq, add I I ,334 instructions were executed. Logical shift cmovlbs, cmovlbc, cmovle, cmovgt, cmovlt. ornot. cmovne, Conclusions cmoveq, cmovge, sri, bic, sll, sra, xor, and, bis The introd uction of the ne\\" single byte and \\"ord Byte manipulation insll, inslh, mskll, mskhl, insqh, manif1tt!ation instructions in the Alpha Architectu re zap, insql, mskwl, mskqh, mskbl, improved the performance of the Microsoft SQ L insbl, extwh, insbl, extwh, mskql, extql, inswl, extqh, extwl, extll, Server database . \Ve observed a decrease in the num­ extlh, zapnot, extbl ber of instructions executed per transaction, the

Other addt, ldt, stt, mulq, callsys, cpys, elimination of some instructions in the \\"orklo:td, •l trapb, rdteb, mb red istribution of the instruction mix, and an increase in rebti\'e pcrt(mlunce. The results arc in line wi th During the sCJicd transactions, each instruction cxpect:ttions when the addition of the nc\\" instruc­ group showed a decrease in the number of instruc­ tions was proposed . tions executed, ra nging fr om negligible to as much as We limited our investigation to J single commercial

54 percent. In addition, the number of byte manipula­ workload ::l!ld operating system. Te sting a work!oJd tion and logical shift instructions decreased, because with more TjO, such as the TPC-C benchmark, wou ld

CACHED BYTE/WORD

CACHED ORIGINAL

SCALED BYTE/WORD

SCALED ORIGINAL

0 10 20 30 40 50 60 70 80 100 PERCENT KEY

• INTEGER LOAD D INTEGER STORE � INTEGER CONTROL � INTEGER ARITHMETIC lffil. LOGICAL SHIFT � BYTE MANIPULAT ION D OTHER

Figure 4 lnsl rucrion Group Distribution l04 Dig:itcll Technical )oumcll Vol. 8 No. 4 I ')')6 Ta ble 14 Instruction Va riations (Scaled Minus Cached Transactions)

Instruction Original Byte/Word Instruction Original Byte/Word Instruction Original Byte/Word stw 0 -2 cmplt -4 +51 subI -69 -66 ldwu 0 -26 rdteb -17 -7 br -30 +20 ldbu 0 +117 extwl -6 -5 sra -77 -34 cmovlbc -1 +1 stq_u +1 -1 bsr -45 -26 callsys -3 +2 bit -33 -1 s4addl -16 +4 s8subq -4 -3 bic -12 -1 ret -50 -25 zap -8 -6 ext II -5 +44 zapnot -15 +1 umulh -3 -3 extlh -5 +44 addq -115 -54 mull - 14 -14 bge -15 +12 subq -165 -94 arnot -6 -6 mb -78 -40 ldah -109 -55 cmpeq -11 -8 sll -50 -21 extbl +19 +14 bibs -6 -6 cmovge -65 -51 xor -148 -48 s8addl -2 -4 blbc -11 -11 and -41 +21 mskwl -1 0 bgt -39 -26 bne -209 -64 jsr -8 +3 ldl_l -30 -15 add I -163 +1 cpys -17 0 stl_c -30 -15 ldq_u +15 +113 mskqh -3 +4 extql -3 +2 stl -335 -249 cmovne +13 +24 cmple -1 +16 Ida -207 -91 cmoveq -1 +17 i nswl -9 -4 stq -314 -125 extwh -1 0 sri -24 -15 ldq -333 - 160 trapb +7 -4 extqh -7 -3 beq -252 +11 mskql -3 +1 cmpule -18 -10 bis -394 -39 jmp +1 +15 cmpult -13 +54 ldl -998 -488 Totals - 4596 -1334

prod uce J ditkrent set of results Jnd would merit References and Notes investig:�tion. The use of another database, such �1s the

Oracle RDBMS, which makes greater usc ofbytc oper­ I. Z. C1·etanm·ic and D. l\ handarb1·, "Characterization :ltions,wo uld possibly result in an even greater pcrt(Jr­ ofAJplu AXP Pcd(lrmance Using TP and SPEC Work­ mancc imp�K t. L:1stly, rebuilding the entire operating loads," 21st Allllllal lnlenwtional Sy mpos ium on Cornputer Archilec/ure, ( 1994 ). system to usc the new instructions would m:1ke an Chicago interesting and worthwhile study. 2. vV. Kohler ct :�1.,"P ert(mllJIKe Ev;Jiuation ofTransac­ tion Processing," J)igiral Technical}ournal, vol. 3, Acknowledgments no. I (Winter 1991) 45-57.

3. S. Lcurcncggcr and D. Dias, "A Modeling Study ofthc As with :111y project, many peopl e were instrumental in TPC-C Benchmark," Proceedin,�s of thf! 1993 ACM this dh >rt. Wim Colgate, Miche Raker-Han-ey, and SJG"tlH)f) flltemalional Co nference on /VJ.u nage­ Steve Jenness gave us numerous insights into the Jnent ()/Datct, S!Gt'v! OD Record 22 (2), (June 1993). Windows NT operating system. Tom Van Baak pro­ 4. R. Sites :�nd E. Peri , Pa lch Wrk.s-A D}'l"lCtmic vided sever:� ! analysis and tracing/simulation tools fo r Execution Tm cillii To ol (Palo AJro, Calif.: Digital the Windows NT environment. Rich Grove providtd Equipment Corporation, Svstcms Research Center, access early builds of the CEM compiler back end to 1995 ). that contJined bytt and word support. StJn Gazaway built the SQL Server application with the modifica­ 5. W. Kohler, A. Shah, and F. Raab, Owruil!ll' of lPC Bf!nch mark C: "! he Order-t:nt1y Benchmark (San tions. Ve hbi Tasar provided encouragement and sanity Jose, Calif.: ·rransacrion Processing Performance checking. John Shakshober lent insight into the world Council Technical Report, \991 ). of TPC:. Peter Bannon provided the early prototype machine. Contributors fr om Microsott Corporation 6. R. Sires, "Alpha AX P Architecture," Dip,i!al Te chni­ included Todd Ra gland, who helped rebuild the SQL caljournal. vol . 4, no. 4 (Special Issue 1992): 19-34.

Server; Ri ck Vicik, who provided detailed insights into 7. Alpha AXP Svstems Hu 11dbook (Maynard, M:�ss.: the operJtion of the SQL Server; and Damien Digital Equipment Corporation, 1993). Li ndauer, who helped set up and run the TPC: bench­ 8. DF;Ccb ip 21064A-2.·U. -2 75 Alpha AXP kl icro­ mark. Finally, we thank Dick Sites tor encouraging processor Data SIJeet (Mavnard, Mass.: Digital us to undertake this dhxt. Equipn1cnr Cmporarion , 1994 ).

Digital Technical )ourn;ll Vol. 8 No. 4 1996 105 9. Alpha Microprocess or 21 10-1 Hardll'rlrc Nefc'r­ (!'vbyn:ml, Di gi tal Equipment ence Manual Mass.: Corporation, 1994 ).

10. R. Sires and R. \V irek, AljJba AXP Archilcc!u rc Rl:fc'r­ ence 2d ed. (Newton, Mass.: Di i ral Press, t\ll!lltllal, g 1995).

11. C. Kane, m.'l.CArchi!eC!ure(En!!-kwood Mil'S N2000 Cliffs , N .] . : l'renrice Hall, 1987 ).

12. ]. Hennessv, N. )ouppi, Baskett, and ) Gill, .lf!PS· F . Eric B. Betts A IlLS! Arch ilec/1/rc (Stan�(mi , C11if.: Eric Betts is pri ncip:lisoftware e gine in rh c DIGITAL Processor ;1 n er Computer Syste ms Laboratory, Stanford University, Sotnvare Partners Engineering Group, where he has been involved l i th perf

C. Cole and Crudele, personal corrcsponcknce, 1 S. L. December !996.

16. Microsoft C p a tio developed the ARC ti rm,,-,1re or m n to t· the MIPS pbr�(mn. During the earlv (L11's of the port of Windows Alpha, DIGITAL's engincns NT to ported the ARC fi rmware to the Alpha platt(mn .

17. The Alplu instruction type mix i ncluded PALcode calls, barriers, and other implemcntation-spn:i�ic PALcock instructions.

Biographies

David P. Hunter DIGITAl. David Hunter is rhc engineering manager of the Sofi:wat·e P n rs En):\inecring Advanced Development art e Group, wh he has been involved in perturmance investi­ ere atio of databases and their i e act ons with UNIX and g ns nt r i ·windows NT. Prior ro this work, he held positions in the Alpha Migration Organizcnion, r h P rti ng <_;roup, c ISY o and the Gol'ernment <_;roup'sTec hnical Program Man;1ge­ menr Office. He DIGITAl. i n rhe La bora ton· JOined J),m Products Croup in 1983, whe e he dc1·cloped the V.-\ XI,lb S r scr Management 1·stem. He 11 ·as rhe project leader of rhc advanced d op e project, ITS, an executive int(mna­ e,·d m nr rion system, 11·hich he (k si gncd hardware and sotn1 are �o r components. David has two patent icatio s pe nding in appl n rhe area of software engineering. h olds a degree in electri­ He cal and computer engineering �r om Northeastern Uni1 crsiry.

06 Digital Tec hnical l

FurtherRead ings

The Dip,ilal Te chnica/foumal is a rdcreed, quarterly Alpha AXP Partners-Gray, Raytheon, Kubota/ publication of papers that explore the toundations DECchip 21071/21072 PCI Chip Sets/ of DIGITAL's products and technologies. jo urnal DLT2000 Tape Drive content is selected by the Journal Advisory Board, Vol. 6, No. 2, Spring 1994, EY-F947E-TJ and papers are written by DIGITAL's engineers High-performance Networking/OpenVMS AXP and engineering partners. Engineers who would System Software/ Alpha AXP PC Hardware like to contribute a p:1pcrto the Jou rna/ should Vol. 6, No. l, Winter 1994, EY-Q01 1E-TJ contact the manJging editor, Jane Rlake, at Software Process and Quality Jane.Blake@ ljo.dec.com. Vol. 5, No. 4, Fall 1993, EY-P920E-DP

Topics covered in previous issues ofthe [)igital Product Internationalization Te chnical.fournal arc as fo llows: Vol. 5, No. 3,Summer 1993, EY-!'986E-DP

Internet Protocol V.6/Preservation of Historical Multimedia/Applic ation Control Computer Systems/Fortran fo r Parallel Computing/ Vol. 5, No. 2, Spring 1993, EY-P963E-DP Server Performance Evaluation and Optimization/ DECnet Open Networking InternetCollaboration Software 5, No. Winter 1993, EY-M770E-DP Vo l. 8, No. 3, 1996, EC-N7285- l8 Vo l. I, Alpha AXP Architecture and Systems Spiralog Log-structured File System/ Vol . 4, No. 4, Special Issue 1992, EY-)886E-DP Open VMS fo r 64-bit Addressable Virtual Memory/ High-performance Message Passing fo r Clusters/ NVAX-mic roprocessor VA X Systems Speech Recognition Software Vol. 4, No. 3, Summer 1992, EY-)884E-DP Vol. 8, 1 o. 2, 1996, EY-N6992- I 8 Semiconductor Technologies Digital UNIX Clusters/Object Modification Tools/ Vol. 4, No. 2, Spring 1992, EY-L521 E-D!' eXcursion fo r Windows Operating Systems/ Network Directory Services PATHWORKS: PC Integration Software Vol. 4, No. 1, Winter 1992, EY- )825E-DP Vo l. 8, No. I, 1996, t-: Y-U025 l:-Tj

Audio and Video Technologies/ UNIX Available Image Processing, Video Terminals, and Servers/Real-time Debugging Tools Printer Technologies Vol. 7, No. 4, 1995, EY-U002E-T) Vol . 3, No. 4, rail 1991, EY- H889E- Dl'

High Performance Fortran in Parallel Environments/ Availability in VAXcluster Systems/ Sequoia 2000 Research Network Performance and Adapters Vo l. 7, No. 3, 199S, EY-T838E-TJ Vol. 3, No. 3,Summcr 1991, EY- H890E-DP

(Av ailahle on/)• on !he Internet J Fiber Distributed Data Interface

Graphical Software Development/Systems Engineering Vol . 3, No. 2, Spring 1991, EY-H876E-DP

Vol . 7, No. 2, 1995, EY-U001E-TJ Transaction Processing, Databases, and

Database Integration/Alpha Servers & Workstations/ Fault-tolerant Systems Alpha 21164 CPU Vo l. 3, No. 1, Winter 1991, EY-F588E-DP

Vo l. 7, No. l, 1995, EY-Tl35E-TJ VAX 9000 Series (Ao ai.lubleonlr on the Internet J Vol . 2, No. 4, �all 1990, EY-E762E-DP

RAI D Array Controllers/Workflow Models/ DECwindows Program PC LAN and System Management To ols Vol. 2, No. 3, Summer 1990, EY-E756E-DP Vol. 6, No. 4, bll 1994, EY-Tl l 8E-TJ VAX 6000 Model 400 System AlphaServer Multi processing Systems/ DEC OSF/1 Vol. 2, No. 2, Spring 1990, EY-C197E-Dl' Symmetric Multiprocessing/ ScientificComputing Optimization fo r Alpha Compound Document Architecture Vol. 6, No. 3, Summer 1994, EY-S799F-T'j Vol. 2, No. 1, vV inter 1990, EY-C196E-Dl'

Digital Te chnical Journal Vol. 8 No. 4 1996 107 Call fo r Authors fr om Digital Press

Digital Press is an imprint ofBu tterworth-Heinemann, a major international pub­ lisher of professional books and a member of the Reed Elsevier group. Digital Press is the authorized publisher fo r Digital Equipment Corporation: The two companies are working in partnership to identif)r and publish new books under the Digital Press imprint and create opportunities fo r authors to publish their work.

Digital Press is committed to publishing high-quality books on a wide variety of subjects. We would like to hear fr om you ifyou are writing or thinking about writ­ ing a book.

Contact: Liz McCarthy, Associate Acquisitions Editor, or Mike Cash, Digital Press Manager

DIGITAL PRESS 313 Wa shington Su·eet Newton, MA 02 158- 1626 U.S.A. Te l: (617) 928-2649, Fax: (617) 928-2640 E-mail: [email protected] or [email protected]