A Comparison of Scalable Superscalar Processors

A Comparison of Scalable SuperscalarProcessors Bradley C. Kuszmaul, Dana S. Henry, and Gabriel H. Loh* Yale University Departments of Computer Science and Electrical Engineering {bradley,dana,lohg}@{ cs.yale.edu,ee.yale.edu} Abstract for e > 0. The Ultrascalar II has gate delay @(log L + log n). The wire delay is O(n), which is optimal for n = O(L). Thus, the The poor scalability of existing superscalarprocessors has been of Ultrascalar II dominatesthe Ultrascalar I for n = O(L2), otherwise great concern to the computer engineering community. In partic- the Ultrascalar I dominatesthe Ultrascalar II. ular, the critical-path lengths of many componentsin existing im- We introduce a hybrid ultrascalar that uses a two-level layout plementationsgrow as O(n’) where n is the fetch width, the issue scheme: Clusters of execution stations are layed out using the Ul- width, or the window size. This paper describestwo scalable pro- trascalarII mesh-of-treeslayout, and then the clusters areconnected cessor architectures, the Ultrascalar I and the Ultrascalar II, and together using the H-tree layout of the Ultrascalar I. For the hybrid comparestheir VLSI complexities (gate delays, wire-length delays, (in which n > L) the wire delay is @(a + M(n)), which is and area.) Both processorsare implemented by a large collection optimal. For n 1 L the hybrid dominatesboth the Ultrascalar I and of ALUs with controllers (together called execution stations) con- Ultrascalar II. nected together by a network of parallel-prefix tree circuits. A fat- We also present an empirical comparison of the Ultrascalar I tree network connects an interleaved cache to the execution sta- and the hybrid, both layed out using the Magic VLSI editor. For a tions. These networks provide the full functionality of superscalar processorthat has 32 32-bit registers and a simple integer ALU, the processorsincluding renaming, out-of-order execution, and specu- hybrid requires about 11 times less area. lative execution. The difference between the processorsis in the mechanismused to transmit register values from one execution station to another. 1 Introduction Both architectures use a parallel-prefix tree to communicate the register values between the execution stations. The Ultrascalar I Today’s superscalarprocessors must rename registers,bypass reg- transmits an entire copy of the register file to each station, and the isters, checkpoint state so that they can recover from speculative station chooseswhich register values it needsbased on the instruc- execution, check for dependencies,allocate execution units, and tion. The Ultrascalar I uses an H-tree layout. The Ultrascalar II accessmulti-ported register files. The circuits employed are com- uses a mesh-of-treesand carefully sends only the register values plex, irregular, and difficult to implement. Furthermore, the de- that will actually be neededby each subtree to reduce the number lays through many of today’s circuits grow quadratically with issue of wires required on the chip. width (the maximum number of simultaneously fetched or issued The complexity results are as follows: The complexity is de- instructions), and with window size (the maximum number of in- scribed for a processor which has an instruction-set architecture structions within the processorcore). It seemslikely that some of containing L logical registersand can executen instructions in par- thosecircuits can be redesignedto have at most linear delays,but all allel. The chip provides enough memory bandwidth to execute up the published circuits are at least quadratic delay [12, 3, 41. Chips to M(n) memory operations per cycle. (M is assumedto have a continue to scale in density. For example, Texas Instruments an- certain regularity property.) In all the processors,the VLSI area nounced recently a 0.07 micrometer processwith plans to produce is the square of the wire delay. The Ultrascalar I has gate delay processor chips in volume production in 2001 [18]. With billion O(log n) and wire-delay transistor chips on the horizon, this scalability barrier appearsto be one of the most serious obstaclesfor high-performance uniproces- if M(n) is O(nl/‘+), sors in the next decade. @(da This paper comparesthree different microarchitecturesthat im- O(fi(L + logn)) if M(n) is O(n’j2), and 7wires = prove the scalability bounds while extracting the same instruc- i O(fiL + M(n)) if M(n) is 0(nli2+‘) tion-level parallelism as today’s superscalars. All three proces- ‘This work was partially supported by NSF Career Grants CCR-9702980 (Kusz- sors renameregisters, issue instructions out of order, speculateon maul) and MIP-9702281 (Henry) and by an equipmentgrant from Intel. branches,and effortlessly recover from branch mispredictions. The three processorsall implement identical instruction sets, with identical scheduling policies. The only differences between the proces- Permission to make digital or hard copies of all or part of this work for sorsare in their VLSI complexities, which include gate delays, wire personal or classroom USCis granted without fee provided that copies delays, and area, and which have implications therefore on clock are not made or distributed for profit or commercial advantage and that speeds. The regular structure of the processorsnot only makesit copies bear this notice and the full citation on the tirst page. To copy possible to theoretically analyze the VLSI complexity of the pro- otherwise, to republish, to post on servers or to redistribute to lists. requires prior specific permission and/or a fee. cessors,but madeit possible for us to quickly implement VLSI lay- SPAA ‘99 Saint Malo, France outs to facilitate an empirical comparison. Copyright ACM 1999 I-581 13-124-0/99/06...$5.00 We analyze the complexity of each processoras a function of three important parameters. 1. L is the number of logical registers, which is defined by the instruction-set architecture. L is the number of registersthat 126 RegisterValue andReady Bil ModifiedBk Execution Execution Execution Execution Execution Execution Execution Execulion Station0 StationI Station2 SIation3 Station4 Station5 Station6 Station7 Figure 1: A linear-gate-delaydatapath for the Ultrascalar I processor. the programmer sees, which in most modem processorsis techniques to move data directly from one instruction to an- substantially less than the number of real registers that the other without using up memory bandwidth. Thus, we express processorimplementor employs. A superscalarprocessor of- the memory bandwidth, M(n), as a function of n. We as- ten uses more real registers than logical registers to exploit sumethat M(n) is O(n), since it makesno senseto provide more instruction-level parallelism than it could otherwise. more memory bandwidth than the total instruction issue rate. 2. n is the issue width of the processor. The issue width de- This paper contributes two new architectures,the Ultrascalar II termines how many instructions can be executed per clock and a hybrid Ultrascalar, and comparesthem to each other and to a cycle. This paper does not study the effect of increasing the third, the Ultrascalar I. We describedthe Ultrascalar I architecture instruction-fetch width or the window size independently of in [7], but did not analyze its complexity in terms of L. For L the issue width. A simplified explanation of theseterms fol- equal to 64 64-bit values, as is found in today’s architectures,the lows: The instruction-fetch width is the number of instruc- improvement in layout areais dramatic over the Ultrascalar I. tions that can enter the processorpipeline on every clock cy- This paper does not evaluatethe benefits of larger issue widths, cle, whereasthe issue-width is the number that can leave the window sizes, or of providing more logical registers. Some work processorpipeline on every cycle. It is reasonableto assume has been done showing the advantagesof large-issue-width,large- that the issue width and the instruction-fetch width scale to- window-size, and large-register-file processors. Lam and Wilson gether. The window size is the number of instructions that suggestthat ILP of ten to twenty is available with an infinite instruc- can be “in flight” between the instruction fetcher and the is- tion window and good branch prediction [8]. Patel, Evers and Patt sue logic. That is, the window size indicates how many in- demonstratesignificant parallelism for a 16-wide machine given a structions have entered the processorand are being consid- good trace cache [13]. Patt et al arguethat a window size of 1000’s ered for execution. In most modem processorsthe window is the best way to use large chips [ 141.Steenkiste and Hennessy[ 171 size is an order of magnitude larger than the issue width. We conclude that certain compiler optimizations can significantly im- do not here consider the window-size independently because prove a program’s performanceif many logical registersare avail- it makesno difference to the theoretical results. From an em- able. And, although the past cannot guaranteethe future, the num- pirical point of view, it is doubtless worth investigating the ber of logical registershas been steadily increasing over time. The impact of changing the window size independently from the amount of parallelism available in a thousand-wideinstruction win- issue width. We know how to separatethe two parametersby dow with realistic branch prediction and 128 logical registersis not issuing instructions to a smaller pool of sharedALUs. Our well understood and is outside the scope of this paper. Thus, we ALU scheduling circuitry is describedelsewhere [6] and fits are interestedin processorsthat scale well with the issue width, the within the bounds describedhere. window size, the fetch width, and with an increasing number of logical registersin the instruction-set architecture. 3. M is the bandwidth provided to memory. The idea is that The rest of this paper describes and comparesour three scal- some programs need quite a bit of memory bandwidth per able processordesigns. Section 2 functionally describesthe Ul- instruction, whereas others need less.

A Comparison of Scalable Superscalar Processors

Data-Level Parallelism

Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors

Lecture 14: Gpus

GPU Implementation Over IPTV Software Defined Networking

Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution

Computer Hardware Architecture Lecture 4

Superscalar Fall 2020

The Multiscalar Architecture

Computer Architecture Out-Of-Order Execution

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor

Trends in Processor Architecture