st

To bepublishedinthe Proceedings of the 31 Annual InternationalSymposium on , December 1998.1

Simple Vector for Multimedia Applications

Corinna G. Lee and Mark G. Stoodley g fcorinna,stoodla @eecg.toronto.edu Department of Electrical and Engineering University of Toronto

Abstract consensus that multimedia applications will increase in In anticipation of the emergenceof multimedia applications importance as greater priority is given to more human- as an important workload, companies have friendly interfaces and to personal mobile computing [7, augmented their instruction-set architectures with short vec- 20]. tor extensions, thus adding basic vector hardware to state-of- In anticipation of this emerging applications area, the-art superscalar processors. Although a vector architec- microprocessor companies have augmented their ture may be a good match for multimedia applications, there is instruction-set architectures with short vector exten- growing evidence that the control logic for increasingly com- plex superscalar processors is dif®cult to implement. sions, thus adding basic vector hardware to state-of-the- Rather than combining a complex superscalar core with art superscalar processors. Figure 1 lists the extensions short wide vector hardware,we proposeusing a much simpler that have been introduced or announced by all major processordesign that is similar to traditional vectorcomputers microprocessor companies. with long vectors and simple control logic for instruction issue. A common aspect of the vector extensions listed in Such a design would use the bulk of its transistorsand die area Figure 1 is that all use a wide that is partitioned for datapath and registers,and thus lessen the time required to to execute narrower data types in parallel. These nar- design, implement, and verify control. rower data types are more typical for multimedia appli- In this paper, we present data that quanti®es this trading cations which manipulate sound and image data. Almost of control transistors for datapath and register transistors. We all use a 64-bit datapath; the HP MAX-1 uses a 32-bit demonstratethat a 2-way, in-ordervectorprocessorwith a vec- datapath while the PowerPC AltiVec uses a 128-bit dat- tor length of 64 and a vector width of 8 requires no more die apath. area, and possiblysigni®cantly less area, than a 4-way, out-of- ordersuperscalarprocessorwith short vector extensions. Fur- It is useful to characterize a vector implementation in thermore, we show that the simple long vector is, terms of its vector length and vector width. Vector length on average,2.7 times faster executing multimedia applications is the maximum number of operations that a vector in- than the superscalar processor, and 1.6 times faster than one struction can execute, while vector width refers to the with short vector extensions. number of operations that are executed in one clock cy- To explain the reasons for the higher performance,we an- cle for a vector instruction. alyze execution time in terms of dynamic operation count and Thus, each 64-bit vector extension can be viewed cycles per operation (CPO). A vector processorexecutes fewer as an extremely short vector architecture with a vector operations by using vector instructions to stripmine a loop. length of 8 and a vector width of 8 for 8-bit data types. Moreover, a long achieves a lower CPO by Wider data types are executed with even short vector effectively using parallelism at both the operation and the in- lengths. Such vector con®gurations are quite different struction levels. Thus by reducing both terms ofthe CPO equa- tion, the simple long vector processorachieves greater perfor- from the more typical vector lengths of 64 or 128 and mance. vector widthsof 1 or2 whichappear invector supercom- puters. 1. Introduction Although a vector architecture may be a good match for multimedia applications, there is mounting evidence Advances in microprocessor design over the past that combining one with increasingly complex super- decade have been primarily driven by two applica- scalar processors will be dif®cult to implement. Over tion domains: technical and scienti®c applications for the past 2 years, shippings of superscalar processors uniprocessor desktops, and transaction processing and have been delayed repeatedly in order to meet target ®le-system workloads for multiprocessor servers. It is speeds [15, 14, 16]. Late shippings are often attributed expected, however, that application domains will shift to complex out-of-order designs. Promises of a huge over the next two decades. Althoughit is dif®cultto pre- transistor budget within a decade offer the possibility dict what future applications will be, there is a growing of implementing even more aggressive and complex de- Processor Short Vector Extension Year Sun UltraSPARC VIS: Visual Instruction Set [19] 1995 shipped Hewlett-Packard PA-RISC MAX-1: Multimedia Acceleration eXtensions [24] 1995 shipped MAX-2 [25] 1996 shipped Silicon Graphics MIPS MDMX: MIPS Digital Media eXtension [12] 1996 announced Digital Alpha MVI: Motion Video Instructions [5] 1996 announced MMX: MultiMedia eXtensions [28] 1996 announced 1997 shipped Intel Katmai SIMD ¯oating-point extensions [18] 1998 beta Motorola PowerPC AltiVec [27] 1998 announced

Figure 1. Short Vector Extensions in General-Purpose Microprocessors

signs [4]. 2. Processor Con®gurations Rather than combine a complex superscalar core with Figure 2 lists the features of the processors in our short wide vector hardware, we propose using a much study. We include an out-of-order, 4-way superscalar simpler that is similar to traditionalvec- processor for comparative purposes. The OOO super- tor with long vectors and simple control logic is modeled after the MIPS with for instruction issue. Such a design would use the bulk a PA8000-sized re-order buffer [30, 21]. of its transistors and die area for datapath and registers, The features that are most relevant to this study are and thus lessen the time required to design, implement, highlighted in bold: issue order, issue width, vector and verify control. length, and vector width. The last two features are de- termined by the con®guration of the vector register ®le In this paper, we present data that quanti®es this trad- and the vector datapath, respectively. These features ing of controltransistors for datapath and register transis- are varied in different combinations for the three pro- tors. We demonstrate that a 2-way, in-order vector pro- cessors while other features are not. For example, the cessor with a vector length of 64 and a vector width of ISA, -based memory system, and memory band- 8 requires no more die area, and possibly signi®cantly width are the same for all three processors. In this way, less area, than a 4-way, out-of-order superscalar proces- the impact on performance and die area of the common sor with short vector extensions. Furthermore, we show features should be approximately the same across all pro- that the simple long vector processor is, on average, 2.7 cessors. Thus any performance or cost differences that times faster executing multimedia applications than the we observe can be attributedto the fourfeatures of inter- superscalar processor, and 1.6 times faster than one with est. short vector extensions. The vector processors are based on the Torrent-0 (T0) For this paper, we focus on theuse of vector architec- microprocessor [2, 29]. The T0 is a single-chip vector tures for multimedia applications because, as mentioned microprocessor that was implemented by researchers at earlier, multimedia applications are growing in impor- the University of California at Berkeley. It is fabricated tance and because the effectiveness of vector architec- with Hewlett-Packard's CMOS26G using 1.0 m tures in other application areas has been reported else- scalable CMOS design rules and two metal layers, and where. Their effectiveness on scienti®c and engineering was ®rst fully functional in April 1995 at 45MHz.1 Un- applications has been demonstrated by their historically likevector , the T0 implementation is in- dominant use in the supercomputing arena, while other expensive by virtue of being fabricated as a single VLSI researchers are currently investigating the vectorizabil- chip [23]. In addition to being inexpensive, T0 is also a ity of SPECint programs [1]. ªnimbleº vector implementation [1]. Much of T0's nim- bleness can be attributed to the tight integration of the The remainder of the paper is organized as follows. scalar processor and vector hardware on a single die, thus In the next section, we describe the details of the proces- reducing the scalar overhead of vector execution signi®- sors that we study. In Section 3, we give area estimates cantly. T0's single-die implementation also allows back- for the simple long vector processor and compare its area to-back vector instructionsto execute in the same vector to those of existing OOO superscalar processors. In Sec- 1 tion 4, we present CPI and CPO analyses of simulation- The main reason for the relatively slow is the coarser process technology. The 45MHz clock rate is actually competitive based performance data to explain why greater perfor- with full-custom commercial processors implemented in similar pro- mance is achieved by the long vector processor. cesses [1].

2 PROCESSORS FEATURE OOO Superscalar OOO Short Vector Simple Long Vector ISA 64b MIPS 64b MIPS with vector extensions issue order out of order in order issue width 4 instructions 2 instructions fetch width 4 instructions 2 instructions re-order buffer size 56 instructions Ð #physical registers 64 integer 64 integer 32 integer 64 ¯oating-point 64 ¯oating-point 32 ¯oating-point 32 8-element vector 32 64-element vector datapath 2 integer units 2 integer units 2 integer units 1 load/store unit 1 load/store unit 1 load/store unit 1 VU with 8 IUs 1 VU with 8 IUs memory system 64-bit data , 64-bit address bus 2-level cache memory based on the R10000 implementation C SGI V5.3 ±O2 SGI V5.3 ±O2 and VSUIF V1.1.0

Figure 2. Superscalar and Vector Processors

unit without any intervening recovery cycles. A vector ecuting successive vector instructions. As in the T0, unitin a with its board-levelimplementa- both vector memory and vector computational instruc- tion must typically wait an additional 4 cycles when exe- tions can be chained. In addition to handling RAW data cuting successive vector instructions. Additional advan- hazards, the vector processors also resolve RAR struc- tages of implementing a vector architecture in micropro- tural hazards. Finally, operational latencies are the same cessor technology, rather than supercomputer technol- as in the R10000 [30] although the integer units in the ogy, include an area-ef®cient implementation of a high vector datapath includea 16x64 multiplierarray to allow bandwidth vector register ®le, more ef®cient energy con- vector multiply instructions for narrower data widths to sumption, and microprocessor-like operational latencies. execute at a fully pipelined rate. Vector multiplicationof Althoughwe base the details of ourvector implemen- wider data types has a non-unit repeat latency. tationontheT0,weuseourownvectorISA [22]. Wede- The bus interface to the memory system consists of a ®ned a separate vector ISA in order to support ¯oating- 64-bit address bus and a 64-bit data bus. At most, only point data for graphics applications as well as scienti®c one address can be placed on the 64-bit address bus in and engineering applications. Moreover, our vector ex- one cycle. As a result, only unit-stride vector instruc- tension is different from the MIPS MDMX short vector tions can proceed at the maximum memory vector width. extension [12] because we are also interested in studying Non-unitstride vector memory instructionsproceed with the effects of varying vector lengths and vector widths. a vector width of one. Our vector MIPS ISA is de®ned in the coproces- Because the memory data bus is 64 bits, the memory sor 2 opcode space and includes memory and compu- vector widthsforunit-stridevector instructionsare deter- tational instructions that execute in vector mode. The mined by the width of data being accessed. For example, memory instructions include loads and stores of var- 8 8-bitdata or4 16-bit data could be accessed in a single ious data widths using unit stride, constant stride, or clockcycle. On theotherhand, thevector widthfor com- gather/scatter. The computational instructions include putational vector instructions is ®xed at 8 by the num- integer arithmetic, logical, shift, and ¯oating-point in- ber of integer units in a vector unit. Thus it is possible structions. The vector ISA is a load/store architecture to have a computational vector widthof 8 and a memory which de®nes 32 vector registers, each of which consists vector width of 4 when processing halfword data. Chain- of 64 elements. Similar to traditional vector architec- ing hardware automatically provides the necessary de- tures, and unlike the recent short vector extensions, our pendency checks to execute chained computational and vector ISA includes a vector length register (VLR) that memory instructions of different vector widths. speci®es the number of operations a vector instruction The memory system itself is a 2-level cache system executes. based on the R10000. The split L1 caches are each T0-like features in the vector processors we study in- 64KB,2-way set-associative with 32B linesand 4 banks. clude chaining of data-dependent vector instructionsand The uni®ed L2 cache is 512KB, 4-way set-associative zero-cycle recovery time for the vector unit when ex- with 64B and 8 banks. The L1 miss penalty is 6 cycles

3 and the L2 miss penalty is 19 cycles. AREAINMM 2 We emulate the execution behavior of an OOO short High Area Perfor- vector extensionusingourmore conventionalvector ISA PROCESSOR COMPONENT Ef®cient and set the vector length to 8. In addition rather than us- mance ing a partitioned 64-bit datapath, we instead use a vec- tor unit containing 8 integer units. How an 8-wide vec- 64b Vector Datapath tor datapath is implemented primarily affects the cost 8 integer units 24.0 36.0 of the implementation and has little impact on perfor- load/store unit 3.0 3.0 mance. Thus, when determining the area requirements 64b Vector of the OOO short vector processor we assume the parti- 32 64-element vector registers 9.5 19.0 64b MIPS tioned datapath implementation but for our performance scalar integer and FP datapath 10.3 10.3 study we assume the vector-unit implementation to sim- scalar integer and FP register ®le 0.5 0.5 plify our simulation infrastructure. instruction issue 0.8 0.8 By emulating the OOO short vector extensions with Clocking and Overhead 4.0 4.0 a more conventional vector datapath, we will overesti- TOTAL 52.3 73.8 mate the performance of programs using data types that are wider than 8 bits. This is because in a 64-bit par- titioned vector implementation, a program with 16-bit Figure 3. Component Areas for Two Imple- data would execute with a vector length of 4 and vector mentations of the Long Vector Processor widthof4. On theotherhand, theOOO shortvectorpro- cessor in our study executes such a program with a vector length of 8, vector memory width of 4, and vector com- putational width of 8. are based on actual implementations. Because the im- plementations on which we base the estimates are imple- 3. Processor Die Areas mented in varying process technologies, we scale all the

In this section, we estimate the area required to im- areas to a common line size of 0.25 m. plement the simple long vector processor and compare The area-ef®cient 64b integer units are 3mm 2 its area to those of existing OOO superscalar processors. each. This area has been extrapolated from T0's 32b

We focus only on the processor portion of the die since, MIPS scalar datapath and 16x16 multiplier array

 + 4 

in our performance study, we assume the same cache- as follows: 2 area of 32b scalar datapath

2

2  1 + 4 

based memory system for all the processors. area for 16x16 multiplier array = mm

2

:25 0 mm . The 16x64 multiplier array allows vector 3.1. Simple Long Vector Processor multiplies of narrow data types (  16 bits) to be fully We determine the area required to implement the pipelined. simple long vector processor by estimating the space As independent evidence that 3mm2 is a reasonable requirements for each of the major processor compo- area estimate for a 64b integer unit, we measured the ar- nents: vector datapath, vector register ®le, scalar data- eas of integer units in several in-order and out-of-order path, scalar register ®le, and control logic for instruction superscalar implementations using annotated photomi- issue. crographs and ¯oorplans [11, 3, 13, 10, 26, 9]. The ar- Using the Torrent-0 as a realistic VLSI implemen- eas of the integer ªexecution boxº for in-order proces-

tation of a vector microprocessor, we conducted a de- sors such as the MIPS R5000 and range 2 tailed analysis of T0's area requirements in a previous from 3.25mm 2 to 3.72mm . These execution boxes in- study [23]. Our analysis showed that, of the major pro- clude iterative multiplier/dividers, and two ALUs. The cessor components, the most area-intensive for a vector OOO counterparts, MIPS R10000 and , processor are the datapath and vector register ®le in con- have /multiplier units that measure 4.5±4.65mm 2 . trast to aggressive superscalar processors where control Additional bussing and reservation stations to support for instruction issue dominates. out-of-orderexecution are the reasons for the increase in For this study, we improve upon our previous analy- area. sis by using more precise estimates for the component ar- Because the simple long vector processor uses in- eas. These are listed in Figure 3. We give areas for two order issue, it would use integer units closer in design implementations of the long vector processor: an area- to the in-order units. Although the area we choose is ef®cient implementation and a high-performance imple- slightly smaller than that of the in-order units, the T0 mentation. Differences in area between the two are high- implementation suggests that this is a conservative esti- lightedinboldfont. As in ourpreviousstudy, these areas mate. The individual integer units of T0's vector datap-

4 ath are much smaller than the MIPS scalar datapath be- Datapath Registers Instruction Other cause they do not require hardware, such as logic for PC IO 2-way superscalar Issue 12 handling, that is included in the scalar core. Nonetheless, MIPS R5000 OOO 4-way superscalar we conservatively base our area estimate on the larger MIPS R10000 67 datapath. Alpha 21264 70 OOO 4-way short vector We estimate the area for a high performance integer HP PA-8000 68 unit to be 4.5mm2 which is the area for an OOO unit. simple long vector We extrapolate the area for the load/store area efficient 52 high performance 73 unit from T0's vector memory datapath as || 0 10 20 30 40 50 60 70

2 µ

2  follows: area for 128x256 crossbar + Processor Die Area (in mm scaled to 0.25 ) area for shifting/aligning32b data. The load/store unit interfaces with a 64-bit memory data bus on one Figure 4. Breakdown of Processor Areas side and 8 64-bit data busses on the other, thus requiring a 64x512 cross-bar to transfer data from the memory system into the vector register ®le. T0 has a 128x256 components. ªOtherº is the area for clocking and over- cross-bar to perform the same functionality and that head. We obtained the areas of the superscalar proces-

occupies about 1mm2 . This crossbar can be restructured sors by measuring annotated die photomicrographs and without any area growth to provide the necessary ¯oorplans [10, 26, 9]. functionality for the long vector processor. Shifting and A comparison of die areas of different implementa- aligning 64b data between the memory data bus and tions can be complex. Differences in area may be due register busses requires twice the area used in the 32b to many factors that are orthogonal to how a processor T0 vector memory unit. exploits parallelism. These factors include fabrication Because there is only one address bus, only one ad- technology, non-processor components, datapath width, dress needs to be computed per clock cycle. More- and circuit and layout design. We want to minimize dif- over, scalar and vector instructions use the same 64b ferences due to these factors so that remaining differ- data bus to the memory system. Thus, computing ad- ences will be due to parallel-speci®c features such as dresses and handling TLB functions is handled by hard- datapath organization, register ®le implementation, and ware in the scalar portionof the processor, and additional instruction-issue mechanism.

area for address functionalityis not needed in the vector To this end, all die areas are scaled to a 0.25 m pro- load/store unit. cess to eliminate areal differences due to differences in We compute the areas for the two implementations of line size, The areas that we compare are for processor our vector register ®le based on layout details given by components only. Excluded are cache and TLB struc- AsanoviÂcin Chapter 5 of his doctorate dissertation [1]. tures, external interface logic, and the pad ring. Lastly, In addition to giving the area for storing data, the details all areas are based on actual VLSI implementations. of the register ®le design also include the area for over- The MIPS R10000, Alpha 21264, and HP PA-8000 head circuitry such as read sense ampli®ers, data latches are existing implementations of OOO superscalar and for writes and reads, multiplexors, and drivers, The areas OOO short vector processors. Of the three OOO super- of all the storage and overhead components are based on scalar processors, only the PA-8000 implements its short cells used in the T0 implementation. vector extensions, MAX-2. The implementation for the The area-ef®cient vector register ®le time- MAX-2 instructions occupies less than 0.2% of the area multiplexes the word and bit lines at twice the clock rate shown in Figure 4 [25]. For comparison purposes, areas to allow both a read and a write access in the same clock for the MIPS R5000 are also included. period. The high-performance implementation adds The area-ef®cient implementation of the long vector additional bussing and ports to avoid time-multiplexing processor is about 70±75% the size of the OOO imple- while still allowing simultaneous read and write mentations, while the high-performance version is 4±

accesses. 7mm2 larger. Note that this is only for the processor We use the MIPS R5000 [11] as the 64-bit scalar portion of the die. Typically there are on-chip caches as processor, which is responsible for instruction control, well as hardware for TLBs and external interface logic. scalar execution, and data address processing. For the MIPS R10000 and Alpha 21264, the processor 3.2. Comparison of Die Areas portion occupies 45% of the die, while 85% of the HP PA-8000 die is dedicated to the processor. The reason Figure4 plots theareas of the simplelong vector pro- a larger portion of the die is used for the PA-8000 pro- cessor and several superscalar processors. Also shown cessor is because its die does not include any on-chip

is a breakdown of each processor's area into its major caches. Thus, while 7mm2 is a 10% increase over the

5 5.17 processor area of an OOO superscalar implementation, 5 OOO superscalar 4.14 4 OOO short vector it is at most a 5% growth in total die area after non- simple long vector 3.35 3 2.95 2.92 2.72

Speedup over 2.27 2.18 processor components such as on-chip caches are con- 2 1.89 1.75 1.69 1.69

OOO superscalar 1.59 1.57 1.13 sidered. 1 1.00 1.00 1.001.07 1.00 1.00 1.00 1.00 1.00

0 || chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Consistent with the results in our previous study, the Average Mean area breakdown shows that the simple long vector pro- cessor uses its transistors differently from the OOO su- Figure 6. Processor Performance perscalar processors. For the long vector processor, 70% of the die is dedicated to functional units, while only 1± 2% is used for instructionissue. The remaining 20±25% link the programs with the SGI C compiler. For vector of the die is used for the large vector register ®le. In execution, we compile the vectorizable functions with contrast, the OOO superscalar processors use 40±65% of VSUIF, a vectorizing compiler we developed [6]. Be- the area for hardware structures to supportmulti-way and cause little effort has been put into optimizing the code out-of-order instruction issue. This is comparable or, in generated by VSUIF, non-vectorizable portions are com- some cases, more than the area used for datapath. piled with the SGI compiler to bene®t from its optimiz- The reason the long vector processor is able to trade ingcabilities. The SGI linkeris used to combine the sep- instruction-issue for datapath and registers is because it arately compiled object ®les into an executable. uses vector instructions that compactly encode parallel To generate performance statistics, both superscalar operations. Thus much of the functionality for detecting and vector executables are executed using Vmable, a parallelism is transferred to software. Figure 4 quanti®es MIPS ISA simulator that we modi®ed to emulate the this tradeoff and shows that using area for datapath and execution of vector instructions. Correctness, espe- registers rather than control is a reasonable alternative to cially of the vector executable, is veri®ed by compar- existing designs. In the next section, we will show that ing Vmable's output against that of native execution. this tradeoff also provides signi®cant performance gains Vmable also generates instruction and data addresses as well. that are input to Vcello, our vector-enhanced version We have been conservative with our estimates of the of Todd Mowry's cello timing simulator. Vcello long vector processor's die area. For example, we use can be con®gured to emulate the timing of the proces- full 64-bit integer units for the vector datapath and 64- sors that we study. Vcello also keeps track of cache bit elements for the vector register ®le. A more area- misses, branch mispredictions, etc. and produces archi- conservative design could use a partitioned datapath tectural performance statistics that characterize the exe- such as in the HP PA-8000 and implement a correspond- cution of a program. ingly narrower vector register ®le. In addition by re- Figure 6 shows the performance of the two vector structuringthe vector register®le, it is possible to reduce processors relative to the superscalar processor. The its area requirements by almost half without sacri®cing graph shows that vector execution is faster than bandwidth[1]. In short, it may be possible to implement wide-issue, out-of-order scalar execution: the short vec- a simple long vector multimedia processor in one-third tor processor is on average 1.7 times faster than the su- to one-half of the area of existing OOO superscalar pro- perscalar processor, and the long vector processor is 2.7 cessors. times faster.2 This is not surprising since the vector pro- cessors support more parallelism. Recall, however, that 4. Performance Results the cost of this greater support, as measured by die area, We now investigate the performance of vector proces- is not signi®cantly larger than that of the OOO-only im- sors executing multimedia applications. plementation. Figure 5 describes the programs and inputs we use The OOO short vector and simple long vector pro- as the workload for our performance study. The widths cessors have the same peak computational rate and peak of the dominant data types are also listed. All programs memory bandwidth. Nonetheless, the simple longer vec- are written in C. Four are image processing applications tor processor is 1.25 to 2.0 times faster than the com- while the other two programs are versions of the IDEA plex shorter vector processor. Across all the programs, encryption/decryption algorithm for PGP. Details of the 2When reporting averages in the text, we use the geometric mean two versions and an analysis of their performance differ- (GM) because we are averaging ratios (CPI and CPO) and normalized ence are given elsewhere [23]. All these programs con- data [8]. Consequently, we can use the GM's multiplicative properties tain highly vectorizable functionswhere much of the ex- when summarizing our analyses of CPI and CPO data. We also show ecution time is spent. the arithmetic average, which has a more intuitive interpretation, to- gether with the geometric mean in the graphs. Because both averages The same C source code is executed on all three pro- are within a few percent of eachother for the data we present, the geo- cessors. For superscalar-only execution, we compile and metric mean is a reasonable approximation of the arithmetic average.

6 DATA WIDTH INPUT DESCRIPTION chroma 8 bit 320x240 Merges two images on the basis of a ªwhitenessº threshold. colorspace 8 bit 24-bit Converts an image in RGB to YUV values.

composite 8 bit color Blends two images together by a blend factor . convolve 8,16 bit image(s) Convolves an image with a 3x3 16-bit kernel. decrypt.unroll 16 bit 16,000-byte Unrolled version of IDEA decryption. decrypt.inter 16 bit message Loop-interchanged version of IDEA decryption.

Figure 5. Benchmark Programs

6.36 6 5.86 the long vector processor executes on average 1.6 times OOO superscalar OOO short vector 5.06 4.55 4.51 simple long vector 4.25 faster than the short vector processor. 4

2.84 In the remainder of this section, we analyze the per- 2.41 2.10 2 1.90 1.37 1.38 1.19 1.30 0.93 formance of the three processors to explain why the 0.71 0.71 0.78 0.50 0.43 0.40 0.45 0.53 0.52 0 || chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric vector processors are faster than the superscalar-only Average Mean processor and why the longer vector is faster than the

13.97 shorter. 8 7.75 scalar instructions vl8 vector instructions 6.63 6 vl64 vector instructions 5.82 5.83 5.21

4.1. CPI  Dynamic Instruction Count

(in millions) 4 4.04 3.39 2.97

2 1.74 1.56 The classic method for determining why one com- 1.21 1.26 1.38 0.88 0.88

Number of Instructions 0.58 0.28 0.20 0.24 0.24 0.17 0.29 0.26 0 || puter executes faster than another is to deconstruct each chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean computer's execution time into the three components of the CPU performance equation: cycle time, aver- Figure 7. CPI & Dynamic Instruction Count age cycles per instruction(CPI),and dynamic instruction count [17]. Intheremainder ofthispaper, we refertothis equation as the ªCPI equationº to distinguishit from the ªCPO equationº that we discuss in the next subsection. level. A vector instruction executes 8 operations that, in For the purposes of this study, we make the reason- a scalar processor, would be 8 separate instructions that able assumption that all three processors can be imple- could be executed in parallel. mented with similar state-of-the-art clock frequencies. The CPI for the long vector processor is very high, OOO short vector processors have already been imple- ranging from 2.4 to 6.4 cycles per instructions. This is 6 mented at the same clock frequencies as OOO super- to 12 times greater than the superscalar's CPI. The reason scalar processors, while a long vector processor with its for the substantially higher CPI is that a vector instruc- simpler control logic could be implemented with a much tion for the long vector processor takes 8 clock periods faster clock frequency. Thus this assumption potentially to execute. underestimates the performance of the simple long vec- Despite the substantially higher CPIs, the vector pro- tor processor. cessors are faster than the superscalar processor. To un- Thegraphs inFigure7show thedata fortheothertwo derstand why, it is important to consider the other half components of the CPI equation. For the dynamic in- of the CPI equation: the dynamic instructioncount. The struction count, we also show the number of scalar and bottomhalfof Figure7 shows the number ofinstructions vector instructions executed for each program. executed by each processor. As expected, the CPI for the superscalar processor is An immediately obvious trend among the processors low: between 0.4 and 0.7, or equivalently between 1.4 is that the dynamic instruction count is inversely propor- and 2.5 . tional to the CPI. The superscalar processor, which has All instructions in the short vector processor includ- the lowest CPI, executes far more instructions than the ing the vector ones still require a minimum of only one vector processors. In contrast, the long vector processor clock period to execute. Nonetheless, this processor has withthe highestCPI executes 12±34times fewer instruc- aCPI thatis,onaverage, 2.5 times higherthan theCPI of tions than the superscalar processor and 3.7±7.4 times the superscalar processor. The higher CPI suggests that fewer instructions than the short vector processor. less instruction-levelparallelism is being exploited. One The greatly reduced instruction count is the result of reason forthereduced ILP isthat by usingvectorinstruc- using stripmined vector code. A stripmined vector loop tions, there is less parallelism to be used at the instruction that uses vector lengths of 64 is comparable to implic-

7 itly unrolling the loop 64 times. The scalar instructions achieved by using vector instructions, a technique obvi- that handle the bookkeeping of the stripmined code cor- ously not available for non-vector processors. In short, a respond to the loop overhead code of the original loop. stripmined vector loop of length VL can reduce the dy- The vectorinstructionsexecute theoperationsof theloop namic instruction count for that loop by a factor close body itself. to VL while keeping the static count close to that of the Stripmining a loop is better than explicitly unrolling original rolled version. it for two reasons. First, the stripmined vector loop re- It is not always possible, however, to reduce the over- duces the dynamic count for both the loop overhead in- all dynamic instruction count of the scalar version by a structions and loop body instructions, while explicit un- factor of VL for several reasons. The most obvious rea- rolling reduces only the overhead count. Second, strip- son is that vector loops do not always dominate the ex- mining achieves this reduction without substantially in- ecution time. While the bulk of time is spent in vector creasing the static instructioncount, whereas explicit un- loops for the programs in this study, a small but signi®- rolling requires replicating loop body instructions. cant amount isspent in non-vectorizablepartsof the pro- An example of a loop overhead instructionis one that gram. updates the loop index variable (add r1,r1,1). In Anotherreason isthat thereisnot alwaysa one-to-one the rolled version of a loop, the same scalar instruction correspondence between instructions in the rolled loop for carrying out this functionality is executed 64 times. version and instructions in the stripmined version. As a The stripmined and unrolled versions would collapse result, the static instruction count for the stripmined ver- these into a single instruction(add r1,r1,64) that is sion can be slightly higher than the count for the rolled executed once for every 64 loop iterations. Similar trans- version. For example, instructions to update the vector formations are applied to other scalar instructions that length register must be introduced into the stripmined constitute the loop overhead. Examples include those version. In addition, depending upon the ISA support that update address values. In this way the number of for conditional vector execution, more vector instruc- loop overhead instructions executed is reduced by a fac- tions may be used to handle conditional statments than tor of 64. in the scalar version. A ®nal example is when the scalar An example of an instruction sequence that includes version is unrolled several times. The SGI compiler we a loop bodyinstructionis one that loads an element from use for the superscalar processor will unroll loops 2 to 4 an array and updates the associated address register: times depending upon the size of the loop body. load r3,0(r2); add r2,r2,4. In this case, the Even though it is not always possible to achieve a re- load instruction is a loop body instruction while the duction of VL in the dynamic instruction count, using add instruction is a loop overhead instruction. In the stripmining to reduce this count is more effective than rolled version, these two instructions would be executed explicity unrolling. This is because VL can be much 64 times resulting in a dynamic instructioncount of 128. greater than the amount typically used for unrolling. For the version that explicitly unrolls the loop 64 Small values of VL such as 8, are large for explicit un- times, the load instruction would be repeated 64 times rolling, while more typical VL values, such as 32 or but with a different offset for each load instruction. 64, are ineffective for unrolling due to the tremendous The address update instruction, however, would not be growth in the number of static instructions generated as repeated and instead would be transformed into add well as the number of registers required. On the other r2,r2,256. Thusthenumberof static instructions has hand, support for vector execution with length 32 or 64 grown from 2 to 65, whilethe dynamic instruction count is not prohibitivelyexpensive as explained in the previ- has been reduced from 128 to 65. ous section. The stripmined vector version reduces the dynamic Figure 8 summarizes the results of our CPI analysis. count even further, while keeping the static count com- To explain why the vector processors are faster than the parable to the rolled version. The 64 load instructions superscalar processor, we express the average speedup of are now replaced with a single vector load instruction a vector processor as a function of the superscalar pro- vload v3,(r2),4 that accesses memory locations cessor's average CPI (CPISS) and average dynamic in- that are 4 bytes apart. This results in only 2 instructions struction count (NISS). This is possible because the ge- being executed to carry out 64 loads and their associated ometric mean is used to compute the averages and we address updates. use the multiplicative properties of the geometric mean The reduction in dynamic instruction count for loop to guarantee equivalency between the two expressions. overhead is obtained by solving simple linear recur- From Figure 8, we observe that vector processors rences, a technique that is independent of instruction have a signi®cantly higher average CPI than the super- type and thus available to all three processors. On the scalar processor does: the average CPI of the short vec- other hand, the tremendous savings for the loop body is tor processor is 2.5 times higher while the long vector's

8 SPEEDUP CYCLES PER DYNAMIC OVER PROCESSOR INSTRUCTION COUNT SUPER- COUNT

SCALAR

OOO superscalar CPISS NISS CPISS NISS 1.00

OOO short vector 2.50 CPISS 0.24 NISS 0.60 CPISS NISS 1.67

simple long vector 8.19 CPISS 0.045 NISS 0.37 CPISS NISS 2.71

Figure 8. Average Speedup Deconstructed Using CPI Equation

0.75 average CPI is substantially greater at 8.2 times that of 0.71 OOO superscalar 0.71 OOO short vector 0.55 simple long vector 0.53 0.50 0.52 the superscalar's CPI. On the other hand, vector proces- 0.50 0.48 0.48 0.45 0.44 0.43 0.40 0.38 0.37 0.35 0.32 sors execute far fewer instructions: the short vector pro- 0.29 0.27 0.25 0.26 0.25 0.21 0.16 0.17 0.15 cessor executes, on average, about 4 times fewer instruc- Cycles per Operation

0.00 || chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric tions whilethe longvector processor reduces theinstruc- Average Mean tion count by more than a factor of 22.

13.97 The tremendous reduction in dynamic instruction 10.0 9.99 scalar operations 8.07 vector operations 8.01 7.75 count more than compensates for the increase in CPI, 7.5 7.35 vector operations 6.63 5.82 5.88 5.75 5.83 5.21 5.37 5.16 (in millions) 5.0 5.00 resulting in the OOO short vector processor being 1.7 4.57 4.50 4.04 4.06 3.50 3.14 2.97 2.5 2.49 times faster than the OOO superscalar processor and the 2.06 Number of Operations

0.0 || simple long vector being 2.7 times faster. Moreover the chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric longvector'sreductionin instructioncount is suf®ciently Average Mean large that it is 1.6 times faster than the OOO short vector Figure 9. CPO & Dynamic Operation Count despite having a CPI that is 3.3 times higher.

4.2. CPO  Dynamic Operation Count Using the CPI equation to deconstruct execution time of the vector length register when the instructionis exe- 3 explains why the vector processors are more effective cuted. than the superscalar processor when executing highly Figure 9 shows the cycles per operations (CPO) and vectorizable programs. However, it does not provide dynamic operation count for the three processors. The much insight into how ef®cient a vector processor is at data for the superscalar processor are the same as in the executing parallel loops. For example, given that indi- instruction-centric graphs. For the dynamic operation vidual instructions can take 1 or 8 cycles to execute in count, we also show the number of scalar and vector op- the long vector processor, a CPI less than 8 could be the erations executed for each program. result of using a signi®cant amount of parallelism effec- Whereas the CPIs are higher, the CPOs of the vec- tively; or it could be due to having far more scalar in- tor processors are lower than the superscalar's CPO. Ex- structions being executed than vector instructions. cludingthe CPO for decrypt.unroll, theshort vec- The reason for this ambiguity is that an instruction tor's CPOs are up to 1.7 times lower than the super- is no longer an appropriate unit of work for comparison scalar's CPO while the long vector's CPOs are 1.35±2.8 purposes. The amount of work carried out by an instruc- times lower. For decrypt.unroll, the vector CPOs tion varies tremendously depending upon both the type are more than 4 times smaller. of instruction and the vector length. When all instruc- Recall that the higher CPI of the short vector pro- tions take about the same amount of time to execute, a cessor indicates that it is ®nding less ILP than the high CPI indicates possible performance inef®ciences in OOO superscalar processor. Its lower CPO indicates the implementation. On the other hand, it is possible for that the short vector processor compensates for the re- a long vector processor to use parallelism effectively and 3This mapping is a rough measure of work. A scalar memory still have a high CPI. instruction that uses an auto-increment could be Rather than using an instructionas a unit of work, we counted as executing two operations: the memory access and the ad- recommend using an operation which is the amount of dress register update. In a similar fashion, vectormemory instructions work that is carried out by a functional unit. One scalar could also be counted as executing 2VL operations to account for both the memory accesses and the address register updates. However instruction executes one operation while one vector in- to keep the mapping simple, we count only one operation per pair of struction executes VL operations where VL is the value memory access and address register update.

9 duced instruction-level parallelism by effectively using lel. For the long vector processor, the use of simple list operation-level parallelism. More generally because op- scheduling within a vector strip can allow most of the erations, unlike instructions, carry out approximately scalar instructions to execute in parallel with vector in- equivalent amounts of work in all three processors, the structions and thus transform potential scalar-scalar ILP vector processors' lower CPO indicates that they are us- into achievable vector-scalar ILP. Neither of these tech- ing the available parallelism in the program more effec- niques is currently used by our vectorizing compiler. tively than the superscalar processor. The long vector processor's lower CPO constitutes The CPOof thelong vectorprocessor is1.1±1.7 times only a portion of its speedup over the short vector pro- lower than the short vector's CPO across the programs. cessor. As before, it is importantto consider dynamic op- The lower CPO suggests that the long vector processor eration count, the other half of the CPO equation, to ac- is able to use parallelism more effectively. By using a count for the remaining speedup factor. The bottom half vector width of 8, each processor is able to take advan- of Figure 9 shows the number of operations executed by tage of the operational parallelism within a vector in- each processor. structionequally well. It is at the instruction level where The long vector processor always executes fewer op- the two processors differ in their use of parallelism as erations than the short vector processor. This is entirely a consequence of their differences in issue mechanism due to the fact that the long vector processor executes and vector length. Figure 10 summarizes the types of about 7 times fewer scalar operations, on average, to ac- instruction-level parallelism that each can use. complish the same task. Both vector processors execute The complex short vector processor uses its out-of- the same number of vector operations for a given pro- order wide issue to execute vector instructions in paral- gram. lel with scalar instructions. In addition, scalar instruc- The reduction in scalar operation count is a direct re- tions also execute in parallel. It is rare, however, that sult of the long vector processor having a vector length vector instructions would execute in parallel with each that is 8 times longerthanthe short vector processor. Re- other for several reasons. Vector instructions take only call that different techniques are used to reduce the dy- one clock cycle to execute, and there is only one vector namic instruction count for the loop overhead and loop memory unit and one vector computational unit. Thus body in a stripmined vector loop: loop overhead instruc- vector-vector ILP requires a vector memory instruction tions are eliminated by solving simple linear recurrences, and a vector computational instruction that have no data while a loop bodyinstructionis replaced by a single vec- dependencies. tor instruction. By implicity unrolling the loop 64 times It is unlikely, however, that such a pair will be found rather than 8 times, the long vector processor executes because there are often data dependences among vec- the same number of vector operations as the short vec- tor instructions in the same vector strip. Vector instruc- tor processor but 8 times fewer scalar operations for the tions that can execute in parallel are typically from sub- vectorized loop. sequent vector strips. However, a non-trivial loop usu- The overall reduction in scalar operations is less than ally generates a vector strip that contains more than 10 a factor of 8 because a small but non-trivial number of instructions that often include several memory instruc- the scalar instructions are executed for non-vectorizable tions. Even with complex out-of-order issue hardware, parts of the program. it is unlikely that the hardware would be able to execute Figure 11 summarizes the results of our CPO analy- vector instructions from different strips in parallel. sis. We use the same averaging technique as we used The simple long vector processor, on the other hand, to summarize our CPI analysis in Figure 8. Unlike the is able to execute vector memory instructions in paral- CPI results, the CPO results show that vector proces- lel with vector computational instructions. This is be- sors achieve greater performance by reducing both com- cause a vector instruction takes 8 cycles to execute and ponents of the CPO equation. First, the short and long with chaining, even data-dependent vector instructions vector processors are more effective at using parallelism can overlap in execution. In addition, scalar instructions to achieve a CPO that is, on average, 1.5 and 2.1 times can also execute in parallel with already executing vector lower than the superscalar's CPO. Second, the vector instructions. With in-order single issue, however, scalar processors execute fewer operations by stripmining vec- instructions cannot execute in parallel. tor loops: the short vector processor executes, on av- For both processors, it is possible to use compiler erage, 1.1 times fewer operations while the long vector techniques to overcome the inabilityof each to exploit a processor executes 1.3 fewer. speci®c type of ILP. For the short vector processor, loop The CPO analysis also provides more insight into unrollingor the more sophisticatedtechnique ofsoftware why the simple long vector processor is faster than the pipelining can be used to schedule vector instructions complex short one. The bulk of the speedup is the result from different strips so that they could execute in paral- of a CPO that is, on average, 1.4 times lower, with an

10 VECTOR PROCESSORS ILP TYPE OOO short simple long COMPILER ASSISTANCE

p use simple list scheduling to enable

scalar±scalar

vector±scalar ILP instead p vector±scalar p

p use loop unrolling or software pipelin-

vector±vector ing to enable

Figure 10. Effective ILP in Vector Processors

SPEEDUP DYNAMIC CYCLES PER OVER PROCESSOR OPERATION CYCLE COUNT OPERATION SUPER- COUNT

SCALAR

OOO superscalar CPOSS NOSS CPOSS NOSS 1.00

OOO short vector 0.67 CPOSS 0.88 NOSS 0.59 CPOSS NOSS 1.70

simple long vector 0.48 CPOSS 0.77 NOSS 0.37 CPOSS NOSS 2.70

Figure 11. Average Speedup Deconstructed Using CPO Equation average 1.15x lower operation count providing the rest A VLSI implementation of a vector processor also of the speedup. By using instruction-level parallelism provides performance bene®ts. Examples include vector more effectively and executing fewer scalar operations, units with zero-cycle recovery time, short operational la- the simple long vector processor is on average 1.6 times tencies, and a high-bandwidth, area-ef®cient vector reg- faster than the complex short vector processor. ister ®le. Latencies are comparable to those in other mi- croprocessor implementations, rather than to latencies 5. Summary and Conclusions in supercomputer implementations which are considered We have presented die area measurements and the quintessential vector technology. simulation-based performance data to demonstrate the Additional performance gains are achieved from the cost/performance bene®ts of using a 2-way, in-order vector ISA. Although a long vector length results in a vector microprocessor with a vector length of 64 and much higher CPI, it also reduces the dynamic instruc- vector width of 8 to execute multimedia applications. tion count substantially without signi®cantly increasing Unlike a 4-way OOO superscalar processor with short the static instruction count. By using cycles per oper- vector extensions, this simple long vector processor is ation as a metric, we showed that the simple long vec- con®gured more like traditional vector architectures but tor processor is more effective at using vectorizable par- with two major enhancements: much wider vectors and allelism than the superscalar processors despite have a slightly wider instruction issue. signi®cantly higher CPI. Thus by executing more oper- The cost bene®ts of the long vector processor are ations per cycle and fewer operations in total, the sim- made possible by implementing it in VLSI technol- ple long vector processor is, on average, 2.7 times faster ogy. Using components from actual VLSI implementa- than a 4-way OOO superscalar processor and 1.6 times tions, we showed that a conservative estimate of a high- faster than a superscalar processor with short vector ex- performance implementation of this processor would be tensions. comparable in area to existing OOO 4-way superscalar These performance gains are obtained with an processors. An area-ef®cient implementation would be R10000-like cache-based memory system being used 70±75% smaller, while a more aggressive implementa- for all three processors. Thus it is not necessary for a tion could possibly be one-third to one-half the size of vector implementation to require a costly, high band- an OOO superscalar processor. Furthermore, the simple width memory system to achieve high performance. In long vector processor should also be easier to design, im- fact, a preliminary analysis of the cache performance plement, and verify. This is because the bulk of its die statistics suggests that all three processors could bene®t area is used for datapath and registers rather than com- from a more sophisticated cache system. plex instruction-issue logic. In summary, the complexity and area cost for provid-

11 ing highly parallel vector support can be substantially [11] Silicon Graphics. MIPS RISC R5000 Microprocessor. less than the complexity and area cost for OOO super- http://www.sgi.com/MIPS/products/r5000, 1996. scalar support. Moreover, these savings in cost are cou- [12] Silicon Graphics. MIPS Extension for Digital Me- pled with a signi®cant increase in performance. Thus a dia with 3D. http://www.sgi.com/MIPS/arch/ISA5/- simple, highly parallel, long vector microprocessor rep- index.html#MIPSV indx, March 1997. resents a cost-effective alternative for executing multi- [13] Paul E. Gronowski et al. A 433MHz 64b Quad-Issue media applications. RISC Microprocessor. In Digest of Technical Papers for the International Solid-State Circuits Conference, pages Acknowledgements 222±223,449, San Fransisco, CA, February 1996. We thank David Patterson, Krste AsanoviÂc, and the IRAM [14] Linley Gwennap. Class of '94 Has Mixed Success. Mi- researchgroup for their helpful commentsand suggestions. We croprocessor Report, 11(14), Oct 27 1997. also thank Todd Mowry for his expertise in constructing sim- [15] LinleyGwennap.Is ItSoupYet? MicroprocessorReport, ulators. This work was supported by NSERC, CITO, DARPA 11(2), Feb 19 1997. (DABT63-96-C-0056), NSF (CDA 94-01156), and the Califor- [16] Linley Gwennap. RISC Disappointments Mount. Micro- nia State MICRO Program. processor Report, 11(17), Dec 29 1997. [17] John L. Hennessy and David A. Patterson. Computer References Architecture: A Quantitative Approach, Second Edition. [1] Krste AsanoviÂc. Vector Microprocessors. PhD thesis, Morgan Kaufmann Publishers, San Fransisco, CA, 1996. University of California at Berkeley, May 1998. [18] Intel. Software Bene®ts of Katmai New Instructions. http://developer.intel.com/drg/news/katmai.htm, 1998. [2] Krste AsanoviÂc, James Beck, Bertrand Irissou, Brian Kingsbury, Nelson Morgan, and John Wawrzynek. The [19] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and T0 Vector Microprocessor. In Proceedings of Hot Chips G. Zyner. The Visual Instruction Set (VIS) in Ultra- VII, pages 187±196, Stanford, CA, August 1995. SPARC. In Proceedings of the Compcon-95,pages 462± 469, March 1995. [3] William J. Bowhill et al. A 300MHz 64b Quad-Issue CMOS RISC Microprocessor. In Digest of Technical [20] Christoforos E. Kozyrakis and David A. Patterson. A Papers for the International Solid-State Circuits Confer- New Direction for Research. To ence, pages 182±183,362, San Fransisco, CA, February be published in IEEE Computer., May 1998. 1995. [21] AshokKumar. TheHPPA-8000RISC CPU. IEEE Micro, 17(2):27±32, March±April 1997. [4] Doug Burger and James R. Goodman (Guest Editors). Billion-Transistor Architectures. IEEE Computer, [22] Corinna G. Lee. MIPS Vector Architecture Manual. In 30(9):46±48, September 1997. Special issue on The preparation., June 1997. Future of Microprocessors. [23] Corinna G. Lee and Derek J. DeVries. Initial Results on [5] Digital Equipment Corperation. Advanced Technol- the Performance and Cost of Vector Microprocessors. In ogy for Visual Computing: Alpha Architecture with Proceedingsof the 30th AnnualInternational Symposium MVI. http://www.digital.com/semiconductor/mvi- on Microarchitecture, pages 171±182, December 1997. backgrounder.htm, March 1997. [24] Ruby B. Lee. Accelerating Multimedia with Enhanced Microprocessors. IEEE Micro, 15(2):22±32, April 1995. [6] Derek J. DeVries. A Vectorizing SUIF Compiler: Imple- mentation and Performance. Master's thesis, University [25] Ruby B. Lee. Subword Parallelism with MAX-2. IEEE of Toronto, June 1997. Micro, 16(4):52±59, August 1996. [7] and Pradeep K. Dubey. How Multi- [26] Jon Lotz et al. A Quad-Issue Out-of-Order RISC CPU. media Workloads Will Change Processor Design. IEEE In Digest of TechnicalPapers for the International Solid- Computer, 30(9):43±45, September 1997. State Circuits Conference,pages 210±211,446,San Fran- sisco, CA, February 1996. [8] P.J. Flemming and J.J. Wallace. How Not to Lie with [27] Motorola. Motorola AltiVec Technology: Home Page. Statistics: The Correct Way to Summarize Benchmark http://www.mot.com/SPS/PowerPC/AltiVec/, 1998. Results. Communications of the ACM, 29(3):218±221, March 1986. [28] Alex Peleg and Uri Wieser. MMX Technology Exten- sions to the Intel Architecture. IEEE Micro, 16(4):10±20, [9] Bruce A. Gieseke et al. A 600MHz Superscalar RISC August 1996. Microprocessor with Out-Of-Order Execution. In Di- gest of TechnicalPapers for the International Solid-State [29] John Wawrzynek, Krste AsanoviÂc, Brian Kingsbury, James Beck, David Johnson, and Nelson Morgan. Circuits Conference, pages 176±177,451, San Fransisco, CA, February 1997. SPERT-II: A Vector Microprocessor System. IEEE Computer, 29(3):79±86, March 1996. [10] Silicon Graphics. MIPS R10000 Microprocessor. [30] Kenneth C. Yeager. The MIPS R10000 Superscalar Mi- http://www.sgi.com/-MIPS/products/r10k/T5 Die.html, croprocessor. IEEE Micro, 16(2):28±40, April 1996. 1996.

12