st
To bepublishedinthe Proceedings of the 31 Annual InternationalSymposium on Microarchitecture, December 1998.1
Simple Vector Microprocessors for Multimedia Applications
Corinna G. Lee and Mark G. Stoodley g fcorinna,stoodla @eecg.toronto.edu Department of Electrical and Computer Engineering University of Toronto
Abstract consensus that multimedia applications will increase in In anticipation of the emergenceof multimedia applications importance as greater priority is given to more human- as an important workload, microprocessor companies have friendly interfaces and to personal mobile computing [7, augmented their instruction-set architectures with short vec- 20]. tor extensions, thus adding basic vector hardware to state-of- In anticipation of this emerging applications area, the-art superscalar processors. Although a vector architec- microprocessor companies have augmented their ture may be a good match for multimedia applications, there is instruction-set architectures with short vector exten- growing evidence that the control logic for increasingly com- plex superscalar processors is dif®cult to implement. sions, thus adding basic vector hardware to state-of-the- Rather than combining a complex superscalar core with art superscalar processors. Figure 1 lists the extensions short wide vector hardware,we proposeusing a much simpler that have been introduced or announced by all major processordesign that is similar to traditional vectorcomputers microprocessor companies. with long vectors and simple control logic for instruction issue. A common aspect of the vector extensions listed in Such a design would use the bulk of its transistorsand die area Figure 1 is that all use a wide datapath that is partitioned for datapath and registers,and thus lessen the time required to to execute narrower data types in parallel. These nar- design, implement, and verify control. rower data types are more typical for multimedia appli- In this paper, we present data that quanti®es this trading cations which manipulate sound and image data. Almost of control transistors for datapath and register transistors. We all use a 64-bit datapath; the HP MAX-1 uses a 32-bit demonstratethat a 2-way, in-ordervectorprocessorwith a vec- datapath while the PowerPC AltiVec uses a 128-bit dat- tor length of 64 and a vector width of 8 requires no more die apath. area, and possiblysigni®cantly less area, than a 4-way, out-of- ordersuperscalarprocessorwith short vector extensions. Fur- It is useful to characterize a vector implementation in thermore, we show that the simple long vector processor is, terms of its vector length and vector width. Vector length on average,2.7 times faster executing multimedia applications is the maximum number of operations that a vector in- than the superscalar processor, and 1.6 times faster than one struction can execute, while vector width refers to the with short vector extensions. number of operations that are executed in one clock cy- To explain the reasons for the higher performance,we an- cle for a vector instruction. alyze execution time in terms of dynamic operation count and Thus, each 64-bit vector extension can be viewed cycles per operation (CPO). A vector processorexecutes fewer as an extremely short vector architecture with a vector operations by using vector instructions to stripmine a loop. length of 8 and a vector width of 8 for 8-bit data types. Moreover, a long vector processor achieves a lower CPO by Wider data types are executed with even short vector effectively using parallelism at both the operation and the in- lengths. Such vector con®gurations are quite different struction levels. Thus by reducing both terms ofthe CPO equa- tion, the simple long vector processorachieves greater perfor- from the more typical vector lengths of 64 or 128 and mance. vector widthsof 1 or2 whichappear invector supercom- puters. 1. Introduction Although a vector architecture may be a good match for multimedia applications, there is mounting evidence Advances in microprocessor design over the past that combining one with increasingly complex super- decade have been primarily driven by two applica- scalar processors will be dif®cult to implement. Over tion domains: technical and scienti®c applications for the past 2 years, shippings of superscalar processors uniprocessor desktops, and transaction processing and have been delayed repeatedly in order to meet target ®le-system workloads for multiprocessor servers. It is speeds [15, 14, 16]. Late shippings are often attributed expected, however, that application domains will shift to complex out-of-order designs. Promises of a huge over the next two decades. Althoughit is dif®cultto pre- transistor budget within a decade offer the possibility dict what future applications will be, there is a growing of implementing even more aggressive and complex de- Processor Short Vector Extension Year Sun UltraSPARC VIS: Visual Instruction Set [19] 1995 shipped Hewlett-Packard PA-RISC MAX-1: Multimedia Acceleration eXtensions [24] 1995 shipped MAX-2 [25] 1996 shipped Silicon Graphics MIPS MDMX: MIPS Digital Media eXtension [12] 1996 announced Digital Alpha MVI: Motion Video Instructions [5] 1996 announced Intel Pentium MMX: MultiMedia eXtensions [28] 1996 announced 1997 shipped Intel Katmai SIMD ¯oating-point extensions [18] 1998 beta Motorola PowerPC AltiVec [27] 1998 announced
Figure 1. Short Vector Extensions in General-Purpose Microprocessors
signs [4]. 2. Processor Con®gurations Rather than combine a complex superscalar core with Figure 2 lists the features of the processors in our short wide vector hardware, we propose using a much study. We include an out-of-order, 4-way superscalar simpler processor design that is similar to traditionalvec- processor for comparative purposes. The OOO super- tor computers with long vectors and simple control logic scalar processor is modeled after the MIPS R10000 with for instruction issue. Such a design would use the bulk a PA8000-sized re-order buffer [30, 21]. of its transistors and die area for datapath and registers, The features that are most relevant to this study are and thus lessen the time required to design, implement, highlighted in bold: issue order, issue width, vector and verify control. length, and vector width. The last two features are de- termined by the con®guration of the vector register ®le In this paper, we present data that quanti®es this trad- and the vector datapath, respectively. These features ing of controltransistors for datapath and register transis- are varied in different combinations for the three pro- tors. We demonstrate that a 2-way, in-order vector pro- cessors while other features are not. For example, the cessor with a vector length of 64 and a vector width of ISA, cache-based memory system, and memory band- 8 requires no more die area, and possibly signi®cantly width are the same for all three processors. In this way, less area, than a 4-way, out-of-order superscalar proces- the impact on performance and die area of the common sor with short vector extensions. Furthermore, we show features should be approximately the same across all pro- that the simple long vector processor is, on average, 2.7 cessors. Thus any performance or cost differences that times faster executing multimedia applications than the we observe can be attributedto the fourfeatures of inter- superscalar processor, and 1.6 times faster than one with est. short vector extensions. The vector processors are based on the Torrent-0 (T0) For this paper, we focus on theuse of vector architec- microprocessor [2, 29]. The T0 is a single-chip vector tures for multimedia applications because, as mentioned microprocessor that was implemented by researchers at earlier, multimedia applications are growing in impor- the University of California at Berkeley. It is fabricated tance and because the effectiveness of vector architec- with Hewlett-Packard's CMOS26G process using 1.0 m tures in other application areas has been reported else- scalable CMOS design rules and two metal layers, and where. Their effectiveness on scienti®c and engineering was ®rst fully functional in April 1995 at 45MHz.1 Un- applications has been demonstrated by their historically likevector supercomputers, the T0 implementation is in- dominant use in the supercomputing arena, while other expensive by virtue of being fabricated as a single VLSI researchers are currently investigating the vectorizabil- chip [23]. In addition to being inexpensive, T0 is also a ity of SPECint programs [1]. ªnimbleº vector implementation [1]. Much of T0's nim- bleness can be attributed to the tight integration of the The remainder of the paper is organized as follows. scalar processor and vector hardware on a single die, thus In the next section, we describe the details of the proces- reducing the scalar overhead of vector execution signi®- sors that we study. In Section 3, we give area estimates cantly. T0's single-die implementation also allows back- for the simple long vector processor and compare its area to-back vector instructionsto execute in the same vector to those of existing OOO superscalar processors. In Sec- 1 tion 4, we present CPI and CPO analyses of simulation- The main reason for the relatively slow clock rate is the coarser process technology. The 45MHz clock rate is actually competitive based performance data to explain why greater perfor- with full-custom commercial processors implemented in similar pro- mance is achieved by the long vector processor. cesses [1].
2 PROCESSORS FEATURE OOO Superscalar OOO Short Vector Simple Long Vector ISA 64b MIPS 64b MIPS with vector extensions issue order out of order in order issue width 4 instructions 2 instructions fetch width 4 instructions 2 instructions re-order buffer size 56 instructions Ð #physical registers 64 integer 64 integer 32 integer 64 ¯oating-point 64 ¯oating-point 32 ¯oating-point 32 8-element vector 32 64-element vector datapath 2 integer units 2 integer units 2 integer units 1 load/store unit 1 load/store unit 1 load/store unit 1 VU with 8 IUs 1 VU with 8 IUs memory system 64-bit data bus, 64-bit address bus 2-level cache memory based on the R10000 implementation C compiler SGI V5.3 ±O2 SGI V5.3 ±O2 and VSUIF V1.1.0
Figure 2. Superscalar and Vector Processors
unit without any intervening recovery cycles. A vector ecuting successive vector instructions. As in the T0, unitin a supercomputer with its board-levelimplementa- both vector memory and vector computational instruc- tion must typically wait an additional 4 cycles when exe- tions can be chained. In addition to handling RAW data cuting successive vector instructions. Additional advan- hazards, the vector processors also resolve RAR struc- tages of implementing a vector architecture in micropro- tural hazards. Finally, operational latencies are the same cessor technology, rather than supercomputer technol- as in the R10000 [30] although the integer units in the ogy, include an area-ef®cient implementation of a high vector datapath includea 16x64 multiplierarray to allow bandwidth vector register ®le, more ef®cient energy con- vector multiply instructions for narrower data widths to sumption, and microprocessor-like operational latencies. execute at a fully pipelined rate. Vector multiplicationof Althoughwe base the details of ourvector implemen- wider data types has a non-unit repeat latency. tationontheT0,weuseourownvectorISA [22]. Wede- The bus interface to the memory system consists of a ®ned a separate vector ISA in order to support ¯oating- 64-bit address bus and a 64-bit data bus. At most, only point data for graphics applications as well as scienti®c one address can be placed on the 64-bit address bus in and engineering applications. Moreover, our vector ex- one cycle. As a result, only unit-stride vector instruc- tension is different from the MIPS MDMX short vector tions can proceed at the maximum memory vector width. extension [12] because we are also interested in studying Non-unitstride vector memory instructionsproceed with the effects of varying vector lengths and vector widths. a vector width of one. Our vector MIPS ISA is de®ned in the coproces- Because the memory data bus is 64 bits, the memory sor 2 opcode space and includes memory and compu- vector widthsforunit-stridevector instructionsare deter- tational instructions that execute in vector mode. The mined by the width of data being accessed. For example, memory instructions include loads and stores of var- 8 8-bitdata or4 16-bit data could be accessed in a single ious data widths using unit stride, constant stride, or clockcycle. On theotherhand, thevector widthfor com- gather/scatter. The computational instructions include putational vector instructions is ®xed at 8 by the num- integer arithmetic, logical, shift, and ¯oating-point in- ber of integer units in a vector unit. Thus it is possible structions. The vector ISA is a load/store architecture to have a computational vector widthof 8 and a memory which de®nes 32 vector registers, each of which consists vector width of 4 when processing halfword data. Chain- of 64 elements. Similar to traditional vector architec- ing hardware automatically provides the necessary de- tures, and unlike the recent short vector extensions, our pendency checks to execute chained computational and vector ISA includes a vector length register (VLR) that memory instructions of different vector widths. speci®es the number of operations a vector instruction The memory system itself is a 2-level cache system executes. based on the R10000. The split L1 caches are each T0-like features in the vector processors we study in- 64KB,2-way set-associative with 32B linesand 4 banks. clude chaining of data-dependent vector instructionsand The uni®ed L2 cache is 512KB, 4-way set-associative zero-cycle recovery time for the vector unit when ex- with 64B and 8 banks. The L1 miss penalty is 6 cycles
3 and the L2 miss penalty is 19 cycles. AREAINMM 2 We emulate the execution behavior of an OOO short High Area Perfor- vector extensionusingourmore conventionalvector ISA PROCESSOR COMPONENT Ef®cient and set the vector length to 8. In addition rather than us- mance ing a partitioned 64-bit datapath, we instead use a vec- tor unit containing 8 integer units. How an 8-wide vec- 64b Vector Datapath tor datapath is implemented primarily affects the cost 8 integer units 24.0 36.0 of the implementation and has little impact on perfor- load/store unit 3.0 3.0 mance. Thus, when determining the area requirements 64b Vector Register File of the OOO short vector processor we assume the parti- 32 64-element vector registers 9.5 19.0 64b MIPS R5000 tioned datapath implementation but for our performance scalar integer and FP datapath 10.3 10.3 study we assume the vector-unit implementation to sim- scalar integer and FP register ®le 0.5 0.5 plify our simulation infrastructure. instruction issue 0.8 0.8 By emulating the OOO short vector extensions with Clocking and Overhead 4.0 4.0 a more conventional vector datapath, we will overesti- TOTAL 52.3 73.8 mate the performance of programs using data types that are wider than 8 bits. This is because in a 64-bit par- titioned vector implementation, a program with 16-bit Figure 3. Component Areas for Two Imple- data would execute with a vector length of 4 and vector mentations of the Long Vector Processor widthof4. On theotherhand, theOOO shortvectorpro- cessor in our study executes such a program with a vector length of 8, vector memory width of 4, and vector com- putational width of 8. are based on actual implementations. Because the im- plementations on which we base the estimates are imple- 3. Processor Die Areas mented in varying process technologies, we scale all the
In this section, we estimate the area required to im- areas to a common line size of 0.25 m. plement the simple long vector processor and compare The area-ef®cient 64b integer units are 3mm 2 its area to those of existing OOO superscalar processors. each. This area has been extrapolated from T0's 32b
We focus only on the processor portion of the die since, MIPS scalar datapath and 16x16 multiplier array
+ 4
in our performance study, we assume the same cache- as follows: 2 area of 32b scalar datapath
2
2 1 + 4
based memory system for all the processors. area for 16x16 multiplier array = mm
2
:25 0 mm . The 16x64 multiplier array allows vector 3.1. Simple Long Vector Processor multiplies of narrow data types ( 16 bits) to be fully We determine the area required to implement the pipelined. simple long vector processor by estimating the space As independent evidence that 3mm2 is a reasonable requirements for each of the major processor compo- area estimate for a 64b integer unit, we measured the ar- nents: vector datapath, vector register ®le, scalar data- eas of integer units in several in-order and out-of-order path, scalar register ®le, and control logic for instruction superscalar implementations using annotated photomi- issue. crographs and ¯oorplans [11, 3, 13, 10, 26, 9]. The ar- Using the Torrent-0 as a realistic VLSI implemen- eas of the integer ªexecution boxº for in-order proces-
tation of a vector microprocessor, we conducted a de- sors such as the MIPS R5000 and Alpha 21164 range 2 tailed analysis of T0's area requirements in a previous from 3.25mm 2 to 3.72mm . These execution boxes in- study [23]. Our analysis showed that, of the major pro- clude iterative multiplier/dividers, and two ALUs. The cessor components, the most area-intensive for a vector OOO counterparts, MIPS R10000 and Alpha 21264, processor are the datapath and vector register ®le in con- have adder/multiplier units that measure 4.5±4.65mm 2 . trast to aggressive superscalar processors where control Additional bussing and reservation stations to support for instruction issue dominates. out-of-orderexecution are the reasons for the increase in For this study, we improve upon our previous analy- area. sis by using more precise estimates for the component ar- Because the simple long vector processor uses in- eas. These are listed in Figure 3. We give areas for two order issue, it would use integer units closer in design implementations of the long vector processor: an area- to the in-order units. Although the area we choose is ef®cient implementation and a high-performance imple- slightly smaller than that of the in-order units, the T0 mentation. Differences in area between the two are high- implementation suggests that this is a conservative esti- lightedinboldfont. As in ourpreviousstudy, these areas mate. The individual integer units of T0's vector datap-
4 ath are much smaller than the MIPS scalar datapath be- Datapath Registers Instruction Other cause they do not require hardware, such as logic for PC IO 2-way superscalar Issue 12 handling, that is included in the scalar core. Nonetheless, MIPS R5000 OOO 4-way superscalar we conservatively base our area estimate on the larger MIPS R10000 67 datapath. Alpha 21264 70 OOO 4-way short vector We estimate the area for a high performance integer HP PA-8000 68 unit to be 4.5mm2 which is the area for an OOO unit. simple long vector We extrapolate the area for the load/store area efficient 52 high performance 73 unit from T0's vector memory datapath as || 0 10 20 30 40 50 60 70
2 µ
2 follows: area for 128x256 crossbar + Processor Die Area (in mm scaled to 0.25 ) area for shifting/aligning32b data. The load/store unit interfaces with a 64-bit memory data bus on one Figure 4. Breakdown of Processor Areas side and 8 64-bit data busses on the other, thus requiring a 64x512 cross-bar to transfer data from the memory system into the vector register ®le. T0 has a 128x256 components. ªOtherº is the area for clocking and over- cross-bar to perform the same functionality and that head. We obtained the areas of the superscalar proces-
occupies about 1mm2 . This crossbar can be restructured sors by measuring annotated die photomicrographs and without any area growth to provide the necessary ¯oorplans [10, 26, 9]. functionality for the long vector processor. Shifting and A comparison of die areas of different implementa- aligning 64b data between the memory data bus and tions can be complex. Differences in area may be due register busses requires twice the area used in the 32b to many factors that are orthogonal to how a processor T0 vector memory unit. exploits parallelism. These factors include fabrication Because there is only one address bus, only one ad- technology, non-processor components, datapath width, dress needs to be computed per clock cycle. More- and circuit and layout design. We want to minimize dif- over, scalar and vector instructions use the same 64b ferences due to these factors so that remaining differ- data bus to the memory system. Thus, computing ad- ences will be due to parallel-speci®c features such as dresses and handling TLB functions is handled by hard- datapath organization, register ®le implementation, and ware in the scalar portionof the processor, and additional instruction-issue mechanism.
area for address functionalityis not needed in the vector To this end, all die areas are scaled to a 0.25 m pro- load/store unit. cess to eliminate areal differences due to differences in We compute the areas for the two implementations of line size, The areas that we compare are for processor our vector register ®le based on layout details given by components only. Excluded are cache and TLB struc- AsanoviÂcin Chapter 5 of his doctorate dissertation [1]. tures, external interface logic, and the pad ring. Lastly, In addition to giving the area for storing data, the details all areas are based on actual VLSI implementations. of the register ®le design also include the area for over- The MIPS R10000, Alpha 21264, and HP PA-8000 head circuitry such as read sense ampli®ers, data latches are existing implementations of OOO superscalar and for writes and reads, multiplexors, and drivers, The areas OOO short vector processors. Of the three OOO super- of all the storage and overhead components are based on scalar processors, only the PA-8000 implements its short cells used in the T0 implementation. vector extensions, MAX-2. The implementation for the The area-ef®cient vector register ®le time- MAX-2 instructions occupies less than 0.2% of the area multiplexes the word and bit lines at twice the clock rate shown in Figure 4 [25]. For comparison purposes, areas to allow both a read and a write access in the same clock for the MIPS R5000 are also included. period. The high-performance implementation adds The area-ef®cient implementation of the long vector additional bussing and ports to avoid time-multiplexing processor is about 70±75% the size of the OOO imple- while still allowing simultaneous read and write mentations, while the high-performance version is 4±
accesses. 7mm2 larger. Note that this is only for the processor We use the MIPS R5000 [11] as the 64-bit scalar portion of the die. Typically there are on-chip caches as processor, which is responsible for instruction control, well as hardware for TLBs and external interface logic. scalar execution, and data address processing. For the MIPS R10000 and Alpha 21264, the processor 3.2. Comparison of Die Areas portion occupies 45% of the die, while 85% of the HP PA-8000 die is dedicated to the processor. The reason Figure4 plots theareas of the simplelong vector pro- a larger portion of the die is used for the PA-8000 pro- cessor and several superscalar processors. Also shown cessor is because its die does not include any on-chip
is a breakdown of each processor's area into its major caches. Thus, while 7mm2 is a 10% increase over the
5 5.17 processor area of an OOO superscalar implementation, 5 OOO superscalar 4.14 4 OOO short vector it is at most a 5% growth in total die area after non- simple long vector 3.35 3 2.95 2.92 2.72
Speedup over 2.27 2.18 processor components such as on-chip caches are con- 2 1.89 1.75 1.69 1.69
OOO superscalar 1.59 1.57 1.13 sidered. 1 1.00 1.00 1.001.07 1.00 1.00 1.00 1.00 1.00
0 || chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Consistent with the results in our previous study, the Average Mean area breakdown shows that the simple long vector pro- cessor uses its transistors differently from the OOO su- Figure 6. Processor Performance perscalar processors. For the long vector processor, 70% of the die is dedicated to functional units, while only 1± 2% is used for instructionissue. The remaining 20±25% link the programs with the SGI C compiler. For vector of the die is used for the large vector register ®le. In execution, we compile the vectorizable functions with contrast, the OOO superscalar processors use 40±65% of VSUIF, a vectorizing compiler we developed [6]. Be- the area for hardware structures to supportmulti-way and cause little effort has been put into optimizing the code out-of-order instruction issue. This is comparable or, in generated by VSUIF, non-vectorizable portions are com- some cases, more than the area used for datapath. piled with the SGI compiler to bene®t from its optimiz- The reason the long vector processor is able to trade ingcabilities. The SGI linkeris used to combine the sep- instruction-issue for datapath and registers is because it arately compiled object ®les into an executable. uses vector instructions that compactly encode parallel To generate performance statistics, both superscalar operations. Thus much of the functionality for detecting and vector executables are executed using Vmable, a parallelism is transferred to software. Figure 4 quanti®es MIPS ISA simulator that we modi®ed to emulate the this tradeoff and shows that using area for datapath and execution of vector instructions. Correctness, espe- registers rather than control is a reasonable alternative to cially of the vector executable, is veri®ed by compar- existing designs. In the next section, we will show that ing Vmable's output against that of native execution. this tradeoff also provides signi®cant performance gains Vmable also generates instruction and data addresses as well. that are input to Vcello, our vector-enhanced version We have been conservative with our estimates of the of Todd Mowry's cello timing simulator. Vcello long vector processor's die area. For example, we use can be con®gured to emulate the timing of the proces- full 64-bit integer units for the vector datapath and 64- sors that we study. Vcello also keeps track of cache bit elements for the vector register ®le. A more area- misses, branch mispredictions, etc. and produces archi- conservative design could use a partitioned datapath tectural performance statistics that characterize the exe- such as in the HP PA-8000 and implement a correspond- cution of a program. ingly narrower vector register ®le. In addition by re- Figure 6 shows the performance of the two vector structuringthe vector register®le, it is possible to reduce processors relative to the superscalar processor. The its area requirements by almost half without sacri®cing speedup graph shows that vector execution is faster than bandwidth[1]. In short, it may be possible to implement wide-issue, out-of-order scalar execution: the short vec- a simple long vector multimedia processor in one-third tor processor is on average 1.7 times faster than the su- to one-half of the area of existing OOO superscalar pro- perscalar processor, and the long vector processor is 2.7 cessors. times faster.2 This is not surprising since the vector pro- cessors support more parallelism. Recall, however, that 4. Performance Results the cost of this greater support, as measured by die area, We now investigate the performance of vector proces- is not signi®cantly larger than that of the OOO-only im- sors executing multimedia applications. plementation. Figure 5 describes the programs and inputs we use The OOO short vector and simple long vector pro- as the workload for our performance study. The widths cessors have the same peak computational rate and peak of the dominant data types are also listed. All programs memory bandwidth. Nonetheless, the simple longer vec- are written in C. Four are image processing applications tor processor is 1.25 to 2.0 times faster than the com- while the other two programs are versions of the IDEA plex shorter vector processor. Across all the programs, encryption/decryption algorithm for PGP. Details of the 2When reporting averages in the text, we use the geometric mean two versions and an analysis of their performance differ- (GM) because we are averaging ratios (CPI and CPO) and normalized ence are given elsewhere [23]. All these programs con- data [8]. Consequently, we can use the GM's multiplicative properties tain highly vectorizable functionswhere much of the ex- when summarizing our analyses of CPI and CPO data. We also show ecution time is spent. the arithmetic average, which has a more intuitive interpretation, to- gether with the geometric mean in the graphs. Because both averages The same C source code is executed on all three pro- are within a few percent of eachother for the data we present, the geo- cessors. For superscalar-only execution, we compile and metric mean is a reasonable approximation of the arithmetic average.
6 DATA BENCHMARK WIDTH INPUT DESCRIPTION chroma 8 bit 320x240 Merges two images on the basis of a ªwhitenessº threshold. colorspace 8 bit 24-bit Converts an image in RGB to YUV values.
composite 8 bit color Blends two images together by a blend factor . convolve 8,16 bit image(s) Convolves an image with a 3x3 16-bit kernel. decrypt.unroll 16 bit 16,000-byte Unrolled version of IDEA decryption. decrypt.inter 16 bit message Loop-interchanged version of IDEA decryption.
Figure 5. Benchmark Programs
6.36 6 5.86 the long vector processor executes on average 1.6 times OOO superscalar OOO short vector 5.06 4.55 4.51 simple long vector 4.25 faster than the short vector processor. 4
2.84 In the remainder of this section, we analyze the per- 2.41 2.10 2 1.90 1.37 1.38 1.19 1.30 0.93 formance of the three processors to explain why the Cycles per Instruction 0.71 0.71 0.78 0.50 0.43 0.40 0.45 0.53 0.52 0 || chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric vector processors are faster than the superscalar-only Average Mean processor and why the longer vector is faster than the
13.97 shorter. 8 7.75 scalar instructions vl8 vector instructions 6.63 6 vl64 vector instructions 5.82 5.83 5.21
4.1. CPI Dynamic Instruction Count
(in millions) 4 4.04 3.39 2.97
2 1.74 1.56 The classic method for determining why one com- 1.21 1.26 1.38 0.88 0.88
Number of Instructions 0.58 0.28 0.20 0.24 0.24 0.17 0.29 0.26 0 || puter executes faster than another is to deconstruct each chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean computer's execution time into the three components of the CPU performance equation: cycle time, aver- Figure 7. CPI & Dynamic Instruction Count age cycles per instruction(CPI),and dynamic instruction count [17]. Intheremainder ofthispaper, we refertothis equation as the ªCPI equationº to distinguishit from the ªCPO equationº that we discuss in the next subsection. level. A vector instruction executes 8 operations that, in For the purposes of this study, we make the reason- a scalar processor, would be 8 separate instructions that able assumption that all three processors can be imple- could be executed in parallel. mented with similar state-of-the-art clock frequencies. The CPI for the long vector processor is very high, OOO short vector processors have already been imple- ranging from 2.4 to 6.4 cycles per instructions. This is 6 mented at the same clock frequencies as OOO super- to 12 times greater than the superscalar's CPI. The reason scalar processors, while a long vector processor with its for the substantially higher CPI is that a vector instruc- simpler control logic could be implemented with a much tion for the long vector processor takes 8 clock periods faster clock frequency. Thus this assumption potentially to execute. underestimates the performance of the simple long vec- Despite the substantially higher CPIs, the vector pro- tor processor. cessors are faster than the superscalar processor. To un- Thegraphs inFigure7show thedata fortheothertwo derstand why, it is important to consider the other half components of the CPI equation. For the dynamic in- of the CPI equation: the dynamic instructioncount. The struction count, we also show the number of scalar and bottomhalfof Figure7 shows the number ofinstructions vector instructions executed for each program. executed by each processor. As expected, the CPI for the superscalar processor is An immediately obvious trend among the processors low: between 0.4 and 0.7, or equivalently between 1.4 is that the dynamic instruction count is inversely propor- and 2.5 instructions per cycle. tional to the CPI. The superscalar processor, which has All instructions in the short vector processor includ- the lowest CPI, executes far more instructions than the ing the vector ones still require a minimum of only one vector processors. In contrast, the long vector processor clock period to execute. Nonetheless, this processor has withthe highestCPI executes 12±34times fewer instruc- aCPI thatis,onaverage, 2.5 times higherthan theCPI of tions than the superscalar processor and 3.7±7.4 times the superscalar processor. The higher CPI suggests that fewer instructions than the short vector processor. less instruction-levelparallelism is being exploited. One The greatly reduced instruction count is the result of reason forthereduced ILP isthat by usingvectorinstruc- using stripmined vector code. A stripmined vector loop tions, there is less parallelism to be used at the instruction that uses vector lengths of 64 is comparable to implic-
7 itly unrolling the loop 64 times. The scalar instructions achieved by using vector instructions, a technique obvi- that handle the bookkeeping of the stripmined code cor- ously not available for non-vector processors. In short, a respond to the loop overhead code of the original loop. stripmined vector loop of length VL can reduce the dy- The vectorinstructionsexecute theoperationsof theloop namic instruction count for that loop by a factor close body itself. to VL while keeping the static count close to that of the Stripmining a loop is better than explicitly unrolling original rolled version. it for two reasons. First, the stripmined vector loop re- It is not always possible, however, to reduce the over- duces the dynamic count for both the loop overhead in- all dynamic instruction count of the scalar version by a structions and loop body instructions, while explicit un- factor of VL for several reasons. The most obvious rea- rolling reduces only the overhead count. Second, strip- son is that vector loops do not always dominate the ex- mining achieves this reduction without substantially in- ecution time. While the bulk of time is spent in vector creasing the static instructioncount, whereas explicit un- loops for the programs in this study, a small but signi®- rolling requires replicating loop body instructions. cant amount isspent in non-vectorizablepartsof the pro- An example of a loop overhead instructionis one that gram. updates the loop index variable (add r1,r1,1). In Anotherreason isthat thereisnot alwaysa one-to-one the rolled version of a loop, the same scalar instruction correspondence between instructions in the rolled loop for carrying out this functionality is executed 64 times. version and instructions in the stripmined version. As a The stripmined and unrolled versions would collapse result, the static instruction count for the stripmined ver- these into a single instruction(add r1,r1,64) that is sion can be slightly higher than the count for the rolled executed once for every 64 loop iterations. Similar trans- version. For example, instructions to update the vector formations are applied to other scalar instructions that length register must be introduced into the stripmined constitute the loop overhead. Examples include those version. In addition, depending upon the ISA support that update address values. In this way the number of for conditional vector execution, more vector instruc- loop overhead instructions executed is reduced by a fac- tions may be used to handle conditional statments than tor of 64. in the scalar version. A ®nal example is when the scalar An example of an instruction sequence that includes version is unrolled several times. The SGI compiler we a loop bodyinstructionis one that loads an element from use for the superscalar processor will unroll loops 2 to 4 an array and updates the associated address register: times depending upon the size of the loop body. load r3,0(r2); add r2,r2,4. In this case, the Even though it is not always possible to achieve a re- load instruction is a loop body instruction while the duction of VL in the dynamic instruction count, using add instruction is a loop overhead instruction. In the stripmining to reduce this count is more effective than rolled version, these two instructions would be executed explicity unrolling. This is because VL can be much 64 times resulting in a dynamic instructioncount of 128. greater than the amount typically used for unrolling. For the version that explicitly unrolls the loop 64 Small values of VL such as 8, are large for explicit un- times, the load instruction would be repeated 64 times rolling, while more typical VL values, such as 32 or but with a different offset for each load instruction. 64, are ineffective for unrolling due to the tremendous The address update instruction, however, would not be growth in the number of static instructions generated as repeated and instead would be transformed into add well as the number of registers required. On the other r2,r2,256. Thusthenumberof static instructions has hand, support for vector execution with length 32 or 64 grown from 2 to 65, whilethe dynamic instruction count is not prohibitivelyexpensive as explained in the previ- has been reduced from 128 to 65. ous section. The stripmined vector version reduces the dynamic Figure 8 summarizes the results of our CPI analysis. count even further, while keeping the static count com- To explain why the vector processors are faster than the parable to the rolled version. The 64 load instructions superscalar processor, we express the average speedup of are now replaced with a single vector load instruction a vector processor as a function of the superscalar pro- vload v3,(r2),4 that accesses memory locations cessor's average CPI (CPISS) and average dynamic in- that are 4 bytes apart. This results in only 2 instructions struction count (NISS). This is possible because the ge- being executed to carry out 64 loads and their associated ometric mean is used to compute the averages and we address updates. use the multiplicative properties of the geometric mean The reduction in dynamic instruction count for loop to guarantee equivalency between the two expressions. overhead is obtained by solving simple linear recur- From Figure 8, we observe that vector processors rences, a technique that is independent of instruction have a signi®cantly higher average CPI than the super- type and thus available to all three processors. On the scalar processor does: the average CPI of the short vec- other hand, the tremendous savings for the loop body is tor processor is 2.5 times higher while the long vector's
8 SPEEDUP CYCLES PER DYNAMIC OVER PROCESSOR INSTRUCTION INSTRUCTION CYCLE COUNT SUPER- COUNT
SCALAR
OOO superscalar CPISS NISS CPISS NISS 1.00