Scalable Vector Processors for Embedded Systems

SCALABLE VECTOR PROCESSORS FOR EMBEDDED SYSTEMS

FOR EMBEDDED APPLICATIONS WITH DATA-LEVEL PARALLELISM, A VECTOR

PROCESSOR OFFERS HIGH PERFORMANCE AT LOW POWER CONSUMPTION

AND LOW DESIGN COMPLEXITY. UNLIKE SUPERSCALAR AND VLIW DESIGNS, A

VECTOR PROCESSOR IS SCALABLE AND CAN OPTIMALLY MATCH SPECIFIC

APPLICATION REQUIREMENTS.

Designers of embedded processors leads to increased code size. Both approaches have typically optimized for low power con- are difficult to scale because they require either sumption and low design complexity to min- significant hardware redesign (superscalar) or imize cost. Performance was a secondary instruction-set redefinition (VLIW). Further- consideration. Nowadays, many embedded more, scaling up either of the two exacerbates systems (set-top boxes, game consoles, per- their initial disadvantages. sonal digital assistants, and cell phones) com- This article advocates an alternative monly perform computation-intensive media approach to embedded processors that provides Christoforos E. tasks such as video processing, speech high performance for critical tasks without sac- transcoding, graphics, and high-bandwidth rificing power efficiency or design simplicity. Kozyrakis telecommunications. Consequently, modern The key observation is that multimedia and embedded processors must provide high per- telecommunications tasks contain large Stanford University formance in addition to low cost. They must amounts of data-level parallelism (DLP). also be easy to scale and customize to meet the Hence, it’s not surprising that we revisit vector rigorous time-to-market requirements for architectures, the paradigm developed for high David A. Patterson consumer electronic products. performance with the large-scale DLP available The conventional wisdom for high-perfor- in scientific computations.1 Just as superscalar University of California at mance embedded processors is to use the super- and VLIW processors for desktop systems scalar or very large instruction word (VLIW) adjusted to accommodate embedded designs, Berkeley paradigms developed for desktop computing. we can revise vector architectures for super- Both approaches exploit instruction-level par- computers to serve in embedded applications. allelism (ILP) in applications in order to exe- To demonstrate that vector architectures cute in parallel a few operations per cycle. meet the requirements of embedded media Superscalar processors detect ILP dynamically processing, we evaluate the Vector IRAM, or with hardware, which leads to increased power VIRAM (pronounced “V-IRAM”), architec- consumption and complexity. VLIW proces- ture developed at UC Berkeley, using bench- sors rely on the compiler to detect ILP, which marks from the Embedded Microprocessor

36 Published by the IEEE Computer Society 0272-1732/03/$17.00  2003 IEEE Benchmark Consortium (EEMBC). Our fixed-point arithmetic. Specifically, VIRAM evaluation covers all three components of the includes a flexible multiply-add model that VIRAM architecture: the instruction set, the supports arbitrary fixed-point formats with- vectorizing compiler, and the processor out using accumulators or extended-precision microarchitecture. We show that a compiler registers. Three vector instructions implement can vectorize embedded tasks automatically element permutations within vector registers. without compromising code density. We also We limit their scope to the vectorization of describe a prototype vector processor that out- dot products (reductions) and fast Fourier performs high-end superscalar and VLIW transforms, which makes them regular and designs by 1.5× to 100× for media tasks, with- simple to implement. Finally, VIRAM uses out compromising power consumption. the flag registers to support conditional (pred- Finally, we demonstrate that clustering and icated) execution of element operations for modular design techniques let a vector proces- virtually all vector instructions2 and to sup- sor scale to tens of arithmetic data paths before port speculative vectorization of loops with wide instruction-issue capabilities become data-dependent exit points. necessary. The VIRAM architecture includes several features that help in developing general-pur- Vector instruction set for multimedia pose systems—features untypical of tradi- The VIRAM architecture is a complete tional vector supercomputers. It provides full load-store vector instruction set defined as a support for paged virtual addressing using a coprocessor extension to the MIPS architec- separate translation look-aside buffer (TLB) ture. VIRAM’s core features resemble those for vector memory accesses. It also provides a of vector architectures for supercomputing. mechanism that lets the operating system The vector state includes a vector register file defer the saving and restoring of vector state (VRF) with 32 registers that can store integer during context switches until it knows that or floating-point elements, a 16-entry flag reg- the new process uses vector instructions. ister file that contains vectors with single-bit elements, and a few scalar registers for control Vectorizing compiler values and base memory addresses. The num- The VIRAM compiler can automatically ber of elements per vector or flag register is vectorize C, C++, and Fortran 90 programs. implementation dependent. The instruction It’s based on the PDGCS compilation system set contains integer and floating-point arith- for Cray supercomputers such as the C90- metic instructions that operate on vectors YMP, the T3E, and the X1. The optimizer has stored in the register file, as well as logical extensive capabilities for automatic vectoriza- functions and operations such as population tion, including outer-loop vectorization and count that use the flag registers. Vector load the handling of partially vectorizable language and store instructions support the three com- constructs. For many multimedia codes, mon access patterns: unit stride, strided, and including all those considered in this article, indexed (scatter/gather). vectorization is possible without modifications To enable vectorization of multimedia to the original code; special function intrinsics tasks, VIRAM includes several media-specif- or custom libraries are not necessary for vec- ic enhancements. The elements in the vector torization. However, for certain cases with registers can be 64, 32, or 16 bits wide. Thus, irregular scatter/gather patterns, the program- multiple narrow elements can occupy the stor- mer must use a small set of pragmas to help age location for one wide element. Similarly, the compiler with data-dependence analysis. we partitioned each 64-bit data path in the The two main challenges in adapting a processor implementation so that it can exe- supercomputing compiler to a multimedia cute multiple narrow-element operations in architecture were support for narrow data parallel. Instead of specifying the element and types and the vectorization of dot products. operation width in the instruction opcode, we We modified the compiler to select the vector use a control register typically set once per element and operation width for each group group of nested loops. of nested loops in two passes. The compiler Integer instructions support saturated and also recognizes reductions with common

NOVEMBER–DECEMBER 2003 37 MICRO TOP PICKS

Table 1. Vectorization and code size statistics for the EEMBC benchmarks.

Average MIPS/ x86/VLIW/ VLIW-opt/ vector VIRAM VIRAM VIRAM VIRAM VIRAM Vector length code code code code code operations (element size size size size size Benchmark (percentage) width) (bytes) ratio ratio ratio ratio Consumer Rgb2cmyk 99.6 128 (16b) 672 2.7 1.1 3.8 9.1 Rgb2yiq 98.9 64 (32b) 528 3.0 1.7 8.2 65.5 Filter 99.2 106 (16b) 1,328 1.5 0.7 3.5 2.7 JPEG 66.0 70 (16b) 65,280 0.9 0.5 1.8 2.6 13 (32b) Telecom Autocor 94.7 43 (32b) 1,072 1.1 0.5 0.9 1.4 Convenc 97.1 128 (16b) 704 2.3 1.1 1.6 2.9 Bital 95.7 38 (32b) 1,024 1.5 0.7 2.3 1.3 FFT 98.9 64 (32b) 3,312 1.7 4.7 0.9 1.1 Viterbi 92.1 18 (16b) 2,592 1.5 0.5 0.7 1.0 Average 93.8 86 (16b) NA 1.8 1.3 2.6 9.7 39 (32b)

arithmetic, logical, and comparison opera- provide high performance for both cases. Note tions, and it uses the permutation instructions that even for the benchmark with the shortest to vectorize them. vectors (Viterbi), vectors are significantly longer Table 1 shows the results from using the than those supported by 64- and 128-bit sin- VIRAM compiler with the telecommunica- gle-instruction, multiple-data (SIMD) exten- tions and consumer benchmarks in the sions for superscalar and VLIW processors EEMBC suite. Three benchmarks (Autocor, (MMX, SSE-2, Altivec, or 3DNow). Bital, and FFT) use permutation instructions, Finally, Table 1 shows the static code size and three other benchmarks (JPEG, Bital, and for the VIRAM instruction set and compares Viterbi) require conditional vector execution. it with the published code sizes for alternative The degree of vectorization—the percentage instruction-set architectures such as those for of all operations specified by vector instruc- the MIPS RISC, x86 CISC, Trimedia VLIW tions—is greater than 90 percent for nine out (consumer benchmarks only), and TI of 10 benchmarks; the 10th benchmark’s TMS320C6x VLIW (telecom benchmarks degree of vectorization is 66 percent. The over- only). The table presents the comparison as all high degree of vectorization suggests that the ratio of code size for each architecture to the compiler is effective at discovering the DLP that for VIRAM. A ratio larger than one sug- in each benchmark and expressing it with vec- gests that VIRAM has higher code density. tor instructions. It also indicates that using vec- With the two VLIW architectures, we com- tor hardware can accelerate a significant portion pare code size after direct compilation of the of the execution time. original source code in addition to the code Table 1 also shows the average vector length size after manual optimization for perfor- for each benchmark. The maximum vector mance (VLIW-opt), using SIMD and DSP length for this experiment is 64 elements for 32- instructions when possible. bit operations and 128 elements for 16-bit oper- VIRAM’s average code density is similar to ations. Very long vectors, typical in scientific that of the x86 and 80 percent better than that computing, significantly simplify the processor of the MIPS ISA, which VIRAM is based on. implementation. Embedded tasks, on the other Vector instructions function as useful hand, involve an interesting mix of long and macroinstructions, eliminating the loop over- short vectors. Hence, the vector processor must head for small, fixed-size loops. A single vec-

38 IEEE MICRO $I Instruction cache $D Data cache 64 bits Lane 0 Lane 1 Lane 2 Lane 3 Flags Flags Flags Flags

ALU0 ALU0 ALU0 ALU0

Vector Vector Vector Vector Vector control register register register register elements elements elements elements

MIPS ALU1 ALU1 ALU1 ALU1 cache

$I $D Load- Load- Load- Load- store store store store unit unit unit unit

Memory crossbar

256 bytes 256 bytes 256 bytes

I/O DRAM DRAM DRAM bank bank bank 0 1 7

(a) (b)

Figure 1. VIRAM prototype processor: block diagram (a), die photo (b). tor load-store instruction captures the func- A block diagram of the chip, shown in Fig- tionality of multiple RISC or CISC instruc- ure 1, focuses on the vector hardware and the tions for address generation, striding, and memory system.3 This design partitions the indexing. VIRAM’s code is 2.5 to 10 times register and data path resources in the vector smaller than that of the VLIW architectures. coprocessor vertically into four identical vec- VIRAM’s main advantage is that the compil- tor lanes. Each lane contains several elements er needn’t use loop unrolling or software from each vector and ﬂag register, and a 64-bit pipelining for performance reasons, as it must data path from each functional unit. The four with VLIW processors. VLIW architectures lanes receive identical control signals on each incur the additional penalty of empty slots in clock cycle. The concept of parallel lanes is their wide-instruction format. fundamental for the vector microarchitecture, as it leads to advantages in performance, Prototype vector processor design complexity, and scalability. VIRAM We implemented a prototype vector proces- achieves high performance by using the par- sor to validate the VIRAM architecture’s allel lanes to execute multiple element opera- potential for embedded media processing and tions for each pending vector instruction. as a fast platform for software experimenta- Having four identical lanes significantly tion. The processor attaches a vector unit as a reduces design and verification time. Lanes coprocessor to a 64-bit MIPS core. The VRF also eliminate most long communication has a capacity of 8 Kbytes (32 64-bit, 64 32- wires that complicate scaling in CMOS tech- bit, or 128 16-bit elements per vector regis- nology.4 Executing an element operation for ter). There are two vector arithmetic all vector instructions, excluding memory ref- functional units, each with four 64-bit ALUs erences and permutations, involves register for parallel element execution, and there’s a and data path resources within a single lane. vector load-store unit with four address gen- The scalar core in VIRAM is a single-issue, erators and a hierarchical TLB. The prototype in-order MIPS core with ﬁrst-level caches of is actually a system on a chip because it inte- 8 Kbytes each. The MIPS core supplies the grates 13 Mbytes of DRAM organized in vector instructions to the vector lanes for in- eight independent banks. order execution. Vector load and store instruc-

NOVEMBER–DECEMBER 2003 39 MICRO TOP PICKS

Table 2. Characteristics of the embedded processors compared in this study.

Instruction Cache size Clock Power Architecture issue Execution (Kbytes) frequency dissipation Technology Processor type rate style L1I L1D L2 (MHz) (W) (µm) VIRAM Vector 1 In order 8 NA NA 200 2.0 0.18 MPC7455 PowerPC Superscalar 4 Out of order 32 32 256 1,000 21.3 0.18 Trimedia TM1300 VLIW with SIMD 5 In order 32 16 NA 166 2.7 0.25 TI TMS320C6203 VLIW plus DSP 8 In order 96 512 NA 300 1.7 0.13

tions access DRAM-based main memory power characteristics are entirely due to its directly without using SRAM caches. Embed- microarchitecture. ded DRAM is faster than commodity DRAM Nevertheless, we can draw some general but still significantly slower than a first-level conclusions about power consumption. Super- SRAM cache. scalar, out-of-order processors like the The processor operates at a modest 200 MPC7455 consume the most power because megahertz (MHz). The intentionally low clock of their high clock rates and the complexity of frequency, along with the in-order, cacheless their control logic. VLIW processors’ simplic- organization for the vector unit results in a ity and lower clock frequencies result in power consumption of only 2 W. These fea- reduced power consumption despite their high tures also contribute to reduced design com- instruction issue rates. The VIRAM proces- plexity. Three full-time and three part-time sor’s microarchitecture permits additional graduate students designed the 125-million- power savings. Low clock frequency and the transistor chip over three years. To hide DRAM use of exclusively static circuits make the par- latency, both load-store and arithmetic vector allel vector lanes power efficient. The simple instructions are deeply pipelined (15 stages). single-issue, in-order control logic must fetch and decode only one vector instruction to exe- Prototype processor evaluation cute tens of element operations, so it dissipates Table 2 compares the VIRAM prototype a minimum amount of power. Finally, the on- with three embedded processors whose fea- chip main memory provides high bandwidth tures represent the great variety in high-per- without wasting power for driving off-chip formance embedded designs: two basic interfaces or for accessing caches for applica- architectures (out-of-order superscalar and in- tions with limited temporal locality. order VLIW), a range of instruction issue rates Design complexity is also difficult to com- (4 to 8 instructions per clock cycle), varying pare unless the same work group implements clock rates (166 MHz to 1 GHz), and differ- all processors in the same technology with the ences in power dissipation (1.7 W to 21.3 W). same tools. However, we believe the VIRAM The only common feature of the three embed- microarchitecture is significantly less complex ded processors is their use of SRAM caches than superscalar processors. Both the vector for low memory latency. coprocessor and the main-memory system are Before comparing performance, let’s look modular, the control logic is simple, and there at power consumption and design complexi- is no need for caches or circuit design for high ty. The typical power dissipation numbers in clock frequency. Table 2 are not directly comparable because Superscalar processors, on the other hand, these processors represent different CMOS include complicated control logic for out-of- manufacturing technologies. The VIRAM order execution, and this is difficult to design power figure includes the power for main at high clock rates. Although VLIW proces- memory accesses, which is not the case for any sors for embedded applications are simpler other design. In addition, the VIRAM circuits than superscalar designs, the high instruction were generated with a standard synthesis flow issue rate makes them more complicated than and were not optimized for low power con- single-issue vector processors. In addition, sumption in any way. The VIRAM chip’s VLIW architectures introduce significant

40 IEEE MICRO Superscalar VLIW VLIW-opt VIRAM-1L VIRAM-2L VIRAM-4L VIRAM-8L

25 180

20 150

120 15

90 10 60 Performance/MHz Performance/MHz 5 30

0 0 Rgb2cmyk Rgb2yiq JPEG Autocor BitalViterbi Filter FFT Convenc Benchmarks Benchmarks (a) (b)

Figure 2. Performance-per-MHz comparison, normalized to the performance of the MPC7455 superscalar processor. Note the change of scale between the benchmarks in (a) and those in (b). complexity into compiler development. MPC7445. Each vector instruction defines Figure 2 compares performance per MHz,3 tens of independent element operations, where performance is inversely proportional which can use the parallel data paths in the to the execution time for each benchmark. We vector lanes for several clock cycles. A super- normalize the performance for each processor scalar processor, on the other hand, can extract to that of the MPC7455 superscalar processor. a much smaller amount of ILP from its We compare performance per MHz because sequential instruction streams. For bench- we have no fundamental reason to believe that marks with strided and indexed memory these processors couldn’t be clocked at the accesses, such as Rgb2cmyk and Rgb2yiq, same frequency, given similar design efforts. VIRAM-4L outperforms MPC7455 by a fac- If raw performance is of interest, it’s easy to tor of 10. In this case, VIRAM-4L’s restriction scale the results in Figure 2 by the corre- is its address generation throughput for strid- sponding clock frequency for each design. ed and indexed accesses to narrow data types. Interestingly, Figure 2 would look nearly iden- For partially vectorizable benchmarks with tical if we plotted performance per watt. short vectors, such as JPEG and Viterbi, Along with showing the performance of the VIRAM-4L maintains a 3× to 5× performance actual VIRAM chip with four lanes (VIRAM- advantage over MPC7455. Even with just 13 4L), Figure 2 presents simulated performance elements per vector, simple vector hardware is for two scaled-down variations of the design, more efficient with DLP than aggressive, wide- with one and two lanes (VIRAM-1L and issue, superscalar organizations. VIRAM-2L), and a scaled-up version with With the original benchmark code, the two eight lanes (VIRAM-8L). Excluding the num- VLIW processors’ performance is similar to ber of lanes, all parameters are identical. We or worse than the MPC7455’s. Source code used the same VIRAM executables with all optimizations and the manual insertion of versions of the VIRAM processor; there was SIMD and DSP instructions lead to signifi- no lane-specific recompilation. cant improvements and enable the VLIW For highly vectorizable benchmarks with designs with optimized code to be within 50 long vectors, such as Filter and Convenc, percent of VIRAM’s performance. TM1300 VIRAM-4L is up to 100 times faster than outperforms VIRAM-4L only for JPEG, for

NOVEMBER–DECEMBER 2003 41 MICRO TOP PICKS

which the TM1300’s performance improves limitation.6 To support the one load-store and considerably once its designers restructured two arithmetic vector units, the 2-Kbyte parti- the benchmark code to eliminate the function tion requires nine access ports. In general, with call within the nested loop for the discrete N functional units, the VRF partition requires cosine transform. The same manual transfor- approximately 3N ports. Its area, power con- mation would have benefited VIRAM-4L sumption, and access latency are roughly pro- because it would allow outer-loop vectoriza- portional to N2, logN, and N.7 To avoid this tion and result in long vectors. complexity, most vector processors include up to four units. However, a larger number of Scaling for data-level parallelism functional units is desirable for applications Organizing the vector hardware in lanes pro- with short and medium-length vectors, for vides a simple mechanism for scaling perfor- which multiple lanes do not help with performance, power consumption, and area (cost). mance. With more functional units, multiple With each extra lane, we can execute more ele- vector instructions could execute in a parallel or ment operations per cycle for each vector overlapped manner, even if their correspond- instruction. This is a balanced scaling method: ing vectors were relatively short. Each lane includes both register and data path To alleviate the problem of VRF complexi- resources. Figure 2 shows the VIRAM proces- ty, we propose replacing the centralized vec- sor’s performance as we scale the number of tor-lane organization in the VIRAM processor lanes from one to eight, without modifying the with a clustered arrangement. Figure 3a shows scalar core or the memory system. Except for the centralized organization. In the clustered JPEG, performance scales linearly for up to organization (Figure 3b), each cluster contains four lanes. Eight lanes are useful for bench- a data path for a single vector functional unit marks with long vectors (Rgb2cmyk, Convenc, and a few vector registers. A separate inter- and FFT). For benchmarks with medium- cluster network moves vectors between func- length or short vectors (Viterbi and Bital), eight tional units when needed. The local register lanes execute vector instructions so quickly that file block within each cluster provides operands we can no longer hide the overhead of scalar to one data path and one network interface. operations and vector-scalar communication. Hence, the number of access ports for each In other words, VIRAM-8L, a vector configu- block is small (four to five) and independent of ration with 16 64-bit integer ALUs, is often the number of vector functional units in the limited by instruction-issue bandwidth. system. In other words, the area, power con- Comparable scaling results are difficult to sumption, access latency, and design com- achieve with other architectures. Scaling a plexity of each VRF block are constant. four-way superscalar processor to eight-way, To use the clustered organization without for example, would probably lead to a small modifying the VIRAM instruction set, we overall performance improvement. It’s diffi- employ renaming in the vector hardware’s cult to extract such a high amount of ILP, and control logic. A renaming table identifies the the wide out-of-order logic’s complexity cluster that stores the source and destination would slow the clock frequency.5 A VLIW registers for a vector instruction and indicates processor could achieve similar scaling after whether an intercluster transfer is necessary, recompilation and manual code reoptimiza- given the cluster that will execute the instruction. The cost in this case would be even larg- tion. Renaming also lets us implement more er code and increased power consumption for than the 32 vector registers defined in the the wide instruction fetch. instruction set architecture. Extra registers make it possible to tolerate long memory Clustered vector processor latencies or to implement precise exceptions. Although the VIRAM processor demon- The vector unit’s control logic must also strates vector architectures’ great potential for select the cluster to execute each vector embedded media processing, VIRAM is still instruction. If multiple cluster assignments susceptible to the basic limitations of traditional are possible, the control logic tries to mini- vector designs. The complexity of the VRF par- mize the number of transfers without creat- tition within each lane is the most significant ing a significant load imbalance across clusters.

42 IEEE MICRO Instructions Instructions

32 vector registers 8 vector 8 vector 8 vector 8 vector registers registers registers registers

Load- Load- ALU ALU ALU ALU store store unit unit Intercluster network

To/from memory To/from memory (a) (b)

Figure 3. Vector lane organization: centralized (a) and clustered (b) lane organizations.

Overall, the new control logic’s functionality and communicates vector results between func- resembles that of multicluster superscalar tional units without incurring a penalty. The processors.8 It is much simpler, however, intercluster network, on the other hand, can because maintaining an issue rate of one vec- become a latency or bandwidth bottleneck for tor instruction per cycle is sufficient. A super- a clustered design.10 Nevertheless, Figure 4 scalar processor, on the other hand, must issue shows that the clustered organization is actual- one instruction per functional unit per cycle. ly faster than the centralized one for many The clustered organization in Figure 3b benchmarks. Decoupled execution hides most associates a small instruction queue with each intercluster transfer latency. It also hides a sig- cluster. Instruction queues permit decoupled nificant portion of the memory latency for vec- execution.9 As long as there are no data depen- tor accesses that run into memory bank conflicts dencies among the instructions assigned to or address generation limitations (as in different clusters, the clusters can execute Rgb2cmyk, for example). On average, the clus- instructions independently of stalls in other tered organization outperforms the centralized clusters. Decoupled execution can hide mem- one by 21 percent to 42 percent, depending on ory stalls and the latency of intercluster trans- the number of lanes in the system. fers. Data queues for decoupled execution are The clustered vector organization provides not necessary in this case because the vector a second scaling dimension for the vector registers within each cluster provide similar processor. We can scale both the number of functionality. lanes and the number of vector functional units (clusters) without complicating the VRF Evaluation of clustered vector processor design. Figure 5 presents the benefits from Figure 4 shows the simulated performance these two scaling techniques. We measure improvement if we replace the centralized lane speedup over a clustered vector design with in the VIRAM with the clustered-lane organi- one lane and four clusters (Figure 3b). zation. Each lane has four clusters: two for the When increasing the number of function- arithmetic units, one for the load-store unit, and al units, we maintain approximately a 2:1 ratio one that provides additional register storage. between arithmetic and load-store vector The intercluster network’s bandwidth is two functional units. Every cluster includes eight words per cycle; its latency is two cycles. We vector registers; hence, all configurations with assume the new design will have the same clock more than four clusters implement more than frequency as the original, despite the much sim- 32 physical vector registers. pler VRF. In theory, the original VIRAM design Increasing the number of lanes enables the should always be faster. It uses a centralized VRF execution of multiple element operations per

NOVEMBER–DECEMBER 2003 43 MICRO TOP PICKS

300 250 1 lane 2 lanes 4 lanes 8 lanes 200 150 100 50 0

Performance improvement (%) −50 FFT Bital Filter JPEG Viterbi Autocor Rgb2yiq Convenc Rgb2cmyk

Figure 4. Performance improvement with the clustered vector organization. The FFT benchmark shows a 0 percent performance degradation.

1 lane 2 lanes 4 lanes 8 lanes 8

4 Speedup 3

1 46812 16 No. of clusters

Figure 5. Average speedup for the EEMBC benchmarks with scaling of the clustered vector processor.

cycle for each vector instruction in progress. to 75 percent performance improvement over Two lanes lead to a nearly perfect speedup of 2. the four-cluster case, depending on the num- Four and eight lanes lead to speedups of 3.1 and ber of lanes. Yet performance improves little 4.8. In general, efficiency drops significantly as with more than eight clusters, for two reasons. the number of lanes grows. Applications with Primarily, it is not possible to use more than very short vectors (for example, Viterbi) cannot eight clusters efficiently with single-instruction efficiently use a large number of data paths. For issue, especially in the case of multiple lanes. programs with longer vectors, multiple lanes Furthermore, data dependencies and an decrease the execution time of each vector increased number of memory system conflicts instruction, so hiding the overhead of scalar limit the possibilities for concurrent instruction instructions, intercluster communication, and execution. Overall, combining the two scaling memory latency becomes more difficult. techniques improves performance by a factor of Increasing the number of clusters permits approximately 7, using eight clusters and eight parallel execution of multiple vector instruc- lanes (or 42 arithmetic data paths), before it tions. Using eight clusters provides a 50 percent becomes instruction bandwidth limited.

44 IEEE MICRO The clustered organization also allows for puter Architecture (ISCA 2000), IEEE CS precise exceptions in the vector processor at a Press, 2000, pp. 248-259. negligible performance cost. Furthermore, 6. C. Kozyrakis and D. Patterson, “Overcoming clustering permits high performance with the Limitations of Conventional Vector main memory systems that exhibit higher Processors,” Proc. 30th Int’l Symp. Com- access latencies than the embedded DRAM puter Architecture (ISCA 2003), ACM Press, used in the VIRAM prototype. We earlier pre- 2003, pp. 283-293. sented these experimental results in detail.6 7. S. Rixner et al., “Register Organization for Media Processing,” Proc. 6th Int’l Conf. he VIRAM architecture demonstrates the High-Performance Computer Architecture Tgreat potential of vector processors with (HPCA 6), IEEE CS Press, 2000, pp. 375-386. high-end embedded systems. It exploits DLP 8. K. Farkas et al., “The Multicluster Architec- to provide high performance with simple hard- ture: Reducing Processor Cycle Time ware and low power consumption. Our current Through Partitioning,” Proc. 30th Ann. Int’l work focuses on architectures that combine vec- Symp. Microarchitecture (Micro-30), IEEE tor techniques for DLP and multithreading CS Press, 1997, pp. 327-356. techniques for task-level parallelism, the other 9. J. Smith, “Decoupled Access/Execute Com- major type of parallelism in embedded appli- puter Architecture,” ACM Trans. Computer cations. Hardware and software development Systems, vol. 2, no. 4, Nov. 1984, pp. 289-308. for such hybrid architectures poses a number of 10. J. Fisher et al., Clustered Instruction-Level exciting research challenges. MICRO Parallel Processors, tech. report HPL-98-204, HP Labs, Dec. 1998. Acknowledgments We thank all the members of the IRAM research group at the University of California Christoforos E. Kozyrakis is an assistant pro- at Berkeley. IBM, MIPS Technologies, Cray, fessor of electrical engineering and computer and Avanti made significant hardware and science at Stanford University. His research software contributions to the IRAM project. interests include parallel architectures and com- This work was supported by DARPA pilation techniques for systems ranging from (DABT63-96-C-0056) and by an IBM PhD supercomputers to deeply embedded devices. fellowship. Kozyrakis has a PhD in computer science from the University of California at Berkeley, where References he was the architect of the VIRAM processor. 1. R. Espasa, M. Valero, and J.E. Smith, “Vec- He is a member of the IEEE and the ACM. tor Architectures: Past, Present and Future,” Proc. 12th Int’l Conf. Supercomputing, 1998, David A. Patterson teaches computer archi- ACM Press, pp. 425-432. tecture at the University of California at 2. J. Smith, G. Faanes, and R. Sugumar, “Vec- Berkeley, and holds the Pardee Chair of Com- tor Instruction Set Support for Conditional puter Science. His work has included the Operations,” Proc. 27th Int’l Symp. Com- design and implementation of RISC I and the puter Architecture (ISCA 2000), IEEE CS redundant arrays of inexpensive disks (RAID) Press, 2000, pp. 260-269. project. Patterson has a PhD in computer sci- 3. C. Kozyrakis and D. Patterson, “Vector ver- ence from the University of California at Los sus Superscalar and VLIW Architectures for Angeles. He is a fellow of the IEEE Comput- Embedded Multimedia Benchmarks,” Proc. er Society and the ACM, and a member of the 35th Ann. Int’l Symp. Microarchitecture National Academy of Engineering. (Micro-35), ACM Press, 2002, pp. 283-293. 4. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc. IEEE, vol. 89, no. 4, Apr. Direct questions and comments about this 2001, pp. 490-504. article to Christoforos Kozyrakis, Stanford 5. V. Agarwal et al., “Clock Rate vs IPC: The University, Electrical Engineering Depart- End of the Road for Conventional Microar- ment, Stanford, CA 94304-9030; christos@ chitectures,” Proc. 27th Int’l Symp. Com- ee.stanford.edu.

NOVEMBER–DECEMBER 2003 45