Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks

Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks Christoforos Kozyrakis David Patterson Electrical Engineering Department Computer Science Division Stanford University University of California at Berkeley christosOee, stanford, edu pattrsn@cs, berkeley, edu Abstract This paper studies the efficiency of vector architectures for the emerging computing domain of multimedia pro- Multimedia processing on embedded devices requires an grams running on embedded systems. Multimedia pro- architecture that leads to high performance, low power con- grams such as video, speech recognition, and 3D graphics, sumption, reduced design complexity, and small code size. constitute the fastest growing class of applications [5]. They In this paper, we use EEMBC, an industrial benchmark require real-time performance guarantees for data-parallel suite, to compare the VIRAM vector architecture to super- tasks that operate on narrow numbers with limited tem- scalar and VLIW processors for embedded multimedia ap- poral locality [6]. Embedded systems include entertain- plications. The comparison covers the VIRAM instruction ment devices, such as set-top-boxes and game consoles, set, vectorizing compiler, and the prototype chip that inte- and portable electronics, such as PDAs and cellular phones. grates a vector processor with DRAM main memory. They call for low power consumption, small code size, and We demonstrate that executable code for VIRAM is up to reduced design and programming complexity in order to 10 times smaller than VLIW code and comparable to x86 meet the cost and time-to-market requirements of consumer CISC code. The simple, cache-less VIRAM chip is 2 times electronics. The complexity, power consumption, and lack faster than a 4-way superscalar RISC processor that uses a of explicit support for data-level parallelism suggest that su- 5 times faster clock frequency and consumes 10 times more perscalar processors are not necessarily a suitable approach power. VIRAM is also 10 times faster than cache-based for embedded multimedia processing. VLIW processors. Even after manual optimization of the To prove that vector architectures meet the requirements VLIW code and insertion of SIMD and DSP instructions, of embedded media-processing, we evaluate the VIRAM the single-issue VlRAM processor is 60%faster than 5-way vector architecture with the EEMBC benchmarks, an indus- to 8-way VLIW designs. trial suite for embedded systems. Our evaluation covers all three components of VIRAM: the instruction set, the vectorizing compiler, and the processor microarchitecture. We 1 Introduction show that the compiler can extract a high degree of data- level parallelism from media tasks described in C and can The exponentially increasing performance and general- express it with vector instructions. The VIRAM code is sig- ity of superscalar processors has lead many to believe that nificantly smaller than code for RISC and VLIW architec- vector architectures are doomed to extinction. Even in the tures and is comparable to that for x86 CISC processors. supercomputing domain, the traditional application of vec- We describe a simple, low power, prototype chip that in- tor processors, it is widely considered that interconnecting tegrates the VIRAM architecture with embedded DRAM. superscalar processors into large-scale MPP systems is the The cache-less vector processor is 2 times faster than a 4- most promising approach [4]. Nevertheless, vector archi- way superscalar processors running at a 5 times higher clock tectures provide us with frequent reminders of their capabil- frequency. Despite issuing a single instruction per cycle, it ities. The recently announced Japanese Earth Simulator, a is also 10 times faster than 5-way to 8-way VLIW designs. supercomputer based on NEC SX-6 vector processors, pro- We demonstrate that the vector processor provides perfor- vides 5 times the performance with half the number of nodes mance advantages for both highly vectorizable benchmarks of ASCI White, the most powerful supercomputer based on and partially vectorizable tasks with short vectors. superscalar technology. Vector processors remain the most The rest of this paper is structured as follows. Section effective way to exploit data-parallel applications [20]. 2 summarizes the basic features of the VIRAM architec- 283 0-7695-1859-1/02 $17.00 © 2002 1EEE ture. Section 3 describes the EEMBC embedded bench- Scalar Single-issue 64-bit MIPS pipeline marks. Section 4 evaluates the vectorizing compiler and the Core 8K/8K direct-mapped L1 I/D caches use of the vector instruction set. It also presents a code size Vector 8K vector register file (32 registers) comparison between RISC, CISC, VLIW, and vector archi- Coprocessor 2 pipelined arithmetic units 4 64-bit datapaths per arithmetic unit tectures. Section 5 proceeds with a microarchitecture eval- 1 load-store unit (4 address generators) uation in terms of performance, power consumption, design 256-bit memory interface complexity, and scalability. Section 6 presents related work Memory 13 MBytes in 8 DRAM banks and Section 7 concludes the paper. System 25ns random access latency 256-bit crossbar interconnect 2 Vector Architecture for Multimedia Technology 0.18/~mCMOS process (IBM) 6 layers copper interconnect Transistors 120M (7.5M logic, 112.5M DRAM) In this section, we provide an overview of the three com- Clock 200 MHz ponents of the VIRAM architecture: the instructions set, the Frequency prototype processor chip, and the vectorizing compiler. Power 2 Watts Dissipation 2.1 Instruction Set Overview Peak Int: 1.6/3.2/6.4 Gop/s (64b/32b/16b) Performance FP: 1.6 Gflop/s (32b) VIRAM is a complete, load-store, vector instruction set defined as a coprocessor extension to the MIPS architecture. Table 1. The characteristics of the VIRAM vec- The vector architecture state includes a vector register file tor processor chip. with 32 entries that can store integer or floating-point elements, a 16-entry flag register file that contains vectors with single-bit elements, and a few scalar registers for control The VIRAM architecture includes several features that values and memory addresses. The instruction set contains help with the development of general-purpose systems, integer and floating-point arithmetic instructions that oper- which are not typical in traditional vector supercomputers. ate on vectors stored in the register file, as well as logical It provides full support for paged virtual addressing using a functions and operations such as population count that use separate TLB for vector memory accesses. It also provides the flag registers. Vector load and store instructions support a mechanism that allows the operating system to defer the the three common access patterns: unit stride, strided, and saving and restoring of vector state during context switches, indexed. Overall, VIRAM introduces 90 unique instruc- until it is known that the new process uses vector instructions, which, due to variations, consume 660 opcodes in the tions. In addition, the architecture defines valid and dirty coprocessor 2 space of the MIPS architecture. bits for all vector registers that are used to minimize the To enable the vectorization of multimedia applications, amount of vector state involved in a context switch. VIRAM includes a number of media-specific enhance- Detailed descriptions of the features and instructions in ments. The elements in the vector registers can be 64, 32, the VIRAM architecture are available in [ 13]. or 16 bits wide. Multiple narrow elements are placed in the storage location for one wide element. Similarly, each 2.2 Microarchitecture Overview 64-bit datapath is partitioned in order to execute multiple narrower element operations in parallel. Instead of spec- ifying the element and operation width in the instruction The VIRAM prototype processor is a simple implemen- tation of the VIRAM architecture. It includes an on-chip opcode, we use a control register which is typically set main memory system I based on embedded DRAM technol- once per group of nested loops. Integer instructions sup- ogy that provides the high bandwidth necessary for a vector port saturated and fixed-point arithmetic. Specifically, VI- processor at moderate latency. Table 1 summarizes the basic RAM includes a flexible multiply-add model that supports features of the chip. arbitrary fixed-point formats without using accumulators or Figure 1 presents the chip microarchitecture, focusing on extended-precision registers. Three vector instructions im- the vector hardware and the memory system. The register plement element permutations within vector registers. Their scope is limited to the vectorization of dot-products (reduc- and datapath resources in the vector coprocessor are partitioned vertically into four identical vector lanes. Each lane tions) and FFFs, which makes them regular and simple to implement. Finally, VIRAM supports conditional execution contains a number of elements from each vector and flag of element operations for virtually all vector instructions us- 1The processor can address additional, off-chip, main memory. Data ing the flag registers as sources of element masks [21]. transfers between on-chip and off-chip are under software control. 284 64b ......................... .......................... 2.3 Vectorizing Compiler ,, LANE 0 LANE 1 LANE 2 LANE 3 , ~ ~ ~ The VIRAM compiler is based on the PDGCS compi- lation system for Cray supercomputers such as C90-YMP, T3E, and SV2. The front-end allows

Load more