Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010

Dynamic Vectorization in the E2 Dynamic Multicore Architecture To appear in the proceedings of HEART 2010 Andrew Putnam Aaron Smith Doug Burger Microsoft Research Microsoft Research Microsoft Research [email protected] [email protected] [email protected] ABSTRACT TFlex [9] is one proposed architecture that demonstrated a Previous research has shown that Explicit Data Graph Exe- large dynamic range of power and performance by combin- cution (EDGE) instruction set architectures (ISA) allow for ing power efficient, lightweight processor cores into larger, power efficient performance scaling. In this paper we de- more powerful cores through the use of an Explicit Data scribe the preliminary design of a new dynamic multicore Graph Execution (EDGE) instruction set architecture (ISA). processor called E2 that utilizes an EDGE ISA to allow for TFlex is dynamically configurable to provide the same per- the dynamic composition of physical cores into logical pro- formance and energy efficiency as a small embedded proces- cessors. We provide details of E2’s support for dynamic re- sor or to provide the higher performance of an out-of-order configurability and show how the EDGE ISA facilities out- superscalar on single-threaded applications. of-order vector execution. Motivated by these promising results, we are currently designing a new dynamic architecture called E2 that utilizes an EDGE ISA to achieve high performance power effi- Categories and Subject Descriptors ciently [3]. The EDGE model divides a program into blocks C.1.2 [Computer Systems Organization]: Multiple Data of instructions that execute atomically. Blocks consist of a Stream Architectures—single-instruction-stream, multiple- sequence of dataflow instructions that explicitly encode re- data-stream processors (SIMD), array and vector proces- lationships between producer-consumer instructions, rather sors; C.1.3 [Computer Systems Organization]: Other Ar- than communicating through registers as done in a conven- chitecture Styles—adaptable architectures, data-flow archi- tional ISA. These explicit encodings are used to route operands tectures to private reservation stations (called operand buffers) for each instruction. Registers and memory are only used for General Terms handling less-frequent inter-block communication. Prior dynamic architectures [7, 9] have demonstrated the Design, Performance ability to take advantage of task and thread-level parallelism, but handling data-level parallelism requires dividing data into Keywords independent sets and using thread-level parallelism. In this Explicit Data Graph Execution (EDGE) paper we focus on efficiently exploiting data-level parallelism, even without threading, and present our preliminary vector unit design for E2. Unlike previous in-order vector ma- 1. INTRODUCTION chines, E2 allows for out-of-order execution of both vectors Chip designers have long relied on dynamic voltage and and scalars. frequency scaling (DVFS) to trade off power for performance. The E2 instruction set and execution model supports three However, voltage scaling no longer works as processors ap- new capabilities that enable efficient vectorization across a proach the minimum threshold voltage (Vmin) as frequency broad range of codes. First, by slicing up the statically pro- scaling at Vmin reduces both power and performance lin- grammed issue window into vector lanes, highly concurrent, early, achieving no reduction in energy. Power and perfor- out-of-order issue of mixed scalar and vector operations can mance trade-offs are thus left to either the microarchitecture be achieved with lower energy overhead than scalar mode. or the system software. Second, the statically allocated reservation stations permit When designing an architecture with little (if any) DVFS, the issue window to be treated as a vector register file, with designers must choose how to spend the silicon resources. wide fetches to memory and limited copying between a vec- Hill and Marty [6] described four ways that designers could tor load and the vector operations. Third, the atomic block- use these resources: (1) many small, low performance, power based model in E2 permits refreshing of vector (and scalar) efficient cores, (2) few large, power inefficient, high per- instruction blocks mapped to reservation stations, enabling formance cores, (3) a heterogeneous mix of both small and repeated vector operations to issue with no fetch or decode large cores, and (4) a dynamic architecture capable of com- energy overhead after the first loop iteration. Taken together, bining or splitting cores to adapt to a given workload. Of these optimizations will reduce the energy associated with these alternatives, the highest performance and most energy finding and executing many sizes of vectors across a wide efficient design is the dynamic architecture. Hill and Marty range of codes. characterized what such a dynamic processor could do but did not describe the details of such an architecture. 8-&()-?$ "9&;$7 %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ .,&/-0$ =*CC;)$ =*CC;)$ GHI74J$ 12$3$456$ 12$3$D56$ 12$3$D56$ !"# 7D$3$D56$ $ 8-);$ 8-);$ 8-);$ 8-);$ "9&;$2 "2$ "2$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ 8-);$ 8-);$ 8-);$ 8-);$ .,&/-0$ =*CC;)$ =*CC;)$ G7DI17J$ "7$ !"# 7D$3$D56$ %&'()*+(,-&$ 12$3$456$ 12$3$D56$ 12$3$D56$ 8-);$ 8-);$ 8-);$ 8-);$ $ 89+:;$ "9&;$1 "2$ $ "2$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ 8-);$ 8-);$ 8-);$ 8-);$ 12$<=$ .,&/-0$ =*CC;)$ =*CC;)$ G12I5KJ$ !"# 8-);$ 8-);$ 8-);$ 8-);$ 12$3$456$ 12$3$D56$ 12$3$D56$ 7D$3$D56$ $ "2$ "2$ "9&;$5 8-);$ 8-);$ 8-);$ 8-);$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ "-9/PQ(-);$ .,&/-0$ =*CC;)$ =*CC;)$ G5LID1J$ R*;*;$ 8-);$ 8-);$ 8-);$ 8-);$ !"# =)9&+:$ 12$3$456$ 12$3$D56$ 12$3$D56$ 7D$3$D56$ $ "2$ "2$ 8-);$ 8-);$ 8-);$ 8-);$ @);/,+(-)$ M;N-)O$ "7$>9(9$89+:;$ %&(;)C9+;$ $ 8-&()-??;)$ 12$<=$ Figure 1: E2 microarchitecture block diagram. In vector mode, each core is composed of four independent vector lanes, each with a 32-instruction window, two 64-bit operand buffers, an ALU for both integer and floating point operations, and 16 registers. In scalar mode, the ALUs in lanes 3 and 4 are powered down, and the instruction windows, operand buffers, and registers are made available to the other two lanes. 2. THE E2 ARCHITECTURE two of the ALUs are turned off to conserve power. In vec- E2 is a tiled architecture that consists of low power, high tor mode, all N ALUs are turned on, but instructions can performance, decentralized processing cores connected by only send operands to instructions in the same vector lane. an on-chip network. This design provides E2 with the bene- The mode is determined on a per-block basis from a bit in fits of other tiled architectures - namely simplicity, scalabil- the block header. This allows each core to adapt quickly to ity, and fault tolerance. Figure 1 shows the basic architecture different application phases on a block-by-block basis. of an E2 processor containing 32 cores, and a block diagram of the internal structure of one physical core. 2.1 Composing Cores A core contains N lanes (in this paper we choose four), One key characteristic that distinguishes E2 from other with each lane consisting of a 64-bit ALU and one bank processors is the ability to dynamically adapt the architec- of the instruction window, operand buffers, and register file. ture for a given workload by composing and decomposing ALUs support both integer and floating point operations, as cores. Rather than fixing the size and number of cores at de- well as fine-grained SIMD execution (eight 8-bit, four 16- sign time, one or more physical cores can be merged together bit, or two 32-bit integer operations per cycle, or two single- at runtime to form larger, more powerful logical cores. For precision floating point calculations per cycle). This innova- example, serial portions of a workload can be handled by tion of breaking the window into lanes allows for high vector composing every physical core into one large logical proces- throughput with little additional hardware complexity. sor that performs like an aggressive superscalar. Or, when E2’s EDGE ISA restricts blocks in several ways to sim- ample thread-level parallelism is available, the same large plify the hardware that maps blocks to the execution sub- logical processor can be split so each physical processor can strate and detect when blocks are finished executing. Blocks work independently and execute instruction blocks from in- are variable-size: they contain between 4 and 128 instruc- dependent threads. Merging cores together is called compos- tions and may execute at most 32 loads and stores. The hard- ing cores, while splitting cores is called decomposing cores. ware relies on the compiler to break programs into blocks Logical cores interleave accesses to registers and memory of dataflow instructions and assign load and store identifiers among the physical cores to give the logical core the com- to enforce sequential memory semantics [12]. To improve bined computational resources of all the composed physical performance, the compiler uses predication to form large cores. For example, a logical core composed of two physi- blocks filled with useful instructions. To simplify commit, cal cores uses an additional bit of the address to choose be- the architecture relies on the compiler to ensure that a sin- tween the two physical caches, effectively doubling the L1 gle branch is produced from every block, and to encode the cache capacity. The register files are similarly interleaved, register writes and the set of store identifiers used. but since only 64 registers are supported by the ISA, the E2 cores operate in two execution modes: scalar mode additional register file capacity is powered gated to reduce and vector mode. In scalar mode, any instruction can send power consumption. operands to any other instruction in the block, and all but Each instruction block is mapped to a single physical processor. When composed, the architecture uses additional Component Parameters Area (mm2) % Area cores to execute speculative instruction blocks.

Load more