CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

CG-OoO Energy-Efficient Coarse-Grain Out-of-Order Execution Milad Mohammadi⋆, Tor M. Aamodt†, William J. Dally⋆‡ ⋆Stanford University, †University of British Columbia, ‡NVIDIA Research [email protected], [email protected], [email protected] ABSTRACT CG-OoO model to make it even more energy efficient. We introduce the Coarse-Grain Out-of-Order (CG- Despite the significant achievements in improving en- OoO) general purpose processor designed to achieve ergy and performance properties of the OoO proces- close to In-Order processor energy while maintaining sor in the recent years [2], studies show the energy Out-of-Order (OoO) performance. CG-OoO is an and performance attributes of the OoO execution model energy-performance proportional general purpose remain superlinearly proportional [3, 4]. Studies indi- architecture that scales according to the program cate control speculation and dynamic scheduling tech- load1. Block-level code processing is at the heart of nique amount to 88% and 10% of the OoO superior the this architecture; CG-OoO speculates, fetches, performance compared to the In-Order (InO) proces- schedules, and commits code at block-level granu- sor [5]. Scheduling and speculation in OoO is performed larity. It eliminates unnecessary accesses to energy at instruction granularity regardless of the instruction consuming tables, and turns large tables into smaller type even though they are mainly effective during un- and distributed tables that are cheaper to access. predictable dynamic events (e.g. unpredictable cache CG-OoO leverages compiler-level code optimizations misses) [5]. Furthermore, our studies show speculation to deliver efficient static code, and exploits dynamic and dynamic scheduling amount to 67% and 51% of instruction-level parallelism and block-level parallelism. the OoO excess energy compared to the InO processor. CG-OoO introduces Skipahead issue, a complexity These observations suggest any general purpose proces- effective, limited out-of-order instruction scheduling sor architecture that aims to maintain the superior per- model. Through the energy efficiency techniques formance of OoO while closing the energy efficiency gap applied to the compiler and processor pipeline stages, between InO and OoO ought to implement architectural CG-OoO closes 64% of the average energy gap between solutions in which low energy program speculation and the In-Order and Out-of-Order baseline processors at dynamic scheduling are central. the performance of the OoO baseline. This makes Our study provide four high level observations. CG-OoO 1.9× more efficient than the OoO on the First, OoO excess energy is well distributed across all energy-delay product inverse metric. pipeline stages. Thus, an energy efficient architecture should reduce energy of each stage. Second, OoO execution model imposes tight functional dependencies 1. INTRODUCTION between stages requiring a solution to enable energy arXiv:1606.01607v1 [cs.AR] 6 Jun 2016 This paper revisits the Out-of-Order (OoO) execution efficiency across all stages. Third, as mentioned by model and devises an alternative model that achieves others, complexity effective micro-architectures such the performance of the OoO at over 50% lower en- as ILDP [6] and Palachara, et al. [7] enable simpler ergy cost. Czechowski et al. [2] discusses the energy hardware, such as local and global register files that efficiency techniques used in the recent generations of improve energy efficiency. A block-level execution the Intel CPU architectures (e.g. Core i7, Haswell) model, like CG-OoO, enables energy efficiency by including Micro-op cache, Loop cache, and Single In- simplifying complex, energy consuming modules struction Multiple Data (SIMD) instruction set archi- throughout the pipeline stages. Fourth, since dynamic tecture (ISA). This paper questions the inherent en- scheduling and speculation techniques mainly benefit ergy efficiency attributes of the OoO execution model unpredictable dynamic events, they should be applied and provides a solution that is over 50% more energy to instructions selectively. Unpredictable events are efficient than the baseline OoO. The energy efficiency hard to detect and design for; however, we show a techniques discussed in [2] can also be applied to the hierarchy of scheduling techniques can adjust the 1 processing energy according to the program runtime Not to be confused with energy-proportional designs [1]. Energy-performance proportional scaling refers to linear state. change in energy as the processor configuration allows higher CG-OoO contributes a hierarchy of scheduling peak performance (Figure 26). techniques centered around clustering instructions; structions. In contract to these proposals, the CG-OoO static instruction scheduling organizes instructions clusters basic-block instructions statically such that, at at basic-block level granularity to reduce stalls. The runtime, control speculation, fetch, commit, and squash CG-OoO dynamic block scheduler dispatches multiple are done at block granularity. Furthermore, CG-OoO code blocks concurrently. Blocks issue instructions leverages energy efficient, limited out-of-order schedul- in-order when possible. In case of an unpredictable ing from each code block (col. 8). stall, each block allows limited out-of-order instruction issue using a complexity effective structure named Skipahead; Skipahead accomplishes this by perform- ing dynamic dependency checking between a very small collection of instructions at the head of each code block. Section 4.4.1 discusses the Skipahead micro-architecture. CG-OoO contributes a complexity effective block- level control speculation model that saves speculation energy throughout the entire pipeline by allowing -architecture/Coarse-Grain block-level control speculation, fetch, register renaming µ bypass, dispatch, and commit. Several front-end architectures have shown block-level speculation can be done with high accuracy and low energy cost [8, 9, 10]. CG-OoO uses a distributed register file hierarchy to DESIGN Distributed Energy Modeling Complexity Effective Design Profiling NOT done Pipeline Clustering Static & Dynamic Scheduling Hybrid Block-level Out-of-Order Scheduling Register File Hierarchy allow static allocation of block-level, short-living regis- ! ! ! ! ! ! ! CG-OoO ! ters, and dynamic allocation of long-living registers. ! ! ! Braid [11] ! The rest of this paper is organized as follows. Sec- ! ! ! ! WiDGET [3] ! tion 2 presents the related work, Section 3 describes ! ! ! ! TRIPS [12, 13] ! the CG-OoO execution model, Section 4 discusses the ! ! ! Multiscalar [14] ! processor architecture, Section 5 presents the evaluation ! ! ! CE [7] ! methodology, Section 6 provides the evaluation results, ! ! ! TP [15] ! and Section 7 concludes the paper. ! ! MorphCore [16] ! ! ! BOLT [17] ! ! 2. OVERVIEW & RELATED WORK iCFP [18] ! ! ! ! CG-OoO aims to design an energy efficient, high- ILDP [6] ! ! ! ! ! performance, single-threaded, processor through target- WaveScalar [19] ! ing a design point where the complexity is nearly as Table 1: Eight high level design features of the CG-OoO architecture compared to the previous literature. simple as an in-order and instruction-level parallelism (ILP) is paramount. Table 1 compares several high-level design features that distinguish the CG-OoO processor Multiscalar [14] evaluates a multi-processing unit from the previous literature. Unlike others, CG-OoO’s capable of steering coarse grain code segments, often primary objective is energy efficient computing (column larger than a basic-block, to its processing units. It 3), thereby designing several complexity effective (col. replicates register context for each computation unit, 4), energy-aware techniques including: an efficient reg- increasing the data communication across its register ister file hierarchy (col. 9), a block-level control specu- files. TRIPS and EDGE [12,20] are high-performance, lation, and a static and dynamic block-level instruction grid-processing architectures that uses static instruc- scheduler (col. 7, 8) coupled with a complexity effective tion scheduling in space and dynamic scheduling in out-of-order issue model named Skipahead. CG-OoO is time. It uses Hyperblocks [21] to map instructions a distributed instruction-queue model (col. 2) that clus- to the grid of processors. Hyperblocks use branch ters execution units with instruction queues to achieve predication to group basic-blocks that are connected an energy-performance proportional solution (col. 6). together through weakly biased branches. To construct Braid [11] clusters static instructions at sub-basic- Hyperblocks, the TRIPS compiler uses program block granularity. Each braid runs in-order as a block profiling. While effective for improving instruction of code. Clustering static instructions at this granular- parallelism, Hyperblocks lead to energy inefficient ity requires additional control instructions to guarantee mis-speculation recovery events. Palachara, et al. [7] program execution correctness. Injecting instructions supports a distributed instruction window model that increases instruction cache pressure and processor en- simplifies the wake-up logic, issue window, and the ergy overhead. Braid performs instruction-level, branch forwarding logic. In this paper, instruction scheduling prediction, issue and commit. WiDGET [3] is a power- and steering is done at instruction granularity. Trace proportional grid execution design consisting of a de- Processors [15] is an instruction flow design based coupled thread context management

CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

Parallel Computer Architecture III

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

An Evaluation of the TRIPS Computer System

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget

Modeling Instruction Placement on a Spatial Architecture

Compiling for EDGE Architectures

Scatter-Add in Data Parallel Architectures

Universal Mechanisms for Data-Parallel Architectures