Improving Processor Efficiency by Statically
Total Page:16
File Type:pdf, Size:1020Kb
Improving Processor Efficiency by Statically Pipelining Instructions Ian Finlayson Brandon Davis Peter Gavin Gang-Ryung Uh University of Mary Washington Florida State University Boise State University Fredericksburg Tallahassee Boise Virginia, USA Florida, USA Idaho, USA fi[email protected] fbdavis,[email protected] [email protected] David Whalley Magnus Sjalander¨ Gary Tyson Florida State University Tallahassee Florida, USA fwhalley,sjalande,[email protected] Abstract oped when efficiency was not as important. For instance, specula- A new generation of applications requires reduced power con- tion is a direct tradeoff between power and performance, but many sumption without sacrificing performance. Instruction pipelining other techniques are assumed to be efficient. is commonly used to meet application performance requirements, Instruction pipelining is one of the most common techniques for but some implementation aspects of pipelining are inefficient with improving performance of general-purpose processors. Pipelining respect to energy usage. We propose static pipelining as a new is generally considered very efficient when speculation costs and instruction set architecture to enable more efficient instruction scheduling complexity are minimized. While it is true that spec- flow through the pipeline, which is accomplished by exposing the ulation, dynamic scheduling policies, and superscalar execution pipeline structure to the compiler. While this approach simplifies have the largest impact on efficiency, even simple, in-order, scalar hardware pipeline requirements, significant modifications to the pipelined architectures have inefficiencies that lead to less optimal compiler are required. This paper describes the code generation implementations of the processor architecture. For instance, haz- and compiler optimizations we implemented to exploit the features ard detection and data forwarding not only require evaluation of of this architecture. We show that we can achieve performance and register dependencies each cycle of execution, but successful for- code size improvements despite a very low-level instruction repre- warding does not prevent register file accesses to stale values, nor sentation. We also demonstrate that static pipelining of instructions does it eliminate unnecessary pipeline register writes of those stale reduces energy usage by simplifying hardware, avoiding many un- values, which are propagated for all instructions. necessary operations, and allowing the compiler to perform opti- The goal of this paper is to restructure the organization of a mizations that are not possible on traditional architectures. pipelined processor implementation in order to remove as many redundant or unnecessary operations as possible. This goal is achieved by making the pipelined structures architecturally visible and relying on the compiler to optimize resource usage. While tech- 1. Introduction niques like VLIW [8] have concentrated on compile time instruc- tion scheduling and hazard avoidance, we seek to bring pipeline Energy expenditure is clearly a primary design constraint, espe- control further into the realm of compiler optimization. When cially for embedded processors where battery life is directly re- pipeline registers become architecturally visible, the compiler can lated to the usefulness of the product. As these devices become directly manage tasks like forwarding, branch prediction, and reg- more sophisticated, the execution performance requirements in- ister access. This mitigates some of the inefficiencies found in more crease. This trend has led to new generations of efficient processors conventional designs, and provides new optimization opportunities that seek the best solution to the often conflicting requirements of to improve the efficiency of the pipeline. low-energy design and high-performance execution. Many of the Figure 1 illustrates the basic idea of the static pipeline (SP) ap- micro-architectural techniques to improve performance were devel- proach. With traditional pipelining, each instruction spends sev- eral cycles in the pipeline. For example, the load instruction in Figure 1(b) requires one cycle for each stage and remains in the pipeline from cycles five through eight. Each instruction is fetched and decoded and information about the instruction flows through Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed the pipeline, via pipeline registers, to control each portion of the for profit or commercial advantage and that copies bear this notice and the full citation processor that takes a specific action during each cycle. The load on the first page. To copy otherwise, to republish, to post on servers or to redistribute instruction shares the pipelined datapath with other instructions that to lists, requires prior specific permission and/or a fee. are placed into the pipeline in adjacent cycles. LCTES’13, June 20–21, 2013, Seattle, Washington, USA. Copyright c 2013 ACM 978-1-4503-2085-6/13/06. $15.00 clock cycle clock cycle 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 add IFRF EX MEM WB IF RF EX MEM WB store IF RF EX MEM WB IF RF EX MEM WB sub IF RF EX MEM WB IF RF EX MEM WB load IF RF EX MEM WB IF RF EX MEM WB or IF RF EX MEM WB IF RF EX MEM WB (a) Traditional Insts (b) Traditional Pipelining (c) Static Pipelining Figure 1. Traditionally Pipelined vs. Statically Pipelined Instructions Figure 2. Classical Five-Stage Pipeline Figure 1(c) illustrates how an SP processor operates. Opera- piler optimizations targeting an SP architecture. The novelty of this tions associated with conventional instructions still require multi- paper is not strictly in the SP architectural features or compiler op- ple cycles to complete execution; however, the method used to en- timizations when either is viewed in isolation, but also in the SP code how the operation is processed while executing differs. In SP ISA that enables the compiler to more finely control the processor scheduled code, the execution of the load operation is not spec- to produce a more energy efficient system. ified by a single instruction. Instead each SP instruction specifies This paper makes the following contributions. (1) We show that how all portions of the processor are controlled during the cycle the use of static pipelining can expose many new opportunities for it is executed. Initially encoding any conventional instruction may compiler optimizations to avoid redundant or unnecessary opera- take as many SP instructions as the number of pipeline stages in a tions performed in conventional pipelines. (2) We establish that a conventional processor. While this approach may seem inefficient very low-level ISA can be used and still achieve performance and in specifying the functionality of any single conventional instruc- code size improvements. (3) We demonstrate that static pipelining tion, the cost is offset by the fact that multiple SP effects can be can reduce energy usage by significantly decreasing the number of scheduled for the same SP instruction. The SP instruction set ar- register file accesses, internal register writes, branch predictions, chitecture (ISA) results in more control given to the compiler to and branch target address calculations, as well as completely elim- optimize instruction flow through the processor, while simplifying inating the need for a branch target buffer. In summary, we provide the hardware required to support hazard detection, data forwarding, a novel method for pipelining that exposes more control of pro- and control flow. By relying on the compiler to do low-level proces- cessor resources to the compiler to significantly improve energy sor resource scheduling, it is possible to eliminate some structures efficiency while obtaining counter-intuitive improvements in per- (e.g., the branch target buffer), avoid some repetitive computation formance and code size. (e.g., sign extensions and branch target address calculations), and greatly reduce accesses to both the register file and internal reg- 2. Architecture isters. This strategy results in improved energy efficiency through simpler hardware, while providing new code optimization oppor- In this section we discuss the SP architecture design including tunities for achieving performance improvement. The cost of this both the micro-architecture and the instruction set. While static approach is the additional complexity of code generation and com- pipelining refers to a class of architectures, we describe one design in detail that is used as the basis for the remainder of this paper. Figure 3. Datapath of a Statically Pipelined Processor 2.1 Micro-Architecture target addresses and adding an offset to a register to form a memory The SP micro-architecture evaluated in this paper is designed to be address even when that offset is zero. similar to a classical five-stage pipeline in terms of available hard- Figure 3 depicts one possible datapath of an SP processor. The ware and operation. The motivation for this design is to minimize fetch portion of the processor is mostly unchanged from the con- the required hardware resource differences between the classical ventional processor. Instructions are fetched from the instruction baseline design and our SP design in order to evaluate the benefits cache and branches are predicted by a BPB. One difference is that of the SP technique. there is no longer any need for a BTB. This structure is used to store Figure 2 depicts a classical five-stage