Pipelining a Triggered Processing Element

Pipelining a Triggered Processing Element Thomas J. Repetti João P. Cerqueira Dept. of Computer Science, Columbia University Dept. of Electrical Engineering, Columbia University [email protected] [email protected] Martha A. Kim Mingoo Seok Dept. of Computer Science, Columbia University Dept. of Electrical Engineering, Columbia University [email protected] [email protected] ABSTRACT ACM Reference format: Programmable spatial architectures composed of ensembles Thomas J. Repetti, João P. Cerqueira, Martha A. Kim, and Min- of autonomous fixed-ISA processing elements o↵er a com- goo Seok. 2017. Pipelining a Triggered Processing Element. In Proceedings of MICRO-50, Cambridge, MA, USA, October 14–18, pelling design point between the flexibility of an FPGA and 2017, 13 pages. the compute density of a GPU or shared-memory many-core. https://doi.org/10.1145/3123939.3124551 The design regularity of spatial architectures demands exam- ination of the processing element microarchitecture early in the design process to optimize overall efficiency. 1 INTRODUCTION This paper considers the microarchitectural issues sur- Spatial accelerators support important workloads such as rounding pipelining a spatial processing element with triggered- information retrieval [22], databases [33–35], string process- instruction control. We propose two new techniques to miti- ing [26], and neural networks [8, 9]. A general purpose spatial gate pipeline hazards particular to spatial accelerators and array of programmable processing elements can serve these non-program-counter architectures, evaluating them using in- and other applications with spatial parallelism and direct vivo performance counters from an FPGA prototype coupled inter-processing element communication. In contrast to a with a rigorous VLSI power and timing estimation methodol- fixed-function accelerator, a programmable accelerator ac- ogy. We consider the e↵ect of modern, post-Dennard-scaling commodates new workloads and optimization of existing ones. CMOS technology on the energy-delay tradeo↵s and identify In such designs, triggered control [19, 20, 32] has demonstra- a set of microarchitectures optimal for both high-performance ble architectural benefits, reducing both dynamic and static and low-power application settings. Our analysis reveals the instruction counts relative to program counter-based control. e↵ectiveness of our hazard mitigation techniques as well as the This is an important element of reducing energy and delay, range of microarchitectures designers might consider when but it is only part of the story. selecting a processing element for triggered spatial accelera- Total energy consumption is a product of dynamic instruc- tors. tion count and the energy expended per instruction: Energy = Instructions Energy . CCS CONCEPTS Program Program ⇥ Instruction Instructions While the architecture and workload determine Program , • Computer systems organization → Pipeline com- Energy puting; Reduced instruction set computing ; Multiple instruc- the microarchitecture and circuit determine Instruction .Total tion, multiple data; Multicore architectures; Interconnection delay is likewise determined by the archictectural constraint of architectures; • Hardware → Power and energy; dynamic instruction count but also by cycles per instruction (CPI) and cycles per unit time (frequency): Time = Instructions Cycles Time . KEYWORDS Program Program ⇥ Instruction ⇥ Cycle Cycles Time Spatial architectures, pipeline hazards, microarchitecture, Instruction and Cycle are also properties of the microarchi- design-space exploration, low-power design tecture and underlying circuit technology. Exploring these elements of energy and delay below the architectural level is the focus of this paper. Permission to make digital or hard copies of all or part of this work This work investigates the microarchitectural and circuit for personal or classroom use is granted without fee provided that design space of triggered processing elements. The replication copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first factor in a tiled architecture demands deep analysis and page. Copyrights for components of this work owned by others than optimization of the central building block: the processing ACM must be honored. Abstracting with credit is permitted. To copy element (or PE). We focus on the interplay between the otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions instruction pipeline and supply voltage scaling, as it has from [email protected]. been long established that pipelining can improve instruction MICRO-50, October 14–18, 2017, Cambridge, MA, USA level parallelism, timing closure, and power efficiency through © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4952-9/17/10. $15.00 voltage scaling [2, 3, 16, 27]. Once a pipeline has reduced https://doi.org/10.1145/3123939.3124551 the critical path of a circuit, additional opportunity to trade 96 MICRO-50, October 14–18, 2017, Cambridge, MA, USA T. Repetti et al. energy and delay appears. One could maintain nominal supply 2.1 Background voltage and increase clock frequency, maintain the original Triggered control was proposed by Parashar et al. in 2013 [19] clock frequency and reduce supply voltage, or apply some as an alternative to program-counter-based control for spatial combination in the middle. Our results reveal the importance arrays of autonomous PEs. In the triggered scheme, each PE of such design choices. Having explored over 4,000 unique is programmed with a priority ordered list of guarded atomic designs in this space, the energy-delay tradeo↵ curve spans actions. This list represents a finite, statically configured 71x in energy – from 0.67 to 47.59 pJ / instruction – and local pool of datapath instructions, whose eligibility for issue 225x in delay – from 1.37 to 309.03 ns / instruction. in any given cycle is determined by a corresponding “trigger” Triggered control poses unique sorts of hazards for an condition (i.e., the guard). Each cycle, all of the triggers are instruction pipeline. To launch an instruction, the front end compared to designated architectural state – predicate and must compare the predicate and communication queue state queue status, described shortly – to determine whether the to a programmed set of trigger conditions, as opposed to just corresponding instruction has been “triggered”. Instructions calculating the next address for the program counter. We are ordered by priority rather than sequence, with the highest present two new hazard mitigation techniques that help keep priority triggered instruction issued for execution (Figure 2). the pipeline full: speculation on upcoming predicate state and Each trigger-controlled PE is connected to neighboring PEs accurate queue status accounting given the current contents by a set of incoming and outgoing tagged data queues over of the pipeline. We find that these techniques reduce the an interconnect fabric. Tags encode programmable semantic increases in CPI that otherwise accompany deep pipelines, information that accompanies the data communicated over together reducing CPI in a 4-stage pipeline by 35%. They these queues. For example, a tag might be used to indicate incur some overheads – in the worst case 1.4% area, 8% the datatype of the accompanying data word or a message to power, 20% critical path – but ultimately improve the optimal e↵ect control flow like a termination condition. Tag values at design frontier by 20-25% in both energy and delay. While the head of the input queues determine, in part, whether an predicate prediction is applicable specifically to triggered instruction can be fired. The PE also contains a set of single- instruction architectures, the method of determining e↵ective bit predicate registers, which can be updated immediately queue status benefits any spatial architecture with pipelined upon triggering an instruction, or as the result of a datapath processing elements and register queues. operation. Each trigger’s validity is determined by the state We have released an open-source repository to support of the predicate registers, the availability of tagged input further investigation in this area. It includes SystemVerilog operands on the incoming queues, and capacity on the output implementations of both the single cycle and pipelined mi- queues for any instructions that write there. Comparison or croarchitectures presented here, which in turn can be used in logic instructions whose destination is a predicate register synthesizable spatial arrays. It is supported by a toolchain provide control flow equivalent to branching in program- that includes an assembler, functional ISA simulator, Linux counter-based ISAs. driver and userspace library. All of these are governed by a By eliminating explicit branches, triggered control reduces single parameter file that configures the architecture (e.g., the dynamic instruction count of a given task. Moreover, in queue counts or instructions per processing element) and a spatial context, it allows PEs to react quickly to incoming microarchitecture (e.g., turning on/o↵ the aforementioned data. Together these two features help multiple PEs to work hazard mitigation techniques). Lastly, we include a set of ten together in an efficient processing chain: each PE in the chain triggered instruction microbenchmarks that exhibit a

Load more