Application-Specific Customization of Soft Processor Microarchitecture
Total Page:16
File Type:pdf, Size:1020Kb
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto {yiannac,steffan,jayar}@eecg.utoronto.ca ABSTRACT processors on their FPGA die, there has been significant up- A key advantage of soft processors (processors built on an take [15,16] of soft processors [3,4] which are constructed in FPGA programmable fabric) over hard processors is that the programmable fabric itself. While soft processors can- they can be customized to suit an application program’s not match the performance, area, or power consumption of specific software. This notion has been exploited in the past a hard processor, their key advantage is the flexibility they principally through the use of application-specific instruc- present in allowing different numbers of processors and spe- tions. While commercial soft processors are now widely de- cialization to the application through special instructions. ployed, they are available in only a few microarchitectural Specialization can also occur by selecting a different micro- variations. In this work we explore the advantage of tuning architecture for a specific application, and that is the focus the processor’s microarchitecture to specific software appli- of this paper. We suggest that this kind of specialization cations, and show that there are significant advantages in can be applied in two different contexts: doing so. 1. When an embedded processor is principally running Using an infrastructure for automatically generating soft a single application for which both performance and processors that span the area/speed design space (while re- area are important factors—if a particular microarchi- maining competitive with Altera’s Nios II variations), we tecture can be shown to be advantageous over what explore the impact of tuning several aspects of microarchi- would otherwise be a single, general-purpose microar- tecture including: (i) hardware vs software multiplication chitecture, there is a win. Moreover, this win is “free” support; (ii) shifter implementation; and (iii) pipeline depth, in a reconfigurable FPGA environment except for the organization, and forwarding. We find that the processor de- exploration time spent finding these customizations at sign that is fastest overall (on average across our embedded compile time. benchmark applications) is often also the fastest design for an individual application. However, in terms of area effi- 2. When an embedded processor executes many applica- ciency (i.e., performance-per-area), we demonstrate that a tions on a reconfigurable FPGA, there may be value tuned microarchitecture can offer up to 30% improvement in reducing the area consumed by the processor, either for three of the benchmarks and on average 11.4% improve- to make room for something else that is dynamically ment over the fastest-on-average design. We also show that required or to reduce its power consumption, or pos- our benchmark applications use only 50% of the available sibly to increase the performance of the processor on instructions on average, and that a processor customized to a per-program basis. Of course, these circumstances support only that subset of the ISA for a specific application have to be such that the reconfiguration itself does not can on average offer 25% savings in both area and energy. Fi- obviate the gain in benefit. nally, when both techniques for customization are combined In this paper we show that the impact of varying several we obtain an average improvement in performance-per-area microarchitectural axes varies widely across different embed- of 25%. ded applications. We then demonstrate that a soft processor microarchitecture that has been tuned to a specific applica- 1. INTRODUCTION tion can be significantly more area-efficient than the best-on- Instruction set processors are an integral part of all mod- average processor, and that typical embedded applications ern digital systems, including systems now fully implemented use less than half of the available Instruction Set Architec- with FPGAs. While some vendors have included fixed, hard ture (ISA), despite the use of an unmodified version of gcc as a compiler. By removing the hardware associated only with these unused instructions (what we call ISA subsetting) we can obtain significant area and power savings. Finally, we show that microarchitectural tuning and ISA subsetting Permission to make digital or hard copies of all or part of this work for are complementary. personal or classroom use is granted without fee provided that copies are This paper is organized as follows: in the following sub- not made or distributed for profit or commercial advantage and that copies section we review related work. Section 2 briefly describes bear this notice and the full citation on the first page. To copy otherwise, to our system for automatically generating soft processors with republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. different micro-architecture. Section 3 describes our method Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. of measuring the area, speed and power consumption of the generated processors. Section 4 presents the impact of spe- Architecture cialization for these metrics, and Section 5 concludes. Description 1.1 Related Work SPREE Commercial customizable processors for inclusion in ASICs are available from Tensilica [7] and ARC [1] which allow de- signers to tune the processor with additional hardware in- Synthesizable structions to better match their application requirements. RTL Altera Nios [4] and Xilinx Microblaze [3] market soft pro- cessors that also allow customized instructions or hardware, Benchmarks and are typically available in only a few microarchitectural variants. RTL Simulator RTL CAD Flow The CUSTARD [9] customizable threaded soft processor is an FPGA implementation of a parameterizable core sup- 1. Correctness 3. Area 2. Cycle count 4. Clock frequency porting the following options: different number of hardware 5. Power threads and types, custom instructions, branch delay slot, load delay slot, forwarding, and register file size. While the Figure 1: Overview of the SPREE system. available architectural axes seem interesting the results show large overheads in the processor design: clock speed varied only between 20 and 30 MHz on the 0.15 micron XC2V2000, target ISA and datapath, and generates an RTL description and the single-threaded base processor consumed 1800 slices of a working soft processor (SPREE is described in more de- while the commercial Microblaze typically consumes less tail later in Section 2). SPREE was initially used to explore than 1000 slices on the same device. the microarchitecture of soft processors [19], and allowed The PEAS-III project [11] focuses on ISA design and hard- us to study the impact on area, performance, and energy ware software co-design, and proposes a system which gen- of a given microarchitectural feature or trade-off across a erates a synthesizable RTL description of a processor from a set of benchmark applications. For example, we are able clock-based micro-operation description of each instruction. to vary the shifter implementation, hardware versus soft- Although PEAS-III enables a broad range of exploration, ware support for multiplication, and the pipeline depth and it requires changes to the description of many instructions forwarding capabilities. In this work we explore microarchi- to produce a small structural change to the architecture. tectural “wins” on a per-application basis by customizing PEAS-III was used [11] to conduct a synthesis-driven explo- the processor to the given application. ration which explored changing the multiply/divide unit to Figure 1 shows an overview of the SPREE system. Taking sequential (34-cycles), and then adding a MAC (multiply- a high-level description of an architecture as input, SPREE accumulate) instruction. The results compared the area and automatically generates synthesizable RTL (in Verilog). We clock frequency reported by the synthesis tool. then simulate the RTL description on benchmark applica- Mishra and Kejariwal augmented the EXPRESSION Ar- tions to both ensure correctness of the processor, and to chitecture Description Language (ADL) to include RTL gen- measure the total number of cycles required to execute each eration enabling synthesis-driven architectural exploration [14]. application. The RTL is also processed by CAD tools which The generated processors were shown to have significant accurately measure the area, clock frequency, and power overhead versus an already poorly designed academic pro- of the generated soft processor. The following discussion cessor meant only for simulation not synthesis. A small describes how SPREE generates a soft processor in more exploration was then conducted [14] for an FFT benchmark detail—complete descriptions of SPREE are available in pre- where the number of functional units were increased from vious publications [17–19]. 1 to 4, the number of stages in the multiplier unit were in- creased from 1 to 4, and sin/cos instructions were added to 2.1 Input: The Architecture Description the instruction set. These customizations seemed promising The input to SPREE is a description of the desired pro- but the analysis did not consider the entire processor (mea- cessor, composed of textual descriptions of both the tar- surements were only made for the execution stage). More- get ISA and the processor datapath. Each instruction in over, this exploration was performed for traditional hard the ISA is described as a directed graph of generic oper- processors, with no reference to FPGA-based processors. ations (GENOPs), such as ADD, XOR, PCWRITE, LOADBYTE, and REGREAD. The graph indicates the flow of data from one 2. GENERATING SOFT PROCESSORS WITH GENOP to another required by that instruction. SPREE provides a library of basic components (e.g., a register file, SPREE adder, sign-extender, instruction fetch unit, forwarding line, Our long-term research agenda is to be able to automati- and more).