Analysis and Optimization of a Deeply Pipelined FPGA Soft Processor

Hui Yan Cheah, Suhaib A. Fahmy, Nachiket Kapre School of Computer Engineering Nanyang Technological University, Singapore Email: [email protected]

Abstract—FPGA soft processors have been shown to achieve Octavo [7] builds around an RAM Block to develop high frequency when designed around the specific capabilities a soft processor that can run at close to the maximum RAM of heterogenous resources on modern FPGAs. However, such Block frequency. It is a multi-threaded 10-cycle processor that performance comes at a cost of deep pipelines, which can result can run at 550 MHz on a IV, representing the maximum in a larger number of idle cycles when executing programs frequency supported by Block RAMs in that architecture. with long dependency chains in the instruction sequence. We iDEA [8], [9] makes use of the dynamic programmability perform a full design-space exploration of a DSP block based soft processor to examine the effect of pipeline depth on frequency, the DSP48E1 primitive to build a lean soft processor area, and program runtime, noting the significant number of which achieves a frequency close to the fabric limits. In the NOPs required to resolve dependencies. We then explore the case of iDEA, the long pipeline was shown to result in a potential of a restricted data forwarding approach in improving large number of NOPs being required between dependent runtime by significantly reducing NOP padding. The result is a instructions. Octavo was designed as a multi-issue processor to processor that runs close to the fabric limit of 500MHz with a overcome this issue, but as a result, is only suitable for highly case for simple data forwarding. parallel code. In this paper, we conduct a detailed design space explo- I. INTRODUCTION AND RELATED WORK ration to determine the optimal pipeline depth for the iDEA Processors are widely used within FPGA systems, from DSP Block based soft-processor [8]. We also make a case for management of execution and interfacing, to implementation a restricted forwarding scheme to overcome the dependency of iterative algorithms outside of the performance-critical overhead due to the long pipeline. datapath. When a soft processor is used in a sequential part of a computation [1], a low frequency would fail to deliver II. DEEP PIPELINING IN SOFT-PROCESSORS the high performance required to avoid the limits stated by Amdahl’s law. Maximizing the performance of such soft iDEA is based on a classic 32-bit 5-stage load-store RISC processors requires us to consider the architecture of the FPGA architecture with instruction fetch, decode, execute, and mem- in the design, and to leverage unique architectural capabilities ory stages followed by write-back to the register file. We tweak wherever possible. Soft processors can be designed to be the pipeline by placing the memory stage in parallel with independent of architecture, and hence, portable, or to leverage the execute stage to lower latency, effectively making this a architectural features of the underlying FPGA. 4-stage processor. Each stage of our processor can support a configurable number of pipeline registers. The minimum Commercial soft processors include the Xilinx MicroB- pipeline depth for each stage is one. As we increase the laze [2], Altera Nios II [3], ARM Cortex-M1 [4], and Lat- number of pipeline stages, clock frequency increases, until it ticeMico32 [5], in addition to the open-source LEON3. All of plateaus, as later shown in Fig. 3. We explore all possible these processors have been designed to be flexible, extensible, combinations of depths for the different stages, and pick, for and general, but suffer from not being fundamentally built each overall processor depth, the configuration that gives the around the FPGA architecture. The more generalised a core highest frequency. To achieve maximum frequency using a is, the less closely it fits the low-level target architecture, and primitive like the DSP block, it must have its multiple pipeline hence, the less efficient its implementation in terms of area and stages enabled. iDEA uses the DSP block as its execution unit speed. Consider the LEON 3 soft processor: implemented on a and a Block RAM as the instruction and data memory, and Virtex 6 FPGA with a fabric that can support operation at over as a result, we expect a long pipeline to be required to reach 400MHz, it barely achieves a clock frequency of 100MHz [6]. fabric frequency limits. By taking a fine-grained approach to pipelining the remaining logic, we can ensure that we balance Recent work on architecture-focused soft processors has delays to achieve high frequency. Since the pipeline stages in resulted in a number of more promising alternatives. These the DSP block are fixed, arranging registers in different parts of processors are designed considering the core capabilities of the pipeline can have a more pronounced impact on frequency. the FPGA architecture and benefit from the performance and efficiency advantages of the hard macro blocks present in While deep pipelining of the processor results in higher modern devices. These processors typically have long pipelines frequency, it also increases the dependency window for data to achieve high frequency. Such long pipelines suffer from the hazards, hence requiring more NOPs for dependent instruc- need to pad dependent instructions to overcome data hazards tions, as shown in Fig. 1 for a set of benchmarks. Fig. 2 as a result of the long pipeline latency. shows pipeline depths of 7, 8 and 9 cycles, respectively, with

978-1-4799-6245-7/14/$31.00 ©2014 IEEE crc fir qsort 500 fib median 450 70000 400 60000 350 50000 40000 300 Frequency 30000 250

NOP counts 20000 200 10000 150 4 6 8 10 12 14 0 4 6 8 10 12 14 Pipeline Depth Pipeline Depth Fig. 3: Frequency of different pipeline combinations. 1000 Fig. 1: NOP counts with increasing pipeline depth. LUTs 900 Registers 7-Stage 800 700 IF IF ID EX EX EX WB 600 4 nops IF IF ID EX EX EX WB 500 400

8-Stage Area Unit 300 IF IF ID EX EX EX EX WB 200 5 nops 150 200 250 300 350 400 450 500 IF IF ID EX EX EX EX WB Frequency 9-Stage Fig. 4: Resource utilization of all pipeline combinations.

IF IF IF ID ID EX EX EX WB

5 nops support pipeline depths from 4–15. The pipeline depth is made IF IF IF ID ID EX EX EX WB variable through a parameterizable shift register at the output Fig. 2: Dependencies for pipeline depths of 7, 8 and 9 stages. of each processor stage. During automated implementation runs in ISE, the shift register count is adjusted to increase the pipeline depth. We enable retiming and register balancing to exploit the extra registers in the datapath. Each pipeline fetch, decode, execute and write back stages in each instruction configuration is synthesised, mapped, and placed and routed pipeline. to obtain final area and speed results. Backend CAD imple- mentation options are consistent throughout all experimental To prevent a data hazard, an instruction dependent on runs. We are able to test our processor on the FPGA using the the result of a previous instruction must wait until the com- PCI-based driver in [10]. puted data is written back to the register file before fetching operands. The second instruction can be fetched, but cannot We benchmark the instruction count performance of our move to the decode stage (in which operands are fetched), processor using embedded C benchmarks. These benchmarks until the instruction on which it is dependent has written back are compiled using the LLVM-MIPS compiler, and emitted its results. In the case of a 7-stage pipeline with the pipeline assembly is then analysed for dependencies. A sliding window configuration shown, 4 NOPs are required between dependent is used to ensure all dependencies are suitably addressed by instructions. Since there are many ways we can distribute pro- inserting sufficiency NOPs as required for the specific pipeline cessor pipeline cycles between the different stages, an increase configuration being tested. in processor pipeline depth does not always mean more NOPs are needed. Consider the 8 and 9-stage configurations in Fig. 2. A. Area and Frequency Analysis Since the extra stage in the 9 cycle configuration is an IF stage, that can be overlapped with a dependent instruction, Since the broad goal of our design is to maximize soft no additional NOPs are required than for the given 8 cycle processor frequency while keeping the processor small, we configuration. This explains why the lines in Fig. 1 do not perform a design space exploration to help pick the optimal increase uniformly. However, due to the longer dependency combination of pipeline depths for the different stages. We window, a longer pipeline depth with the same number of vary the number of pipeline cycles from 1–5 for each stage: NOPs between consecutive dependent instructions may still fetch, decode, and execute, and the resulting overall pipeline have a slightly higher total instruction count. depth is 4–15 cycles, with writeback fixed at 1 cycle. Fig. 3 shows the frequency achieved for varying pipeline III. EXPERIMENTS depths between 4–15. Each overall depth can be built using We implement the soft processor on a Xilinx Virtex-6 varying combinations of stage depths, as we can distribute XC6VLX240T-2 FPGA (ML605 platform) using the Xilinx these registers in different parts of the 4-stage soft processor. ISE 14.5 tools. We generate various processor combinations to The line shows the trend for the maximum frequency achieved

ubro Osbtensqeta eedn ntutosto instructions hazards. dependent data required sequential avoid the between show value we NOPs lower I, of Table a In number hence (IPC). and cycle offset per code, the instructions be executed of to in that padding pipelines NOPs expect long increased increased very by we of require benefits Hence, frequency also marginal instructions. 1, depth Fig. dependent in showed pipeline between we As in equation. performance increases the of part one Analysis Execution B. 500MHz. around approach starts limits we and as fabric expected 12, raw is frequency at This the that that. peaks beyond see stages, slightly decline 11 we to to 3, the enabled up in Fig. be considerably is From to increases it stages performance. where pipeline increase threshold further depth to requires configurations a and represent at illustrated path These is critical MHz, block reach plot. DSP 310 the (58%) the and in where designs regions MHz the denser 250 by of between majority of frequencies higher A the in frequency designs. pronounced LUT frequency more by than becomes higher this combinations and generally consumption, is implemented consumption Register all achieved. for sumption of I. Table combination in optimal presented is The depth depth. each for pipeline stages overall each at 1) = (WB depth NOPs pipeline associated and each stages at of combination Optimal I: TABLE Fig. nraigpoesrfeunytruhpplnn sjust is pipelining through frequency processor Increasing con- register and LUT of distribution the shows 4 Fig. :Rsuc tlzto fhgetfeunyconfiguration. frequency highest of utilization Resource 5: Area Unit 1,000 200 400 600 800 0

5 281

55549 9 8 7 4 6 4 5 6 4 5 4 5 4 4 4 3 4 4 2 4 2 15 4 14 3 13 12 11 10 923368223572224622135121341112Pipeline 371 Depth

6 283 370

7 320 462 FI EX ID IF

8 336 479 Pipeline

9 365 613 LUTs 10 362

Depth 542

11 363

#NOPs 529 Re Req’d 12

372 gs 676

13 422 790

14 425 928

15 423 928 rm1-tgsowrs Osbgnt nrae hl the benefit. while runtime increase, a to begin in NOPs result onwards, limited, to is 11-stages gains NOPs From in frequency increasing increase the an the as 9–11, by allowing time From offset wall-clock NOPs. are in of increases increase number frequency slight of a pipeline benefits For is point. the there certain 5–9, a to of up lengths depth decreases pipeline time increase wall-clock we expected, as As tested. we benchmarks various at time wall-clock geomean and depths. pipeline Frequency 6: Fig. rt ttcfradn ol etocmlx restricted a complex, too be would forwarding static orate increases a overhead pipelines. for this excessive deeper and be for processor, in would soft that resulting FPGA-based overhead hardware, small frequency processor in and considering soft area additions order an a algorithm, significant of in Tomasulo’s require out it would fly. executed implementing the However, be dependencies. on to decisions instructions allows these processor the make the in allow exploited to approaches be dynamic be- schemes then other Some while approach must processor. assembly, which soft an paths fast pipeline, forwarding such lean provide a longer paths, are for a forwarding paths infeasible With comes possible multiplexed flexibility. additional more this pipeline, since and facilitate the costly to of be required can stages typically so different scheme and between forwarding full forwarding mod- a requirements allows in However, present padding are processors. the these ern and reduce instructions, help dependent Data between hardware. can in paths dynamically or forwarding software in statically for solved study similar [12]. A in frequency, statistics. presented achieved trace is program and processors derived of superscalar is depth help depth pipeline the has pipeline with balancing pipelines optimal data on An in-order of [11]. based effect of in the performance discussed of been the Analysis IPC. on instructions, effective between low dependencies the seen, cycles a have determining idle hence we more and in As mean processors. essential pipelines such longer for is depth pipelines pipeline optimal deeper of penalty benchmarks. of set gives this configuration for pipeline time 11-cycle execution increasing the lowest time 6, the execution Fig. in From resulting again. slows, trend frequency Time (us) 45 50 55 60 65 70 75 80 i.6sostenraie alcoktmsfrthe for times wall-clock normalised the shows 6 Fig. norcs,wiednmcfradn,o vnelab- even or forwarding, dynamic while case, our In re- be can instructions sequential of dependency Data NOP the with frequency increased of benefits the Balancing 4 V C A IV. 6 Frequency S FOR ASE 8 Pipeline Depth S IMPLE 10 Execution Time D ATA F 12 ORWARDING 14 200 250 300 350 400 450 500

Frequency (MHz) crc fir qsort TABLE II: Dynamic cycle counts with 11-stage pipeline with fib median % of NOPs savings. 40 35 Reduced Consecutive Reduced Total Consecutive 30 Benchmark Dependant Total NOPs Dependant NOPs NOPs 25 NOPs 20 crc 22,808 7,200 (32%) 2,400 18,008 ( 21%) fib 4,144 816 (20%) 272 3,600 ( 13%) 15 fir 46,416 5,400 (12%) 1,800 42,816 ( 8%) 10 median 13,390 1,212 (9%) 404 12,582 ( 6%) 5 qsort 28,443 1,272 (4%) 424 27,595 ( 3%) Instr. Count Reduction (%) 0 4 6 8 10 12 14 9-Stage Pipeline Depth IF IF IF ID ID EX EX EX WB Fig. 8: Reduced instruction count with data forwarding. 5 nops IF IF IF ID ID EX EX EX WB 9-Stage with Forwarding configurations for each processor stage, determining the opti- mal configuration for each overall depth. We then investigated IF IF IF ID ID EX EX EX WB the resulting runtime for a set of benchmark programs, and 2 nops IF IF IF ID ID EX EX EX WB determined that an 11-cycle configuration leads to the lowest execution time. We also conducted an initial study to explore Fig. 7: Forwarding configurations, showing how subsequent the potential of a simple data forwarding approach for such instruction can commence earlier in the pipeline. lean soft processors, and determined that a fixed forwarding path can reduce instruction count by 4–30%. We aim to explore the effect of this path on the processor area and frequency and investigate other opportunities for forwarding in future work. forwarding approach may be possible and could result in a significant overall performance improvement. Rather than add a forwarding path from every stage after the decode stage back REFERENCES to the execute stage inputs, we can consider just a single path. [1] N. Kapre and A. DeHon, “VLIW-SCORE: Beyond C for sequential In Table II, we analyse the NOPs inserted in more detail. Out control of SPICE FPGA acceleration,” in Proceedings of the Interna- of all the NOPs, we can see that a significant proportion are tional Conference on Field Programmable Technology (FPT), 2011, pp. 1–9. between consecutive instructions with dependencies (4–30%). These could be overcome by adding a single path allowing [2] UG081: MicroBlaze Processor Reference Guide, Xilinx Inc., 2011. the result of an instruction to be used as an operand in a [3] Nios II Processor Design, Altera Corpration, 2011. subsequent instruction, avoiding the need for a writeback. We [4] Cortex-M1 Processor, ARM Ltd., 2011. [Online]. Available: http://www.arm.com/products/processors/cortex-m/cortex-m1.php propose adding a single forwarding path between the output [5] LatticeMico32 Processor Reference Manual, of the execute stage, and its input to allow this. Fig. 7 shows Corp., 2009. how the addition of this path in a 9-stage configuration would [6] GRLIB IP Library User’s Manual, Aeroflex Gaisler, 2012. reduce the number of NOPs required before a subsequent [7] C. E. LaForest and J. G. Steffan, “Octavo: an FPGA-centric processor dependent instruction to just 2, compared to 5 in the case of family,” in Proceedings of the ACM/SIGDA International Symposium no forwarding. on Field Programmable Gate Arrays (FPGA), Feb. 2012, pp. 219–228. [8] H. Y. Cheah, S. A. Fahmy, and D. L. Maskell, “iDEA: A DSP In Table II, we show how the addition of this path reduces block based FPGA soft processor,” in Proceedings of the International the number of NOPs required to resolve such consecutive de- Conference on Field Programmable Technology (FPT), Dec. 2012, pp. pendencies, and hence the reduction in overall NOPs required. 151–158. As this fixed forwarding path is only valid for subsequent [9] H. Y. Cheah, F. Brosser, S. A. Fahmy, and D. L. Maskell, “The iDEA dependencies, it does not eliminate NOPs entirely, and non- DSP Block Based Soft Processor for FPGAs,” ACM Transactions on adjacent dependencies are still subject to the same window. Reconfigurable Technology and Systems, vol. 7, no. 3, pp. 19:1–19:23, However, we can see a significant reduction in the overall 2014. [10] K. Vipin, S. Shreejith, D. Gunasekara, S. A. Fahmy, and N. Kapre, number of NOPs and hence, cycle count for execution of our “System-level FPGA device driver with high-level synthesis support,” benchmarks across a range of pipeline depths. These savings in Proceedings of the International Conference on Field Programmable are shown in Fig. 8. We can see significant savings of between Technology (FPT), Dec. 2013, pp. 128–135. 4 and 30% for the different benchmarks. This depends on [11] P. G. Emma and E. S. Davidson, “Characterization of Branch and Data how often such chains of dependent instructions occur in the Dependencies in Programs for Evaluating Pipeline Performance,” IEEE assembly and how often they are executed. Transactions on Computers, vol. 36, pp. 859–875, 1987. [12] A. Hartstein and T. R. Puzak, “The optimum pipeline depth for a ,” ACM Sigarch News, vol. 30, V. C ONCLUSIONS AND FUTURE WORK pp. 7–13, 2002. We have demonstrated how to achieve an optimal pipeline depth for a DSP-based soft-processor. We have completed a full design space exploration in which we varied the overall pipeline depth, allowing for a combination of different depth