A HIGH PERFORMANCE WITH DSP EXTENSIONS OPTIMIZED FOR THE VIRTEX-4 FPGA

Andreas Ehliar∗, Per Karlstrom¨ †, Dake Liu

Department of Electrical Engineering Linkoping¨ University Sweden email: [email protected], [email protected], [email protected]

ABSTRACT In this paper we will present a high speed soft micropro- As the use of FPGAs increases, the importance of highly cessor core with DSP extensions optimized for the Virtex- optimized processors for FPGAs will increase. In this paper 4 FPGA family. The of the is we present the microarchitecture of a carefully designed to allow for high speed operation. core optimized for the Virtex-4 architecture. The core can operate at 357 MHz, which is significantly faster than Xil- 2. RELATED WORK inx’ Microblaze architecture on the same FPGA. At this fre- quency it is necessary to keep the logic complexity down There are many soft processor cores available for FPGA us- and this paper shows how this can be done while retaining age although Nios II, Mico32, and Microblaze are common sufficient functionality for a high performance processor. choices thanks to the support from their vendors. ’s Nios II [2]. is a 32-bit RISC processor that 1. INTRODUCTION comes in three flavors; e, s, and, f with a one, five, or six pipeline stages respectively. The use of FPGAs has increased steadily since their in- ’ Microblaze is a 32-bit RISC processor [10] opti- troduction. The first FPGAs were limited devices, usable mized for Xilinx FPGAs. mainly for glue logic whereas the capabilities of modern FP- Lattice’ Mico32 is a 32 bit RISC processor [7] with a six GAs allow for extremely varied use cases in everything from stage pipeline. The source code of Mico32 is also available high end communication and networking equipment to con- under an open source license. sumer devices like flat screen televisions. In many cases, a Besides the vendor supported processors there are a wide soft processor core is an important part of the design. variety of processor cores available, both commercial and The main players in this market are Altera’s Nios, Xil- open source. Notable cores include OR1200 [6], Leon [8], inx’ Microblaze and Lattice’ Mico32. All are capable mi- OpenSparc [9]. These processors are targeted at ASICs but crocontrollers based on a traditional RISC pipeline. How- have found a use on FPGAs as well. ever, there is little choice available if a soft DSP processor core is needed. Some might argue that a DSP processor core is unnecessary in an FPGA as DSP computations can instead 3. OVERVIEW be handled by custom designed IP blocks. For example, a radar processing core can easily fill an entire high end FPGA Our main design goal was to create a high speed soft pro- with high utilisation rate of all functional units. On the other cessor core with support for common DSP operations. In hand, it is harder to design a system which will use a wide addition, the processor should be reasonable easy to pro- variety of different DSP algorithms if custom IP blocks are gram without intimate knowledge of the pipeline. It should used for each algorithm. As an example, a video confer- also be possible to write a decent compiler backend for the ence system might use a hardware accelerated video encoder processor. Finally, the processor footprint should not be ex- and a software based video decoder and audio codec. There cessive. are many reasons for partitioning the design like this, in- cluding; better hardware utilization and shorter development 3.1. Tradeoffs time due to software reuse and simplified debugging. ∗Funded by Stringent of SSF It is hard to create a processor which is both fast and easy to †Funded by Stringent of SSF program. A fast processor will have a deep pipeline, forcing

978-1-4244-1961-6/08/$25.00 ©2008 IEEE. 599 the programmer (or compiler) to think hard about instruction Select active unit scheduling and branch penalties. On the other hand, a programmer friendly processor presents an architecture with few surprises trading either R Execution speed or hardware complexity for ease of use. unit 1 Our goal was to create a high speed processor which is still relatively easy to program. For example, in order to in- R crease the maximum clock frequency our processor only has Execution partial support for register forwarding. The result of some unit 2 From other pipeline stages instructions cannot be forwarded directly to other execution units. Typically one or two other instructions has to be is- R Execution sued before the result of an operation can be reused on an- unit 3 other . We feel that this is an acceptable trade- off, based on our experience with other processors without any forwarding at all [4]. Fig. 1. Forwarding architecture

4. ARCHITECTURE

The architecture is RISC based with six pipeline stages; fetch, decode / read operands, register forwarding, execute1, execute2, and writeback. The processor is a 32-bit micro- processor with 16 general purpose registers. The address space is limited to 16 bits. The instruction set contains a fairly standard set of RISC instructions. The instruction words are 27 bits wide and up to 7 bits can be used for immediates. Longer immediates can be han- dled either by a 128 entry lookup-table or by using an extra SETHI instruction. Special purpose registers are used for I/O and processor configuration.

4.1. Register Forwarding Register forwarding is implemented as a separate pipeline stage. This means that in general, the result of one opera- Fig. 2. The architecture of the arithmetic unit and the prin- tion cannot be forwarded directly to the next instruction. To ciples of the final part of the register forwarding unit mitigate this, the arithmetic unit can forward a result of an arithmetic operation directly to itself. Similarly, the result of a logic unit operation can be forwarded directly to the logic the critical path would be too long (e.g. a 32 bit with unit. 4-to-1 muxes in front of each operand can be synthesized to The principle of the forwarding unit is shown in Fig. 1. only 290 MHz in a Virtex-4 speedgrade 12). To reduce the size of the mux, the pipeline is constructed so However, the processor is able to forward results from that signals from one pipeline stage can be or:ed together. the adder back to the adder without any penalty to support This is accomplished by utilizing the reset input of the flip- a sequence of AU instructions. As can be seen in Fig. 2, flops after each execution unit to set all non-active execution the result of the addition can only be forwarded to one of unit outputs in a certain pipeline stage to zero. the inputs of the adder. Due to the design of a slice in a Virtex-4 it is not possible to put a mux in front of the other 4.2. Arithmetic Unit operand when only one LUT is used per bit in the adder. This complicates forwarding since either operand has to be The arithmetic unit (AU), shown in Fig. 2 is one of the most able to be forwarded to any input of the adder. To solve critical parts of the entire processor. As mentioned earlier this problem the previous pipeline stage is responsible for we could not afford to have full register forwarding in this ensuring that the correct operand appears on the inputs. An processor. The main reason for this is the 32 bit adder in the example is shown in Table 1. It should also be noted that AU. If a large mux is inserted before the inputs to the adder only the principles for the register forwarding pipeline stage

600 Instruction Forwarded Control signals sequence operand add r2,r1,r0 - - add r2,r1,r2 OpB Force0=1 Select=1 add r2,r2,r1 OpA Swap=1 Force0=1 Select=1 sub r2,r1,r2 OpB Force1=1 Select=1 sub r2,r2,r1 OpA Swap=1 InvA=1 Select=1 sub r2,r2,r2 Both Replace with set r2,#0 add r2,r2,r2 Both Cannot forward directly Fig. 3. Table 1. Forwarding operands to the arithmetic unit

the memory. This also means that the register used must be is shown in the figure. In our implementation this has been written to the register file before being used for addressing merged into the same LUTs that are used to implement the memory. We believe that this is an acceptable tradeoff as one forwarding shown in Fig. 1 to reduce the logic level. very common usage for this is accessing variables on the stack and the stack pointer is unlikely to 4.3. Branching change very often. The data memory is byte addressable which is important Branches always has one delay slot. If absolute addressing if high level languages like C and C++ are used to write is used for the jump address, the processor can immediately programs for the processor. start executing the target instruction after the delay slot. The processor has 4 status flags: Z (zero), V (overflow), N (negative), C (carry). An arithmetic or logic instruction 4.5. DSP Extensions will change these flags. The critical path of this unit is the Z flag generation. This is performed partly in the AU and A few architectural features can greatly improve the perfor- LU units. In the AU unit, the 20 lower bits are preprocessed mance of DSP applications. These are the multiply-and- in groups of four bits using five 4-input or-gates. In the LU accumulate (MAC) unit, the circular buffer and zero over- unit, the entire 32 bit result is preprocessed in the same way head loop support. using eight 4-input or-gates. Thanks to this preprocessing of The MAC operation is ubiquitious in DSP applications the Z flag it is possible to start branch condition computation and fairly easy to implement in hardware. Four DSP48 32 × 32 one pipeline stage earlier. blocks were used to implement a bit multiplication Conditional jumps are statically predicted using a bit in followed by a 64 bit accumulation unit. In total, the MAC the instruction word. A correctly predicted conditional jump unit has six pipeline stages. Due to the long latency of this has no penalty cycles. A mispredicted jump has a penalty of unit the results are written to a special accumulation regis- either three or four cycles. ter instead of the normal register file. The MAC unit is also A register indirect jump always has a penalty of four cy- used for multiplication without accumulation. cles. The operands of the MAC instruction can either be If the branch prediction was wrong, the speculatively fetched from the register file or from the data and constant fetched instructions are invalidated before entering the ex- memory. Special address generator units (AGU) are con- ecute1 stage. nected to the data and constant memory to allow for a steady stream of data from the memories to the MAC unit. The AGUs support linear and circular addressing. 4.4. Memory Architecture For each memory access, the AGU increases the current There are three memories in the system: program, data, and address with a configurable stepsize. In circular address- constant memory. The program memory is 27 bits wide, the ing, a start and end address constrains the range of valid data and constant memory is 32 bits wide. Both the constant addresses. If the next address is located beyond the end and data memory can be addressed using address generator address, the next address is set to CURRENT ADDRESS + units described in the next section. The constant memory is STEPSIZE - BUFFER SIZE, where BUFFER SIZE is the also used as a lookup table for the 128 constants described size of the circular buffer. in Section 4. A straight forward hardware implementation of this cal- The data memory can be addressed using a value from culation could be synthesized to 209 MHz. The next address the register file plus an 8 bit offset in the instruction word. is compared to END ADDRESS and, if it is too large, ad- The adder is located in the same pipeline stage as register justed as described above. Pipelining is used to improve the forwarding. This is done to minimize the complexity before performance of the address generator. Due to the pipelin-

601 ing, the address must be compared to END ADDRESS- As already explained in Section 4.2 there is not enough 2*STEPSIZE (END ADDRESS-STEPSIZE for the first it- time available to have full forwarding in front of the arith- eration) instead of END ADDRESS. The pipelined address metic unit, but it might be possible to forward operands from generator is shown in Fig. 3. the adder directly to the logic unit, shifter, and memory unit. To improve the performance of small loops typical for This should be evaluted with the help of benchmarks. DSP kernels there is also a loop instruction available which Other possible improvements are caches, interrupts, allows for up to 65535 loop iterations. floating point instructions, and a .

5. RESULTS 7. CONCLUSION

A floorplanned version of the processor can operate at It is not possible to design a really fast processor in an FPGA 357 MHz in a Virtex-4 LX80, speedgrade 12 according without some quirks. It is however possible to design a pro- static timing analysis. Without floorplanning, 334 MHz is cessor where the impact of these quirks are reduced. the maximum frequency. The processor uses 1197 slices, Like all high speed designs, a high speed microprocessor 1716 LUTs, and 1301 flip-flops. The largest parts of the has to keep the logic complexity between flip-flops at a min- processor are the shifter (405 LUTs, 131 flip-flops) and the imum. Unlike many other high speed designs, the pipeline register forwarding pipeline stage (264 LUTs, 64 flip-flops). also has to be short. This paper has demonstrated a number of ways to deal with these issues, resulting in a processor which can operate 6. DISCUSSION at a much higher clock frequency than Xilinx’ Microblaze. The architectural details and tradeoffs presented here should In order to reach a clock frequency of 357 MHz in a Virtex- be of interest to anyone who is interested in 4 FPGA, a number of compromises had to be made. This for FPGAs. means that the processor will have a few quirks not found in more general processors. The most important impact of this is that the pipeline is partly visible to the programmer. Acknowledgments According to [5] on Xilinx’ homepage, the Microblaze processor can run at 160 MHz in a Virtex-4. We have, how- Thanks to Prof. Lars Svensson for an interesting discussion ever seen figures of up to 200 MHz reported for the Microb- regarding the processor described in this chapter. laze on Virtex-4 [3]. Even so, our processor has a maximum clock frequency which is is almost 80% faster than Microb- 8. REFERENCES laze. In addition, it is also operating at a significantly higher [1] P.S. Ahuja, D.W. Clark, and A. Rogers. The performance frequency than the Microblaze on a Virtex-5. This does not impact of incomplete bypassing in processor pipelines. Mi- mean that all applications will be 80% faster when running croarchitecture, 1995. Proceedings of the 28th Annual Inter- on our processor. Some programs will require more clock national Symposium on, pages 36–45, Nov-1 Dec 1995. cycles to run on our processor, due to the incomplete regis- [2] Altera. Nios II Processor Reference Handbook, 2007. ter forwarding. However, DSP applications can typically be [3] Peter Clarke. Xilinx raises soft processor clock frequency rewritten to compensate for the lack of register forwarding 25%. http://www.eetimes.com/, 2005. by proper instruction scheduling and algorithm selection. For example, in [4], only 10% of the cycles were wasted [4] J. Eilert, A. Ehliar, and Dake Liu. Using low precision float- ing point numbers to reduce memory cost for mp3 decoding. on NOP instructions and that processor has no support for Multimedia Signal Processing, 2004 IEEE 6th Workshop on, register forwarding at all. A more thorough examination of pages 119–122, Sept.-1 Oct. 2004. the results of incomplete forwarding can be found in [1]. [5] Xilinx Inc. Microblaze - the industry’s most flexible embed- We also acknowledge that standardized benchmarks are ded processing solution. http://www.xilinx.com/ required to fully evaluate our processor. publications/prod_mktg/MicroBlaze_Sell_ Sheet.pdf, 2006. 6.1. Future Work [6] Damjan Lampret. OpenRISC 1200 IP Core Specification, 2001. Our final goal is a soft processor core optimized for DSP [7] Lattice. LatticeMico32 Processor Reference Manual, 2007. computations on FPGAs. To reach this goal it is necessary [8] Gaisler Research. The LEON processor users manual, 2001. to benchmark the processor using a number of realistic DSP [9] Sun. OpenSPARC T2 Core Microarchitecture Specification, applications. Unfortunately such benchmarks are not easily a edition, December 2007. performed as there is not yet a compiler available for this [10] Xilinx. MicroBlaze Processor Reference Guide, 2004. processor.

602