A High Performance Microprocessor with Dsp Extensions Optimized for the Virtex-4 Fpga
Total Page:16
File Type:pdf, Size:1020Kb
A HIGH PERFORMANCE MICROPROCESSOR WITH DSP EXTENSIONS OPTIMIZED FOR THE VIRTEX-4 FPGA Andreas Ehliar∗, Per Karlstrom¨ †, Dake Liu Department of Electrical Engineering Linkoping¨ University Sweden email: [email protected], [email protected], [email protected] ABSTRACT In this paper we will present a high speed soft micropro- As the use of FPGAs increases, the importance of highly cessor core with DSP extensions optimized for the Virtex- optimized processors for FPGAs will increase. In this paper 4 FPGA family. The microarchitecture of the processor is we present the microarchitecture of a soft microprocessor carefully designed to allow for high speed operation. core optimized for the Virtex-4 architecture. The core can operate at 357 MHz, which is significantly faster than Xil- 2. RELATED WORK inx’ Microblaze architecture on the same FPGA. At this fre- quency it is necessary to keep the logic complexity down There are many soft processor cores available for FPGA us- and this paper shows how this can be done while retaining age although Nios II, Mico32, and Microblaze are common sufficient functionality for a high performance processor. choices thanks to the support from their vendors. Altera’s Nios II [2]. is a 32-bit RISC processor that 1. INTRODUCTION comes in three flavors; e, s, and, f with a one, five, or six pipeline stages respectively. The use of FPGAs has increased steadily since their in- Xilinx’ Microblaze is a 32-bit RISC processor [10] opti- troduction. The first FPGAs were limited devices, usable mized for Xilinx FPGAs. mainly for glue logic whereas the capabilities of modern FP- Lattice’ Mico32 is a 32 bit RISC processor [7] with a six GAs allow for extremely varied use cases in everything from stage pipeline. The source code of Mico32 is also available high end communication and networking equipment to con- under an open source license. sumer devices like flat screen televisions. In many cases, a Besides the vendor supported processors there are a wide soft processor core is an important part of the design. variety of processor cores available, both commercial and The main players in this market are Altera’s Nios, Xil- open source. Notable cores include OR1200 [6], Leon [8], inx’ Microblaze and Lattice’ Mico32. All are capable mi- OpenSparc [9]. These processors are targeted at ASICs but crocontrollers based on a traditional RISC pipeline. How- have found a use on FPGAs as well. ever, there is little choice available if a soft DSP processor core is needed. Some might argue that a DSP processor core is unnecessary in an FPGA as DSP computations can instead 3. OVERVIEW be handled by custom designed IP blocks. For example, a radar processing core can easily fill an entire high end FPGA Our main design goal was to create a high speed soft pro- with high utilisation rate of all functional units. On the other cessor core with support for common DSP operations. In hand, it is harder to design a system which will use a wide addition, the processor should be reasonable easy to pro- variety of different DSP algorithms if custom IP blocks are gram without intimate knowledge of the pipeline. It should used for each algorithm. As an example, a video confer- also be possible to write a decent compiler backend for the ence system might use a hardware accelerated video encoder processor. Finally, the processor footprint should not be ex- and a software based video decoder and audio codec. There cessive. are many reasons for partitioning the design like this, in- cluding; better hardware utilization and shorter development 3.1. Tradeoffs time due to software reuse and simplified debugging. ∗Funded by Stringent of SSF It is hard to create a processor which is both fast and easy to †Funded by Stringent of SSF program. A fast processor will have a deep pipeline, forcing 978-1-4244-1961-6/08/$25.00 ©2008 IEEE. 599 the programmer (or compiler) to think hard about instruction Select active unit scheduling and branch penalties. On the other hand, a programmer friendly processor presents an architecture with few surprises trading either R Execution speed or hardware complexity for ease of use. unit 1 Our goal was to create a high speed processor which is still relatively easy to program. For example, in order to in- R crease the maximum clock frequency our processor only has Execution partial support for register forwarding. The result of some unit 2 From other pipeline stages instructions cannot be forwarded directly to other execution units. Typically one or two other instructions has to be is- R Execution sued before the result of an operation can be reused on an- unit 3 other execution unit. We feel that this is an acceptable trade- off, based on our experience with other processors without any forwarding at all [4]. Fig. 1. Forwarding architecture 4. ARCHITECTURE The architecture is RISC based with six pipeline stages; fetch, decode / read operands, register forwarding, execute1, execute2, and writeback. The processor is a 32-bit micro- processor with 16 general purpose registers. The address space is limited to 16 bits. The instruction set contains a fairly standard set of RISC instructions. The instruction words are 27 bits wide and up to 7 bits can be used for immediates. Longer immediates can be han- dled either by a 128 entry lookup-table or by using an extra SETHI instruction. Special purpose registers are used for I/O and processor configuration. 4.1. Register Forwarding Register forwarding is implemented as a separate pipeline stage. This means that in general, the result of one opera- Fig. 2. The architecture of the arithmetic unit and the prin- tion cannot be forwarded directly to the next instruction. To ciples of the final part of the register forwarding unit mitigate this, the arithmetic unit can forward a result of an arithmetic operation directly to itself. Similarly, the result of a logic unit operation can be forwarded directly to the logic the critical path would be too long (e.g. a 32 bit adder with unit. 4-to-1 muxes in front of each operand can be synthesized to The principle of the forwarding unit is shown in Fig. 1. only 290 MHz in a Virtex-4 speedgrade 12). To reduce the size of the mux, the pipeline is constructed so However, the processor is able to forward results from that signals from one pipeline stage can be or:ed together. the adder back to the adder without any penalty to support This is accomplished by utilizing the reset input of the flip- a sequence of AU instructions. As can be seen in Fig. 2, flops after each execution unit to set all non-active execution the result of the addition can only be forwarded to one of unit outputs in a certain pipeline stage to zero. the inputs of the adder. Due to the design of a slice in a Virtex-4 it is not possible to put a mux in front of the other 4.2. Arithmetic Unit operand when only one LUT is used per bit in the adder. This complicates forwarding since either operand has to be The arithmetic unit (AU), shown in Fig. 2 is one of the most able to be forwarded to any input of the adder. To solve critical parts of the entire processor. As mentioned earlier this problem the previous pipeline stage is responsible for we could not afford to have full register forwarding in this ensuring that the correct operand appears on the inputs. An processor. The main reason for this is the 32 bit adder in the example is shown in Table 1. It should also be noted that AU. If a large mux is inserted before the inputs to the adder only the principles for the register forwarding pipeline stage 600 Instruction Forwarded Control signals sequence operand add r2,r1,r0 - - add r2,r1,r2 OpB Force0=1 Select=1 add r2,r2,r1 OpA Swap=1 Force0=1 Select=1 sub r2,r1,r2 OpB Force1=1 Select=1 sub r2,r2,r1 OpA Swap=1 InvA=1 Select=1 sub r2,r2,r2 Both Replace with set r2,#0 add r2,r2,r2 Both Cannot forward directly Fig. 3. Address generation unit Table 1. Forwarding operands to the arithmetic unit the memory. This also means that the register used must be is shown in the figure. In our implementation this has been written to the register file before being used for addressing merged into the same LUTs that are used to implement the memory. We believe that this is an acceptable tradeoff as one forwarding shown in Fig. 1 to reduce the logic level. very common usage for this addressing mode is accessing variables on the stack and the stack pointer is unlikely to 4.3. Branching change very often. The data memory is byte addressable which is important Branches always has one delay slot. If absolute addressing if high level languages like C and C++ are used to write is used for the jump address, the processor can immediately programs for the processor. start executing the target instruction after the delay slot. The processor has 4 status flags: Z (zero), V (overflow), N (negative), C (carry). An arithmetic or logic instruction 4.5. DSP Extensions will change these flags. The critical path of this unit is the Z flag generation. This is performed partly in the AU and A few architectural features can greatly improve the perfor- LU units. In the AU unit, the 20 lower bits are preprocessed mance of DSP applications.