Experiences with Soft-Core Processor Design

Experiences with Soft-Core Processor Design Franjo Plavec Blair Fort Zvonko G. Vranesic Stephen D. Brown University of Toronto Department of Electrical and Computer Engineering Toronto,ON, Canada {plavec, fort, zvonko, brown}@eecg.toronto.edu Abstract tions and others. In this work, we present a new implementation of the Nios architecture, called UT Nios [3]. Soft-core processors exploit the flexibility of Field Pro- In the paper we refer to the original Nios implementation grammable Gate Arrays (FPGAs) to allow a system de- as Altera Nios, and use the term Nios for the Nios architec- signer to customize the processor to the needs of a target ture, independent of the implementation. To assess the per- application. This paper describes the UT Nios implementa- formance of the UT Nios and compare it with Altera Nios, tion of Altera’s Nios architecture. A benchmark set appro- a UT Nios benchmark set was defined. priate for soft-core processors is defined. Using the bench- The paper is organized as follows. In section 2, the key mark set, the performance of UT Nios is explored and com- features of the Nios architecture are described, and the UT pared with the commercial implementation. Nios implementation is presented. Section 3 defines the UT Nios benchmark set, and gives the performance comparison of UT and Altera Nios is given. Section 4 concludes the paper. 1. Introduction 2. UT Nios Field Programmable Gate Arrays (FPGAs) are becom- ing increasingly popular for implementation of logic cir- 2.1. Nios architecture cuits. Inclusion of abundant memory resources in FPGAs has made them suitable for implementation of embedded Nios is a highly customizable RISC architecture. One of SOPC systems, where a complete system fits on a single the architectural parameters that is selected at design time programmable chip. A processor unit in such a system is is the width of the datapath, which can be 16 or 32 bits [4]. usually a soft-core processor. The instruction word is 16 bits wide, regardless of the data- A soft-core processor is a microprocessor fully described path width, to reduce the code size compared to the 32 bits in software, usually in an HDL, which can be synthesized wide instruction word. We will focus on the 32-bit datap- in programmable hardware, such as FPGAs. Soft-core pro- ath,whichismorelikelytobeusedinhighperformance cessors implemented in FPGAs can be easily customized applications. to the needs of a specific target application. The two major The Nios instruction set supports only the basic arith- FPGA manufacturers provide commercial soft-core proces- metic and logic instructions, with optional support for in- sors. Xilinx offers its MicroBlaze processor [1], while Al- teger multiplication. The instruction set can be augmented tera has Nios and Nios II processors [2]. In this paper we with up to five custom instructions. A custom instruc- focus on the Nios architecture. tion performs a user-defined operation in hardware, thus Nios is a RISC soft-core processor optimized for imple- speeding-up the critical parts of an application [5]. Nios has mentation in Altera FPGAs. Many architectural parameters a load-store architecture, with most instructions using reg- of Nios can be customized at design time, including the dat- ister file operands. Although the size of the register file is apath width, register file size, cache size, custom instruc- configurable, only 32 registers are visible to programs at any given time. The register file is divided into register win- This work was supported in part by NSERC, CITR and University of dows. The sliding windows overlap, facilitating fast context Toronto Master’s Fellowship switching on entrance and exit from procedures. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) 1530-2075/05 $ 20.00 IEEE ¨ ' ( ] ¨ ' 5 ¨ ] ¥ ¦ ¨ ] ¥ ¨ ' ¨ ] e _ R R _ _ d ! _ T H U F V W X D L M V h ¥ R ¨ & ¨ ' ( % £ ¥ h e Y Z Y Z ? < / > ? 0 @ ) * ) + % P f f g , - . / 0 1 Y Z Y < / > ? 0 @ [ A A 0 / \ \ * 6 \ Z 0 - 8 Z : ? 6 < / > ? 0 @ Y Z Y . 0 ? > [ A A 0 / \ \ < / > ? 0 @ 2 ¨ ¨ 3 ¦ 5 ¨ % ¨ ¨ * 6 \ Z 0 - 8 Z : ? 6 \ ¨ ¨ ¨ ' 5 1 ¡ £ ¨ ¨ ¨ ¨ ¨ ' 5 ¨ R ¨ ¨ S + < ¨ ¨ + 6 7 8 9 : ; < / > ? 0 @ ! # < ? A - B / ¨ ¥ ¦ ¨ ¨ + 6 7 8 9 : ; < / > ? 0 @ < ? A - B / C D F G H D I Y Z Y ) ? 0 i Y 0 A : 6 j J K L M N G K H N k ? j : 8 Figure 1. UT Nios datapath Processor, memory and other peripherals are intercon- Most instructions commit in the execute stage. The re- nected using the automatically generated Avalon bus [6]. quired calculations are performed by the arithmetic and This is a synchronous bus defining the interface and proto- logic unit (ALU). The results are stored in the register file col for connecting master and slave components in the sys- and control registers. Since the result of the instruction exe- tem. Memory modules and peripherals in the system can cution may be coming from different sources, such as mem- reside on or off-chip, but the bus logic is always placed on- ory for load instructions or ALU for arithmetic and logic in- chip. structions, the result is selected by a register file write mul- tiplexer. 2.2. UT Nios datapath All UT Nios components were described in Verilog HDL. The current version of UT Nios does not support some optional components, such as instruction and data UT Nios datapath, shown in Figure 1, is organized in caches, multiplication support in hardware, and advanced 4 pipeline stages: fetch (F), decode (D), operand (O), and debug support [3]. execute (X). Prefetch unit fetches instructions from the instruction memory. The address of the next instruction to be fetched is stored in the prefetch program counter (PPC). 3. Performance evaluation If the pipeline stalls, the fetched instruction is temporarily stored in a FIFO buffer. Otherwise, the instruction is for- 3.1. UT Nios benchmark set warded to the decode stage, where it is decoded and ad- dresses of its register operands are calculated. A set of benchmark programs was defined to assess the In the operand stage, immediate operands are formed. performance of UT Nios and compare it with the perfor- Since the instruction word is only 16 bits wide, it is not pos- mance of Altera Nios. Although several benchmark sets for sible to specify a 16-bit immediate constant in a single in- embedded systems already exist, we found most of them struction word. 16-bit constants are formed by preloading to be inappropriate for benchmarking soft-core processors. an 11-bit prefix into the K register using the PFX instruc- Since many of these benchmarks contain computationally tion. An instruction following PFX specifies the remaining intensive programs, which are much more efficiently imple- 5 bits to form a 16-bit immediate operand. Since the Nios ar- mented in custom hardware, it is not likely that a soft-core chitecture supports several different addressing modes, de- processor would be used for such an application. A new set pending on the addressing mode used the operands may be of benchmark programs was defined to measure the perfor- coming from several different sources. Therefore, proper mance of soft-core processors. operands have to be selected from the possible sources. The UT Nios benchmark set is based on MiBench [7], a Operand handling also includes the data forwarding logic. freely available benchmark suite. A subset of the MiBench Branch instructions commit in the operand stage, because benchmarks was selected and augmented with other pro- all of the necessary information is available (through the grams that better characterize the performance of UT Nios. data forwarding logic); this reduces the branch penalty. Since Nios-based systems generally do not include a file- Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) 1530-2075/05 $ 20.00 IEEE system, all benchmarks were adapted to use the input data • 1 MB of 32-bit off-chip SRAM that contains the pro- from memory. gram, the data, and the interrupt vector table. The UT Nios benchmark set is divided into three cate- • Avalon tri-state bridge with default settings gories: test benchmarks, which test the performance of specific architectural features of the processor, toy benchmarks, • UART peripheral with default settings used to down- which are small programs performing simple calculations, load the programs to the development board and com- and application benchmarks that represent the real applica- municate with the program. tions expected to run on the processor. • Timer peripheral with default settings used to measure Test benchmarks include Loops, that tests the perfor- the program execution time. mance of branch instructions, Memory, which performs op- erations on an array residing in memory, Pipeline,which The processor’s instruction and data masters were both tests the data forwarding capabilities of the pipeline imple- connected to the on-chip memory and the tri-state bridge, mentation, and PipelineM, which tests the data forwarding while the data master was also connected to the UART and when instruction operands are in the data memory. timer peripherals. The tri-state bridge was connected to the th Toy benchmarks include Fibo, which calculates the 9 off-chip SRAM. All arbitration priorities were set to 1. Both Fibonacci number using a recursive procedure, Multiply, systems were compiled using Quartus II 4.1 SP2 target- which uses a triply nested loop to multiply two integer ma- ing the Stratix EP1S10F780C6ES FPGA chip, optimizing trices, and QsortInt, a program that sorts an integer array the design for speed, and using the standard fit (highest ef- using the qsort algorithm.

Load more