Quick viewing(Text Mode)

Design of Parallel Vector/Scalar Floating Pointco-Processor for Fpgas E

Design of Parallel Vector/Scalar Floating Pointco-Processor for Fpgas E

7490 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493

Available online at www.elixirpublishers.com (Elixir International Journal) Computer Science and Engineering

Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493 Design of parallel vector/scalar floating pointco- for FPGAs E. Manikandan and K.Immanuvel Arokia James Veltech Multitech Dr.Rangarajan Dr,Sakunthala Engineering College.

ARTICLE INFO ABSTRACT Article history: Current FPGA soft processor systems use dedicated hardware modules or accelerators to Received: 30 September 2011; speed up data-parallel applications. This work explores an alternative approach of using a Received in revised form: soft as a general-purpose accelerator. Due to the complexity and expense of 16 March 2012; floating point hardware, these are usually converted to fixed point operations or Accepted: 24 March 2012; implemented using floating-point emulation in software. As the technology advances, more and more homogeneous computational resources and fixed function embedded blocks are Keywords added to FPGAs and hence implementation of floating point hardware becomes a feasible Co-Processor, option. In this research we have implemented a high performance, autonomous floating Parallelism Floating Point Unit, point vector co-processor (FPVC) that works independently within an embedded processor FPGA. system. We have presented a unified approach to vector and scalar computation, using a single for both scalar operands and vector elements. The FPVC is completely autonomous from the embedded processor, exploiting parallelism and exhibiting greater speedup than alternative vector processors. The FPVC supports scalar computation so that loops can be executed independently of the main embedded processor. © 2012 Elixir All rights reserved. Introduction embedded blocks on Virtex FPGAs in implementing Reconfigurable hardware bridges a gap between ASICs floating-point operations and vector processing (Application Specific Integrated Circuits) and Related Work based systems. Recently, there has been an increased interest in The implementation of a floating point unit in general using reconfigurable hardware for multimedia, signal processing purpose computing is extremely common but it makes an and other computationally intensive embedded processing interesting case study for an FPGA based reconfigurable applications. These applications perform floating point computing system. Up to now there have been many research arithmetic computation for high data accuracy and high efforts applied to the implementation of an FPGA based Floating performance. Reconfigurable hardware allows the designer to point unit. This research can be categorized based on the type of customize the computational units in order to best match communication with main processor, precision support, number application requirements and at the same time, optimize device of computation units and level of autonomy. resource utilization. Because of these advantages, extensive One of the earliest works in this area is done by Fagin et research has been done to efficiently implement floating point al.[1] They have implemented IEEE-754 standard compliant computations on the reconfigurable hardware. Floating point floating point and multiplier function on the FPGA for (FP) computations can be categorized in three classes: design space exploration. They found that the floating point unit 1. Software library substantially improves performance, but technology limitations 2. General purpose floating point unit made it difficult to implement floating point units at that time. 3. Application specific floating point data path Recently, Pittman et[2] al.have implemented a custom Embedded RAMs in FPGAs provide large storage floating point unit (CFPU) which is compliant with the IEEE structures. While the capacity of a given block RAM is fixed, 754 standard and improves floating-point intensive computation multiple block RAMs can be connected through the performance. The CFPU is implemented on the Xilinx s Virtex interconnection network to form larger capacity RAM storage. A FPGA based eMIPS platform which is partitioned into fixed and key limitation of block RAMs is they have only two access ports reconfigurable regions. They demonstrated various trade-offs for allowing just two simultaneous reads or writes. The multiply- area, power and speed with the help of software profiling and accumulate blocks, also referred to as DSP blocks, have run time reconfiguration for the CFPU. dedicated circuitry for performing multiply and accumulate The Xilinx s MicroBlaze processor supports a single operations. These DSP blocks can also perform addition, precision floating point unit in hardware that is tightly coupled subtraction and functions to the embedded processor pipeline. The FPU implements The major FPGA companies provide embedded cores, both addition, subtraction, multiplication and comparison. Floating hard and soft, for use with their processors. has the Nios point division and square root are available as an extension, as II soft core and Xilinx offers the MicroBlaze soft and PowerPC well as conversions from integers to floating point and vice hard cores on their FPGAs. All these large embedded logic versa. If the FPU is not instantiated, floating point operations are blocks make more efficient use of on-chip FPGA resources. emulated in software. The Xilinx s PowerPC hardcore processor However, they can also waste on-chip resources if they are not [3] has a floating point unit available that supports being used. In this work, we will explore the utilization of these

Tele: E-mail addresses: [email protected] © 2012 Elixir All rights reserved 7491 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493

IEEE-754 floating-point arithmetic operations in single or Vector Lane double precision. Floating point instructions supported include The vector lanes of a vector unit are shown in detail in add, subtract, multiply, divide, square root and fused multiply- Figure 2.2 Each vector lane has a complete copy of the add instructions. The FPU is tightly coupled to the PowerPC functional units, a partition of the vector register file and vector processor core with the Auxiliary Processing Unit (APU) flag registers. All vector lanes receive the same control signals interface [4]. The Xilinx FPU includes 32 floating point and operate independently without communication for most registers. vector instructions. With more vector lanes, a fixed-length Vector Processing Overview vector can be processed in fewer cycles, improving Most current have scalar instruction sets. performance. Processor which supports vector operations A scalar instruction set is one which requires a separate opcode through parallel lanes is generally known as Single Instruction and related operand specifiers for every operation to be Multiple Data (SIMD) processors. Many popular performed. Vector processors provide vector instructions in microprocessors have extended instruction set architecture addition to scalar instructions. This chapter reviews vector (ISA) to support SIMD instructions such as SSE, MMX processing in general, defines terms used in the rest of the thesis and PowerPC. and lists recent trends in vector execution architecture Design Of Fpvc Vector Operation This describes the design and implementation of the The code in Figure 2.1 is called SAXPY/DAXPY loop which Floating Point Vector Co-processor (FPVC) targeting Xilinx forms the inner loop of the Basic Linear Algebra Subprograms FPGAs. Three key features distinguish our work in floating- library [42]. For the above code, A and Y are vectors and x is a point architecture: a unified approach to scalar and vector scalar value. Each iteration of the SAXPY/DAXPY loop, processing, support for different latency of each functional unit performs below six steps. and simplicity of organization. The initial section of this chapter for (i=0; i < n; i++) provides an overall architecture of the processor. Next section Y[i] = A[i] * x + Y[i]; describes the Instruction Set Architecture (ISA) features whereas 1. Load element from vector A the unified vector core is described in following sections. 2. Load element from vector Y Finally, FPGA specific novel inter-lane communication features, 3. Load scalar value x vector compression and expansion and configurable parameters 4. Multiply A s element with x are described. 5. Add result of multiplication to element of Y Floating Point Vector Co-processor Architecture 6. Store result back in vector Y The Floating Point Vector Co-processor (FPVC) is a For a , these operations will be performed in configurable soft-core vector processor architecture developed a loop. A Vector processor provides direct instruction set specifically for FPGAs. It leverages the configurability of support for operations on whole vectors i.e., on multiple data FPGAs to provide many parameters to configure the processor elements rather than on a single scalar value. for a specific application for desired performance and area. The This vector instruction specifies operand vectors and a FPVC instruction set supports both scalar and vector instructions vector length, and an operation to be applied element-wise to with unified register files and execution units. Instruction set these vector operands. Assuming the vector length is equal to the features are heavily borrowed from the instruction set of number of elements in each vector register then the VIRAM and RISC processors such as PowerPC and Microblaze. SAXPY/DAXPY operation can be performed with just six Currently the FPVC does not support and certain instructions. Thus, vector operations reduce the dynamic bit manipulation instructions, but it adds new instructions to take instruction bandwidth requirement. advantage of DSP functionality and embedded memories. Vector Memory and Vector Register Architecture III FPVC – The Vector Co-processor There are two main classes of vector architecture: Vector Some of the key design features of Floating Point Vector Core Memory Architecture and Vector Register Architecture . In are: vector memory architecture such as the CDC STAR 100 , • Completely autonomous from the main processor operands are read from memory and results of the operation • Supports single precision and 32-bit integer arithmetic performed on operands will be stored back in to memory. operations In vector register architecture such as Cray series , operands • 4 stage RISC pipeline for integer arithmetic and memory are read from vector registers and results of the operation access performed on operands are stored back in the vector registers. • Variable length RISC pipeline for floating point arithmetic Vector memory architecture has higher memory bandwidth • Unified vector scalar general purpose register file requirement than vector register architecture. • Supports modified Harvard style memory architecture where Vector Length Control there are separate level 1 instruction and data RAM but unified A vector processor has a natural length determined by the level 2 memory number of elements in each vector register, which is called the Local Instruction and Data RAM Maximum Vector Length (MVL) . It is highly unlikely that a The FPVC implements a modified Harvard style memory given program will have vector length that equals MVL. The system architecture. The FPVC s memory system is divided in size of all the vector operations for SAXPY /DAXPY depends on two levels: main memory and local memory. Main memory is “n”, which may not be known until run time. The solution is to connected to FPVC through master port of system interface create a vector-length register (VLR). The VLR controls the whereas local memory sits in between. Apart from our approach, length of any vector operation. However, the value of VLR there are many options exist for connecting the FPVC to the cannot be greater than the natural vector length of the processor. main memory, such as through unified memory, separate instruction and data cache, through direct connection to main 7492 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493

memory etc. For off-chip memory, caches are used to hide the Hence, the branch penalty for the vector co-processor will be memory latencies, but for streaming applications in the area of two cycles. embedded and scientific applications this may not be true. This During the decode stage (ID), instructions pass through range of different memory system configurations could be three steps and all steps are performed in a single clock cycle. In interesting to explore in the future. the first step, the instruction is decoded and data hazard checkers System Bus Interface perform checks to find out whether the current instruction s The FPVC has one slave (SLV) port for communicating with source and/or destination registers are in flight. Due to dynamic the main processor and one master (MST) port for main memory scheduling, not only RAW. (Read after Write) hazards exist but accesses. The system bus interface for master and slave ports is the data hazard checker must also check WAW (Write after not restricted to a specific bus protocol. The slave port interface Write) and WAR (Write after Read) hazards. Until all hazards can be connected to any type of bus including point-to-point, are cleared, the Instruction Decode unit stalls the next shared bus or simple . instruction fetch. In the current design implementation, we have implemented Once all hazards are cleared, the instruction enters the Processor Local Bus (PLB) [3] as the system bus interface. The vector state machine. If the instruction is a scalar instruction PLB can be configured for 32-bit, 64-bit or 128-bit interface. then the vector state machine issues the instruction to the Data alignment is also done in the system bus interface. The and the new fetched instruction enters the decode master port of the system bus interface includes a Direct stage in next clock cycle. If it is a vector instruction, based on Memory Access (DMA) controller to provide software prefetch the maximum vector length stored in the MVL register, the mechanism for the vector core. DMA transfers setup include instruction will be repeatedly issued to the execution unit. below three steps. As the instruction encoding can address 32 short vector 1. Write source/destination address of main memory in global registers as source/destination registers, the vector address register together with source/destination register references the current 2. Write destination/source address of local memory in local instruction s operand data. Finally, in the last step, when the address register. decode stage issues an instruction to the execution unit, it 3. Configure DMA transfer parameters : includes direction of updates the scoreboard to keep track of in flight instructions for the DMA, size of the DMA, which local memory involves in the data hazard checks. Once the instruction reaches the end of the transfer, state of DMA transfer are provided written in execution unit, the result is written back to registers. configuration register. Experimental Setup Vector Scalar Isa We have tested the FPVC for correctness as well as for area The goals of the Vector-Scalar ISA(Figure 3.1) design are usage and speed. The FPVC is implemented in VHDL and to be flexible and scalable and to provide a simple architecture synthesized using Xilinx ISE 10.1 CAD tools targeting Virtex-5 suitable for a wide range of floating point applications. Due to FPGAs. We have compared various FPVC configurations trade-offs between performance versus design complexity, we against a hard processor (PowerPC 440 with the Xilinx FPU selected to design new ISA which is inspired by RISC Figure 4.1 using various linear algebra kernels. We have also instruction architectures such as Power ISA [3], Microblaze ISA compared area and speed for each configuration of the FPVC. [2] and VIRAM vector ISA. Conclusion Vector-Scalar Pipeline The completely autonomous floating point vector/scalar co- The FPVC is based on the classic dynamic scheduling (In- processor exhibits speedup on important linear algebra kernels order issue and our-of-order completion) RISC pipeline. The when compared to the implementation used by most practioners: four stages of the pipeline are Instruction Fetch, Decode, embedded processor FPU provided by Xilinx. The FPVC is Execution and Write Back. The pipeline is intentionally kept easier to implement at the cost of a decrease in performance short so integer vector instructions can complete in a small compared to a custom . Hence the FPVC occupies a number of cycles to eliminate the need for forwarding middle ground in the range of designs that make use of floating and to reduce area. Due to the short pipeline, point. The FPVC is completely autonomous. Thus, the PowerPC floating point instruction spends most of their time in the can be doing independent work while the FPVC is computing floating point unit which optimizes the overall execution floating point solutions. The FPVC is configurable at design latency. As both scalar and vector instructions are executed from time. The number of lanes, size of vectors and local memory the same instruction pipeline, both type of instructions are freely sizes can be configured to fit the application. mixed in the program execution and stored in the same local References instruction memory. [1] B. Fagin and Cyril Renard, “Field Programmable Gate As shown Figure 3.2, the fetch and decode stages are Arrays and Floating Point Arithmetic”, IEEE Transactions on common to all instructions. The instruction fetch stage (IF) is VLSI Systems, Volume 2, September 1994. used to access the instruction. During IF stage, the next [2] Zhanpeng Jin, Richard Neil Pittman, and Alessandro Forin, instruction address is determined and sent to local instruction Reconfigurable Custom Floating-Point Instructions, Microsoft RAM, which returns the instruction by the end of the cycle. We Research, no. MSR-TR-2009-157, August 2009. assume that all instructions fit in the local instruction RAM. [3] “Embedded Processor Block in Virtex-5 FPGAs.” Therefore the size put a limitation on the size of program. An http://www.xilinx.com/support/ documentation/user guides/ alternative approach is an instruction cache which would support ug200.pdf, 2009. larger programs at the cost of increased complexity. The fetch [4] “Virtex-5 APU Floating-Point Unit v1.01a.” unit of the pipeline always assumes that branches are not taken http://www.xilinx.com/support/ documentation/ip and fetches an instruction from the next instruction address. documentation/apu fpu virtex5.pdf, April 2009. Data Sheet DS693. 7493 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493

[5] C. Brunelli, F. Garzia, J. Nurmi, et al., “A FPGA Exploration for Performance and Area”, International Implementation of an Open – Source Floating Point conference on and FPGAs, 2009. Computation System”, Internation Symposium on System-on- [7] “MicroBlaze Processor Reference Guide.” Chip, 29-32, 2005. http://www.xilinx.com/support/documentation/sw manuals/mb [6] T. Rodolfo, N. L. V. Calazans, et al., “Floating Point reference guide.pdf, 2008. Hardware for Embedded Processors in FPGAs : Design Space