Design of Parallel Vector/Scalar Floating Pointco-Processor for Fpgas E
Total Page:16
File Type:pdf, Size:1020Kb
7490 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493 Available online at www.elixirpublishers.com (Elixir International Journal) Computer Science and Engineering Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493 Design of parallel vector/scalar floating pointco-processor for FPGAs E. Manikandan and K.Immanuvel Arokia James Veltech Multitech Dr.Rangarajan Dr,Sakunthala Engineering College. ARTICLE INFO ABSTRACT Article history: Current FPGA soft processor systems use dedicated hardware modules or accelerators to Received: 30 September 2011; speed up data-parallel applications. This work explores an alternative approach of using a Received in revised form: soft vector processor as a general-purpose accelerator. Due to the complexity and expense of 16 March 2012; floating point hardware, these algorithms are usually converted to fixed point operations or Accepted: 24 March 2012; implemented using floating-point emulation in software. As the technology advances, more and more homogeneous computational resources and fixed function embedded blocks are Keywords added to FPGAs and hence implementation of floating point hardware becomes a feasible Co-Processor, option. In this research we have implemented a high performance, autonomous floating Parallelism Floating Point Unit, point vector co-processor (FPVC) that works independently within an embedded processor FPGA. system. We have presented a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The FPVC is completely autonomous from the embedded processor, exploiting parallelism and exhibiting greater speedup than alternative vector processors. The FPVC supports scalar computation so that loops can be executed independently of the main embedded processor. © 2012 Elixir All rights reserved. Introduction embedded blocks on Xilinx Virtex FPGAs in implementing Reconfigurable hardware bridges a gap between ASICs floating-point operations and vector processing (Application Specific Integrated Circuits) and microprocessor Related Work based systems. Recently, there has been an increased interest in The implementation of a floating point unit in general using reconfigurable hardware for multimedia, signal processing purpose computing is extremely common but it makes an and other computationally intensive embedded processing interesting case study for an FPGA based reconfigurable applications. These applications perform floating point computing system. Up to now there have been many research arithmetic computation for high data accuracy and high efforts applied to the implementation of an FPGA based Floating performance. Reconfigurable hardware allows the designer to point unit. This research can be categorized based on the type of customize the computational units in order to best match communication with main processor, precision support, number application requirements and at the same time, optimize device of computation units and level of autonomy. resource utilization. Because of these advantages, extensive One of the earliest works in this area is done by Fagin et research has been done to efficiently implement floating point al.[1] They have implemented IEEE-754 standard compliant computations on the reconfigurable hardware. Floating point floating point adder and multiplier function on the FPGA for (FP) computations can be categorized in three classes: design space exploration. They found that the floating point unit 1. Software library substantially improves performance, but technology limitations 2. General purpose floating point unit made it difficult to implement floating point units at that time. 3. Application specific floating point data path Recently, Pittman et[2] al.have implemented a custom Embedded RAMs in FPGAs provide large storage floating point unit (CFPU) which is compliant with the IEEE structures. While the capacity of a given block RAM is fixed, 754 standard and improves floating-point intensive computation multiple block RAMs can be connected through the performance. The CFPU is implemented on the Xilinx s Virtex interconnection network to form larger capacity RAM storage. A FPGA based eMIPS platform which is partitioned into fixed and key limitation of block RAMs is they have only two access ports reconfigurable regions. They demonstrated various trade-offs for allowing just two simultaneous reads or writes. The multiply- area, power and speed with the help of software profiling and accumulate blocks, also referred to as DSP blocks, have run time reconfiguration for the CFPU. dedicated circuitry for performing multiply and accumulate The Xilinx s MicroBlaze processor supports a single operations. These DSP blocks can also perform addition, precision floating point unit in hardware that is tightly coupled subtraction and barrel shifter functions to the embedded processor pipeline. The FPU implements The major FPGA companies provide embedded cores, both addition, subtraction, multiplication and comparison. Floating hard and soft, for use with their processors. Altera has the Nios point division and square root are available as an extension, as II soft core and Xilinx offers the MicroBlaze soft and PowerPC well as conversions from integers to floating point and vice hard cores on their FPGAs. All these large embedded logic versa. If the FPU is not instantiated, floating point operations are blocks make more efficient use of on-chip FPGA resources. emulated in software. The Xilinx s PowerPC hardcore processor However, they can also waste on-chip resources if they are not [3] has a floating point unit available that supports being used. In this work, we will explore the utilization of these Tele: E-mail addresses: [email protected] © 2012 Elixir All rights reserved 7491 E. Manikandan et al./ Elixir Comp. Sci. & Engg. 44 (2012) 7490-7493 IEEE-754 floating-point arithmetic operations in single or Vector Lane double precision. Floating point instructions supported include The vector lanes of a vector unit are shown in detail in add, subtract, multiply, divide, square root and fused multiply- Figure 2.2 Each vector lane has a complete copy of the add instructions. The FPU is tightly coupled to the PowerPC functional units, a partition of the vector register file and vector processor core with the Auxiliary Processing Unit (APU) flag registers. All vector lanes receive the same control signals interface [4]. The Xilinx FPU includes 32 floating point and operate independently without communication for most registers. vector instructions. With more vector lanes, a fixed-length Vector Processing Overview vector can be processed in fewer cycles, improving Most current microprocessors have scalar instruction sets. performance. Processor which supports vector operations A scalar instruction set is one which requires a separate opcode through parallel lanes is generally known as Single Instruction and related operand specifiers for every operation to be Multiple Data (SIMD) processors. Many popular performed. Vector processors provide vector instructions in microprocessors have extended instruction set architecture addition to scalar instructions. This chapter reviews vector (ISA) to support SIMD instructions such as Intel SSE, MMX processing in general, defines terms used in the rest of the thesis and PowerPC. and lists recent trends in vector execution architecture Design Of Fpvc Vector Operation This describes the design and implementation of the The code in Figure 2.1 is called SAXPY/DAXPY loop which Floating Point Vector Co-processor (FPVC) targeting Xilinx forms the inner loop of the Basic Linear Algebra Subprograms FPGAs. Three key features distinguish our work in floating- library [42]. For the above code, A and Y are vectors and x is a point architecture: a unified approach to scalar and vector scalar value. Each iteration of the SAXPY/DAXPY loop, processing, support for different latency of each functional unit performs below six steps. and simplicity of organization. The initial section of this chapter for (i=0; i < n; i++) provides an overall architecture of the processor. Next section Y[i] = A[i] * x + Y[i]; describes the Instruction Set Architecture (ISA) features whereas 1. Load element from vector A the unified vector core is described in following sections. 2. Load element from vector Y Finally, FPGA specific novel inter-lane communication features, 3. Load scalar value x vector compression and expansion and configurable parameters 4. Multiply A s element with x are described. 5. Add result of multiplication to element of Y Floating Point Vector Co-processor Architecture 6. Store result back in vector Y The Floating Point Vector Co-processor (FPVC) is a For a scalar processor, these operations will be performed in configurable soft-core vector processor architecture developed a loop. A Vector processor provides direct instruction set specifically for FPGAs. It leverages the configurability of support for operations on whole vectors i.e., on multiple data FPGAs to provide many parameters to configure the processor elements rather than on a single scalar value. for a specific application for desired performance and area. The This vector instruction specifies operand vectors and a FPVC instruction set supports both scalar and vector instructions vector length, and an operation to be applied element-wise to with unified register files and execution units. Instruction set these vector operands. Assuming the vector length is equal to the features are heavily borrowed from