Review of Parallel Computing Methods and Tools for FPGA Technology
Total Page:16
File Type:pdf, Size:1020Kb
Review of parallel computing methods and tools for FPGA technology Radoslaw Cieszewski, Maciej Linczuk, Krzysztof Pozniak, Ryszard Romaniuk Institute of Electronic Systems Warsaw University of Technology Nowowiejska 15/19,00-665 Warsaw, Poland May 20, 2013 ABSTRACT Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating specific tasks. High-Energy Physics Experiments measuring systems often use FPGAs for fine-grained computation. FPGA combines many benefits of both software and ASIC implementations. Like software, the mapped circuit is flexible, and can be reconfigured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for fine-grained computation implemented in FPGA using Behavioral Description and High Level Programming Languages. Keywords: Parallel Computing, Algorithmic Synthesis, High-Level Synthesis, Behavioral Synthesis, Elec- tronic System Level Synthesis, FPGA, ASIC, DSP, High-Energy Physics Experiment, KX1, Fine-Grained Par- allelism, Coarse-Grained Parallelism 1. INTRODUCTION Parallel computing is a form of computation in which many arithmetical, logical and input/output operations are processed simultaneously. Parallel computer is based on the principle that large problems can be divided into subtasks, which are then solved concurrently. There are many classes of parallel computers: - Symmetric Multiprocessing (SMP), - Multicore Computing, - Grid Computing, - Cluster Computing, - Parallel Computing on Graphics Processing Units (GPU), - Reconfigurable Computing on FPGA's, - Parallel Computing on ASIC's. FPGA has a set of hardware resources, including logic and routing, which function is controlled by on-chip configuration SRAM. Programming the SRAM, either at the start of an application or during execution, allows the hardware functionality to be configured and reconfigured. Such approach gives a possibility to implement different algorithms and applications on the same hardware. The principal difference, comparing to traditional microprocessors, is the ability to make substantial changes to logic and routing. This technique is termed "time-multiplexed hardware" or "run-time reconfigured hardware". In High-Energy Physics Experiments FPGAs are widely used for accelerating diagnostic algorithms 26, 28, 30, 32{35, 45{48 . The main goal of this article is to research efficient FPGA development tools with High Level Programming Language for High-Energy Physics Experiment domain. Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2013, edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 8903, 890321 © 2013 SPIE · CCC code: 0277-786X/13/$18 · doi: 10.1117/12.2035385 Proc. of SPIE Vol. 8903 890321-1 Outline The remainder of this article is organized as follows. Section 2 gives a theoretical base for parallel computing . The comparison of different tools and methods is presented in Section 3. Finally, Section 4 gives the conclusions. 2. BACKGROUND Traditionally, computer software has been written for serial computation. To solve the problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may be executed at a time. After that instruction is finished, the next is executed. Parallel computing uses multiple resources simultaneously to solve a problem. This is accomplished by mapping the algorithm into independent parts so that each part of the algorithm can be executed simultaneously with the others. Such parts in a parallel program are often called fibers, threads, processes or tasks. If the algorithm tasks communicate many times per second, algorithm exhibits fine-grained parallelism. If the algorithm subtasks communicate fewer times per second, algorithm exhibits coarse-grained parallelism. FPGAs are very flexible and can be programmed to accelerate fine-grained and coarse-grained applications. FGPAs are also scalable. Sometimes the algorithms are so complex that they can't be fit into one FPGA. Then, multi-FPGA systems are used. From the high-level view shown in Figure 1, the FPGA looks much like a network of processors49 . Processing Units ifr Interconnect 111 !!"'! Input /Output Figure 1. High-Level FPGA Abstraction An FPGA, however, differs from a conventional multiprocessor in several ways: - granularity of FPGAs have single bit processing elements, each of which is controlled independently. - instruction control of FPGAs are configurable with a single instruction resident per processing element. - interconnect of FPGAs is dynamic and can be reconfigured in time. 2.1 Amdahl's law and Gustafson's law Amdahl's law defines maximum possible speed-up of an algorithm on a parallel computer. Gustafson's law defines the speed-up with P processors. 1 1 S = = lim (1) α P !1 1 − α + α P S(P ) = P − α(P − 1) = α + P (1 − α) (2) α − fraction of running time a program spends on non-parallelizable parts, P - number of processors. These two formulas show that acceleration depends on algorithm. Due to the fact that sequential parts of code can't be parallelized the speed-up is independent of the number of processors. Gustafson's law assumes that speed-up parallel portion of code varies linearly with the number of processors. Proc. of SPIE Vol. 8903 890321-2 2.2 Parallelism level FPGA's computers can process computation: - at bit-level of parallelism, - using pipeline techniques, - at instruction-level of parallelism, - at data-level of parallelism, - at task-level of parallelism. The main advantage over conventional microprocessors is the ability to perform the calculation at the lowest level of parallelism - bit-level. Due to Flynn's taxonomy, FPGAs are classified as Multiple Instruction Multiple Data (MIMD) machines. 2.3 Memory and communication architectures Parallel computing defines two main memory models: - Shared Memory, - Distributed Memory, Shared memory refers to a block of memory that can be accessed by several different processing units in parallel computer. This memory may be simultaneously accessed by multiple tasks with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between tasks. In distributed memory systems, each processing unit has its own local memory and computation are done on local data. If remote data is required then the subtask must communicate with remote processing units. This means that remote processors are engaged in this operation adding some overheads. Hybrid architectures combine these two main approaches. Processing element has its own local memory and access to the memory on remote processors. System architecture has impact on choosing parallel programming models and tools presented in Chapter 3 . 2.4 Recent trends 2.4.1 Recent Trend of FPGA technologies FPGA technology has improved dramatically since the 1960s. Recent trends in FPGA technology are presented on Figure 245 . Proc. of SPIE Vol. 8903 890321-3 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 10000000 1 years j [count]I I VIRTEX-7, r r r V?RTEX VIRTEXV - ..- 6ST DC V 1000000 ARRIAVGX V IRTEX N T4Tftf 1ir ° STRAT STRATIXN (GX) KINTEX-7CYCLONEV AT II GX CYCLONEIII ARRIA II VIRTEXII- P ACCELERATOR.CYCL SmartFusion EN C ARTIX-7 100000 ,,r LatticeSC/M LatticeXCP3 LatticeECP2IM AR OX RT ProASIG3 STRATSSTRATS(GX) CYCLONE II SPARTAN3A oAltera SPARTAN 3EAlk LatticeXP2 Xilinx CYCLONE RTAX (DS P)_4XCELERATOR _ 10000 Actel *Lattice RTSX(SU) 1000 I Number of logical cell over the years 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 10000 r + years - r oAltera [count] 1 STRATSV - - - o Xilinx -r/IRTEX-7 Actel $TfATIX"Ñ(GX)yl OARRIAVGX V IRTEX 1 - - Lattice - KINTEX-7 1440 j(JRC - VIRTEXII PRO STRATA III ARRIAII ARRIAGX ARTS-7 CYCLONE I 1 °CYCLONEV STRATSII 0LatticeECP2/M CYCLONE Ni LatticeXCP3 STRATIXGX SmartFusion 100 ° () CYCLONE IICYCLONEIII RTAX (DS P) SPARTAN3A STRATIX SPARTAN 3E LatticeXP2 10 I Number of DSP block over the years 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 1000000 i + yearsi ]kB] O Altera Xilinx 100000 Actel VIRTE47-- r ... -- VJRE *Lattice V IRTEXVST,3$T.I,X1/ir} STRATIXV I VIRTEXN STD4Ill. r KINTEX-7 ARRIA V GX VIRTEX II PRO ¡TRT.lXfF 5TR 1IXII CYCLONEIII °CYCLONE V 10000 ""f RRIAIL L r ARRAGX ° Lattice=CP2! + Latti eXCP3 STRATIXSTRATIX(GX) LatticeSC/M g CYCLONE NGX A SmartFusion CYCLONEII SPARTAN 3A 1000 SPARTAN3E0 CYCLONE 11 ° RTAX,DSP}A RTProASIC3 AXCELERATOR I 100 r r SRAM capacity over the years 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 3000 j Iyearsr j VIRTEI X-7 KINTEX-7 - -- -- 2000 STRATIXII STRATXN GX [Mb/sil er--- OARRIAVGX -7-6STRAJXV -NVIRTEXV ARRA I18 STRATIXGX^ - `CYCLONE Al AXLEERATOR CYCLONEN 1000 -~ , ARTIX-7 SmartFusion- LatticeXCP3 N 800 a® VIRTEX LatticeSC/M -VIRTEX II PRO STRATAI ARRIAGX CYCLONE V 600-STRATS 'Ern. I CYCLONE RTAX 3A 500 DSP)-CYCLONEII-SPARTAN oAltera SPARTAN3E 400- OCYCLONEIII Xilinx 3 00 i - Actel LatticeXP2 Lattice RT ProAS IC3 200 UO blocks speed (LVDS) over the years 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 100000 Iyearsl [Mb/s] I I VIRTEX-7 STRA-IX V - o Altera Xilinx 10 Actel V GX Lattice - VIRTEX-6KINTEX-7 ANNA 10000 - * ST2ArTÍGX STRATIX N GX 1 SmartFusion r {V) ARTS-7 VIP,TEXV r--Ó VIRTEXN ARRIAGX ARRIA1I CYCLONEV STRATIXGX T LatticeXCP3 LatticeECP2IM0 VIRTEXII PRO CYCLONE N 1000 I I SERDES speed over the years Figure 2. Recent Trend of FPGA Technologies Proc. of SPIE Vol. 8903 890321-4 During the last few years, the of number of CLB, DSP blocks, SRAM capacity has increased extremely.