High-Performance Reconfigurable Computing
Total Page:16
File Type:pdf, Size:1020Kb
High-Performance Reconfigurable Computing Tarek El-Ghazawi Director, Institute for Massively Parallel Applications and Computing Technology (IMPACT) Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC) The George Washington University ICFPT07 12/11/07 1 Acknowledgements ARSC, AMI, Cray, DoD, HPTi, NASA, NSF/CHREC, SGI, SRC, Star Bridge, Xtreme Data, many others ICFPT07 12/11/07 2 1 Outline Architectures and Systems Tools and Programming Applications Performance Wrap-up ICFPT07 12/11/07 3 Reconfigurable Supercomputing (RSC) Efficient high performance computing using parallel and distributed systems of both reconfigurable hardware resources and conventional microprocessors This tutorial establishes the current status, the direction taken, and the potential for RSC ICFPT07 12/11/07 4 2 Top 500 Supercomputers Rank Site Computer Processors Year Rmax Rpeak eServer Blue DOE/NNSA/LLNL Gene Solution 1 United States 212992 2007 478200 596378 IBM Forschungszentrum Blue Gene/P 2 Juelich (FZJ) Solution 65536 2007 167300 222822 Germany IBM SGI/New Mexico SGI Altix ICE Computing Applications 8200, Xeon quad 3 Center (NMCAC) core 3.0 GHz 14336 2007 126900 172032 United States SGI Cluster Platform Computational Research 3000 BL460c, Laboratories, TATA Xeon 53xx 3GHz, 4 SONS 14240 2007 117900 170880 Infiniband India HP Cluster Platform 3000 BL460c, Government Agency Xeon 53xx 5 Sweden 2.66GHz, 13728 2007 102800 146430 Infiniband HP ICFPT07 12/11/07 5 Reconfigurable Computers The microchip that rewires itself Scientific American – June 1997 0 Computers that modify their hardware circuits as they operate are opening a new era in computer design. 0 Reconfigurable computers architecture is based on FPGAs (Field Programmable Gate Arrays) Source: [Sci97] ICFPT07 12/11/07 6 3 Execution Model for HPRCs μP •Transfer of Control •Input Data RP PC •Output Data Piplines, Systolic Arrays, SIMD, ... •Transfer of Control Fine grain computations with the RP, others with the MP Interaction between RP and MP can be blocking or asynhronous This scenario is replicated across the whole system and standard HPC parallel programming paradigms used for interactions ICFPT07 12/11/07 7 Synergism between μPand RPs µP RP(FPGA-based) Processing Style Software→Control Flow Hardware→Data Flow (von Neumann) Temporal – reuse of Spatial – Unfolding fixed hardware parallel operations with changeable hardware Parallelism Exploited Coarse-Grain Fine-Grain Clocking Rate Very Fast Relatively Slow Saturating Rate Increasing Speed Applications Relatively Easy Partitioning (S.W./Parallel Harder Programming) Commercial COTS, multipurpose COTS, multipurpose Availability ICFPT07 12/11/07 8 4 Capacity Trends Virtex-5 550 MHz 24M gates* Virtex-II Pro Virtex-4 450 MHz 500 MHz 8M gates* 16M gates* Virtex-II 450 MHz Spartan-3 8M gates 326 MHz Virtex-E 5M gates 240 MHz 4M gates Virtex XC4000 200 MHz Spartan-II 100 MHz 1M gates 200 MHz 250K gates 200K gates Spartan Xilinx Device Complexity 80 MHz XC3000 XC5200 40K gates 85 MHz 50 MHz 7.5K gates 23K gates XC2000 50 MHz 1K gates 1985 1987 1991 1995 1998 1999 2000 2002 2003 2004 2006 Year Source: http://class.ece.iastate.edu/cpre583/lectures/Lect-01.ppt ICFPT07 12/11/07 9 WHAT’S NEW IN THE VIRTEX-5 FPGA FAMILY ICFPT07 12/11/07 10 5 The Design Cycle (That we want you to avoid !) Design and implement a simple encryption Specification unit with RC5 cipher with fixed key Functional simulation Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; HDL (Hardware data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); Description Language) key_read: out std_logic; ); end RC5_core; model Synthesis Post-synthesis simulation netlist ICFPT07 12/11/07 11 The Design Cycle (That we want you to avoid !) Implementation (Mapping, Placing & Routing) Timing simulation Downloading and Testing On board testing ICFPT07 12/11/07 12 6 General Architecture of an FPGA-Based Board CLK Processing Processing Processing Element Element Element (PE#0) (PE#1) (PE#N-1) I/O CARD BUS LOCAL LOCAL LOCAL MEMORY MEMORY MEMORY BUS INTERFACE CONTROLLER COMMON MEMORY / INTERCONNECT NETWORK ICFPT07 12/11/07 13 Reconfigurable Computing Boards (Accelerators) Boards may have one or several interconnected FPGA chips Support different bus standards, e.g. PCI, PCI-X, VME May have direct real-time data I/O through a daughter board Boards may have local onboard memory (OBM) to handle large data while avoiding the system bus (e.g. PCI) bottleneck ICFPT07 12/11/07 14 7 Reconfigurable Computing Boards (Accelerators) Many boards per node can be supported Host program (e.g. C) to interface user (and μP) with board via a board API Driver API functions may include functionalities such as Reset, Open, Close, Set Clocks, DMA, Read, Write, Download Configurations, Interrupt, Readback ICFPT07 12/11/07 15 Some Reconfigurable Boards Vendors ANNAPOLIS MICRO SYSTEMS, INC. (http://www.annapmicro.com) University of Southern California -USC/ISI (http://www.east.isi.edu) AMONTEC (http://www.amontec.com/chameleon.shtml) XESS Corporation (http://www.xess.com) CELOXICA (http://www.celoxica.com) CESYS (http://www.cesys.com) TRAQUAIR (http://www.traquair.com) SILICON SOFTWARE: (http://www.silicon-software.com) ALPHA DATA: (http://www.alpha-data.com) Associated Professional Systems: (http://www.associatedpro.com) NALLATECH: (http://www.nallatech.com) ICFPT07 12/11/07 16 8 Representative Example Boards From Annapolis Micro Systems (AMI) http://www.annapmicro.com & Nallatech http://www.nallatech.com ICFPT07 12/11/07 17 WILDFORCETM 256k x 32 Xilinx 4062XL’s dual port RAM Source: [AMS02] ICFPT07 12/11/07 18 9 Source: [AMS02] ICFPT07 12/11/07 19 WILDSTARTM II for VME Backplane I/O Backplane I/O I/O #0 P0 P2 I/O #1 172 Flash 168 88 Flash 88 168 Flash 16 16 16 DDR2 36 36 DDR2 DDR2 36 36DDR2 DDR2 36 36 DDR2 SRAM SRAM SRAM SRAM SRAM SRAM 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB PE 0 PE 2 PE 1 DDR2 36 TM 36 DDR2 DDR2 36 TM 36 DDR2 DDR2 36 TM 36 DDR2 SRAM VIRTEX II SRAM SRAM VIRTEX II SRAM SRAM VIRTEX II SRAM 2, 4 MB XC2V 6000, 8000 2, 4 MB 2, 4 MB XC2V 6000, 8000 2, 4 MB 2, 4 MB XC2V 6000, 8000 2, 4 MB DDR2 36 36 DDR2 DDR2 36 36 DDR2 DDR2 36 36 DDR2 SRAM SRAM SRAM SRAM SRAM SRAM 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB 2, 4 MB 172 172 32 3 32 3 32 3 DDR Prog DDR Prog DDR Prog SDRAM Osc SDRAM Osc SDRAM Osc 64 MB 64 MB 64 MB 32 32 32 32 32/64 BitsPCI 33/66 MHz 32 Controller Flash 16 Master PCLK Clock MCLK Differential Generator ICLK Single Ended PCI to VME Bridge Copyright Annapolis Micro Systems, Inc. 2002 Source: [AMS02] VME BUS ICFPT07 12/11/07 20 10 WILDSTAR™ II Pro Reproduced and displayed with permission ICFPT07 12/11/07 21 WILDSTAR™ II Pro Reproduced and displayed with permission ICFPT07 12/11/07 22 11 Nallatech's BenNUEY-PCI-4E ICFPT07 12/11/07 23 Clusters and Networks of Reconfigurable Computers (NORCs) ICFPT07 12/11/07 24 12 Networks of Reconfigurable Computers (NORCs) Expensive reconfigurable workstations can be at times underutilized Large problems may need to be spread over a number of workstations Many problems may need a high throughput environment So, need S/W system to remotely schedule and monitor reconfigurable tasks, send data and bit- streams, and collect results ICFPT07 12/11/07 25 Example : GWU/GMU Extended JMS http://www.gwu.edu/~hpc/lsf/ http://ece.gmu.edu/lucite/ Team from GWU and GMU with DoD support Considered extending job management systems (JMS’s) to recognize reconfigurable computing resources and support the needed functionalities Evaluated many implementations of JMS’s and selected LSF (Load Sharing Facility) for implementation ICFPT07 12/11/07 26 13 Networked Reconfigurable Resources Management System ICFPT07 12/11/07 27 Architecture of a typical Job Management System Scheduling Policies Resource Manager Resource Available Requirements Resources Resource Job Scheduler User Monitor Server Job Jobs & Resource Allocation Dispatcher their requirements and Job Execution ICFPT07 12/11/07 28 14 LSF LSF (Load Sharing Facility) is the product of Platform Computing LSF is a layer of software services on top of UNIX and Windows NT operating systems The LSF Suite is a set of software modules that manage distributed computing resources and workloads LSF creates a single system view on a network of heterogeneous computers so that the whole network of computing resources can be utilized effectively and managed easily ICFPT07 12/11/07 29 General Architecture of LSF Submission host other Master host other Execution host hosts hosts MLIM LIM Load information SBD Batch API MBD Child SBD queue bsub app RES LIM – Load Information Manager MLIM – Master LIM User job MBD – Master Batch Daemon SBD – Slave Batch Daemon RES – Remote Execution Server ICFPT07 12/11/07 30 15 Extension of LSF to Reconfigurable Hardware other Master host other Execution host hosts hosts Universal ELIM MLIM board independent Remote Load LIM user information SBD MBD 2 bsub board Child SBD Board dependent API plug-in queue RES Status of the RUser job Local LUser job board user Board 2 API Board 2 API FPGA Board 2 driver board ICFPT07 12/11/07 31 GWU/GMU NORCs Testbed Used in Experiments Execution Host 1 SLAAC-1V HPCL 2 FIREBIRD V1000 Windows XP – PIV 1.3 GHz, 256 MB RAM Submission & Master Host Execution Host 2 WILDFORCE HPCL 3 LINUX 2.2.5 – PII 450 MHz, 256 MB RAM Linux RH7.0 – PIII