The Past and Future of FPGA Soft Processors

The Past and Future of FPGA Soft Processors

The Past and Future of FPGA Soft Processors Jan Gray Gray Research LLC [email protected] ReConFig 2014 Keynote 9 Dec 2014 Copyright © 2014, Gray Research LLC. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) license. http://creativecommons.org/licenses/by/4.0/ In Celebration of Soft Processors • Looking back • Interlude: “old school” soft processor, revisited • Looking ahead 9 Dec 2014 ReConFig 2014 2 New Engines Bring New Design Eras 9 Dec 2014 ReConFig 2014 3 1. EARLY DAYS 9 Dec 2014 ReConFig 2014 4 1985-1990: Prehistory • XC2000, XC3000: not quite up to the job – Early multi-FPGA coprocessors – ~8-bit MISCs 9 Dec 2014 ReConFig 2014 5 1991: XC4000 9 Dec 2014 ReConFig 2014 6 1991: RISC4005 [P. Freidin] • The first monolithic general purpose FPGA CPU • “FPGA Devices: 1 Xilinx XC4005 ... On-board RAM: 64K Words (16 bit words) Notes: A 16 bit RISC processor that requires 75% of an XC4005, 16 general registers, 4 stage pipeline, 20 MHz. Can be integrated with peripherals on 1 FPGA, and ISET can be extended. … Includes a macro assembler, gate level simulator, ANSI C compiler, and a debug monitor.” [Steve Guccione: List of FPGA-based Computing Machines, http://www.cmpware.com/io.com/guccione/HW_list.html] Freidin Photos: Photos: Philip 9 Dec 2014 ReConFig 2014 7 1994-95: Gathering Steam • Communities: FCCM, comp.arch.fpga [http://fpga-faq.org/archives/index.html] • Research, commercial interest – OneChip, V6502 9 Dec 2014 ReConFig 2014 8 1995: J32 • 32-bit RISC + “SoC” • Integer only • 33 MHz ÷ 2φ • 4-stage pipeline • <60% of XC4010 • C++ XNF .bit 9 Dec 2014 ReConFig 2014 9 J32 Microarchitecture 9 Dec 2014 ReConFig 2014 10 1995-96: XC4000E and FLEX10K 9 Dec 2014 ReConFig 2014 11 1998: XSOC/xr16 Kit + http://www.xess.com/shop/product/xs40-005e/ 9 Dec 2014 ReConFig 2014 12 1998: xr16 • 40 MHz 16-bit RISC, DMA • XC4005E/XL/Spartan-10, 265 LUTs • LCC C compiler, simulator • Building a RISC System in an FPGA, Circuit Cellar series [http://fpgacpu.org/xsoc/cc.html] • FPGA CPU News, fpga-cpu list 9 Dec 2014 ReConFig 2014 13 1998: xr16 Datapath XC4005E Floorplan 9 Dec 2014 ReConFig 2014 14 1998: Virtex 9 Dec 2014 ReConFig 2014 15 2000: Nios, SOPC Builder 9 Dec 2014 ReConFig 2014 16 2000: FPGA Chip Multiprocessors • 3rd gen 16/32-bit RISC PE: 200/330 LUTs + 1 BRAM • 8 cores fit in an XCV50E, 60 in an XCV600E 9 Dec 2014 ReConFig 2014 17 2001-02: Virtex-II/Pro, MicroBlaze, EDK 9 Dec 2014 ReConFig 2014 18 2002: The End of the Beginning • Diverse 3rd party soft processors – Little MCUs – KCPSMPicoBlaze – Commercial RISCs – ARC, LEON SPARC – Legacy ISAs – 6502, Z80, 68000 – Hobbyist / open source – OpenRISC – Language specific cores – Java, Forth, Erlang – Teaching – Chalmers, Cornell, Georgia Tech, Hiroshima, Mich. State, NM Tech, Riverside, Tokai, UCSC, Valladolid, Virginia Tech, WUStL • Nios, MicroBlaze: comprehensive SoC platforms 9 Dec 2014 ReConFig 2014 19 2001-2014: MicroBlaze Evolution/Configurability Version: 1. 3-stage, mul, bshift , CoreConnect 2. div, FSL, I$, D$, 150 MHz = 100 DMIPS 3. cache links 4. FPU, debug trace 5. 5-stage pipeline, = 240 DMIPS 6. 3/5-stage 7. MMU, exceptions, Linux 8. AXI4, fault tolerance, 330 MHz, = 400 DMIPS ~4.4× faster / 12 years = +13%/year 9 Dec 2014 ReConFig 2014 20 http://forums.xilinx.com/t5/Xcell-Daily-Blog/Mars-Curiosity-Rover-s-MAHLI-images-a-dusty-penny-on-Mars-with/ba-p/369275 http://www-robotics.jpl.nasa.gov/publications/Reg_Willson/Edgett_etal_MAHLI_7Jul2012published.pdf Remarkable (But Typical) Applications Typical) (But Remarkable 9 Dec 2014 ReConFig 2014 21 http://forums.xilinx.com/t5/Xcell-Daily-Blog/The-Search-for-Gravity-Waves-and-Dark-Energy-Gets-Help-from/ba-p/491048 The Utility of Soft Processors • Run existing software, in the FPGA – Replace external MCU – Run RTOS / Linux / drivers / networking / web server • Control plane – Replace complex state machines – Hardware boot, diagnostics, telemetry • Accelerators – Customize with app-specific instruction sets – Tightly couple software to accelerators • Computer architecture research – RAMP*, CHERI, FlexPRET 9 Dec 2014 ReConFig 2014 22 2. DOING IT “OLD SCHOOL”: AN AUSTERE APPROACH TO SOFT PROCESSOR DESIGN 9 Dec 2014 ReConFig 2014 23 Design Study: How Many Austere 32-bit RISCs Fit in a Modern FPGA? Virtex-7 690T: (433K 6-LUTs, 1470 BRAMs) ÷ (330 LUTs, 1 BRAM) = ? • What about…? – 6-LUTs? DSPs? – NoC? Memory model? • Let’s see 9 Dec 2014 ReConFig 2014 24 Planning an Austere RISC PE – Psilos • Assumptions • Pipeline design – Xilinx 6-LUTs – Critical path: operand regs ALU – CMP on-die shared memory mux mux operand regs – Run small kernels (no I$) – BRAM for instructions, data – MicroBlaze integer subset – IF:DC:EX:MEM • Essential: I-fetch, reg file, • 2-core clusters ALU, PC/branches, lw/sw – Share 4 KB instruction BRAM • Configurable: lb/sb, lh/sh – Share 4 KB local data BRAM – Share 32×32=64 mul • Cluster-shared: mul bshift – 2 BRAM 10R×16C slices • A<<k = A×2k A>>k = mulh(A×232-k) – 1 PE ≤ ½×10×16×4×¾ = 240 LUTs 9 Dec 2014 ReConFig 2014 25 RISC PE Datapath D_AD A A REG L FILE U RESULT MULT* B D_IN NEXTPC IN_AD IN_IN IR PC DO D_OUT 9 Dec 2014 ReConFig 2014 26 RISC PE Technology Mapping D_AD 2 @ 4 LUTs/6 bits A A 1 LUT/bit 1 LUT/bit REG L FILE U RESULT 2 @ 4 LUTs/8 bits MULT* 4 DSPs (clustered) B D_IN NEXTPC 1 LUT/bit IN_AD IN_IN IR PC DO D_OUT 9 Dec 2014 ReConFig 2014 27 Floorplan: <200 6-LUTs, 300 MHz 9 Dec 2014 ReConFig 2014 28 2-PE Cluster O P1_AD N_DO P0_AD P0_D N_AD N_DI P1_DO PE 0 DATA BRAM PE 1 PEs share 4KB instruction RAM D_AD AD AD D_AD and 4 KB data RAM. D_OUT DI DI D_OUT NOC interface can write D_IN DO DO D_IN instruction RAM and read/write data RAM. PE 0 enjoys dedicated ports. PE 1 shares with NOC. INSN BRAM Priority: NOC, PE 1. IN_AD AD AD IN_AD Not shown: PEs 0 and 1 share DI DI the non-pipelined 32x32=64 IN_IN DO DO IN_IN multiplier. 9 Dec 2014 ReConFig 2014 29 2-PE CLUSTER 0 4-PE Cluster P0_AD,P0_DO P0_AD,P0_DO P1_AD,P1_DO P1_AD,P1_DO2-PE CLUSTER 1 N_I P0_AD,P0_DO P2_AD,P2_DO P1_AD,P1_DO P3_AD,P3_DO N_DO,N_DI N_AD N_DO N_DI N_AD 2x2=4 PE CLUSTER N_DO 4:1 concentration of NOC accesses. N_DI NOC interface can write instruction RAM and read/write data RAM. N_O Upon receipt of a read-request, N_O a read-response (e.g. write) is N_AD,N_DO injected back into NOC. 9 Dec 2014 ReConFig 2014 (Not shown: read-response FIFO.)30 A 25R×10C 2-D Torus NoC Using 250 5-port Routers (!) N W E ROUTER O I S 4-PE CLUSTER 9 Dec 2014 ReConFig 2014 31 N Hoplite Austere X:Y Router W E ROUTER O I S 4-PE N CLUSTER (8 KB DATA) N HOPLITE: AUSTERE X:Y ROUTER (Unidirectional 2D Torus) W E W Wide. Fast. Simple. Tiny. Bufferless, but no dropped traffic. Dimension Ordered Routing: X>Y. Priority: N, W, I. I No segmentation/reassembly. No flits. No VCs. Simple flow control. O Grossly unfair, but deadlock-free. I S 9 Dec 2014 ReConFig 2014 32 Phalanx re.jpg tu ic 0p %2 facts 20 x% an al cs/ph liti po ld or /apw om .c bs we .free w ww :// ttp h • 25R×10C×4-PEs + routers – 1000 PEs / Virtex-7-690T • OK, but …? – Latency? – External memory system? – Workloads? 9 Dec 2014 ReConFig 2014 33 3. THE FUTURE OF SOFT PROCESSORS 9 Dec 2014 ReConFig 2014 34 FPGAs: What’s Next? • Today • Soon? (speculative) – 20-28 nm low-power – 10-14 nm lower-power but Dennard – 600 KLUTs, 1000s DSPs – 2+ MLUTs, 1000s DSP-FPUs, TFLOPS! – 2.5D packaging – 2.5D packaging • 1.2 MLUTs • 4+ MLUTs • Hetero: 28 Gbps serdes die • + DRAM, 50+ Gbps serdes – + 2×ARM A9s+SoC – “Datacenter edition”? • ARM tools and IP • 4-8×64-bit ARM+SoC + DRAM + ? – Opportunities and challenges! 9 Dec 2014 ReConFig 2014 35 • Catapult [ISCA14] • Microsoft Research + Bing joint study and pilot • Accelerate Bing search query ranking with FPGAs at datacenter scale • Doubled throughput /or/ greatly reduced latency Server Node += FPGA Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs, 2 SSDs 10 Gb Ethernet 9 Dec 2014 ReConFig 2014 37 Catapult FPGA Accelerator Card Stratix V Altera Stratix V D5: 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs 8GB DDR3 PCIe Gen 3 x8 8GB DDR3-1333 Powered by PCIe slot FPGA – FPGA: torus network PCIe Gen3 x8 9 Dec 2014 ReConFig 2014 38 1,632 Server Pilot Deployed in a Production Datacenter FPGA Accelerator for Search Ranking Document 8-Stage Pipeline RaaS Servers Route to Head FE: Feature Extraction FPGA 0 Server Route to FPGA 1 Head Server Document FPGA 2 Server FFE: Free-Form Scoring Expressions Request FPGA 3 Server Return Score FPGA 4 Server Document FPGA 5 Scoring Server Request FPGA 6 Server Return Score Compute Score Score FPGA 7 Server Compute Score 9 Dec 2014 ReConFig 2014 40 FFE: Free Form Expressions {Query,Document Document} Occurrences_0 = 7 Occurrences_1 = 4 Tuples_0_1 = 1 ~4K Dynamic Features FFE #1 =(2*Occurrences_0 + Occurrences_1) (2 * Tuples_0_1) ~2K Synthetic Features FFE #1 = 9 L2Score Score FFE Soft Processor Arrays • Many FFE “programs” soft processor approach Cluster • 4 hardware threads/core 0 • 6 cores/cluster –Shared input vector –Shared pipelined FPUs • 10 clusters/FPGA Core 0 Core 1 Core 2 –60 cores, 240 threads FST Complex • C++ compiler Output Core 3 Core 4 Core 5 The Design Productivity Challenge • Catapult Bing ranking: ~20 KLOCs C++ FPGAs • RTL design productivity << software development – Smaller talent pool, weak tools, bespoke, fragile • Missing essentials – Abstraction builders: languages, types, libraries, services, OS – Reuse, composability, portability, longevity • Often the workload is software already, changes often – Expensive to port and maintain 9 Dec 2014 ReConFig 2014 43 Making HW Dev More Like SW • Vivado HLS one core

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    63 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us