The Past and Future of FPGA Soft Processors Jan Gray Gray Research LLC [email protected] ReConFig 2014 Keynote 9 Dec 2014 Copyright © 2014, Gray Research LLC. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) license. http://creativecommons.org/licenses/by/4.0/ In Celebration of Soft Processors • Looking back • Interlude: “old school” soft processor, revisited • Looking ahead 9 Dec 2014 ReConFig 2014 2 New Engines Bring New Design Eras 9 Dec 2014 ReConFig 2014 3 1. EARLY DAYS 9 Dec 2014 ReConFig 2014 4 1985-1990: Prehistory • XC2000, XC3000: not quite up to the job – Early multi-FPGA coprocessors – ~8-bit MISCs 9 Dec 2014 ReConFig 2014 5 1991: XC4000 9 Dec 2014 ReConFig 2014 6 1991: RISC4005 [P. Freidin] • The first monolithic general purpose FPGA CPU • “FPGA Devices: 1 Xilinx XC4005 ... On-board RAM: 64K Words (16 bit words) Notes: A 16 bit RISC processor that requires 75% of an XC4005, 16 general registers, 4 stage pipeline, 20 MHz. Can be integrated with peripherals on 1 FPGA, and ISET can be extended. … Includes a macro assembler, gate level simulator, ANSI C compiler, and a debug monitor.” [Steve Guccione: List of FPGA-based Computing Machines, http://www.cmpware.com/io.com/guccione/HW_list.html] Freidin Photos: Photos: Philip 9 Dec 2014 ReConFig 2014 7 1994-95: Gathering Steam • Communities: FCCM, comp.arch.fpga [http://fpga-faq.org/archives/index.html] • Research, commercial interest – OneChip, V6502 9 Dec 2014 ReConFig 2014 8 1995: J32 • 32-bit RISC + “SoC” • Integer only • 33 MHz ÷ 2φ • 4-stage pipeline • <60% of XC4010 • C++ XNF .bit 9 Dec 2014 ReConFig 2014 9 J32 Microarchitecture 9 Dec 2014 ReConFig 2014 10 1995-96: XC4000E and FLEX10K 9 Dec 2014 ReConFig 2014 11 1998: XSOC/xr16 Kit + http://www.xess.com/shop/product/xs40-005e/ 9 Dec 2014 ReConFig 2014 12 1998: xr16 • 40 MHz 16-bit RISC, DMA • XC4005E/XL/Spartan-10, 265 LUTs • LCC C compiler, simulator • Building a RISC System in an FPGA, Circuit Cellar series [http://fpgacpu.org/xsoc/cc.html] • FPGA CPU News, fpga-cpu list 9 Dec 2014 ReConFig 2014 13 1998: xr16 Datapath XC4005E Floorplan 9 Dec 2014 ReConFig 2014 14 1998: Virtex 9 Dec 2014 ReConFig 2014 15 2000: Nios, SOPC Builder 9 Dec 2014 ReConFig 2014 16 2000: FPGA Chip Multiprocessors • 3rd gen 16/32-bit RISC PE: 200/330 LUTs + 1 BRAM • 8 cores fit in an XCV50E, 60 in an XCV600E 9 Dec 2014 ReConFig 2014 17 2001-02: Virtex-II/Pro, MicroBlaze, EDK 9 Dec 2014 ReConFig 2014 18 2002: The End of the Beginning • Diverse 3rd party soft processors – Little MCUs – KCPSMPicoBlaze – Commercial RISCs – ARC, LEON SPARC – Legacy ISAs – 6502, Z80, 68000 – Hobbyist / open source – OpenRISC – Language specific cores – Java, Forth, Erlang – Teaching – Chalmers, Cornell, Georgia Tech, Hiroshima, Mich. State, NM Tech, Riverside, Tokai, UCSC, Valladolid, Virginia Tech, WUStL • Nios, MicroBlaze: comprehensive SoC platforms 9 Dec 2014 ReConFig 2014 19 2001-2014: MicroBlaze Evolution/Configurability Version: 1. 3-stage, mul, bshift , CoreConnect 2. div, FSL, I$, D$, 150 MHz = 100 DMIPS 3. cache links 4. FPU, debug trace 5. 5-stage pipeline, = 240 DMIPS 6. 3/5-stage 7. MMU, exceptions, Linux 8. AXI4, fault tolerance, 330 MHz, = 400 DMIPS ~4.4× faster / 12 years = +13%/year 9 Dec 2014 ReConFig 2014 20 http://forums.xilinx.com/t5/Xcell-Daily-Blog/Mars-Curiosity-Rover-s-MAHLI-images-a-dusty-penny-on-Mars-with/ba-p/369275 http://www-robotics.jpl.nasa.gov/publications/Reg_Willson/Edgett_etal_MAHLI_7Jul2012published.pdf Remarkable (But Typical) Applications Typical) (But Remarkable 9 Dec 2014 ReConFig 2014 21 http://forums.xilinx.com/t5/Xcell-Daily-Blog/The-Search-for-Gravity-Waves-and-Dark-Energy-Gets-Help-from/ba-p/491048 The Utility of Soft Processors • Run existing software, in the FPGA – Replace external MCU – Run RTOS / Linux / drivers / networking / web server • Control plane – Replace complex state machines – Hardware boot, diagnostics, telemetry • Accelerators – Customize with app-specific instruction sets – Tightly couple software to accelerators • Computer architecture research – RAMP*, CHERI, FlexPRET 9 Dec 2014 ReConFig 2014 22 2. DOING IT “OLD SCHOOL”: AN AUSTERE APPROACH TO SOFT PROCESSOR DESIGN 9 Dec 2014 ReConFig 2014 23 Design Study: How Many Austere 32-bit RISCs Fit in a Modern FPGA? Virtex-7 690T: (433K 6-LUTs, 1470 BRAMs) ÷ (330 LUTs, 1 BRAM) = ? • What about…? – 6-LUTs? DSPs? – NoC? Memory model? • Let’s see 9 Dec 2014 ReConFig 2014 24 Planning an Austere RISC PE – Psilos • Assumptions • Pipeline design – Xilinx 6-LUTs – Critical path: operand regs ALU – CMP on-die shared memory mux mux operand regs – Run small kernels (no I$) – BRAM for instructions, data – MicroBlaze integer subset – IF:DC:EX:MEM • Essential: I-fetch, reg file, • 2-core clusters ALU, PC/branches, lw/sw – Share 4 KB instruction BRAM • Configurable: lb/sb, lh/sh – Share 4 KB local data BRAM – Share 32×32=64 mul • Cluster-shared: mul bshift – 2 BRAM 10R×16C slices • A<<k = A×2k A>>k = mulh(A×232-k) – 1 PE ≤ ½×10×16×4×¾ = 240 LUTs 9 Dec 2014 ReConFig 2014 25 RISC PE Datapath D_AD A A REG L FILE U RESULT MULT* B D_IN NEXTPC IN_AD IN_IN IR PC DO D_OUT 9 Dec 2014 ReConFig 2014 26 RISC PE Technology Mapping D_AD 2 @ 4 LUTs/6 bits A A 1 LUT/bit 1 LUT/bit REG L FILE U RESULT 2 @ 4 LUTs/8 bits MULT* 4 DSPs (clustered) B D_IN NEXTPC 1 LUT/bit IN_AD IN_IN IR PC DO D_OUT 9 Dec 2014 ReConFig 2014 27 Floorplan: <200 6-LUTs, 300 MHz 9 Dec 2014 ReConFig 2014 28 2-PE Cluster O P1_AD N_DO P0_AD P0_D N_AD N_DI P1_DO PE 0 DATA BRAM PE 1 PEs share 4KB instruction RAM D_AD AD AD D_AD and 4 KB data RAM. D_OUT DI DI D_OUT NOC interface can write D_IN DO DO D_IN instruction RAM and read/write data RAM. PE 0 enjoys dedicated ports. PE 1 shares with NOC. INSN BRAM Priority: NOC, PE 1. IN_AD AD AD IN_AD Not shown: PEs 0 and 1 share DI DI the non-pipelined 32x32=64 IN_IN DO DO IN_IN multiplier. 9 Dec 2014 ReConFig 2014 29 2-PE CLUSTER 0 4-PE Cluster P0_AD,P0_DO P0_AD,P0_DO P1_AD,P1_DO P1_AD,P1_DO2-PE CLUSTER 1 N_I P0_AD,P0_DO P2_AD,P2_DO P1_AD,P1_DO P3_AD,P3_DO N_DO,N_DI N_AD N_DO N_DI N_AD 2x2=4 PE CLUSTER N_DO 4:1 concentration of NOC accesses. N_DI NOC interface can write instruction RAM and read/write data RAM. N_O Upon receipt of a read-request, N_O a read-response (e.g. write) is N_AD,N_DO injected back into NOC. 9 Dec 2014 ReConFig 2014 (Not shown: read-response FIFO.)30 A 25R×10C 2-D Torus NoC Using 250 5-port Routers (!) N W E ROUTER O I S 4-PE CLUSTER 9 Dec 2014 ReConFig 2014 31 N Hoplite Austere X:Y Router W E ROUTER O I S 4-PE N CLUSTER (8 KB DATA) N HOPLITE: AUSTERE X:Y ROUTER (Unidirectional 2D Torus) W E W Wide. Fast. Simple. Tiny. Bufferless, but no dropped traffic. Dimension Ordered Routing: X>Y. Priority: N, W, I. I No segmentation/reassembly. No flits. No VCs. Simple flow control. O Grossly unfair, but deadlock-free. I S 9 Dec 2014 ReConFig 2014 32 Phalanx re.jpg tu ic 0p %2 facts 20 x% an al cs/ph liti po ld or /apw om .c bs we .free w ww :// ttp h • 25R×10C×4-PEs + routers – 1000 PEs / Virtex-7-690T • OK, but …? – Latency? – External memory system? – Workloads? 9 Dec 2014 ReConFig 2014 33 3. THE FUTURE OF SOFT PROCESSORS 9 Dec 2014 ReConFig 2014 34 FPGAs: What’s Next? • Today • Soon? (speculative) – 20-28 nm low-power – 10-14 nm lower-power but Dennard – 600 KLUTs, 1000s DSPs – 2+ MLUTs, 1000s DSP-FPUs, TFLOPS! – 2.5D packaging – 2.5D packaging • 1.2 MLUTs • 4+ MLUTs • Hetero: 28 Gbps serdes die • + DRAM, 50+ Gbps serdes – + 2×ARM A9s+SoC – “Datacenter edition”? • ARM tools and IP • 4-8×64-bit ARM+SoC + DRAM + ? – Opportunities and challenges! 9 Dec 2014 ReConFig 2014 35 • Catapult [ISCA14] • Microsoft Research + Bing joint study and pilot • Accelerate Bing search query ranking with FPGAs at datacenter scale • Doubled throughput /or/ greatly reduced latency Server Node += FPGA Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs, 2 SSDs 10 Gb Ethernet 9 Dec 2014 ReConFig 2014 37 Catapult FPGA Accelerator Card Stratix V Altera Stratix V D5: 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs 8GB DDR3 PCIe Gen 3 x8 8GB DDR3-1333 Powered by PCIe slot FPGA – FPGA: torus network PCIe Gen3 x8 9 Dec 2014 ReConFig 2014 38 1,632 Server Pilot Deployed in a Production Datacenter FPGA Accelerator for Search Ranking Document 8-Stage Pipeline RaaS Servers Route to Head FE: Feature Extraction FPGA 0 Server Route to FPGA 1 Head Server Document FPGA 2 Server FFE: Free-Form Scoring Expressions Request FPGA 3 Server Return Score FPGA 4 Server Document FPGA 5 Scoring Server Request FPGA 6 Server Return Score Compute Score Score FPGA 7 Server Compute Score 9 Dec 2014 ReConFig 2014 40 FFE: Free Form Expressions {Query,Document Document} Occurrences_0 = 7 Occurrences_1 = 4 Tuples_0_1 = 1 ~4K Dynamic Features FFE #1 =(2*Occurrences_0 + Occurrences_1) (2 * Tuples_0_1) ~2K Synthetic Features FFE #1 = 9 L2Score Score FFE Soft Processor Arrays • Many FFE “programs” soft processor approach Cluster • 4 hardware threads/core 0 • 6 cores/cluster –Shared input vector –Shared pipelined FPUs • 10 clusters/FPGA Core 0 Core 1 Core 2 –60 cores, 240 threads FST Complex • C++ compiler Output Core 3 Core 4 Core 5 The Design Productivity Challenge • Catapult Bing ranking: ~20 KLOCs C++ FPGAs • RTL design productivity << software development – Smaller talent pool, weak tools, bespoke, fragile • Missing essentials – Abstraction builders: languages, types, libraries, services, OS – Reuse, composability, portability, longevity • Often the workload is software already, changes often – Expensive to port and maintain 9 Dec 2014 ReConFig 2014 43 Making HW Dev More Like SW • Vivado HLS one core
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages63 Page
-
File Size-