Altera Powerpoint Guidelines
Total Page:16
File Type:pdf, Size:1020Kb
Ecole Informatique 2016 IN2P3, Ecole Polytechnique, 27 Mai 2016 OpenCL On FPGA Marc Gaucheron INTEL Programmable Solution Group Agenda FPGA architecture overview Conventional way of developing with FPGA OpenCL: abstracting FPGA away ALTERA BSP: abstracting FPGA development Examples 2 A bit of Marketing ? 3 FPGA architecture overview FPGA Architecture: Fine-grained Massively Parallel Millions of reconfigurableLet’s zoom logic in elements I/O Thousands of 20Kb memory blocks Thousands of Variable Precision DSP blocks Dozens of High-speed transceivers Multiple High Speed configurable Memory Controllers I/O I/O Multiple ARM© Cores I/O 5 FPGA Architecture: Basic Elements Basic Element 1-bit configurable 1-bit register operation (store result) Configured to perform any 1-bit operation: AND, OR, NOT, ADD, SUB 6 FPGA Architecture: Flexible Interconnect … Basic Elements are surrounded with a flexible interconnect 7 FPGA Architecture: Flexible Interconnect … … Wider custom operations are implemented by configuring and interconnecting Basic Elements 8 FPGA Architecture: Custom Operations Using Basic Elements 16-bit add 32-bit sqrt … … … Your custom 64-bit bit-shuffle and encode Wider custom operations are implemented by configuring and interconnecting Basic Elements 9 FPGA Architecture: Memory Blocks addr Memory Block data_out data_in 20 Kb Can be configured and grouped using the interconnect to create various cache architectures 10 FPGA Architecture: Memory Blocks addr Memory Block data_out data_in 20 Kb Few larger Can be configured and grouped using Lots of smaller caches the interconnect to create various caches cache architectures 11 FPGA Architecture: Floating Point Multiplier/Adder Blocks data_in data_out Dedicated floating point multiply and add blocks 12 DSP block architecture Floating-Point DSP Fixed-Point DSP 13 Elementary math functions supporting floating point Coverage of ~70 elementary math functions Trigonometrics misc Patented & published efficient mapping to FPGA hardware FPHypot FPRangeReduction Polynomial approximation, Horner’s method, truncated multipliers, … Basic Floating Point Compliant to OpenCL & IEEE754 accuracy standards FPAdd FPAddExpert FPAddN Rounding mode options for fundamental operators FPSubExpert FPAddSub \ Half- to Double-precision FPAddSubExpert Trigonometrics of pi*x FPFusedAddSub FPMul FPSinPiX FPMulExpert FPCosPiX FPConstMul Inverse trigonometric functions FPTanPiX FPAcc FPCotPiX FPArcsinX\ FPSqrt FPArcsinPi FPDivSqrt Exp, Log and Power FPArccosX FPRecipSqrt FPCbrt FPLn FPArccosPi FPDiv FPLn1px FPArctanX FPInverse FPLog10 FPArctanPi FPFloor FPLog2 FPArctan2 FPCeil FPExp FPRound FPExpFPC Conversion FPRint FPExpM1 FPFrac FPExp2 FXPToFP FPMod FPExp10 FPToFXP FPDim FPPowr FPToFXPExpert FPAbs FPToFXPFused FPMin Trig with argument reduction FPToFP FPMax FPSinX Macro Operators FPMinAbs Fixed and floating point FPCosX FPMaxAbs FPSinCosX FPFusedHorner FPMinMaxFused Floating point only FPTanX FPFusedHornerExpert FPMinMaxAbsFused FPCotX FPFusedHornerMulti FPCompare FPFusedMultiFunction FPCompareFused 14 FPGA Architecture: Configurable Connectivity = Efficiency Blocks are connected into a custom data-path that matches your application. Streaming data-path more efficient than copying to/from global memory 15 16 1GHz 5.5M Core Performance Logic Elements % Heterogeneous up to Up to 70 Lower Power 1TB/s 3D SIP Integration Up to 10 Intel TFLOPS 14 nm Tri-Gate Most Quad-Core Comprehensive Cortex-A53 Security ARM Processor Developing with FPGA 18 Typical Programmable Logic Design Flow Design specification Design entry/RTL coding - Behavioral or structural description of design RTL simulation - Functional simulation (Mentor Graphics ModelSim® or other 3rd-party simulators) - Verify logic model & data flow (no timing delays) M512 Synthesis (Mapping) LE - Translate design into device specific primitives - Optimization to meet required area & performance constraints M4K/M9K I/O - Quartus II synthesis, Precision Synthesis, Synplify/Synplify Pro, Design Compiler FPGA - Result: Post-synthesis netlist Place & route (Fitting) - Map primitives to specific locations inside target technology with reference to area & performance constraints - Specify routing resources to be used - Result: Post-fit netlist 19 Typical Programmable Logic Design Flow t clk Timing analysis - Verify performance specifications were met - Static timing analysis Gate level simulation (optional) - Timing simulation - Verify design will work in target technology PC board simulation & test - Simulate board design - Program & test device on board - Use on-chip tools for debugging 20 Application Development Paradigm ASIC FPGA Programmers Parallel Programmers Standard CPU Programmers 21 The magic trick ? HDL Coder FpgaC 22 OpenACC to FPGA 23 OpenCL: abstracting FPGA away Altera OpenCL Program Overview 2010 research project Public release 13.1 (Nov 2013) Toronto Technology Center Channels (Streaming IO) 2011 Development started Example Designs SoC Support Proof of concept 9 customer evaluations Release 14.0 (June 2014) 2012 Early Access Program Platforms Emulator/Profiler Demo’s at Supercomputing ‘12 Rapid Prototyping Over 60 customer evaluations 2013 First public release Release 14.1 (Nov 2014) Arria 10 support Publically available May 2013 Shared Virtual Memory (PoC) Passed Conformance Testing >8500 programs run properly Release 15.1 (Nov 2015) Kernel Update Library support 25 OpenCL Use Model: Abstracting the FPGA away Host Code OpenCL Accelerator Code main() { __kernel void sum read_data( … ); (__global float *a, manipulate( … ); __global float *b, clEnqueueWriteBuffer( … ); __global float *y) clEnqueueNDRange(…,sum,…); { clEnqueueReadBuffer( … ); int gid = get_global_id(0); display_result( … ); y[gid] = a[gid] + b[gid]; } } Standard Altera Offline Verilog gcc Compiler Compiler EXE AOCX Quartus II Accelerator Host SoC FPGA combines these in single device 26 Software Development Flow Modify kernel.cl x86 Emulator (sec) Functional Bugs? Hardware performance Optimization Report (sec) Stall-free pipeline? Memory met? coalesced? Profiler (hours) DONE! 27 OpenCL on GPU/Multi-Core CPU Architectures Conceptually many parallel threads Simplified View Each thread runs sequentially on a different processing element (PE) Fixed #s of Functional Units, Registers available on each PE Many processing elements are available to provide significant parallel speedup Host CPU Work Distribution IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory TF TF TF TF TF TF TF TF TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 L2 L2 L2 L2 L2 L2 28 Memory Memory Memory Memory Memory Memory OpenCL on FPGA OpenCL kernels are translated into a highly parallel circuit A unique functional unit is created for every operation in the kernel Memory loads / stores, computational operations, registers Functional units are only connected when there is some data dependence dictated by the kernel Pipeline the resulting circuit with a new thread on each clock cycle to keep functional units busy Amount of parallelism is dictated by the number of pipelined computing operations in the generated hardware 29 Mapping a simple program to an FPGA CPU instructions High-level code R0 Load Mem[100] R1 Load Mem[101] Mem[100] += 42 * Mem[101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] 30 CPU activity, step by step R0 Load Mem[100] A Time R1 Load Mem[101] A R2 Load #42 A R2 Mul R1, R2 A R0 Add R2, R0 A Store R0 Mem[100] A 31 On the FPGA we unroll the CPU hardware… R0 Load Mem[100] A Space R1 Load Mem[101] A R2 Load #42 A R2 Mul R1, R2 A R0 Add R2, R0 A Store R0 Mem[100] A 32 … and specialize by position R0 Load Mem[100] A 1. Instructions are fixed. Remove “Fetch” R1 Load Mem[101] A R2 Load #42 A R2 Mul R1, R2 A R0 Add R2, R0 A Store R0 Mem[100] A 33 … and specialize R0 Load Mem[100] A 1. Instructions are fixed. Remove “Fetch” 2. Remove unused ALU ops R1 Load Mem[101] A R2 Load #42 A R2 Mul R1, R2 A R0 Add R2, R0 A Store R0 Mem[100] A 34 … and specialize R0 Load Mem[100] A 1. Instructions are fixed. Remove “Fetch” 2. Remove unused ALU ops 3. Remove unused Load / Store R1 Load Mem[101] A R2 Load #42 A R2 Mul R1, R2 A R0 Add R2, R0 A Store R0 Mem[100] A 35 … and specialize R0 Load Mem[100] 1. Instructions are fixed. Remove “Fetch” 2. Remove unused ALU ops 3. Remove unused Load / Store R1 Load Mem[101] 4. Wire up registers properly! And propagate state. R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] 36 … and specialize R0 Load Mem[100] 1. Instructions are fixed. Remove “Fetch” 2. Remove unused ALU ops 3. Remove unused Load / Store R1 Load Mem[101] 4. Wire up registers properly! And propagate state. 5. Remove dead data. R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] 37 … and specialize R0 Load Mem[100] 1. Instructions are fixed. Remove “Fetch” 2. Remove unused ALU ops 3. Remove unused Load / Store R1 Load Mem[101] 4. Wire up registers properly! And propagate state. 5. Remove dead data. R2 Load #42 6. Reschedule! R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] 38 Custom data-path on the FPGA matches your algorithm! High-level code Mem[100] += 42 * Mem[101] Custom data-path load load 42 Build exactly what you need: Operations Data widths