Bill Jenkins for LRZ Intel Programmable SolutionsIntel Group Proprietary Agenda
9:00 am Welcome 1:30 pm Lab 2 OpenCL Flow 9:15 am Introduction to FPGAs 2:15 pm Introduction to DSP Builder 9:45 am FPGA Programming models: RTL 3:00 pm Introduction to Acceleration Stack 10:15 am FPGA Programming models: HLS 4:00 pm Lab 3 Acceleration Stack 11:00 am Lab 1 HLS Flow 4:30 pm Curriculum & University Program Coordination 11:45 am Lunch 12:30 pm FPGA Programming models: OpenCLfor LRZ 1:00 pm High PerformanceIntel Data Flow Proprietary Concepts for LRZ Intel Proprietary The Problem: Flood of Data By 2020 The average internet user will generate ~1.5 GB of traffic per day
Smart hospitals will be generating over 3 TB per day radar ~10-100 KB per second Self drivingSelf driving cars cars will will be be generating generating over over sonar ~10-100 KB per second 44,000 TB per GB perday… day… each each gps ~50 KB per second A connected plane will be generating over 40 TB per day lidar ~10-70 MB per second A connected factory will be generating over 1 PB per day for LRZ cameras ~20-40 MB per second
All numbers are approximated http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-trafficIntel-forecast/infographic.html Proprietary http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html 5 exaflops https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 1 car per hour http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html Typical HPC Workloads
Astrophysics Genomics / Bio-Informatics Artificial Intelligence Molecular Dynamics*
for LRZ Big Data Analytics IntelFinancial ProprietaryWeather & CLimate Cyber Security
* Source: https://comp-physics-lincoln.org/2013/01/17/molecular-dynamics-simulations-of-amphiphilic-macromolecules-at-interfaces/
5 Fast Evolution of Technology
We now have the compute to solve these problems today in near real-time
Bigger Data Better Hardware Smarter Algorithms
Image: 50 MB / picture Transistor density doubles Advances in neural networks leading to better Audio: 5 MB / song every 18 months for LRZ accuracy in training models Video: 47 GB / movie Cost / GB in 1995: $1000.00 IntelCost Proprietary / GB in 2015: $0.03
6 for LRZ Intel Proprietary50+ Years of Moore’s Law Computing has Changed…
7 The Urgency of Parallel Computing
If engineers keep building processors the way we do now, CPUs will get even faster but they’ll require so much power that they won’t be usable.
—Patrick Gelsinger, forformer LRZ Intel Chief Technology Officer, Intel Proprietary February 7, 2001 Source: http://www.cnn.com/2001/tech/ptech/02/07/hot.chips.idg/
8 Implications to High Performance Computing 50 GFLOPS/W ~100MW
for LRZ
Intel Proprietary 2022
9 Challenges Scaling Systems to Higher Performance
CPU Intensive Memory Intensive IO Intensive Result: Slow Performance
Memory Result: Result: Bottleneck Excessive power Slow requirements Performance Bottleneck Bottleneck (high latency) System I/O for LRZI/O Intel Proprietary Need to think about Compute Offload as well as Ingress/Egress Processing
10 Diverse Application Demands
for LRZ Intel Proprietary
11 The Intel Vision
Heterogeneous Systems: ▪ Span from CPU to GPU to FPGA to dedicated devices with consistent programming models, languages, and tools
for LRZ CPUs IntelGPUs Proprietary FPGAs ASSP
12 Heterogeneous Computing Systems
Modern systems contain more than one kind of processor ▪ Applications exhibit different behaviors: – Control intensive (Searching, parsing, etc…) – Data intensive (Image processing, data mining, etc…) – Compute intensive (Iterative methods, financial modeling, etc…) ▪ Gain performance by using specialized capabilities of different types of processors for LRZ Intel Proprietary
13 Separation of Concerns
Two groups of developers: ▪ Domain experts concerned with getting a result – Host application developers leverage optimized libraries ▪ Tuning experts concerned with performance – Typical FPGA developers that create optimized libraries Intel® Math Kernel Library a simple example of raising the level of abstraction to the math operations for LRZ ▪ Domain experts focusIntel on formulating Proprietary their problems ▪ Tuning experts focus on vectorization and parallelization
14 for LRZ Intel Proprietary
15 FPGA Enabled Performance and Agility
Workload N Workload Optimization: Workload 2 ensure Xeon cores serve their Workload 1 highest value processing
Efficient Performance: improve performance/watt
Real-Time: high bandwidth connectivity and low-latency z parallel processing Milliseconds Developer Advantage: code re-use across Intel FPGA datafor LRZ center productsIntel Proprietary FPGAs enhance CPU-based processing by accelerating algorithms and minimizing bottlenecks
16 FPGAs Provide Flexibility to Control the Data path
Compute Acceleration/Offload ▪ Workload agnostic compute ▪ FPGAaaS ▪ Virtualization
Intel® Xeon® Processor
Inline Data Flow Processing Storage Acceleration ▪ Machine learning ▪ Machine learning ▪ Object detection and recognition ▪ Cryptography ▪ Advanced driver assistance system (ADAS) for LRZ ▪ Compression ▪ Gesture recognition ▪ Indexing ▪ Face detection Intel Proprietary
17 FPGA Architecture DSP Block Memory Block Field Programmable Gate Array (FPGA) ▪ Millions of logic elements ▪ Thousands of embedded memory blocks ▪ Thousands of DSP blocks ▪ Programmable interconnect ▪ High speed transceivers ▪ Various built-in hardened IP for LRZ
Used to create Custom Hardware! Logic Intel ProprietaryProgrammable Modules Routing Switch
18 FPGA Architecture: Basic Elements Basic Element
1-bit configurable 1-bit register operation (store result)
Configured to perform any 1-bit operation: AND, OR, NOT, ADD, SUB for LRZ Intel Proprietary
19 FPGA Architecture: Flexible Interconnect …
Basic Elements are for LRZ surrounded withIntel a Proprietary flexible interconnect
20 FPGA Architecture: Flexible Interconnect … …
Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements
21 FPGA Architecture: Custom Operations Using Basic Elements
16-bit add
32-bit sqrt … Your custom 64-bit bit-shuffle and encode Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements
22 FPGA Architecture: Memory Blocks
addr Memory data_out data_in Block 20 Kb
Can be configured and grouped using the interconnect to create various cache architectures for LRZ Intel Proprietary
23 FPGA Architecture: Memory Blocks
addr Memory data_out data_in Block 20 Kb
Few larger Can be configured and grouped Lots of smaller caches using the interconnect to create caches various cache architectures for LRZ Intel Proprietary
24 FPGA Architecture: Floating Point Multiplier/Adder Blocks
data_in data_out
Dedicated floating point multiply and add blocks for LRZ Intel Proprietary
25 DSP Blocks
Thousands DSP Blocks in Modern FPGAs ▪ Configurable to support multiple features – Variable precision fixed-point multipliers – Adders with accumulation register – Internal coefficient register bank – Rounding – Pre-adder to form tap-delay line forfor filters LRZ – Single precision floating point multiplication, addition,Intel accumulation Proprietary
26 FPGA Architecture: Configurable Routing
Blocks are connected into a custom data-path that matches your application. for LRZ Intel Proprietary
27 FPGA Architecture: Configurable IO
The Custom data-path can be connected directly to custom or standard IO interfaces for inline data processing for LRZ Intel Proprietary
28 FPGA I/Os and Interfaces
FPGAs have flexible IO features to support many IO and interface standards ▪ Hardened Memory Controllers – Available interfaces to off-chip memory such as HBM, HMC, DDR SDRAM, QDR SRAM, etc. ▪ High-Speed Transceivers ▪ PCIe* Hard IP ▪ Phase Lock Loops for LRZ Intel Proprietary
*Other names and brands may be claimed as the property of others 29 Intel® FPGA Product Portfolio
Wide range of FPGA products for a wide range of applications
Non-volatile, low-cost, Low-power, cost- Midrange, cost, power, High-performance, single chip small form sensitive performance performance balance state-of-the-art
▪ Products features differs across familiesfor LRZ – Logic density, embeddedIntel memory,Proprietary DSP blocks, transceiver speeds, IP features, process technology, etc.
30 Mapping a Simple Program to an FPGA
CPU instructions R0 Load Mem[100] High-level code R1 Load Mem[101] Mem[100] += 42 * Mem[101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] for LRZ Intel Proprietary
31 First let’s take a look at execution on a simple CPU
LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B
CWriteEnable CData Op for LRZ Fixed and general - General “cover-all-cases” data-paths Intel -ProprietaryFixed data-widths architecture: - Fixed operations 32 Looking at a Single Instruction
LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B
CWriteEnable CData Op for LRZ Intel Proprietary Very inefficient use of hardware!
33 Sequential Architecture vs. Dataflow Architecture
Sequential CPU Architecture FPGA Dataflow Architecture
A load load 42 R e A A s o Time u r c A A e for LRZ s Intel Proprietarystore A
34 Custom Data-Path on the FPGA Matches Your Algorithm!
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path Build exactly what you need: load load 42 Operations Data widths for LRZMemory size & configuration Intel ProprietaryEfficiency: store Throughput / Latency / Power
35 Advantages of Custom Hardware with FPGAs
▪ Custom hardware!
DSP Transceiver ▪ Efficient processing Blocks Channels
M20K ▪ Fine-grained parallelism Blocks Transceiver PCS ▪ Low power I/O PLLs PLLs ▪ Flexible silicon Memory Controllers PCIe* IP ▪ Ability to reconfigure and IOs Core ▪ Fast time-to-market for LRZ Logic ▪ Many available I/OIntel standards Proprietary
*Other names and brands may be claimed as the property of others 36 RTL for LRZ Intel Proprietary
37 FPGA Development and Programming Tools
Software Developer Hardware Developer
Algorithm IP Library HDL Software Developer Designer Developer Designer
DSP Builder Intel® HLS Verilog for Intel® Compiler VHDL Intel® SoC FPGAs FPGA Intel® FPGA Embedded SDK for Design Suite OpenCL (EDS) for LRZ IntelIntel® Proprietary Quartus Prime Design Software
Verilog, VHDL and the Intel® FPGA SDK for OpenCL are currently supported by the Acceleration Stack. High Level Synthesis can be used manually by following app note
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 38 Traditional FPGA Design Entry
Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog A designer must describe the behavior of the algorithm to create a low-level digital circuit ▪ Logic, Registers, Memories, State Machines, etc. Design times range from several months to even years!
for LRZ Intel Proprietary
39 Traditional FPGA Design Flow
Time-Consuming Effort
Behavioral Simulation
HDL Synthesis
Place & Route / Timing Analysis / Timing Closure for LRZ Board Simulation & Test Intel Proprietary
40 Intel® Quartus® Prime Design Software
Default Operating Environment
Project Navigator
Tool View window
Tasks window IP Catalog
for LRZ Messages Intel Proprietary window
41 Intel® Quartus® Prime Design Software Projects
Description ▪ Collection of related design files & libraries ▪ Must have a designated top-level entity ▪ Target a single device ▪ Store settings in the software settings file (.qsf) ▪ Compiled netlist information stored in qdb folder in project directory Create new projects with New Projectfor Wizard LRZ ▪ Can be created usingIntel Tcl scripts Proprietary
42 Intel® FPGA Design Store
Download complete example design https://cloud.altera.com/devstore/platform/ templates for specific development kits Design examples include design files, device programming files, and software code as required Install .par files and select as template in New Project Wizard for LRZ Intel Proprietary
43 Device Selection
Filter device list Choose device family & family category (transceiver options, SoC options, etc.)
Choose specific part from list for LRZ Intel Proprietary Tcl: set_global_assignment –name FAMILY “device family name” Tcl: set_global_assignment –name DEVICE
44 Chip Planner
Graphical view of ▪ Layout of device resources ▪ Routing channels between device resources ▪ Global clock regions Uses ▪ View placement of design logic ▪ View connectivity between resourcesfor used LRZ in design ▪ Make placement assignmentsIntel Proprietary ▪ Debugging placement-related issues
45 Chip Planner Tools menu or toolbar
Device floorplan Report aka Chip View Selected Node window Properties
Tasks Layers Settings window Memory block in use Unused for LRZLAB Intel Proprietary
46 Floorplan Views Overall device resource usage
Lower level block usage
Lowest level routing detail
Zoom in for detailed logic for LRZ implementation & routing usageIntel Proprietary
47 Pin Planner
Interactive graphical tool for assigning pins ▪ Drag & drop pin assignments ▪ Set pin I/O standards ▪ Reserve future I/O locations Assignments menu → Pin Planner, toolbar, or Tasks window Default window panes ▪ Package View ▪ All Pins list ▪ Groups list ▪ Tasks window for LRZ ▪ Report window Intel Proprietary
48 Pin Planner Window
Groups list Toolbar Package View
Tasks pane for LRZ Intel ProprietaryAll Pins list
49 The Programmer
Tools menu → Programmer
for LRZ Intel Proprietary
50 State Machine Editor File menu → New or Tasks window Select State Machine File (.smf)
Create state machines in GUI ▪ Manually by adding individual states, transitions, and output actions ▪ Automatically with State Machine Wizard (Tools menu & toolbar) Generate state machine HDL code (required) ▪ VHDL Double-click states & transitions to for LRZ edit properties: name, equations, actions ▪ Verilog Intel Proprietary ▪ SystemVerilog
51 Platform Designer
Components in system use different Processor PCI Express* interfaces to communicate (some (32-bit Master) (64-bit Master) Bus Interface Interrupt Bus Interface Addres standard, some non-standard) Controller Addres Data Data Address Decoder s Typical system requires significant Arbiter s engineering work to design custom interface logic
Width Adapter Width Adapter Width Adapter Width Adapter Width Adapter Integrating design blocks and Bus Interface Bus Interface Bus Interface Bus Interface Bus Interface intellectual property (IP) is tedious and Slave 1 Slave 2 Slave 3 Slave 4 Slave 5 error-prone for 8-LRZBit 32-Bit 16-Bit 32-Bit 64-Bit Intel Proprietary
52 Automatic Interconnect Generation
Avoids error-prone integration Processor PCI Express * (32-bit Master) (64-bit Master) Saves development time with Bus Interface Interrupt Bus Interface Addres Controller Addres Data Data automatic logic & HDL generation Address Decoder s Arbiter s Enables you to focus on value-add Platform Designer automatically blocks generates interconnect
Platform Designer improves Width Adapter Width Adapter Width Adapter Width Adapter Width Adapter productivity by automatically Bus Interface Bus Interface Bus Interface Bus Interface Bus Interface generating the system interconnect Slave 1 Slave 2 Slave 3 Slave 4 Slave 5 for 8-LRZBit 32-Bit 16-Bit 32-Bit 64-Bit logic Intel Proprietary
53 The Platform Designer GUI
Access in Tools menu, toolbar, or Tasks window
Draggable, detachable tabs
IP Catalog System Contents
Hierarchy for LRZ Intel ProprietaryMessages
54 Lab 1 for LRZ Intel Proprietary
55 High Level Synthesis for LRZ Intel Proprietary
56 Can Also Be Wrapped With Higher Level Flows
C++ C IP IP Functions C++ IP
Intel© HLS Compiler for LRZ Intel ProprietaryRTL Platform Designer
57 The Software Programmer’s View
main(…) main(…) main(…){ {{ for( … ) for(for( … …) ) { { {
} } Compiler }
Programmers develop in mature software environments – Ideas can easily be expressed in languages such as ‘C’ – Typically start with simple sequential program – Use parallel APIs / language extensions to exploit multi core for additional performance – Compilation times are almost instantaneousfor LRZ – Immediate feedbackIntel Proprietary – Rich debugging tools
58 High Level Design is the Bridge Between HW & SW
100x More Software Engineers than Hardware Engineers Softwar Key to wide-spread adoption of FPGA in Datacenter e
Debugging software is much faster than hardware Many functions are easier to specify in software than RTL RTL Simulation of RTL takes thousands times longer than software
Design Exploration is much easier and faster in software
We Need to Raise the Level of Abstraction Transistors ▪ Similar to what assembly programmers did with C over 30 years ago
– (Today) Abstract away FPGA Design with Higherfor Level Languages LRZ Productivity and Abstraction – (Today) Abstract away FPGAIntel Hardware behindProprietary Platforms – (Tomorrow) Leverage Pre-Compiled Libraries as Software Services
59 HLS Use Model src.c i++g++
C/C++ Code lib.h
main
f f f
t1 f
f11 f12 f13 t2 IP f21 Directives f22 f23 FPGA Standard HLS gcc/g++ Compiler IP Compiler
100% Makefile compatible for LRZ EXE HDL IP Intel ProprietaryIntel® Quartus® Ecosystem
60 Intel® HLS Compiler
Targets Intel® FPGAs Command-line executable: i++ Builds an IP block ▪ To be integrated into a traditional FPGA design using FPGA tools
Platform C/C++ HLS IP Designer Source Compiler for LRZ Leverages standard C/C++Intel development Proprietary environment Goal: Same performance as hand-coded RTL with 10-15% more resources
61 HLS Procedure
Create Component and Testbench in C/C++
Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB
Compile with i++ -march=
® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ
IntelIntegrate Proprietary IP with rest of your FPGA system
62 Intel® HLS Compiler Usage and Output
GDB-Compatible Executable Develop with C/C++:
src.c i++ -march=x86-64 src.c a.exe|out
Executable which will run calls to lib.h func in simulation of synthesized IP
a.exe|out Run Compiler for HLS: All the files necessary to include IP in a Quartus project. a.prj/components/func/ i.e. .qsys, .ip, .v etc
src.c i++ -march=
a is the default output name, -o option can be used to specify a non-default output name
63 HLS Procedure: x86 Emulation
Create Component and Testbench in C/C++
Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB
Compile with i++ -march=
® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ
IntelIntegrate Proprietary IP with rest of your FPGA system
64 Simple Example Program: i++ and g++ flow
Terminal Commands and Outputs
Example Program $ g++ test.cpp // test.cpp $ ./a.out #include
65 g++ Compatibility
Intel HLS Compiler is command line compatible with g++ ▪ Similar command-line flags, x86 behavior, and compilation flow ▪ Changing “g++” to “i++” should just work – g++
66 i++ Options : g++ Compatible Options
Option Description
-h Display help information
-o
-c Instructs compiler generate the object files and not the executable -march=
-v Verbose mode
-g Generate debug information (default)
-g0 Do not generate debug information
-I
67 i++ Options: FPGA Related Options
Option Description
--component
--clock
-ghdl Enable full debug visibility and logging of all signals when verification executable is run --quartus-compile Compiles the resulting HDL files using the Intel® Quartus® Prime software
--simulator
--x86-only Only create the executable for testbench, no RTL or cosim support --fpga-only Create FPGA componentfor project, LRZ RTL and cosim support, no testbench binary Example: i++ -march=
68 The Default Interfaces
component int add(int a, int b) { return a+b; } C++ Construct HDL Interface
Conduits associated with the start done Scalar arguments default start/busy interface busy stall Pointer arguments Avalon memory master interface a[31:0] add returndata[31:0] Global scalars and Avalon memory master interface b[31:0] forarrays LRZ Intel Proprietary clock Note: more on interfaces later
69 Example Makefile
FILE := myapp DEVICE := Arria10
all:
gpp: $(FILE).cpp g++ $(GCFLAGS) $(FILE).cpp -o $(FILE).out
emu: $(FILE).cpp i++ $(GCFLAGS) $(FILE).cpp -o $(FILE)_emu.out
fpga: $(FILE).cpp i++ $(GCFLAGS) $(FILE).cppfor -o $(FILE) LRZ_fpga.out -march=$(DEVICE) Intel Proprietary
70 x86 Debugging Tools printf/cout gdb Valgrind
GDB-Compatible Executable Develop with C/C++:
src.c i++ -march=x86-64 src.c a.exe|out for LRZ lib.h Intel Proprietary
71 Using printf()
Requires “HLS/stdio.h” ▪ Maps to
72 Using printf(): Example
Example Program Terminal Commands and output
// test.cpp $ i++ test.cpp #include "HLS/stdio.h" $ ./a.out Hello from the testbench void say_hello() { Hello from the component printf("Hello from the component\n"); $ } $ i++ test.cpp –march=Arria10 \ int main() { --component say_hello printf("Hello from the testbench\n"); $ ./a.out say_hello(); Hello from the testbench return 0; for$ LRZ } Intel Proprietary
73 Debugging Using gdb i++ integrates well with GNU gdb ▪ Debug data is generated by default – Unlike g++, -g enabled by default, use -g0 to turn off debug data -march=x86-64 flow: ▪ Can step through any part of the code (including the component) -march=
74 gdb Example
Example Program Terminal Commands and output
// test.cpp $ i++ test.cpp –march=x86-64 –o test-x86 #include "HLS/hls.h" $ gdb ./test-x86 #include "HLS/stdio.h" ………………………………………………………………
75 Debugging with Valgrind
“Valgrind is an instrumentation framework for building dynamic analysis tools.” ▪ Valgrind tools can detect: – Memory leaks – Invalid pointer uses – Use of uninitialized values – Mismatched use of malloc/new vs free/delete – Doubly freed memory for LRZ ▪ Use to debug componentIntel and Proprietary testbench in the x86 emulation flow
76 Simple Valgrind Example Example Program: Terminal Commands and output: $ i++ test.cpp // test.cpp $ ./a.out 1 #include “hls/stdio.h” Segmentation Fault #include
int bin_count (int *bins, int a) { return ++bins[a % 16]; }
int main() { int *bins = (int *) malloc(16 * sizeof(int)); srand(0); for (int i = 0; i < 256; i++) { int x = rand(); int res = bin_count(bins, x); printf("Count val: %d\n", res); } free (bins); for LRZ return 0; } Intel Proprietary
78 HLS Procedure: Cosimulation
Create Component and Testbench in C/C++
Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB
Compile with i++ -march=
® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ
IntelIntegrate Proprietary IP with rest of your FPGA system
79 Example Component/Testbench Source
#include "HLS/hls.h" i++ -march=
80 Translation from C function API to HDL module
All component functions are synthesized to HDL ▪ Each synthesized component is an independent HDL module Component functions can be declared: ▪ Using component keyword in source ▪ Specifying “--component
81 Cosimulation
Combines x86 testbench with RTL simulation HDL code for the component runs in an RTL Simulator ▪ Verilog ▪ RTL testbench automatically created from software main() and everything else called from main runs on x86 as the testbench Communication using SystemVerilog Direct Programming Interface (DPI) ▪ Allows C/C++ to interface SystemVerilogfor LRZ ▪ Inter-process communicationIntel Proprietary (IPC) library used to pass testbench input data to RTL simulator, and returns the data back to the x86 testbench
82 Cosimulation Verifying HLS IP
The Intel® HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator ▪ To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture – Any calls to the component function becomes calls the simulator through DPI IP Function Call
a.exe|out src.c i++ -march=
83 Default Simulation Behavior
Function calls to the simulator are sequential by default
#include "HLS/hls.h" #include "stdio.h" component int acc (int a, int b) { return a+b; } int main() { int x1, x2, x3; x1=acc(1, 2); x2=acc(3, 4); x3=acc(5, 6); for LRZ … } Intel Proprietary
84 Streaming Simulation Behavior
Use enqueue function calls to stream data into the component
#include "HLS/hls.h" #include "stdio.h" component int acc(int a, int b) { return a+b; } int main() { int x1, x2, x3; altera_hls_enqueue(&x1, &acc, 1, 2); altera_hls_enqueue(&x2, &acc, 3, 4); altera_hls_enqueue(&x3, &acc, 5, 6); altera_hls_component_run_all(“acc”); for LRZ … } Intel Proprietary
85 Viewing Component Waveforms
▪ Compile design with i++ -ghdl flag – Enable full visibility and logging of all HDL signals in simulation ▪ After cosimulation execution, waveform available at a.prj/verification/vsim.wlf ▪ Examine with the ModelSim GUI: – vsim a.prj/verification/vsim.wlf for LRZ Intel Proprietary
86 Viewing Waveforms in Modelsim
Locate Component for LRZ IntelAdd Signals Proprietary to Waveform
87 Cosimulation Design Process
Compile and verify on Simulate using Compile for FPGA x86 Modelsim
Iterate on the algorithm Examine the FPGA reports Test functionality
Test latency and Iterate on the architecture Functional verification performance (through of the design for LRZ verification stats) Debugging using Use the reports as feedback gdb/valgrind Intelon whatProprietary the bottlenecks are
88 Main HTML Report
The Intel® HLS Compiler automatically generates HTML report that analyzes various aspects of your function including area, loop structure, memory usage, and system data flow ▪ Located at a.prj/reports/report.html
for LRZMany Types of Reports Intel Proprietary
89 HTML Report: Summary
Overall compile statics ▪ FPGA Resource Utilization ▪ Compile Warnings ▪ Quartus® fitter results – Available after Quartus compilation ▪ etc. for LRZ Intel Proprietary
90 HTML Report: Loops
Serial loop execution hinders function dataflow circuit performance ▪ Use Loop Analysis report to see if and how each loop is optimized – Helps identify component pipeline bottlenecks Loop Automatically unrolled? Yes Fully unrolled? Unrolled? Partially unrolled? #pragma unroll implemented? No
Yes What’s the Initiation Interval (launch Pipelined? forfrequency LRZ of new iteration)? Are there dependency preventing optimal II? IntelNo Proprietary Reason for serial execution?
91 Loop Unrolling
Loop unrolling: Replicate hardware to execute multiple loop iterations at once ▪ Simple loops unrolled by the compiler automatically ▪ User may use #pragma unroll to control loop unrolling ▪ Loop must not have dependency from iteration to iteration
For Begin Iteration 1 2 3 4 5 …
Op 1 Op 1 Op 1 Op 1 Op 1 … Op 1 Loop Unroll for LRZ … Op 2 Intel ProprietaryOp 2 Op 2 Op 2 Op 2 Op 2 For End
92 Loop Pipelining
Loop pipelining: Launch loop iterations as soon as dependency is resolved ▪ Initiation interval(II): launch frequency (in cycles) of a new loop iteration – II=1 is optimally pipelined – No dependency or dependencies can be resolved in 1 cycle For Begin
i2 LoopIterations Pipelined Execution of i3i2 Op 1 i2 Op 1
Op 2 i1 Op 2 Op 3 for LRZ Serial Execution of Execution Serial Iterations Loop i0 Intel ProprietaryOp 3 For End
93 HTML Report: Loop Analysis
Loop analysis shows how loops are implemented – Ability to correlate with source code
Compiler-added loop, not in the code, implicit infinitely loop allowing the component to run continuously in pipelined fashion Pipelined loop, II=1
Pipelined loop, II=2 due to memory dependency
for LRZFully unrolled loop, due to user #pragma Intel Proprietaryunroll
94 HTML Report: Area Analysis
View detailed estimated resource consumption by system or source line ▪ Analyze data control overhead ▪ View memory implementation ▪ Shows resource usage – ALUTs – FFs – RAMs for LRZ – DSPs Intel Proprietary ▪ Identifies inefficient uses
95 HTML Report: Component Viewer
Displays abstracted netlist of the HW implementation ▪ View data flow pipeline – See loads and stores – Interfaces including stream reads and writes – Memory structure – Loop structure – Possible performance bottlenecks – Unpipelined loops are coloredfor light red LRZ – Stallable points are red Intel ProprietaryMouse over node to see tooltip and details.
Correlates with source code. 96 HTML Report: Memory Viewer
Displays local memory implementation and accesses ▪ Visualize memory architecture – Banks, widths, replication, etc ▪ Visualize load-store units (LSUs) – Stall-free? – Arbitration for LRZ – Red indicates stallable Intel ProprietaryMouse over node to see tooltip and details.
Correlates with source code. 97 HTML Report: Verification Statistics
Reports execution statics from testbench execution, available after component is simulated (testbench executable ran) ▪ Number and type of component invocation
▪ Latency of component Measurements based on latest execution of ▪ Dynamic Initiation interval of Component testbench ▪ Data rates of streams for LRZ Intel Proprietary
98 HLS Procedure: Integration
Create Component and Testbench in C/C++
Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB
Compile with i++ -march=
® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ
IntelIntegrate Proprietary IP with rest of your FPGA system
99 Quartus® Generated QoR Metrics for IP
Use Intel® Quartus® Prime software to generate quality-of-result reports ▪ i++ creates the Quartus project in a.prj/quartus ▪ To generate QoR data (final resource utilization, fmax) – Run quartus_sh --flow compile quartus_compile – Or use i++ --quartus-compile option ▪ Report part of the HTML report – a.prj/reports/report.htmlfor LRZ – Summary pageIntel Proprietary
100 Intel® Quartus® Software Integration a.prj/components directory contains all the files to integrate
▪ One subdirectory for each component – Portable, can be moved to a different location if desire 2 use scenarios 1. Instantiate in HDL 2. Adding IP to a Platform Designer system for LRZ Intel Proprietary
101 HDL Instantiation
Add Components to Intel® Quartus Project ▪
102 Platform Designer System Integration Tool
Catalog of Connect custom IP available IP and systems Interface protocols Memory IP 1 DSP Custom 1 Embedded Bridges IP 2 PLL IP 3 Custom 2 Custom Components Custom Systems Accelerate Simplify integration development for LRZ AutomateIntel integration Proprietary tasks HDL
103 Platform Designer Integration
Platform Designer component generated for each component: ▪ For PD Standard – a.prj/components/
104 Platform Designer HLS Component Example
Example ▪ Cascaded low-pass filter and high-pass filter
HLS Components for LRZ Intel Proprietary
105 HLS-Backed Components
▪ Generic component can be used in place of actual IP core
▪ Choose Implementation Type: HLS
• Specify HLS source files • Compile Component for LRZ • Run Cosim Intel Proprietary • Display HTML report
106 Lab 2 for LRZ Intel Proprietary
107 OpenCL for LRZ Intel Proprietary
108 Intel FPGA SDK for OpenCL™ Flow A system level view: OpenCL Host Program
Board Support Package foo.cl System Compiler Integrator
HDL IP Core Kernel compiler:
▪ Optimized pipelines from C/C++ Board support package: (created by hardware developer)
▪ Timing closure, pinouts, periphery planning – we’ve got forit covered LRZ System integrator: (Quartus runs behind the scenes) Intel Proprietary FPGA in a System ▪ Optimized I/O interconnects
109 OpenCL
Hardware Agnostic Compute Language Invented by Apple ▪ 2008 Specification donated to Khronos Group ▪ Now managed by Intel
OpenCL C OpenCL C and C++ C/C++ API What does OpenCL™ give us?
▪ Industry standard programming model Host Accelerator ▪ Functional portability across platformsfor LRZ ▪ Well thought out specificationIntel Proprietary
110 Heterogeneous Platform Model
OpenCL Host Memory Platform Device Device Model
Host
Global Memory
Example x86 Platform for LRZ
PCIe Intel Proprietary
111 OpenCL Use Model: Abstracting the FPGA away
Host Code OpenCL Accelerator Code main() { __kernel void sum read_data( … ); (__global float *a, manipulate( … ); __global float *b, clEnqueueWriteBuffer( … ); __global float *y) clEnqueueNDRange(…,sum,…); { clEnqueueReadBuffer( … ); int gid = get_global_id(0); display_result( … ); y[gid] = a[gid] + b[gid]; } }
Standard Altera Offline Verilog gcc Compiler Compiler
EXE AOCX Quartus Prime for LRZ Intel Proprietary Accelerator Host
112 OpenCL Host Program Pure software written in standard C/C++ languages Communicates with the accelerator devices via an API which abstracts the communication between the host processor and the kernels
main() { Copy data from Host to read_data_from_file( … ); FPGA manipulate_data( … );
Tell the FPGA to run a clEnqueueWriteBuffer( … ); clEnqueueNDRange(…, sum, …); particular kernel clEnqueueReadBuffer( … );
Copy data from FPGA to fordisplay_result LRZ( … ); Host Intel Proprietary}
113 OpenCL Kernels __kernel void sum( __global float *a, Kernel: Data-parallel function __global float *b, __global float *answer) ▪ Defines many parallel threads { int xid = get_global_id(0); ▪ Each thread has an identifier specified by result[xid] = a[xid] + b[xid]; “get_global_id” } ▪ Contains keyword extensions to specify parallelism and memory hierarchy float *a = 0 1 2 3 4 5 6 7
Executed by an OpenCL device float *b = 7 6 5 4 3 2 1 0 ▪ CPU, GPU, FPGA __kernel void sum( … ); Code portable NOT performance for LRZ portable Intel Proprietaryfloat *result = 7 7 7 7 7 7 7 7 ▪ Between FPGAs it is!
114 Software Engineer’s View of an OpenCL System
Dataflow Processor Local memory (shallow, fast, random)
Host Compute Engines
Global Memory (deep, fast, bursting)
Device contains compute engines that run the kernel Host talks to global memory through OpenCLfor routines LRZ Global memory is large, fast,Intel and likes Proprietaryto burst Local memory is small, fast, and supports random access
115 FPGA OpenCL Architecture FPGA External External Intel® Xeon® Memory Controller Memory Controller DDR* Processor / PCIe & PHY & PHY Host Processor
Global Memory Interconnect
BRAM
BRAM
BRAM
BRAM Kernel Kernel Kernel Pipeline Pipeline Pipeline BRAM BRAM
Local Memory Interconnect
Modest external memory bandwidth for LRZ Extremely high internal memoryIntel bandwidth Proprietary Highly customizable compute cores
116 Start with a Reference Platform (1/2)
Network Enabled High Performance Computing (HPC) Low Latency Compute Power/ Requirement Memory Bandwidth
CPLD FLASH Stratix V FPGA Stratix V FPGA OpenCL API OpenCL API CPLD DDR3 DDR3 HAL DDR3 DDR3 HAL Bridge UMD Architecture UMD 10G UDP DMA DMA KMD KMD (OpenCL Kernels) (OpenCL Kernels)
PCIe 10G UDP PCIe
Global Memory DDR and QDRII+ for LRZLarge amount of DDR IO Channels 2x10GbE (MAC/UOE)Intel ProprietaryNone (Minimize IP overhead)
117 Start with a Reference Platform (2/2) Host and accelerator in same package: SoC
FPGA
Scratch DDR FPGA Global Memory DDR DVI Camera
Processor OpenCL Kernels DVO Monitor
for LRZ Intel Proprietary $99 $175 $250 >$1000
118 Development Flow using SDK
Modify kernel.cl
x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)
Profiler (hours) for LRZ Intel ProprietaryDONE!
119 Compiling Kernel Run the Altera Offline Compiler in command prompt ▪ aoc --board
/mydesigns/matrixMult$ aoc matrixMul.cl aoc: Selected target board bittware_s5pciehq
+------+ ; Estimated Resource Usage Summary ; +------+------+ ; Resource + Usage ; +------+------+ ; Logic utilization ; for 52% LRZ ; ; Dedicated logic registers ; 23% ; ; Memory blocks ; 31% ; ; DSP blocks ; 54% ; +------Intel Proprietary+------;
120 Executing the kernel: clCreateProgramWithBinary
fp = fopen(“file.aocx","rb"); fseek(fp,0,SEEK_END); lengths[0] = ftell(fp); binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]); OpenCL.h rewind(fp); API fread(binaries[0],lengths[0],1,fp); fclose(fp); .cl
clGetPlatforms const char** kernel const char** cl_platform Program (exe) const char**
clGetDevices clCreateProgramWithBinary Offline Compiler cl_device cl_program Program (exe)
clCreateContext cl_context clBuildProgram cl_program exe Kernel (src) clCreateCommandQueue .aocx exe Kernel (src) cl_command _queue clCreateKernel cl_kernelfor LRZ CL File exe clEnqueueNDRangeKernel OpenCL “Program” host.c Intel Proprietary Bitstream
121 Development Flow using SDK
Modify kernel.cl
x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)
Profiler (hours) for LRZ Intel ProprietaryDONE!
122 Emulator – The Flow Generate emulation aocx
kernel void convolution( global int * filter_coef, global int * input_image, global int * output_image ) { aoc aocx int grid = get_group_id(0); … } conv.cl aoc -march=emulator conv.cl conv.aocx Run host program with emulator aocx ▪ Host compile does not change ▪ set CL_CONTEXT_EMULATOR_DEVICE_ALTERA=
c:\opencl>aoc –march=emulator conv.cl c:\opencl>dir for LRZ host.exe conv.cl conv.aocx c:Intel\opencl>host.exe Proprietary running… done!
123 Printf
Can use printf within kernel on FPGA ▪ Adds some memory traffic overhead
In the emulator, printf runs on IA ▪ Useful for fast debug iterations
for LRZ Intel Proprietary
124 Development Flow using SDK
Modify kernel.cl
x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)
Profiler (hours) for LRZ Intel ProprietaryDONE!
125 Optimization Report
Intel FPGA SDK for OpenCL provides a static report to identify performance bottlenecks when writing single-threaded kernels Use –c to stop after generating the reports ▪ aoc -c
for LRZ Intel Proprietary
126 Optimization Report Example
for LRZ Intel Proprietary
127 Development Flow using SDK
Modify kernel.cl
x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)
Profiler (hours) for LRZ Intel ProprietaryDONE!
128 Profiler – the flow 1. Generate program bitstream with profiling enabled
kernel void convolution( global int * filter_coef, global int * input_image, global int * output_image ) { aoc aocx int grid = get_group_id(0); } conv.cl aoc --profile conv.cl conv.aocx 2. Run host program with instrumented aocx
c:\opencl>dir host.exe conv.aocx c:\opencl>host.exe running… done! c:\opencl>dir host.exe conv.aocx profile.monfor LRZ 3. Run the profiler GUI: aocl report
129 Dynamic Profiler
Intel FPGA SDK for OpenCL enables users to get runtime information about their kernel performance Bottlenecks, bandwidth, saturation, pipeline occupancy
for LRZ Intel Proprietary Performance Stats Execution Times
130 for LRZ Intel Proprietary
131 Execution of Threads on FPGA – Naïve Approach
Thread execution can be executed on replicated pipelines in the FPGA
kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary
132 Execution of Threads on FPGA – Naïve Approach
Thread execution can be executed on replicated pipelines in the FPGA
t0 t1 t2
kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary
133 Execution of Threads on FPGA – Naïve Approach
Thread execution can be executed on replicated pipelines in the FPGA – Throughput = 1 thread per cycle – Area inefficient t0 t1 t2
t3 t4 t5 Clock Cycles Clock for LRZ Parallel ThreadsIntel Proprietary
134 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
t2 t1 t0
kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary
135 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
t2 t1
t0 kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary
136 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
t2
t1 kernel void add( global int* Mem ) { t0 ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary
137 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
t2 kernel void add( global int* Mem ) { t1 ... Mem[100] += 42*Mem[101]; }
t0 for LRZ Intel Proprietary
138 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
kernel void add( global int* Mem ) { t2 ... Mem[100] += 42*Mem[101]; }
t1 for LRZt0 Intel Proprietary
139 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread
kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; }
t2 for LRZt1 Intel Proprietary
140 Execution of Threads on FPGA
Better method involves taking advantage of pipeline parallelism – Throughput = 1 thread per cycle
t0 t1 t2 kernel void t3 add( global int* Mem ) { ... t4 Mem[100] += 42*Mem[101]; t5 } Clock Cycles Clock for LRZt2 Intel Proprietary
141 for LRZ Intel Proprietary OpenCL on Intel FPGAs
Main assumptions made in previous OpenCL programming model – Data level parallelism exists in the kernel program Not all applications well suited for this assumption – Some applications do not map well to data-parallel paradigms These are the only workloads that GPUs support
for LRZ Intel Proprietary
143 Data-Parallel Execution
On the FPGA, we use the idea of pipeline parallelism to achieve acceleration
kernel void t2 sum(global const float *a, Load Load global const float *b, global float *c) t1 { + int xid = get_global_id(0); t0 c[xid] = a[xid] + b[xid]; } Store for LRZ Threads can execute Intelin an embarrassingly Proprietary parallel manner
144 Data-Parallel Execution - Drawbacks
Difficult to express programs which have partial dependencies during execution
kernel void t2 sum(global const float *a, Load Load global const float *b, global float *c) t1 { + int xid = get_global_id(0); t0 c[xid] = c[xid-1] + b[xid]; } Store for LRZ Would require complicated hardware and new language semantics to describe the desired behavior Intel Proprietary
145 Solution: Tasks and Loop-Pipelining Allow users to express programs as a single-thread
for (int i=1; i < n; i++) { c[i] = c[i-1] + b[i]; }
Pipeline parallelism still leveraged to efficiently execute loops in Intel’s FPGA OpenCL i=2 ▪ Parallel execution inferred Load by compiler i=1 ▪ Loop Pipelining + for LRZi=0 Intel ProprietaryStore
146 Loop Carried Dependencies
Loop-carried dependencies are dependencies where one iteration of the loop depends upon the results of another iteration of the loop
kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i The variable state in iteration 1 dependsfor on LRZ the value from iteration 0. Similarly, iteration 2 depends onIntel the value Proprietary from iteration 1, etc. 147 Loop Carried Dependencies To achieve acceleration, we can pipeline each iteration of a loop containing loop carried dependencies – Analyze any dependencies between iterations – Schedule these operations – Launch the next iteration as soon as possible kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i 148 Loop Pipelining Example No Loop Pipelining With Loop Pipelining i=0 i=0 i=1 i=1 i=2 i=3 Looks almost like i=4 multi-threaded execution! i=2 i=5 Clock Cycles Clock Clock Cycles Clock for LRZ Finishes Faster because Iterations No Overlap of Iterations!Intel Proprietary Are Overlapped 149 Parallel Threads vs. Loop Pipelining So what’s the difference? t0 i=0 Loop t1 i=1 Parallel threads launch 1 dependencies may t2 i=2 thread per clock cycle in not be resolved in t3 i=3 pipelined fashion 1 clock cycle t4 i=4 t5 i=5 Parallel Threads for LRZLoop Pipelining Loop Pipelining enables Pipeline Parallelism *AND* the communication of state information betweenIntel iterations. Proprietary 150 Image Filter Memory F[3][3] for LRZ Intel Proprietary Memory 151 Harnessing Dataflow to Reduce Memory forBandwidth LRZ Intel Proprietary Data Movement in GPUs Data is moved from host over PCIexpress Instructions and data is constantly sent back and forth between host cache and memory and GPU memory ▪ Requires buffering larger data sets before passing to GPU to be processed ▪ Significant latency penalty ▪ Requires high memory and host bandwidth ▪ Requires sequential execution of kernels JPG RGB RGB* JPG* Global forMemory LRZ UncompressIntel ProprietaryImage Filter Compress 153 Altera_Channels Extension An FPGA has programmable routing Can’t we just send data across wires between kernels? Advantages: – Reduce memory bandwidth – Lower latency through fine-grained synchronization between kernels – Reduce complexity (wires are trivial compared to memory access) o Lower cost, lower area, higher performances – Enable modular dataflow design through small kernels exchanging data – Different workgroup sizes and degrees of parallelism in connected modules JPG RGB RGB* JPG* Global forMemory LRZ UncompressIntel ProprietaryImage Filter Compress 154 Data Movement in FPGAs FPGA allows for result reuse between instructions Ingress/Egress to custom functions 100% flexible Multiple memory banks of various types directly off FPGA – Algorithms can be architected to minimize buffering to external memory or host memory – Multiple optional memory banks can be used to allow simultaneous access Optional Optional Memory Memory FPGA 100G, 100G, PCIe, for LRZ PCIe, SRIO, SRIO, Kernel 1 Kernel 2 Kernel3 USB, Intel Proprietary USB, etc… etc… Example: Multi-Stage Pipeline An algorithm may be divided into multiple kernels: – Modular design patterns – Partition the algorithm into kernels with different sizes and dimensions – Algorithm may naturally split into both single-threaded and NDRange kernels Generating random data for a Monte Carlo simulation: kernel void rng(int seed) { kernel void sim(...) { int r = seed; int gid = get_global_id(0); while(true) { int rnd = read_channel_altera( r = rand(r); RAND); write_channel_altera( forout[ gidLRZ] = do_sim(data, rnd); RAND, r); } } Intel Proprietary NDRange } Single-Threaded 156 Traditional Data Movement Without Channels DDR Memory Controller DMA Kernel Kernel DMA PCIe FPGA for LRZ System IntelHOST ProprietaryMemory 157 Data Movement Using Channels DDR Data In Data Out Memory Controller FIFO FIFO DMA DMA Kernel FIFO Kernel PCIe FPGA for LRZ System HOST Intel ProprietaryMemory 158 Data Movement Using Host Channels DDR Memory Controller DMA Kernel DMA PCIe FPGA for LRZ System HOST Intel ProprietaryMemory 159 An Even Closer Look: FPGA Custom Architectures Kernel Replication with num_compute_units using OpenCL ▪ Step #1: Design an efficient kernel ▪ Step #2: How can we scale it up? kernel void PE() { PE … } Processing element for LRZ (task-based kernel) Intel Proprietary 160 Kernel Replication With Intel® FPGA SDK for OpenCL Attribute to specify 1-dim or 2-dim array of PE PE PE PE kernels 3 PE PE PE PE Add API to identify kernel in the array 2 PE PE PE PE __attribute__((num_compute_units(4,4))) 1 kernel void PE() { PE PE PE PE 0 row = get_compute_id(0); 0 1 2 3 col = get_compute_id(1); Processing elements … (task-based kernels) } for LRZ Compile-time constants allowsIntel compiler toProprietary specialize each PE 161 Kernel Replication With Intel® FPGA SDK for OpenCL Topology can be expressed with software constructs PE PE PE PE ▪ Channel connections specified through compute IDs 3 channel float4 ch_PE_row[4][4]; PE PE PE PE channel float4 ch_PE_col[4][4]; 2 channel float4 ch_PE_row_side[4]; channel float4 ch_PE_col_side[4]; PE PE PE PE 1 __attribute__((num_compute_units(4,4))) kernel void PE() { PE PE PE PE row = get_compute_id(0); 0 col = get_compute_id(1); 0 1 2 3 float4 a,b; if (row==0) a = read_channel(ch_PE_col_side[col]); else PE a = read_channel(ch_PE_col[row-for1][col]); LRZ Channel/Pipe Kernel if (col==0) … Intel Proprietary } 162 Matrix Multiply in OpenCL Every PE / feeder is a kernel feeder PE PE PE PE Communication via OpenCL channels feeder PE PE PE PE Data-flow network model feeder PE PE PE PE Software control: – Compute unit granularity feeder PE PE PE PE – Spatial Locality feeder feeder feeder feeder – Interconnect topology Load A – Data movement LoadPE B Drain C Drain interconnect – Caching for LRZ DDR4 – Banking Intel Proprietary feederPE Performance: ~1 TFLOPs Channels/Pipes Kernels 163 Traditional CNN 1 1 Inew 푥 푦 = Iold 푥 + 푥′ 푦 + 푦′ × F 푥′ 푦′ 푥′=−1 푦′=−1 Filter Input Feature Map (3D Space) (Set of 2D Images) Output Featurefor Map LRZ Intel ProprietaryRepeat for Multiple Filters to Create Multiple “Layers” of Output Feature Map 164 CNN On FPGA Want to minimize accessing external memory Want to keep resulting data between layers on the device and between computations Want to leverage reuse of the hardware between computations Parallelism in the depth of the kernel window and across output features. Defer complex spatial math to random access memory. Re-use hardware to compute multiple for LRZ layers. Intel Proprietary 165 Efficient Parallel Execution of Convolutions External DDR FPGA ▪ Parallel Convolutions Double-Buffer Filters (on-chip RAM) – Different filters of the same On-Chip RAM convolution layer processed in parallel in different processing elements (PEs) ▪ Vectored Operations – Across the depth of feature map (Output Depth) (Output Filter Parallelism Filter for LRZ▪ PE Array geometry can be customized to Intel Proprietaryhyperparameters of given topology Programmable Solutions Group Intel Confidential Design Exploration with Reduced Precision Tradeoff between performance and accuracy ▪ Reduced precision allows more processing to be done in parallel ▪ Using smaller Floating Point format does not require retraining of network ▪ FP11 benefit over using INT8/9 – No need to retrain, better performance, less accuracy loss FP16 Sign, 5-bit exponent, 10-bit mantissa FP11 Sign, 5-bit exponent, 5-bit mantissa FP10 for LRZSign, 5-bit exponent, 4-bit mantissa FP9 Intel ProprietarySign, 5-bit exponent, 3-bit mantissa FP8 Sign, 5-bit exponent, 2-bit mantissa Programmable Solutions Group Intel Confidential 167 Lab 3 for LRZ Intel Proprietary Programmable Solutions Group Intel Confidential 168 DSP Builder Advanced Blockset for LRZ Intel Proprietary 169 The Mathworks* Design Environment MATLAB* ▪ Matlab* Algorithm Development and Analysis – High-level technical computing language – Simple C like language SIMULINK* Model-Based Design – Efficient with vectors and matrices – Built-in mathematical functions Validated – Interactive environment for algorithm development Design – 2D/3D graphing tool for data visualization ▪ Simulink* Third Party Tools – Hierarchical block diagram design & simulation tool DSP/embedded – Digital, analog/mixed signal & event driven EDA tools for LRZ software tools – Visualize signals Intel Proprietary – Integrated with MATLAB* Hardware DSP, Control Software 170 DSP Builder for Intel® FPGAs Enables MathWorks* Simulink for Intel FPGA design Device optimized Simulink* DSP Blockset ▪ Key Features: – High-Level Design Exploration – HW-in-the-Loop verification – IP Generation for Intel® Quartusfor LRZ SW / Platform DesignerIntel Proprietary 171 FPGA Design Flow - Traditional Development Implementation Verification System Level Design HDL Coding RTL Simulation System Level Simulation DSP IP Hardware Verification System Engineer Hardware Engineer Verification Engineer Precision*, Synplify* SW ModelSim* tools MATLAB*/Simulink* tools Intel® Quartus® Prime SW Development Kits for LRZ Intel Proprietary 172 FPGA Design Flow – DSP Builder for Intel® FPGAs Development Implementation Verification System Level Design HDL Coding RTL Simulation System Level Simulation DSP IP Hardware Verification DSP Builder for Intel® FPGAs Algorithm- System-level level Verification Modeling Synthesis, RTL Simulation Precision*, Synplify* SW ModelSim* tools MATLAB*/Simulink* tools Intel® Quartus® Prime SW Development Kits Singlefor Simulink* LRZ Intel ProprietaryRepresentation 173 Core Technologies ▪ IP (ready made) library ▪ Automatic pipelining – Multi-rate, multi-channel filters ▪ Automatic folding and resource sharing – Waveform synthesis (NCO/DDS/Mixers) ▪ Multichannel designs with automatic ▪ Custom IP creation using primitive library vectorization – Vectorization – Zero latency ▪ Avalon® Memory-Mapped and Streaming Interfaces – Scheduled – Aligned RTL generation ▪ Design exploration across device families ▪ System integration for▪ HighLRZ-performance floating-point designs – Platform Designer ▪ System-in-the-Loop accelerated – Processor IntegrationIntel Proprietarysimulation 174 Advanced Blockset - High Performance DSP IP Over 150 device optimized DSP building blocks for Intel® FPGAs ▪ DSP building blocks ▪ Interfaces ▪ IP library blocks ▪ Primitives library blocks – Math and Basic blocks ▪ Vector and Complex data types for LRZ Intel Proprietary 175 Build Custom FFTs from FFT Element Library ▪ Quickly build DSP designs using Complete FFT IP Functions from the FFT Library ▪ Build custom radix-22 FFTs using blocks from the FFT Element Library FFT IP Library FFT Element Library FFT Pruning and Twiddle FFT_float Bit vector combine VFFT Butterfly Unit VFFT_float Choose Bits BitReverseCoreC Dual Twiddle Memory VariableBitReverse Edge Detect Floating-Point Twiddlefor Gen LRZ IntelCrossover Switch Proprietary 176 Filter and Waveform Synthesis Library IP Implementations FIR • Half-band DSP Builder includes a comprehensive • L-Band waveform IP library • Symmetric • Decimating ▪ Automatic resource sharing based on • Fractional Rate sample rate • Interpolation • Single-Rate ▪ Support for super sample rate • Super Sample Rate architectures CIC • Decimating • Interpolating • Super Sample Rate Mixer • Complex for LRZ • Real • Super Sample Rate Intel ProprietaryNCO • Super Sample Rate • Multi-bank 177 Library is Technology Independent ▪ Target device using a Device block ▪ Same model generates optimized RTL for each FPGA and speed grade for LRZ Intel Proprietary 178 Datapath Optimization for Performance Automatic Timing Driven Synthesis of Model – Based on specified device and clock frequency Optimization Description Pipelining Inserts registers to improve Fmax Algorithmic Retiming Moves registers to balance pipelining Bit Growth Management Manages bit growth for fixed-point designs Multi-rate Optimizes hardware based on sample rate Optimizations Before for LRZ After A B IntelC ProprietaryA B C Bit Growth Retiming Custom IP Generation b0 + Model Primitive Features Textbook based • Vector support z-1 design entry b1 • Parameterizable + z-1 • Zero latency block … … • ALU folding b6 What to do not when to do it + z-1 b 7 for LRZ Intel Proprietary 180 ALU Design Folding Improves Area Efficiency Optimizes hardware usage for low- Clk throughput designs A Multiply B Multiply ▪ Arranges one of each resources in a C Multiply central arithmetic logic unit (ALU) fashion TDM TDM A A ▪ Folding factor = clock rate / data rate B X B C C ▪ Performed when Folding factor > 500 for LRZFSM Intel Proprietary TDM Resource Sharing Clock Rate = Sample Rate F(.) + READ WRITE F(.) Clock Rate = 2*Sample Rate + F(.) READ WRITE SERIALIZE for LRZDESERIALIZE Intel Proprietary TDM_CLK 182 TDM Design: Trade-Off Example 49-tap Symmetric Single Rate FIR Filter Resources Stratix 10 LUT4s Mults Memory bits TDM Factor Clock Rate = 72 MHz 898 26 0 1 Sample Rate = 72 MSPS Clock Rate = 144 MHz 1082 14 0 2 Sample Rate = 72 MSPS Clock Rate = 288 MHz 741 8 0 4 Sample Rate = 72 MSPS for LRZ Clock Rate = 72 MHz 1082 14 0 2 Sample Rate =Intel 36 MSPS Proprietary 183 2 Antenna DUC Reference Design Clock Rate = 179.2MHz Interpolate4FIR: ComplexMixer: ChanCount = 4 ChanCount = 2 (complex channel) ChannelFIR: Output Sample Rate = 89.6 MSPS Sample Rate = 89.6 MSPS ChanCount = 4 Output Period = 2 Period = 2 Output Sample Rate = 11.2 MSPS ChanWireCount = ceil(4/2) = 2 I’ = I*cos – Q*sin Output Period = 16 ChanCycleCount= ceil(4/2) = 2 Q’ = I*sin + Q*cos Output Seq.=I1,I2,Q1,Q2,zeros(1,16-4) Output Seq.= I1, I2 Output i Seq. = I1, I2 Q1, Q2 Demux Output q Seq. = Q1,Q2 (Terminated) i1 i i data data data Data(2) i2 q valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos Interpolate2FIR: valid NCO valid Clock Rate = 179.2 MHz channel channel Deinterleaver: ChanCount = 4 NCO: Sample Rate = 89.6 MSPS Output Sample Rate = 22.4 MSPS forChanCount LRZ= 2 (complex channel) Period = 2 Output Period = 8 Sample Rate = 89.6 MSPS Input I Seq. = I1,I2 Output Seq.=I1,I2,Q1,Q2,zeros(1,8-4) Intel ProprietaryPeriod = 2 Antenna 1 Seq. = I1,- Sine Seq. = sinA1, sinA2 Antenna 2 Seq. = I2,- Reference Design Included with DSP Builder Cosine Seq. = cosA1,cosA2 184 Changing the Design without DSP Builder ▪ Tedious and time consuming ▪ Channel Count = 8, 16, 32 ▪ Clock Rate = 2x, 4x i1 i i data data data Data(4) i2 q valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos valid NCO valid Specification: channel channel SampleRate = 11.2 for LRZ ChanCount = 8 Intel Proprietary 185 Changing the Design with DSP Builder ▪ Modifications done in minutes ▪ Design still looks the same splitter i1 i i data data data Data(4) i2 q valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos valid NCO valid Specification: channel channel SampleRate = 11.2 for LRZ ChanCount = 8Intel Proprietary 186 Five Designs Iterations < 1 Hour Arria® 10 Arria 10 Arria 10 Stratix® 10 Stratix 10 6 channel 6 channel 12 channel 6 channel 12 channel Requested Clock (MHz) 250 450 450 450 450 Actual Fmax (slow model, 85C) 351 458 458 524 484.5 Multiplier Count (18x18) 10 6 10 6 10 Logic Resources (registers) 686 465for LRZ818 1267 1863 Block Memory Resources (kbits)Intel0 Proprietary0 0 0 25.8 187 Generates Reusable IP for Platform Designer ▪ Platform Designer is the System Integration Environment for Intel® FPGAs DSP Builder ▪ DSP Builder designs fully compatible with Platform Designer IP Catalog ▪ Integrate with other FPGA IPs Platform – Processors Designer – State machines Project A Project B – Streaming interfaces for LRZ ▪ Design reuse fully Intelsupported Proprietary 188 Typical Design Flow Identify system architecture, design filters and choose desired Fmax and device Set the top level system parameters in the MATLAB® software using the ‘params’ file - number of channels, performance, etc. Build the system using the Advanced Blockset tool Simulate the design using Simulink® and ModelSim® tools Target the right FPGA family and compile As system design specs changes, editfor the ‘ paramsLRZ’ file and repeat Intel Proprietary 189 Design Flow -Create Model Create a new blank model Select New Model Wizard from DSP Builder menu for LRZ Intel Proprietary 190 Top Level Testbench Control Signals for LRZ Top-level of a DSPB-IntelAB design isProprietary a testbench Must include Control and Signals blocks 191 Design Flow - Synthesizable Model for LRZ Enter the design in theIntel subsystem Proprietary Device block marks the top level of the FPGA 192 Design Flow – ModelIP Blocks Filters Library - Single rate, multi-rate, and fractional rate FIR filters - Decimating and interpolating cascaded integrator comb (CIC) filters Waveform Synthesis Library - Real and complex mixer Note: Supports super-sample rate (data -forNumerically LRZ controlled rate > system clock freq) interpolation by 2 oscillator (NCO) filters. Intel ProprietaryNote: The NCO block supports frequency hopping (each channel can hop to different frequency from a pool of frequencies) 193 Design Flow – ModelPrim Blocks ChannelIn and ChannelOut blocks to delineate the boundary of a synthesizable primitive subsystem for LRZAdd SynthesisInfo Block to control pipelining and latency and to view resource Intel Proprietaryusage of the subsystem 194 Design Flow – Parameterize the Design C structure like template for LRZ Runs when model is Intelopened or simulationProprietary is run 195 Design Flow – Processor Interface Drop memory and registers in the design ModelIPs have built in memory mapped interface to control registers, coefficient registers for LRZ Intel Proprietary 196 Design Flow - Running Simulink Simulation Creates files in location specified by Control block ▪ VHDL Code ▪ Timing constraints file (.sdc) ▪ DSPB-AB subsystem Quartus® IP file for LRZ Intel Proprietary 197 Design Flow - Documentation Generation Get accurate resource utilization of all modules right after simulation, without place & route DSP Builder > Resource Usage DSP Builder > View Address Map for LRZ Intel Proprietary 198 Design Verification Run ModelSim block loads the design into the ModelSim simulator for LRZ Intel Proprietary RTL Simulation 199 Design Flow – System Integration Add for LRZ Add subsystem fromIntel Proprietary the Component pick list 200 for LRZ Intel Proprietary Using FPGAs Just Got Easier Orchestration / Rack Management Application FPGA Accelerator SW Application (Loadable Workload) Software Frameworks Gap: CreatingIncrease Full-Stack Accelerated Provides standard C API to Abstraction AcceleratorPrebuilt and Functions provided standardized FPGA interface mangaerApplications on FPGA is for specific board Libraries Difficult and Time Consuming Increase Ease of Use Open Programmable Pre-built AccelerationLow-Level FPGA Engine Management (OPAE) FPGA Interface Manager Accelerator (Standard I/O Interfaces) Solutions OS Driver for LRZ Intel ProprietaryFPGA IO Interfaces (ecosystem) Intel® FPGA Programmable Accelerator Card (PAC) * Other names and brands may be claimed as the property of others. 202 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Acceleration Stack for Intel® Xeon® CPU with FPGAs Comprehensive Architecture for Data Center Deployments Rack-Level Solutions Faster Time to Revenue ▪ Fully validated Intel® board User Applications ▪ Standardized frameworks and high-level compilers Industry Standard Software Frameworks ▪ Partner-developed workload accelerators Acceleration Libraries Simplified Management ▪ Supported in VMware vSphere* 6.7 Update 1* Intel Developer Tools (Intel Parallel Studio XE, Intel FPGA SDK for OpenCL™, Intel Quartus® Prime) ▪ Rack management and orchestration framework integration Acceleration Environment (Intel Acceleration Engine with OPAE Technology, FPGA Interface Manager (FIM) Broad Ecosystem Support OS & Virtualization Environment ▪ Upstreaming FPGA drivers to Linux* kernel for▪ QualifiedLRZ by industry-leading server OEMs ▪ Partnering with IP partners, OSVs, ISVs, Intel® HardwareIntel ProprietarySIs, and VARs * Demonstrated at VMWorld Las Vegas - August 28-30, 2018 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Programmable Solutions Group 203 Acceleration Stack Provides FPGA Orchestration in Cloud/Data Center IP Store End User Developed IP IP Workload N Workload 2 Intel Developed IP Workload Workload 1 accelerators 3rd party Public and Private Developed IP Cloud/Datacenter Users Launch workload Software Virtualized Defined Orchestration Software (FPGA Enabled) Infrastructure Place workload Resource Pool Static/dynamic Storage Networkfor LRZCompute FPGA programming Intel Proprietary Secure Xeon VM FPGA Programmable Solutions Group Server Virtualization for the Acceleration Stack with VMware Out-of-the-box support from VMWare for Application Intel Arria 10 PAC and Acceleration Stack in upcoming vSphere 6.7 U1 Server Arria 10 PAC Server virtualization enables Intel Xeon Accelerator customers to deploy IP FPGA workload acceleration Intel Arria 10 Programmable with lower total Acceleration Card for LRZ with Acceleration Stack Intel ProprietaryCompute Solution Stack cost of ownership Programmable Solutions Group 205 Migrating FPGA-Accelerated Workload with vMotion* Virtual Virtual Virtual CPU + FPGA Machine Machine Machine Work Image inference Work vMotion Image inference workload Load Load workload VMware VMware VMware VMware VMware ESXi* ESXi* ESXi* ESXi* ESXi* Server 1 Server 1 Server 1 Server 2 Server 1 Server 2 1. Run Application on 2. Implement on 3. PVRDMA# connects 4. Use vMotion* to move Bare Metal ESXi* Hypervisor application to remote Intel® application from one server forFPGA LRZ PAC / FPGA device to another Server composition Continuous application # – Unoptimized, proof-of-concept with Lenovo xClarity acceleration during code. Not part of a shipping product. PodIntel Manager* and Proprietary vMotion – industry first See supplementary slide for system VMware vSphere* demonstration configuration details. Programmable Solutions Group * Other names and brands may be claimed as the property of others. 206 Components of Acceleration Stack: Overview Intel® Developed by User Xeon® Application (Domain Expert) CPU User, Intel, and 3rd Party Libraries (Tuning Expert) PCIe* Drivers Open Programmable Provided by Intel Drivers Acceleration Engine (OPAE) Provided by Intel PCIe FPGA Intel FPGA Programmable FPGA Interface Manager Signaling and Acceleration Provided by Intel Management Card rd Acceleration User, Intel, or 3 -Party IP Functional Unit Plugs into AFU Slot Qualified and Validated for (AFU) for LRZ (Tuning Expert) volume deployment Provided by OEMs Intel Proprietary Programmable Solutions Group 207 PAC with Intel® Arria® 10 FPGA • Low-profile (half-length, half height) PCIe* slot card • 128 MB Flash • 168 mm × 56 mm • For storage of FPGA • Board Management Controller (BMC) • Maximum component height: 14.47 mm configuration • Server class monitor system • PCIe × 16 mechanical • Accessed via USB or PCIe • USB 2.0 port for board firmware update and FIM image recovery • QSFP+ slot accepts pluggable optical modules • 2 – Banks of DDR4-2133 SDRAM, 4 GB each • 64 bit data, 8 bit ECC for LRZ • Total 8 GB • Powered from PCIe+12V rail • 70Intel W total board power ProprietaryPCIe x8 Gen3 • 45 W FPGA power connectivity to Intel® Xeon® host Programmable Solutions Group 208 PAC with Intel® Stratix® 10 FPGA • 128 MB Flash • For storage of FPGA configuration • Board Management Controller (BMC) ¾, length, full height, dual slot PCIe* slot card • For BMC firmware • Server class monitor system • Accessed via USB or PCIe USB 2.0 port for board firmware update and FIM image recovery • 2x QSFP+ slot accept pluggable optical modules • Up to 100GbE each for LRZ • 4 – Banks of DDR4-2400 SDRAM, 8 GB each • 64 bit data, 8 bit ECC • Total 32 GB • IntelPowered from PCIe+12V Proprietary rail PCIe Gen3 x16 connectivity to • 225 W total board power Intel® Xeon® host Programmable Solutions Group 209 Nearly Transparent Software Application Use Model Properties Token Handle Object Object Object Object model Start / stop Acquire Map AFU Allocate / define Discover / computation on ownership of registers to user shared memory search resource AFU and wait resource space space for result Deallocate Relinquish Unmap MMIO Reconfigure forshared LRZmemory ownership AFUIntel Proprietary Programmable Solutions Group 210 Enumeration and Discovery link fpga_properties prop fpga_token token fpgaEnumerate() objtype: FPGA_ACCELERATOR FPGA_DEVICE fpga_properties prop; fpga_token token; FPGA_ACCELERATOR fpga_guid myguid; /* 0xabcdef */ fpgaGetProperties(NULL, &prop); AFU_ID: 0xabcdef fpgaPropertiesSetObjectType(prop, FPGA_ACCELERATOR); fpgaPropertiesSetGUIDfor LRZ(prop, myguid); fpgaEnumerate(&prop, 1, &token, 1, &n); Intel fpgaDestroyPropertiesProprietary(&prop); Programmable Solutions Group 211 Acquire and Release Accelerator Resource link fpga_token token fpga_handle handle fpgaOpen() FPGA_DEVICE fpga_token token; FPGA_ACCELERATOR // ... enumeration ... fpga_handle handle; AFU_ID: 0xabcdef fpgaOpen(token, &handle, 0); . . for LRZ . Intel fpgaCloseProprietary(handle); Programmable Solutions Group 212 Memory-Mapped I/O SW application process address space SW application (virtual) link TEXT fpgaMapMMIO(…, &mmio_ptr) DATA FPGA_DEVICE BSS mmio_ptr libopae-c control register FPGA_ACCELERATOR control register AFU_ID:control 0xabcdefregister control register fpgaReadMMIO() control register forfpgaWriteMMIO LRZ() control register Intel Proprietary Programmable Solutions Group 213 Management and Reconfiguration link SW application (with admin privilege) FPGA_DEVICE load fpgaReconfigureSlot(…, buf, len, 0) Storage FPGA_ACCELERATOR AFU_ID: 0xbe11e50xabcdef libopae-c GBS metadata GBS file interface_id Partial configuration xyz.gbs afu_id for LRZ … Intel Proprietary Programmable Solutions Group 214 Management and Reconfiguration link fpga_handle handle; /* handle to device */ FILE *gbs_file; void *gbs_ptr; size_t gbs_size; FPGA_DEVICE /* Read bitstream file */ gbs_ptr = malloc(gbs_size); fread(gbs_ptr, 1, gbs_len, gbs_file); /* Program GBS to FPGA */ FPGA_ACCELERATOR fpgaReconfigureSlot(handle, 0, gbs_ptr, gbs_size, 0); /* ... */ AFU_ID: 0xbe11e50xabcdef for LRZ Intel Proprietary Programmable Solutions Group 215 Where to Get AFU’s for the FPGA Self-Developed Externally-Sourced Higher Productivity Performance Optimized Intel® Reference Designs Contracted Engagement C/C++ Programming Language VHDL or Verilog Ecosystem Partner Intel® HLS Compiler Intel® FPGA SDK for OpenCL™ for LRZ Accelerator Intel ProprietaryFunctional Unit (AFU) Programmable Solutions Group 216 Growing the Xeon+FPGA Ecosystem IP and solutions Developer Community Universities ISV Partners Portfolio of Accelerator Enabling software developers access Reaching over 200,000 Expanding the reach for Solutions developed by Intel via: students per year with system vendors with and third-party technologists • Intel Builder programs FPGA publications, platforms and ready-to- to expedite application • AI Academy for LRZworkshops and hands-on use application workloads. development and deployment • Intel Developer Zone (IDZ) research labs •IntelRocketboards.org ProprietaryCommitted to Open Source vision Programmable Solutions Group Growing List of Accelerator Solution Partners Easing Development and Data Center Deployment of Intel FPGAs For Workload Optimization Data Analytics Finance Genomics Media Transcoding Cyber Security AI for LRZ Intel Proprietary Programmable Solutions Group 218 Intel PAC Top Solutions for Data Center Acceleration Cassandra PostgreSQL Genomics JPEG2Lepton Big Data Financial Network GATK JPEG2Webp Streaming Black Scholes Security/ Analytics Monitoring 96% latency ½ TCO 2.5X 3-4X 5X 8X 3x reduction performance performancefor LRZperformance performance performance Intel Proprietary Programmable Solutions Group Customer Application: Big Data Applications running on Spark/Kafka Platforms Current solution: Run Spark/SQL on a cluster of CPUs Challenge: For many applications in the FinServ/Genomics/Intelligence Agencies/etc. Spark performance does not meet customers SLA requirements, especially for delay sensitive streaming workloads Solution for LRZ Value Intel Proprietary Proposition Customer Application: Risk Management acceleration framework (financial back-testing) Current solution: Deploy a cluster of CPUs or GPUs with complex data access Challenge: Traditional risk management methods are compute intensive, time consuming applications - > 10+ hours for financial back-testing Solution for LRZ Value Intel Proprietary Proposition Leverage FPGA Developers and Build Your Own HDL Programming OpenCL Programming ASE OPAE Intel® Quartus Intel® HLS from Intel from Intel Prime Pro Compiler C HDL Host Kernels SW Syn. SW OpenCL Compiler PAR Compiler Compiler AFU exe AFU Image exe Image AFU Application AFU Application AFU Simulation OPAE for LRZOpenCL OPAE Environment Software FIM Emulator Software FIM (ASE) CPU IntelFPGA Proprietary CPU FPGA Programmable Solutions Group 222 AFU RTL AFU Overview Flow AFU Simulation Simulation Test & Environment Compilation Validate AFU OPAE SW Software Application Compilation Hardware System Xeon® FPGA Quartus® Generate the Compilation AF AF Simulation Environment (ASE) enables seamless portability to real HW ▪ Allows fast verification of OPAE software together with AF RTL without HW – SW Application loads ASE library and connectsfor to LRZRTL simulation ▪ For execution on HW,Intel application loadsProprietary Runtime library and RTL is compiled by Intel® Quartus into FPGA bitstream Programmable Solutions Group 223 FPGA Components of Acceleration Stack Partial Reconfiguration (PR) Region FPGA FPGA Interface PCIe* Core Cache EMIF DDR4 Unit Interface (FIU) (CCI) Accelerator EMIF Local DDR4 Functional Unit (AFU) Memory EMIF** Interfaces High Speed DDR4** Serial 10Gb/40Gb QSFP+ Interface 100Gb** EMIF** DDR4** (HSSI) for LRZ Intel Proprietary * Could be other interfaces in the future (e.g. UPI) ** Stratix 10 PAC Card Programmable Solutions Group 224 AFU Development Flow Using OPAE SDK Specify the Platform Configuration AFU requests the ccip_std_afu top level interface classes Design the AFU ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/hello_afu.json Specify Build AFU RTL files implementing accelerated function Configuration ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/afu.sv Generate the ASE List all source files and platform configuration file Build Environment ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/filelist.txt In terminal window, enter these commands:for LRZ ▪ cd $OPAE_PLATFORM_ROOT/Intelhw /samples/Proprietaryhello_afu ▪ afu_sim_setup--source hw/rtl/filelist.txt build_sim Programmable Solutions Group 225 AFU Development Flow Using OPAE SDK Specify the Platform Compile AFU and platform simulation models and start simulation server process Configuration ▪ cd build_sim Design the AFU ▪ make Specify Build ▪ make sim Configuration In 2nd terminal window compile the host application and start the client process Generate the ASE ▪ Export ASE_WORKDIR= $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/ Build Environment build_sim/work ▪ cd $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/sw Verify AFU with ASE ▪ make clean for LRZ ▪ make USE_ASE=1 Intel Proprietary ▪ ./hello_afu Programmable Solutions Group 226 AFU Simulation Environment (ASE) Hardware software co-simulation environment for the Intel Xeon FPGA development Uses simulator Direct Programming Interface (DPI) for HW/SW connectivity ▪ Not cycle accurate (used for functional correctness) ▪ Converts SW API to CCI transactions Provides transactional model for the Core Cache Interface (CCI-P) protocol and memory model for the FPGA-attached local memory Validates compliance to ▪ CCI-P protocol specification for LRZ ▪ Avalon® Memory MappedIntel (Avalon -MM)Proprietary Interface Specification ▪ Open Programmable Acceleration Engine Programmable Solutions Group 227 Simulation Complete for LRZ Intel Proprietary AFU Simulator Window (server) Application SW Window (client) Programmable Solutions Group 228 AFU Development Flow Using OPAE SDK Specify the Platform Configuration Generate the AF build environment: ▪ cd $OPAE_PLATFORM_ROOT/hw/samples/hello_afu Design the AFU ▪ afu_synth_setup --source hw/rtl/filelist.txt build_synth Specify Build Configuration Generate the AF Generate the ASE Build Environment ▪ cd build_synth Verify AFU with ASE ▪ $OPAE_PLATFORM_ROOT/bin/run.sh for LRZ Generate the AF Intel Proprietary Build Environment Generate the AF Programmable Solutions Group 229 Using the Quartus GUI Compiling the AFU uses a command line-driven PR compilation flow ▪ Builds PR region AF as a .gbs file to be loaded into OPAE hardware platform Can use the Quartus GUI for the following types of work: ▪ Viewing compilation reports ▪ Interactive Timing Analysis ▪ Adding SignalTap instances and nodes for LRZ Intel Proprietary Programmable Solutions Group 230 Lab 3 for LRZ Intel Proprietary Programmable Solutions Group 231 Getting Started with Acceleration Buy Server w/ PAC Server OEM (e.g. Dell) Install Server OS OS Vendor Website (e.g. CentOS, RHEL) Download & Install Deployment Download & Write Host Deployment Package Flow Install Workload Application of Acceleration Stack Download & Install Downloadfor &Install LRZ Create & Development Download & Developer Package HLS or OpenCL Simulate Flow Install Simulator of AccelerationIntel Stack Proprietary(Optional) Workload Intel Website Vendor Website Programmable Solutions Group Getting Qualified Hardware is Step 1 Now: Now: Available soon: Dell PowerEdge* PRIMERGY* HPE ProLiant* And more coming ….. R640, R740, RX2540 M4 DL360,for DL380LRZ R740xd, R840, R940xa Intel Proprietary 233 Programmable Acceleration Cards (PAC) Intel® Arria® 10 Intel Stratix® 10 Accelerator Card Accelerator Card Broadest Deployment at Lowest Power Highest Performance and Throughput 40G, PCIe* Gen3 x8 for LRZ2x 100G, PCIe Gen3 x16 ½ length, ½ height, singleIntel-slot PCIe Proprietary card ¾ length, full height, dual-slot PCIe card Lowest power 66W TDP Up to 225 W maximum Programmable Solutions Group 234 • Acceleration Stack for Intel® Xeon® with FPGAs • FPGA Acceleration Platforms Intel® portal for all things related • Acceleration Solutions & Ecosystem to FPGA acceleration • Knowledge Center for• FPGALRZ as a Service Intel Proprietary• 01.org * * 01.org is an open source community site 25 for LRZ Intel Proprietary https://www.intel.com/content/www/us/en/programmable/ Follow-On Courses support/training/overview.html Introduction to Cloud Computing Introduction to High Performance Computing (HPC) Introduction to Apache™ Hadoop Introduction to Apache Spark™ Introduction to Kafka™ Introduction to Intel® FPGAs for Software Developers Introduction to the Acceleration Stack for Intel® Xeon® CPU with FPGA Application Development on the Acceleration Stack for Intel® Xeon® CPU with FPGAs Building RTL Workloads for the Acceleration Stack for Intel®for Xeon® LRZ CPU with FPGAs OpenCL™ Development with theIntel Acceleration StackProprietary for Intel® Xeon® CPU with FPGA Intel FPGA OpenCL Trainings and HLS Trainings Programmable Solutions Group 237 Teaching Resources University-focused content & curriculum ▪ Semester-long laboratory exercises for hands-on learning with solutions ▪ Tutorials and online workshops for self-study on key use cases ▪ Free library of IP common for student projects ▪ Example designs and sample projects Easy-to-use, powerful software tools ▪ Quartus Prime CAD Environment ▪ ModelSim ▪ Intel FPGA Monitor Program for assemblyfor & C developmentLRZ ▪ Intel® SDK for OpenCL™Intel Applications Proprietary ▪ Intel OpenVINO™ toolkit (Visual Inference & Neural Network Optimization) Programmable Solutions Group 238 Teaching Resources (cont.) Hardware designed for education ▪ 4 different FPGA kits with a variety of peripherals to match project needs ▪ Compact designs with robust shielding to provide longevity ▪ Reduced academic prices (range: $55-$275) ▪ Donations available in some circumstances Support ▪ Total access to all developer resources – Documentation – Design examples for LRZ – Support forum Intel Proprietary – Virtual or on-demand trainings Programmable Solutions Group 239 DE-Series Development Boards DE10-Standard DE1-SOC DE10-Nano DE10-Lite Cyclone V FPGA + SoC Cyclone V FPGA + SoC Cyclone V FPGA + SoC Max 10 FPGA $259 $175 for $99LRZ $55 Visit our website for fullIntel specs on theseProprietary boards See the full catalog of Intel FPGA boards & kits at www.terasic.com Programmable Solutions Group 240 Full-Featured Beginner FPGA Dev Kit FPGA+SoC Academic Dev Kit Academic Dev Kit Dev Kit Intel DE10-Lite Intel DE10-Nano Intel DE1-SoC Intel DE10-Standard Academic Price $55 $99 $175 $259 FPGA Max® 10 Cyclone® V Cyclone® V Cyclone® V Logic Elements 50,000 110,000 85,000 110,000 ARM Cortex-A9 Dual-Core 800 MHz 925 MHz 925 MHz System-on-Chip (SoC) 1 GB DDR3 SDRAM (HPS), 64 MB 1 GB DDR3 SDRAM (HPS), Memory 64 MB SDRAM 1 GB DDR3 SDRAM (HPS) SDRAM (FPGA) 64 MB SDRAM (FPGA) PLLs 4 9 9 9 GPIO Count 500 469 469 469 7 Segment Displays 6 6 6 Switches 10 4 10 10 Buttons 2 2 4 4 LEDs 10 8 10 10 Clocks (2x) 50 MHz (3x) 50 MHz (4x) 50 MHz (4x) 50 MHz GPIO Count 40-pin header (2x) 40-pin header (2x) 40-pin header 40-pin header Video Out VGA 12-bit DAC HDMI VGA 24-bit DAC VGA 24-bit DAC ADC Channels 8 8 + programmable voltage range 8 + programmable voltage range Video In NTSC, PAL, Multi-format NTSC, PAL, Multi-format Line In/Out, Microphone In (24 bit Line In/Out, Microphone In Audio In/Out Audio CODEC) (24 bit Audio CODEC) Ethernet Gigabit 10/100/1000 Ethernet (x1) 10/100/1000 Ethernet (x1) USB OTG 1x USB OTG 2x USB 2.0 (Type A) 2x USB 2.0 (Type A) LCD 128x64 backlit Micro SD Card Support ✓ ✓ ✓ Accelerometer ✓ for✓ LRZ✓ ✓ PS/2 Mouse/Keyboard Port ✓ ✓ Infrared Intel Proprietary ✓ ✓ HSMC Header ✓ Arduino Header ✓ ✓ Programmable Solutions Group 241 Undergrad Lab Exercise Suites: Digital Logic First digital hardware course in EE, CompEng or CS curriculum Traditionally introduced sophomore year Offered in VHDL or Verilog Lab 1 - Switches, Lights, and Multiplexers Lab 7 - Finite State Machines Lab 2 - Numbers and Displays Lab 8 - Memory Blocks Lab 3 - Latches, Flip-flops, and Registers Lab 9 - A Simple Processor Lab 4 - Counters forLab LRZ 10 - An Enhanced Processor Lab 5 - Timers and Real-TimeIntel Clock ProprietaryLab 11 - Implementing Algorithms in Hardware Lab 6 - Adders, Subtractors, and Multipliers Lab 12 - Basic Digital Signal Processing 242 Undergrad Lab Exercise Suites: Comp Organization Typically second hardware course in EE, CompEng or CS curriculum Introduction to microprocessors & assembly language program Use ARM processor (on SOC kits) or NIOS II soft processor Intel FPGA Monitor Program for compiling & debugging assembly & C code Lab 1 - Using an ARM Cortex-A9 System or NIOS Lab 5 - Using Interrupts with Assembly Code II System Lab 2 - Using Logic Instructions with the ARM Lab 6 - Using C code with the ARM Processor Processor for LRZ Lab 3 - Subroutines and StacksIntel ProprietaryLab 7 - Using Interrupts with C code Lab 4 - Input/Output in an Embedded System Lab 8 - Introduction to Graphics and Animation 243 Intel FPGA MONITOR PROGRAM Design environment used to compile, assemble, download & debug programs for ARM* Cortex* A9 processor in Intel’s Cyclone® V SoC FPGA devices ▪ Compile programs, specified in assembly language or C, and download the resulting machine code into the hardware system ▪ Display the machine code stored in memory ▪ Run the ARM processor, either continuously or by single-stepping instructions ▪ Modify the contents of processor registers ▪ Modify the contents of memory, as well as memory-mapped registers in I/O devices ▪ Set breakpoints that stop the execution of a program at a specified address, or when certain conditions are met Clean and simple UX for LRZ Tutorials at fpgauniversity.intel.comIntel Proprietary Download independently or as part of University Program Installer (always free!) Programmable Solutions Group 244 Undergrad Lab Exercise Suites: Embedded Systems Typically third hardware course in EE, CompEng or CS curriculum Combines hardware and software Introduction to embedded Linux Lab 1 - Getting Started with Linux Lab 5 - Using ASCII Graphics for Animation Lab 2 - Developing Linux Programs that Lab 6 - Introduction to Graphics and Animation Communicate with the FPGA Lab 3 - Character Device Drivers forLab 7 -LRZUsing the ADXL345 Accelerometer Lab 8 - Audio and an Introduction to Multithreaded Lab 4 - Using Character DeviceIntel Drivers Proprietary Applications Programmable Solutions Group 245 Lab Exercise Suites: Machine Learning Basics Machine Learning on FPGAs Senior or grad-level course in EE, CompEng, CS or data science curriculum Teaches how to use the Intel® SDK for OpenCL™ Applications with FPGAs Basic understanding of AI fundamentals recommended* Lab 1 – Introduction to OpenCL Lab 5 – Neural Networks Lab 2 – Image Processing Lab 6 – Using the Deep Learning Accelerator Library Lab 3 – Lane Detection for Autonomous Lab 7 – Integration OpenCL Accelerators into Driving forExisting LRZ Software Lab 4 – Linear Classifier forIntel Handwritten ProprietaryDigits *For foundational AI & Machine Learning curriculums, visit our partner program Intel AI Academy Programmable Solutions Group 246 AI Academy Course Outline Runs in Cloud on Arria 10 PAC card Contains Slides, Lab exercises, and recordings for each class https://software.intel.com/en-us/ai-academy/students/kits/dl-inference-fpga Class 1 - Introduction to FPGAs for deep learning inferencing Class 2 - Building a deep learning computer vision Lab 1 - Deploy an application on an Intel CPU application w/ Acceleration using DL framework Lab 2 - Deploy an application on an Intel CPU Class 3 - Introduction to the OpenVINO™ toolkit using the OpenVINO toolkit Class 4 - Introduction to the Deep Learning for LRZ Lab 3 - Accelerate the application on an Intel FPGA Accelerator Suite for IntelIntel FPGAs Proprietary Class 5 - Introduction to the Acceleration Stack for Intel Xeon CPU with FPGAs Programmable Solutions Group In-Person Workshops Throughout the year our technical outreach team visits universities and industry conferences around the world to conduct hands-on workshops that train professors and students on how to use Intel FPGAs for education and research. Topics: Intro to FPGAs and Quartus (4 hrs.) Embedded Design using Nios II (4 hrs.) High-Speed IO (4 hrs.) High-level Synthesis (4 hrs.) Static Timing Analysis of Digital Circuits (4 hrs.) Machine Learning Acceleration (4 hrs.) Simulation & Debug (4 hrs.) for LRZModern Applications of FPGAs (1 hr.) Embedded LinuxIntel (4 hrs.) ProprietaryHow to Get Hired in the Tech Industry (1 hr.) Contact us at [email protected] to inquire about scheduling a workshop Programmable Solutions Group 248 Find Materials: FPGAUniversity.INTEL.com for LRZ Intel Proprietary 249 Membership: FPGAUniversity.INTEL.com for LRZ Intel Proprietary 250 Contact the University Team Rebecca Nevin Larry Landis Outreach Manager Senior Manager Intel FPGA University Program for LRZNew User Experience Group [email protected] [email protected] 251 GPU Comparison for LRZ Intel Proprietary 252 How do GPUs Deal With Fine Grained Data Sharing? Some GPU techniques involve implicit SIMT synchronization for LRZ FPGA threads aren’t warpIntel-locked, Proprietary so implicit sync doesn’t make sense ▪ FPGAs do exactly what you ask them to do the way you code it 253 An Even Closer Look: CUDA Execution Model Thread Block Grid Warp CUDA Scheduler NDRange Data FERMI FERMI KEPLER KEPLER MAXWELL GF100 GF104 GK104 GK110 GM107 SM SM SMX SMX SMM Thread Compute Capability 2.0 2.1 3.0 3.5 5.0 Shared Memory/SM 48KB 48KB 48KB 48KB 64KB 32-bit Registers/SM 32768 32768 64K 64K 64K Max Threads/Thread Block 1024 1024 1024 1024 1024 Max Thread Blocks/SM 8 8 16 16 32 Max Threads/SM 1536 1536 for2048 LRZ2048 2048 Threads/Warp 32 32 32 32 32 Max Warps/SM Intel48 Proprietary48 64 64 64 Max Registers/Thread 63 63 63 255 255 254 FPGA Execution Model Single Block of Data Custom Instructions Multiple Blocks of Data, with Multiple Instructions Custom Instructions Custom Instructions Custom Instructions Customfor Instructions LRZ Intel ProprietaryCustom Instructions Custom Instructions All execute in parallel 255 Divergent Control Flow on GPU for (i=0;i GPU uses SIMT pipeline to save area on control logic Branch mask = (x[i] 256 Divergent Control Flow: Just Fine for FPGA FPGA data path already has all operations in silicon ▪ Speculatively execute Overlap branch Branch Path B Path A condition BranchBranch Branch computation PathPath AA Path B Path A PathPath B B for LRZ Absorb into Branch. Path B. Path A Compress the one block Intelschedule Proprietary No longer any control flow 257 Memory Hierarchy 1. Register data: Registers in FPGA fabric 2. Private data: Registers in FPGA fabric 3. Local memory: On-chip RAMs 4. Global memory: for LRZ Off-chip externalIntel Proprietary memory External Memory Dynamic Coalescing For CPU/GPU the cache and memory controller handle For FPGA, we create dynamic coalescing hardware matched to specific memory characteristics connected to – Re-order memory accesses at runtime to exploit data locality – DDR is extremely inefficient at random access – Access with row bursts whenever possible 0 0 0 1 1 1 + 2 2 2 3 3 3 4 4 4 load 5 5 5 6 6 6 7 7 7 8 pipeline 8 8 9 9 to DDR 9 for from LRZ 10 10 10 x 11 11 11 Intel Proprietary12 12 12 13 13 13 14 14 14 15 15 15 259 On-chip FPGA Memory “Local” memory uses on-chip block RAM resources – Very high bandwidth, 8TB/s, – Random access in 2 cycles – Limited capacity The memory system is customized to your application – Huge value proposition over fixed-architecture accelerators Banking configuration (number of banks, width), and interconnect all customized for your kernel – Automatically optimized to eliminate or minimize access contention Key idea: Let the compiler minimize bank contention – If your code is optimized for another architecturefor (e.g.LRZ array[tid + 1] to avoid bank collisions), undo the fixed-architecture workaroundsIntel Proprietary – Can prevent optimal structure from being inferred FPGA Local Memory M20K M20K M20K M20K M20K M20K M20K M20K Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Arbitration Network Load/Stor Load/Stor Load/Stor Load/Stor Split memory into logicale banks e e e ▪ An N-bank configuration can handle N-requests per clock cycle as long as each request addresses a different bank for LRZ ▪ Manipulate memory addressesIntel so Proprietarythat parallel threads likely to access different banks – reduce collisions 261 Local Memory Attributes Annotations added to local memory variables to improve throughput or reduce area Banking control: – numbanks – bankwidth Port control: – numreadports/numwriteports – singlepump/doublepump for LRZ Intel Proprietary 262 numbanks(N) and bankwidth(N) memory attribute What does it do? Specifies the banking geometry for your local memory system A bank = single independent memory system What is it for? Can be used to optimize LSU-to-memory connectivity in an effort to boost performance for LRZ Banking should be setIntel up to maximize Proprietary “stall-free” accesses 263 numbanks(N) and bankwidth(N) memory attribute local int lmem[8][4]; Not stall-free #pragma unroll LSU1 0,0 0,1 0,2 0,3 for(int i = 0; i<4; i+=2) 1,0 1,1 1,2 1,3 { lmem[i][x] = …; LSU2 2,0 2,1 2,2 2,3 } 3,0 3,1 3,2 3,3 4,0 4,1 4,2 4,3 arbitration 5,0 5,1 5,2 5,3 6,0 6,1 6,2 6,3 7,0 7,1 7,2 7,3 for LRZ local int lmem[8][4] Intel Proprietary 264 numbanks(N) and bankwidth(N) memory attribute Stall-free local int __attribute__((numbanks(8), LSU1 0,0 0,1 0,2 0,3 bankwidth(16))) Bank 0 lmem[8][4]; 1,0 1,1 1,2 1,3 Bank 1 2,0 2,1 2,2 2,3 #pragma unroll LSU2 Bank 2 for(int i = 0; i<4; i+=2) 3,0 3,1 3,2 3,3 Bank 3 { 4,0 4,1 4,2 4,3 lmem[i][x & 0x3] = …; Bank 4 } 5,0 5,1 5,2 5,3 Bank 5 6,0 6,1 6,2 6,3 Bank 6 Mask access to tell compiler 7,0 7,1 7,2 7,3 no out-of-bounds accesses Bank 7 for LRZ local int lmem[8][4] Intel Proprietary 265 numreadports/numwriteports and singlepump/doublepump memory attribute What does it do? num What is it for? Controls the number of memory blocksfor used LRZ to implement the local memory system Intel Proprietary 266 numreadports/numwriteports and singlepump/doublepump memory attribute local int read_0 __attribute__((singlepump, M20k numreadports(3), read_1 numwriteports(1)))) read_2 lmem[16]; M20k write M20k lmem local int __attribute__((doublepump, read_0 numreadports(3), read_1 numwriteports(1)))) read_2 lmem[16]; for LRZ M20k Intel Proprietarywrite lmem 267