Bill Jenkins for LRZ Programmable SolutionsIntel Group Proprietary Agenda

9:00 am Welcome 1:30 pm Lab 2 OpenCL Flow 9:15 am Introduction to FPGAs 2:15 pm Introduction to DSP Builder 9:45 am FPGA Programming models: RTL 3:00 pm Introduction to Acceleration Stack 10:15 am FPGA Programming models: HLS 4:00 pm Lab 3 Acceleration Stack 11:00 am Lab 1 HLS Flow 4:30 pm Curriculum & University Program Coordination 11:45 am Lunch 12:30 pm FPGA Programming models: OpenCLfor LRZ 1:00 pm High PerformanceIntel Data Flow Proprietary Concepts for LRZ Intel Proprietary The Problem: Flood of Data By 2020 The average internet user will generate ~1.5 GB of traffic per day

Smart hospitals will be generating over 3 TB per day radar ~10-100 KB per second Self drivingSelf driving cars cars will will be be generating generating over over sonar ~10-100 KB per second 44,000 TB per GB perday… day… each each gps ~50 KB per second A connected plane will be generating over 40 TB per day lidar ~10-70 MB per second A connected factory will be generating over 1 PB per day for LRZ cameras ~20-40 MB per second

All numbers are approximated http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-trafficIntel-forecast/infographic.html Proprietary http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html 5 exaflops https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 1 car per hour http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html Typical HPC Workloads

Astrophysics Genomics / Bio-Informatics Artificial Intelligence Molecular Dynamics*

for LRZ Big Data Analytics IntelFinancial ProprietaryWeather & CLimate Cyber Security

* Source: https://comp-physics-lincoln.org/2013/01/17/molecular-dynamics-simulations-of-amphiphilic-macromolecules-at-interfaces/

5 Fast Evolution of Technology

We now have the compute to solve these problems today in near real-time

Bigger Data Better Hardware Smarter Algorithms

Image: 50 MB / picture Transistor density doubles Advances in neural networks leading to better Audio: 5 MB / song every 18 months for LRZ accuracy in training models Video: 47 GB / movie Cost / GB in 1995: $1000.00 IntelCost Proprietary / GB in 2015: $0.03

6 for LRZ Intel Proprietary50+ Years of Moore’s Law Computing has Changed…

7 The Urgency of Parallel Computing

If engineers keep building processors the way we do now, CPUs will get even faster but they’ll require so much power that they won’t be usable.

—Patrick Gelsinger, forformer LRZ Intel Chief Technology Officer, Intel Proprietary February 7, 2001 Source: http://www.cnn.com/2001/tech/ptech/02/07/hot.chips.idg/

8 Implications to High Performance Computing 50 GFLOPS/W ~100MW

for LRZ

Intel Proprietary 2022

9 Challenges Scaling Systems to Higher Performance

CPU Intensive Memory Intensive IO Intensive Result: Slow Performance

Memory Result: Result: Bottleneck Excessive power Slow requirements Performance Bottleneck Bottleneck (high latency) System I/O for LRZI/O Intel Proprietary Need to think about Compute Offload as well as Ingress/Egress Processing

10 Diverse Application Demands

for LRZ Intel Proprietary

11 The Intel Vision

Heterogeneous Systems: ▪ Span from CPU to GPU to FPGA to dedicated devices with consistent programming models, languages, and tools

for LRZ CPUs IntelGPUs Proprietary FPGAs ASSP

12 Heterogeneous Computing Systems

Modern systems contain more than one kind of processor ▪ Applications exhibit different behaviors: – Control intensive (Searching, parsing, etc…) – Data intensive (Image processing, data mining, etc…) – Compute intensive (Iterative methods, financial modeling, etc…) ▪ Gain performance by using specialized capabilities of different types of processors for LRZ Intel Proprietary

13 Separation of Concerns

Two groups of developers: ▪ Domain experts concerned with getting a result – Host application developers leverage optimized libraries ▪ Tuning experts concerned with performance – Typical FPGA developers that create optimized libraries Intel® Math Kernel Library a simple example of raising the level of abstraction to the math operations for LRZ ▪ Domain experts focusIntel on formulating Proprietary their problems ▪ Tuning experts focus on vectorization and parallelization

14 for LRZ Intel Proprietary

15 FPGA Enabled Performance and Agility

Workload N Workload Optimization: Workload 2 ensure cores serve their Workload 1 highest value processing

Efficient Performance: improve performance/watt

Real-Time: high bandwidth connectivity and low-latency z parallel processing Milliseconds Developer Advantage: code re-use across Intel FPGA datafor LRZ center productsIntel Proprietary FPGAs enhance CPU-based processing by accelerating algorithms and minimizing bottlenecks

16 FPGAs Provide Flexibility to Control the Data path

Compute Acceleration/Offload ▪ Workload agnostic compute ▪ FPGAaaS ▪ Virtualization

Intel® Xeon® Processor

Inline Data Flow Processing Storage Acceleration ▪ Machine learning ▪ Machine learning ▪ Object detection and recognition ▪ Cryptography ▪ Advanced driver assistance system (ADAS) for LRZ ▪ Compression ▪ Gesture recognition ▪ Indexing ▪ Face detection Intel Proprietary

17 FPGA Architecture DSP Block Memory Block Field Programmable Gate Array (FPGA) ▪ Millions of logic elements ▪ Thousands of embedded memory blocks ▪ Thousands of DSP blocks ▪ Programmable interconnect ▪ High speed transceivers ▪ Various built-in hardened IP for LRZ

Used to create Custom Hardware! Logic Intel ProprietaryProgrammable Modules Routing Switch

18 FPGA Architecture: Basic Elements Basic Element

1-bit configurable 1-bit register operation (store result)

Configured to perform any 1-bit operation: AND, OR, NOT, ADD, SUB for LRZ Intel Proprietary

19 FPGA Architecture: Flexible Interconnect …

Basic Elements are for LRZ surrounded withIntel a Proprietary flexible interconnect

20 FPGA Architecture: Flexible Interconnect … …

Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements

21 FPGA Architecture: Custom Operations Using Basic Elements

16-bit add

32-bit sqrt … Your custom 64-bit bit-shuffle and encode Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements

22 FPGA Architecture: Memory Blocks

addr Memory data_out data_in Block 20 Kb

Can be configured and grouped using the interconnect to create various cache architectures for LRZ Intel Proprietary

23 FPGA Architecture: Memory Blocks

addr Memory data_out data_in Block 20 Kb

Few larger Can be configured and grouped Lots of smaller caches using the interconnect to create caches various cache architectures for LRZ Intel Proprietary

24 FPGA Architecture: Floating Point Multiplier/Adder Blocks

data_in data_out

Dedicated floating point multiply and add blocks for LRZ Intel Proprietary

25 DSP Blocks

Thousands DSP Blocks in Modern FPGAs ▪ Configurable to support multiple features – Variable precision fixed-point multipliers – Adders with accumulation register – Internal coefficient register bank – Rounding – Pre-adder to form tap-delay line forfor filters LRZ – Single precision floating point multiplication, addition,Intel accumulation Proprietary

26 FPGA Architecture: Configurable Routing

Blocks are connected into a custom data-path that matches your application. for LRZ Intel Proprietary

27 FPGA Architecture: Configurable IO

The Custom data-path can be connected directly to custom or standard IO interfaces for inline data processing for LRZ Intel Proprietary

28 FPGA I/Os and Interfaces

FPGAs have flexible IO features to support many IO and interface standards ▪ Hardened Memory Controllers – Available interfaces to off-chip memory such as HBM, HMC, DDR SDRAM, QDR SRAM, etc. ▪ High-Speed Transceivers ▪ PCIe* Hard IP ▪ Phase Lock Loops for LRZ Intel Proprietary

*Other names and brands may be claimed as the property of others 29 Intel® FPGA Product Portfolio

Wide range of FPGA products for a wide range of applications

Non-volatile, low-cost, Low-power, cost- Midrange, cost, power, High-performance, single chip small form sensitive performance performance balance state-of-the-art

▪ Products features differs across familiesfor LRZ – Logic density, embeddedIntel memory,Proprietary DSP blocks, transceiver speeds, IP features, process technology, etc.

30 Mapping a Simple Program to an FPGA

CPU instructions R0  Load Mem[100] High-level code R1  Load Mem[101] Mem[100] += 42 * Mem[101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem[100] for LRZ Intel Proprietary

31 First let’s take a look at execution on a simple CPU

LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B

CWriteEnable CData Op for LRZ Fixed and general - General “cover-all-cases” data-paths Intel -ProprietaryFixed data-widths architecture: - Fixed operations 32 Looking at a Single Instruction

LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B

CWriteEnable CData Op for LRZ Intel Proprietary Very inefficient use of hardware!

33 Sequential Architecture vs. Dataflow Architecture

Sequential CPU Architecture FPGA Dataflow Architecture

A load load 42 R e A A s o Time u r c A A e for LRZ s Intel Proprietarystore A

34 Custom Data-Path on the FPGA Matches Your Algorithm!

High-level code

Mem[100] += 42 * Mem[101]

Custom data-path Build exactly what you need: load load 42 Operations Data widths for LRZMemory size & configuration Intel ProprietaryEfficiency: store Throughput / Latency / Power

35 Advantages of Custom Hardware with FPGAs

▪ Custom hardware!

DSP Transceiver ▪ Efficient processing Blocks Channels

M20K ▪ Fine-grained parallelism Blocks Transceiver PCS ▪ Low power I/O PLLs PLLs ▪ Flexible silicon Memory Controllers PCIe* IP ▪ Ability to reconfigure and IOs Core ▪ Fast time-to-market for LRZ Logic ▪ Many available I/OIntel standards Proprietary

*Other names and brands may be claimed as the property of others 36 RTL for LRZ Intel Proprietary

37 FPGA Development and Programming Tools

Software Developer Hardware Developer

Algorithm IP Library HDL Software Developer Designer Developer Designer

DSP Builder Intel® HLS for Intel® Compiler VHDL Intel® SoC FPGAs FPGA Intel® FPGA Embedded SDK for Design Suite OpenCL (EDS) for LRZ IntelIntel® Proprietary Quartus Prime Design Software

Verilog, VHDL and the Intel® FPGA SDK for OpenCL are currently supported by the Acceleration Stack. High Level Synthesis can be used manually by following app note

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 38 Traditional FPGA Design Entry

Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog A designer must describe the behavior of the algorithm to create a low-level digital circuit ▪ Logic, Registers, Memories, State Machines, etc. Design times range from several months to even years!

for LRZ Intel Proprietary

39 Traditional FPGA Design Flow

Time-Consuming Effort

Behavioral Simulation

HDL Synthesis

Place & Route / Timing Analysis / Timing Closure for LRZ Board Simulation & Test Intel Proprietary

40 Intel® Quartus® Prime Design Software

Default Operating Environment

Project Navigator

Tool View window

Tasks window IP Catalog

for LRZ Messages Intel Proprietary window

41 Intel® Quartus® Prime Design Software Projects

Description ▪ Collection of related design files & libraries ▪ Must have a designated top-level entity ▪ Target a single device ▪ Store settings in the software settings file (.qsf) ▪ Compiled netlist information stored in qdb folder in project directory Create new projects with New Projectfor Wizard LRZ ▪ Can be created usingIntel Tcl scripts Proprietary

42 Intel® FPGA Design Store

Download complete example design https://cloud.altera.com/devstore/platform/ templates for specific development kits Design examples include design files, device programming files, and software code as required Install .par files and select as template in New Project Wizard for LRZ Intel Proprietary

43 Device Selection

Filter device list Choose device family & family category (transceiver options, SoC options, etc.)

Choose specific part from list for LRZ Intel Proprietary Tcl: set_global_assignment –name FAMILY “device family name” Tcl: set_global_assignment –name DEVICE

44 Chip Planner

Graphical view of ▪ Layout of device resources ▪ Routing channels between device resources ▪ Global clock regions Uses ▪ View placement of design logic ▪ View connectivity between resourcesfor used LRZ in design ▪ Make placement assignmentsIntel Proprietary ▪ Debugging placement-related issues

45 Chip Planner Tools menu or toolbar

Device floorplan Report aka Chip View Selected Node window Properties

Tasks Layers Settings window Memory block in use Unused for LRZLAB Intel Proprietary

46 Floorplan Views Overall device resource usage

Lower level block usage

Lowest level routing detail

Zoom in for detailed logic for LRZ implementation & routing usageIntel Proprietary

47 Pin Planner

Interactive graphical tool for assigning pins ▪ Drag & drop pin assignments ▪ Set pin I/O standards ▪ Reserve future I/O locations Assignments menu → Pin Planner, toolbar, or Tasks window Default window panes ▪ Package View ▪ All Pins list ▪ Groups list ▪ Tasks window for LRZ ▪ Report window Intel Proprietary

48 Pin Planner Window

Groups list Toolbar Package View

Tasks pane for LRZ Intel ProprietaryAll Pins list

49 The Programmer

Tools menu → Programmer

for LRZ Intel Proprietary

50 State Machine Editor File menu → New or Tasks window Select State Machine File (.smf)

Create state machines in GUI ▪ Manually by adding individual states, transitions, and output actions ▪ Automatically with State Machine Wizard (Tools menu & toolbar) Generate state machine HDL code (required) ▪ VHDL Double-click states & transitions to for LRZ edit properties: name, equations, actions ▪ Verilog Intel Proprietary ▪ SystemVerilog

51 Platform Designer

Components in system use different Processor PCI Express* interfaces to communicate (some (32-bit Master) (64-bit Master) Bus Interface Interrupt Bus Interface Addres standard, some non-standard) Controller Addres Data Data Address Decoder s Typical system requires significant Arbiter s engineering work to design custom interface logic

Width Adapter Width Adapter Width Adapter Width Adapter Width Adapter Integrating design blocks and Bus Interface Bus Interface Bus Interface Bus Interface Bus Interface intellectual property (IP) is tedious and Slave 1 Slave 2 Slave 3 Slave 4 Slave 5 error-prone for 8-LRZBit 32-Bit 16-Bit 32-Bit 64-Bit Intel Proprietary

52 Automatic Interconnect Generation

Avoids error-prone integration Processor PCI Express * (32-bit Master) (64-bit Master) Saves development time with Bus Interface Interrupt Bus Interface Addres Controller Addres Data Data automatic logic & HDL generation Address Decoder s Arbiter s Enables you to focus on value-add Platform Designer automatically blocks generates interconnect

Platform Designer improves Width Adapter Width Adapter Width Adapter Width Adapter Width Adapter productivity by automatically Bus Interface Bus Interface Bus Interface Bus Interface Bus Interface generating the system interconnect Slave 1 Slave 2 Slave 3 Slave 4 Slave 5 for 8-LRZBit 32-Bit 16-Bit 32-Bit 64-Bit logic Intel Proprietary

53 The Platform Designer GUI

Access in Tools menu, toolbar, or Tasks window

Draggable, detachable tabs

IP Catalog System Contents

Hierarchy for LRZ Intel ProprietaryMessages

54 Lab 1 for LRZ Intel Proprietary

55 High Level Synthesis for LRZ Intel Proprietary

56 Can Also Be Wrapped With Higher Level Flows

C++ C IP IP Functions C++ IP

Intel© HLS Compiler for LRZ Intel ProprietaryRTL Platform Designer

57 The Software Programmer’s View

main(…) main(…) main(…){ {{ for( … ) for(for( … …) ) { { {

} } Compiler }

Programmers develop in mature software environments – Ideas can easily be expressed in languages such as ‘C’ – Typically start with simple sequential program – Use parallel APIs / language extensions to exploit multi core for additional performance – Compilation times are almost instantaneousfor LRZ – Immediate feedbackIntel Proprietary – Rich debugging tools

58 High Level Design is the Bridge Between HW & SW

100x More Software Engineers than Hardware Engineers Softwar Key to wide-spread adoption of FPGA in Datacenter e

Debugging software is much faster than hardware Many functions are easier to specify in software than RTL RTL Simulation of RTL takes thousands times longer than software

Design Exploration is much easier and faster in software

We Need to Raise the Level of Abstraction Transistors ▪ Similar to what assembly programmers did with C over 30 years ago

– (Today) Abstract away FPGA Design with Higherfor Level Languages LRZ Productivity and Abstraction – (Today) Abstract away FPGAIntel Hardware behindProprietary Platforms – (Tomorrow) Leverage Pre-Compiled Libraries as Software Services

59 HLS Use Model src.c i++g++ a.exe

C/C++ Code lib.h

main

f f f

t1 f

f11 f12 f13 t2 IP f21 Directives f22 f23 FPGA Standard HLS gcc/g++ Compiler IP Compiler

100% Makefile compatible for LRZ EXE HDL IP Intel ProprietaryIntel® Quartus® Ecosystem

60 Intel® HLS Compiler

Targets Intel® FPGAs Command-line executable: i++ Builds an IP block ▪ To be integrated into a traditional FPGA design using FPGA tools

Platform C/C++ HLS IP Designer Source Compiler for LRZ Leverages standard C/C++Intel development Proprietary environment Goal: Same performance as hand-coded RTL with 10-15% more resources

61 HLS Procedure

Create Component and Testbench in C/C++

Functional Verification with g++ or i++ Functional • Use -march=-64 C/C++ Source Iterations • Both compilers compatible with GDB

Compile with i++ -march= for HLS Architectural Intel® HLS • Generates IP Iterations Compiler • Examine compiler generated reports • Verify design in simulation

® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ

IntelIntegrate Proprietary IP with rest of your FPGA system

62 Intel® HLS Compiler Usage and Output

GDB-Compatible Executable Develop with C/C++:

src.c i++ -march=x86-64 src.c a.exe|out

Executable which will run calls to lib.h func in simulation of synthesized IP

a.exe|out Run Compiler for HLS: All the files necessary to include IP in a Quartus project. a.prj/components/func/ i.e. .qsys, .ip, .v etc

src.c i++ -march= -–component func src.c a.prj/reports/ Component hardware implementation reports lib.h for a.prjLRZ/verification/ Simulation testbench Intel Proprietarya.prj/quartus/ Quartus project to compile all IP

a is the default output name, -o option can be used to specify a non-default output name

63 HLS Procedure: x86 Emulation

Create Component and Testbench in C/C++

Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB

Compile with i++ -march= for HLS Architectural Intel® HLS • Generates IP Iterations Compiler • Examine compiler generated reports • Verify design in simulation

® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ

IntelIntegrate Proprietary IP with rest of your FPGA system

64 Simple Example Program: i++ and g++ flow

Terminal Commands and Outputs

Example Program $ g++ test.cpp // test.cpp $ ./a.out #include Hello world $ int main() { printf("Hello world\n"); return 0; $ i++ test.cpp } $ ./a.out Hello world for LRZ$ Intel Proprietary Using the default –march=x86-64

65 g++ Compatibility

Intel HLS Compiler is command line compatible with g++ ▪ Similar command-line flags, x86 behavior, and compilation flow ▪ Changing “g++” to “i++” should just work – g++ – i++ ▪ x86 behavior should match g++ – Except for integer promotion (discussedfor LRZ later) ▪ No source modificationsIntel required Proprietary (for x86 mode) ▪ Support for GNU Makefiles

66 i++ Options : g++ Compatible Options

Option Description

-h Display help information

-o Specify a non-default output name

-c Instructs compiler generate the object files and not the executable -march= Compile for architecture x86-64 (Default) or

-v Verbose mode

-g Generate debug information (default)

-g0 Do not generate debug information

-I

Add to include path -D[=] Define with or 1 LRZ -L -l IntelLibrary search directoryProprietary and library name when linking Example: i++ -march=x86-64 myfile.cpp –o myexe

67 i++ Options: FPGA Related Options

Option Description

--component Specify a comma-separated list of function names to be synthesizes to RTL

--clock Optimizes the RTL for the specified clock frequency or period

-ghdl Enable full debug visibility and logging of all signals when verification executable is run --quartus-compile Compiles the resulting HDL files using the Intel® Quartus® Prime software

--simulator Specify the simulator used for verification, “none” to skip testbench generation

--x86-only Only create the executable for testbench, no RTL or cosim support --fpga-only Create FPGA componentfor project, LRZ RTL and cosim support, no testbench binary Example: i++ -march= --Proprietarycomponent mycomp --clock 400Mhz myfile.cpp There are many other optimization options available please see the Intel HLS Compiler Reference Manual

68 The Default Interfaces

component int add(int a, int b) { return a+b; } C++ Construct HDL Interface

Conduits associated with the start done Scalar arguments default start/busy interface busy stall Pointer arguments Avalon memory master interface a[31:0] add returndata[31:0] Global scalars and Avalon memory master interface b[31:0] forarrays LRZ Intel Proprietary clock Note: more on interfaces later

69 Example Makefile

FILE := myapp DEVICE := Arria10

all:

gpp: $(FILE).cpp g++ $(GCFLAGS) $(FILE).cpp -o $(FILE).out

emu: $(FILE).cpp i++ $(GCFLAGS) $(FILE).cpp -o $(FILE)_emu.out

fpga: $(FILE).cpp i++ $(GCFLAGS) $(FILE).cppfor -o $(FILE) LRZ_fpga.out -march=$(DEVICE) Intel Proprietary

70 x86 Debugging Tools printf/cout gdb Valgrind

GDB-Compatible Executable Develop with C/C++:

src.c i++ -march=x86-64 src.c a.exe|out for LRZ lib.h Intel Proprietary

71 Using printf()

Requires “HLS/stdio.h” ▪ Maps to when appropriate Can be included in the testbench or the component ▪ Used with no limitations in the x86 emulation flow printf statements inside the component ignored for HDL generation ▪ Ignored in the cosimulation flow with an HDL simulator for LRZ Intel Proprietary

72 Using printf(): Example

Example Program Terminal Commands and output

// test.cpp $ i++ test.cpp #include "HLS/stdio.h" $ ./a.out Hello from the testbench void say_hello() { Hello from the component printf("Hello from the component\n"); $ } $ i++ test.cpp –march=Arria10 \ int main() { --component say_hello printf("Hello from the testbench\n"); $ ./a.out say_hello(); Hello from the testbench return 0; for$ LRZ } Intel Proprietary

73 Debugging Using gdb i++ integrates well with GNU gdb ▪ Debug data is generated by default – Unlike g++, -g enabled by default, use -g0 to turn off debug data -march=x86-64 flow: ▪ Can step through any part of the code (including the component) -march= flow: ▪ Can step through testbench codefor LRZ ▪ gdb does not see the component side execution (that runs in an HDL simulator) Intel Proprietary

74 gdb Example

Example Program Terminal Commands and output

// test.cpp $ i++ test.cpp –march=x86-64 –o test-x86 #include "HLS/hls.h" $ gdb ./test-x86 #include "HLS/stdio.h" ……………………………………………………………… component void say_hello() { (gdb) printf("Hello from the component\n"); } $ i++ test.cpp –march=Arria10 –o test-fpga int main() { $ gdb ./test-fpga printf("Hello from the testbench\n"); ……………………………………………………………… say_hello(); return 0; for LRZ (gdb) } Intel Proprietary

75 Debugging with Valgrind

“Valgrind is an instrumentation framework for building dynamic analysis tools.” ▪ Valgrind tools can detect: – Memory leaks – Invalid pointer uses – Use of uninitialized values – Mismatched use of malloc/new vs free/delete – Doubly freed memory for LRZ ▪ Use to debug componentIntel and Proprietary testbench in the x86 emulation flow

76 Simple Valgrind Example Example Program: Terminal Commands and output: $ i++ test.cpp // test.cpp $ ./a.out 1 #include “hls/stdio.h” Segmentation Fault #include $ valgrind --leak-check=full --show-reachable=yes ./a.out 2 …………………………………………………………………………………………………… 3 ==9744== Invalid read of size 4 4 int bin_count (int *bins, int a) { ==9744== at 0x4006B3: bin_count(int*, int) (test.cpp:5) 5 return ++bins[a]; ==9744== by 0x400723: main (test.cpp:13) } ==9744== Address 0x1b31075dc is not stack'd, malloc'd or 6 (recently) free'd 7 ==9744== Process terminating with default action of signal 8 int main() { 11 (SIGSEGV) 9 int *bins = (int *) malloc(16 * sizeof(int)); ==9744== Access not within mapped region at address 10 srand(0); 0x1B31075DC ==9744== at 0x4006B3: bin_count(int*, int) (test.cpp:5) 11 for (int i = 0; i < 256; i++) { ==9744== by 0x400723: main (test.cpp:13) 12 int x = rand(); …………………………………………………………………………………………………… 13 int res = bin_count(bins, x); ==9744== 64 bytes in 1 blocks are still reachable in loss 14 printf("Count val: %d\n", res); record 1 of 1 ==9744== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) 15 } for LRZ ==9744== by 0x4006ED: main (test.cpp:9) 16 return 0; …………………………………………………………………………………………………… 17 } Intel ProprietarySegmentation fault Valgrind: Segmentation Fault Fixed

int bin_count (int *bins, int a) { return ++bins[a % 16]; }

int main() { int *bins = (int *) malloc(16 * sizeof(int)); srand(0); for (int i = 0; i < 256; i++) { int x = rand(); int res = bin_count(bins, x); printf("Count val: %d\n", res); } free (bins); for LRZ return 0; } Intel Proprietary

78 HLS Procedure: Cosimulation

Create Component and Testbench in C/C++

Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB

Compile with i++ -march= for HLS Architectural Intel® HLS • Generates IP Iterations Compiler • Examine compiler generated reports • Verify design in simulation

® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ

IntelIntegrate Proprietary IP with rest of your FPGA system

79 Example Component/Testbench Source

#include "HLS/hls.h" i++ -march= --component accelerate mysource.cpp #include "assert.h" #include "HLS/stdio.h" #include "stdlib.h" accelerate() becomes an FPGA component int accelerate(int a, int b) { return a+b; component } – Use --component i++ argument or int main() { component srand(0); attribute in source for (int i=0; i<10; ++i) { int x=rand() % 10; int y=rand() % 10; main() becomes testbench for int z=accelerate(x, y); component accelerate() printf("%d + %d = %d\n", x, y, z); assert(z == x + y); for LRZ } return 0; } Intel Proprietary

80 Translation from C function API to HDL module

All component functions are synthesized to HDL ▪ Each synthesized component is an independent HDL module Component functions can be declared: ▪ Using component keyword in source ▪ Specifying “--component ” in the command-line for LRZ Intel Proprietary

81 Cosimulation

Combines x86 testbench with RTL simulation HDL code for the component runs in an RTL Simulator ▪ Verilog ▪ RTL testbench automatically created from software main() and everything else called from main runs on x86 as the testbench Communication using SystemVerilog Direct Programming Interface (DPI) ▪ Allows C/C++ to interface SystemVerilogfor LRZ ▪ Inter-process communicationIntel Proprietary (IPC) library used to pass testbench input data to RTL simulator, and returns the data back to the x86 testbench

82 Cosimulation Verifying HLS IP

The Intel® HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator ▪ To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture – Any calls to the component function becomes calls the simulator through DPI IP Function Call

a.exe|out src.c i++ -march= src.c for LRZ Data lib.h Intel Proprietary a.prj/verification/

83 Default Simulation Behavior

Function calls to the simulator are sequential by default

#include "HLS/hls.h" #include "stdio.h" component int acc (int a, int b) { return a+b; } int main() { int x1, x2, x3; x1=acc(1, 2); x2=acc(3, 4); x3=acc(5, 6); for LRZ … } Intel Proprietary

84 Streaming Simulation Behavior

Use enqueue function calls to stream data into the component

#include "HLS/hls.h" #include "stdio.h" component int acc(int a, int b) { return a+b; } int main() { int x1, x2, x3; altera_hls_enqueue(&x1, &acc, 1, 2); altera_hls_enqueue(&x2, &acc, 3, 4); altera_hls_enqueue(&x3, &acc, 5, 6); altera_hls_component_run_all(“acc”); for LRZ … } Intel Proprietary

85 Viewing Component Waveforms

▪ Compile design with i++ -ghdl flag – Enable full visibility and logging of all HDL signals in simulation ▪ After cosimulation execution, waveform available at a.prj/verification/vsim.wlf ▪ Examine with the ModelSim GUI: – vsim a.prj/verification/vsim.wlf for LRZ Intel Proprietary

86 Viewing Waveforms in Modelsim

Locate Component for LRZ IntelAdd Signals Proprietary to Waveform

87 Cosimulation Design Process

Compile and verify on Simulate using Compile for FPGA x86 Modelsim

Iterate on the algorithm Examine the FPGA reports Test functionality

Test latency and Iterate on the architecture Functional verification performance (through of the design for LRZ verification stats) Debugging using Use the reports as feedback gdb/valgrind Intelon whatProprietary the bottlenecks are

88 Main HTML Report

The Intel® HLS Compiler automatically generates HTML report that analyzes various aspects of your function including area, loop structure, memory usage, and system data flow ▪ Located at a.prj/reports/report.html

for LRZMany Types of Reports Intel Proprietary

89 HTML Report: Summary

Overall compile statics ▪ FPGA Resource Utilization ▪ Compile Warnings ▪ Quartus® fitter results – Available after Quartus compilation ▪ etc. for LRZ Intel Proprietary

90 HTML Report: Loops

Serial loop execution hinders function dataflow circuit performance ▪ Use Loop Analysis report to see if and how each loop is optimized – Helps identify component pipeline bottlenecks Loop Automatically unrolled? Yes Fully unrolled? Unrolled? Partially unrolled? #pragma unroll implemented? No

Yes What’s the Initiation Interval (launch Pipelined? forfrequency LRZ of new iteration)? Are there dependency preventing optimal II? IntelNo Proprietary Reason for serial execution?

91 Loop Unrolling

Loop unrolling: Replicate hardware to execute multiple loop iterations at once ▪ Simple loops unrolled by the compiler automatically ▪ User may use #pragma unroll to control loop unrolling ▪ Loop must not have dependency from iteration to iteration

For Begin Iteration 1 2 3 4 5 …

Op 1 Op 1 Op 1 Op 1 Op 1 … Op 1 Loop Unroll for LRZ … Op 2 Intel ProprietaryOp 2 Op 2 Op 2 Op 2 Op 2 For End

92 Loop Pipelining

Loop pipelining: Launch loop iterations as soon as dependency is resolved ▪ Initiation interval(II): launch frequency (in cycles) of a new loop iteration – II=1 is optimally pipelined – No dependency or dependencies can be resolved in 1 cycle For Begin

i2 LoopIterations Pipelined Execution of i3i2 Op 1 i2 Op 1

Op 2 i1 Op 2 Op 3 for LRZ Serial Execution of Execution Serial Iterations Loop i0 Intel ProprietaryOp 3 For End

93 HTML Report: Loop Analysis

Loop analysis shows how loops are implemented – Ability to correlate with source code

Compiler-added loop, not in the code, implicit infinitely loop allowing the component to run continuously in pipelined fashion Pipelined loop, II=1

Pipelined loop, II=2 due to memory dependency

for LRZFully unrolled loop, due to user #pragma Intel Proprietaryunroll

94 HTML Report: Area Analysis

View detailed estimated resource consumption by system or source line ▪ Analyze data control overhead ▪ View memory implementation ▪ Shows resource usage – ALUTs – FFs – RAMs for LRZ – DSPs Intel Proprietary ▪ Identifies inefficient uses

95 HTML Report: Component Viewer

Displays abstracted netlist of the HW implementation ▪ View data flow pipeline – See loads and stores – Interfaces including stream reads and writes – Memory structure – Loop structure – Possible performance bottlenecks – Unpipelined loops are coloredfor light red LRZ – Stallable points are red Intel ProprietaryMouse over node to see tooltip and details.

Correlates with source code. 96 HTML Report: Memory Viewer

Displays local memory implementation and accesses ▪ Visualize memory architecture – Banks, widths, replication, etc ▪ Visualize load-store units (LSUs) – Stall-free? – Arbitration for LRZ – Red indicates stallable Intel ProprietaryMouse over node to see tooltip and details.

Correlates with source code. 97 HTML Report: Verification Statistics

Reports execution statics from testbench execution, available after component is simulated (testbench executable ran) ▪ Number and type of component invocation

▪ Latency of component Measurements based on latest execution of ▪ Dynamic Initiation interval of Component testbench ▪ Data rates of streams for LRZ Intel Proprietary

98 HLS Procedure: Integration

Create Component and Testbench in C/C++

Functional Verification with g++ or i++ Functional • Use -march=x86-64 C/C++ Source Iterations • Both compilers compatible with GDB

Compile with i++ -march= for HLS Architectural Intel® HLS • Generates IP Iterations Compiler • Examine compiler generated reports • Verify design in simulation

® HDL IP Run Quartus Prime Compilation on Generated IP • Generate QoRformetrics LRZ

IntelIntegrate Proprietary IP with rest of your FPGA system

99 Quartus® Generated QoR Metrics for IP

Use Intel® Quartus® Prime software to generate quality-of-result reports ▪ i++ creates the Quartus project in a.prj/quartus ▪ To generate QoR data (final resource utilization, fmax) – Run quartus_sh --flow compile quartus_compile – Or use i++ --quartus-compile option ▪ Report part of the HTML report – a.prj/reports/report.htmlfor LRZ – Summary pageIntel Proprietary

100 Intel® Quartus® Software Integration a.prj/components directory contains all the files to integrate

▪ One subdirectory for each component – Portable, can be moved to a different location if desire 2 use scenarios 1. Instantiate in HDL 2. Adding IP to a Platform Designer system for LRZ Intel Proprietary

101 HDL Instantiation

Add Components to Intel® Quartus Project ▪ .qsys to Standard Edition ▪ .ip to Pro Edition Instantiate component module in your design ▪ Use template a.prj/components//_for inst.vLRZ Intel Proprietary

102 Platform Designer System Integration Tool

Catalog of Connect custom IP available IP and systems  Interface protocols  Memory IP 1  DSP Custom 1  Embedded  Bridges IP 2  PLL IP 3 Custom 2  Custom Components  Custom Systems Accelerate Simplify integration development for LRZ AutomateIntel integration Proprietary tasks HDL

103 Platform Designer Integration

Platform Designer component generated for each component: ▪ For PD Standard – a.prj/components//.qsys ▪ For Platform Designer – a.prj/components//.ip In Platform Designer, instantiate component from the IP Catalog in the HLS project directory ▪ Add IP directory to IP Catalog Search Locations – May use a.prj/components/**/* ▪ Can be stitched with other user IP or Intel®for Quartus® LRZ IP with compatible interfaces See tutorials under tutorials/usabilityIntel Proprietary

104 Platform Designer HLS Component Example

Example ▪ Cascaded low-pass filter and high-pass filter

HLS Components for LRZ Intel Proprietary

105 HLS-Backed Components

▪ Generic component can be used in place of actual IP core

▪ Choose Implementation Type: HLS

• Specify HLS source files • Compile Component for LRZ • Run Cosim Intel Proprietary • Display HTML report

106 Lab 2 for LRZ Intel Proprietary

107 OpenCL for LRZ Intel Proprietary

108 Intel FPGA SDK for OpenCL™ Flow A system level view: OpenCL Host Program

Board Support Package foo.cl System Compiler Integrator

HDL IP Core Kernel compiler:

▪ Optimized pipelines from C/C++ Board support package: (created by hardware developer)

▪ Timing closure, pinouts, periphery planning – we’ve got forit covered LRZ System integrator: (Quartus runs behind the scenes) Intel Proprietary FPGA in a System ▪ Optimized I/O interconnects

109 OpenCL

Hardware Agnostic Compute Language Invented by Apple ▪ 2008 Specification donated to Khronos Group ▪ Now managed by Intel

OpenCL C OpenCL C and C++ C/C++ API What does OpenCL™ give us?

▪ Industry standard programming model Host Accelerator ▪ Functional portability across platformsfor LRZ ▪ Well thought out specificationIntel Proprietary

110 Heterogeneous Platform Model

OpenCL Host Memory Platform Device Device Model

Host

Global Memory

Example x86 Platform for LRZ

PCIe Intel Proprietary

111 OpenCL Use Model: Abstracting the FPGA away

Host Code OpenCL Accelerator Code main() { __kernel void sum read_data( … ); (__global float *a, manipulate( … ); __global float *b, clEnqueueWriteBuffer( … ); __global float *y) clEnqueueNDRange(…,sum,…); { clEnqueueReadBuffer( … ); int gid = get_global_id(0); display_result( … ); y[gid] = a[gid] + b[gid]; } }

Standard Offline Verilog gcc Compiler Compiler

EXE AOCX Quartus Prime for LRZ Intel Proprietary Accelerator Host

112 OpenCL Host Program Pure software written in standard C/C++ languages Communicates with the accelerator devices via an API which abstracts the communication between the host processor and the kernels

main() { Copy data from Host to read_data_from_file( … ); FPGA manipulate_data( … );

Tell the FPGA to run a clEnqueueWriteBuffer( … ); clEnqueueNDRange(…, sum, …); particular kernel clEnqueueReadBuffer( … );

Copy data from FPGA to fordisplay_result LRZ( … ); Host Intel Proprietary}

113 OpenCL Kernels __kernel void sum( __global float *a, Kernel: Data-parallel function __global float *b, __global float *answer) ▪ Defines many parallel threads { int xid = get_global_id(0); ▪ Each thread has an identifier specified by result[xid] = a[xid] + b[xid]; “get_global_id” } ▪ Contains keyword extensions to specify parallelism and memory hierarchy float *a = 0 1 2 3 4 5 6 7

Executed by an OpenCL device float *b = 7 6 5 4 3 2 1 0 ▪ CPU, GPU, FPGA __kernel void sum( … ); Code portable NOT performance for LRZ portable Intel Proprietaryfloat *result = 7 7 7 7 7 7 7 7 ▪ Between FPGAs it is!

114 Software Engineer’s View of an OpenCL System

Dataflow Processor Local memory (shallow, fast, random)

Host Compute Engines

Global Memory (deep, fast, bursting)

Device contains compute engines that run the kernel Host talks to global memory through OpenCLfor routines LRZ Global memory is large, fast,Intel and likes Proprietaryto burst Local memory is small, fast, and supports random access

115 FPGA OpenCL Architecture FPGA External External Intel® Xeon® Memory Controller Memory Controller DDR* Processor / PCIe & PHY & PHY Host Processor

Global Memory Interconnect

BRAM

BRAM

BRAM

BRAM Kernel Kernel Kernel Pipeline Pipeline Pipeline BRAM BRAM

Local Memory Interconnect

Modest external memory bandwidth for LRZ Extremely high internal memoryIntel bandwidth Proprietary Highly customizable compute cores

116 Start with a Reference Platform (1/2)

Network Enabled High Performance Computing (HPC) Low Latency Compute Power/ Requirement Memory Bandwidth

CPLD FLASH Stratix V FPGA Stratix V FPGA OpenCL API OpenCL API CPLD DDR3 DDR3 HAL DDR3 DDR3 HAL Bridge UMD Architecture UMD 10G UDP DMA DMA KMD KMD (OpenCL Kernels) (OpenCL Kernels)

PCIe 10G UDP PCIe

Global Memory DDR and QDRII+ for LRZLarge amount of DDR IO Channels 2x10GbE (MAC/UOE)Intel ProprietaryNone (Minimize IP overhead)

117 Start with a Reference Platform (2/2) Host and accelerator in same package: SoC

FPGA

Scratch DDR FPGA Global Memory DDR DVI Camera

Processor OpenCL Kernels DVO Monitor

for LRZ Intel Proprietary $99 $175 $250 >$1000

118 Development Flow using SDK

Modify kernel.cl

x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)

Profiler (hours) for LRZ Intel ProprietaryDONE!

119 Compiling Kernel Run the Altera Offline Compiler in command prompt ▪ aoc --board ▪ Run aoc --list-boards to see all available boards AOC performs system integration to generate the kernel hardware system and the Quartus Prime software to compile the design

/mydesigns/matrixMult$ aoc matrixMul.cl aoc: Selected target board bittware_s5pciehq

+------+ ; Estimated Resource Usage Summary ; +------+------+ ; Resource + Usage ; +------+------+ ; Logic utilization ; for 52% LRZ ; ; Dedicated logic registers ; 23% ; ; Memory blocks ; 31% ; ; DSP blocks ; 54% ; +------Intel Proprietary+------;

120 Executing the kernel: clCreateProgramWithBinary

fp = fopen(“file.aocx","rb"); fseek(fp,0,SEEK_END); lengths[0] = ftell(fp); binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]); OpenCL.h rewind(fp); API fread(binaries[0],lengths[0],1,fp); fclose(fp); .cl

clGetPlatforms const char** kernel const char** cl_platform Program (exe) const char**

clGetDevices clCreateProgramWithBinary Offline Compiler cl_device cl_program Program (exe)

clCreateContext cl_context clBuildProgram cl_program exe Kernel (src) clCreateCommandQueue .aocx exe Kernel (src) cl_command _queue clCreateKernel cl_kernelfor LRZ CL File exe clEnqueueNDRangeKernel  OpenCL “Program” host.c Intel Proprietary  Bitstream

121 Development Flow using SDK

Modify kernel.cl

x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)

Profiler (hours) for LRZ Intel ProprietaryDONE!

122 Emulator – The Flow Generate emulation aocx

kernel void convolution( global int * filter_coef, global int * input_image, global int * output_image ) { aoc aocx int grid = get_group_id(0); … } conv.cl aoc -march=emulator conv.cl conv.aocx Run host program with emulator aocx ▪ Host compile does not change ▪ set CL_CONTEXT_EMULATOR_DEVICE_ALTERA=

c:\opencl>aoc –march=emulator conv.cl c:\opencl>dir for LRZ host.exe conv.cl conv.aocx c:Intel\opencl>host.exe Proprietary running… done!

123 Printf

Can use printf within kernel on FPGA ▪ Adds some memory traffic overhead

In the emulator, printf runs on IA ▪ Useful for fast debug iterations

for LRZ Intel Proprietary

124 Development Flow using SDK

Modify kernel.cl

x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)

Profiler (hours) for LRZ Intel ProprietaryDONE!

125 Optimization Report

Intel FPGA SDK for OpenCL provides a static report to identify performance bottlenecks when writing single-threaded kernels Use –c to stop after generating the reports ▪ aoc -c ▪ Report is in: /reports/report.html

for LRZ Intel Proprietary

126 Optimization Report Example

for LRZ Intel Proprietary

127 Development Flow using SDK

Modify kernel.cl

x86 Emulator (sec) Functional Bugs? Hardware performance Not Memory Dependencies? met? Optimization Report (sec)

Profiler (hours) for LRZ Intel ProprietaryDONE!

128 Profiler – the flow 1. Generate program bitstream with profiling enabled

kernel void convolution( global int * filter_coef, global int * input_image, global int * output_image ) { aoc aocx int grid = get_group_id(0); } conv.cl aoc --profile conv.cl conv.aocx 2. Run host program with instrumented aocx

c:\opencl>dir host.exe conv.aocx c:\opencl>host.exe running… done! c:\opencl>dir host.exe conv.aocx profile.monfor LRZ 3. Run the profiler GUI: aocl report

129 Dynamic Profiler

Intel FPGA SDK for OpenCL enables users to get runtime information about their kernel performance Bottlenecks, bandwidth, saturation, pipeline occupancy

for LRZ Intel Proprietary Performance Stats Execution Times

130 for LRZ Intel Proprietary

131 Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on replicated pipelines in the FPGA

kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary

132 Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on replicated pipelines in the FPGA

t0 t1 t2

kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary

133 Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on replicated pipelines in the FPGA – Throughput = 1 thread per cycle – Area inefficient t0 t1 t2

t3 t4 t5 Clock Cycles Clock for LRZ Parallel ThreadsIntel Proprietary

134 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

t2 t1 t0

kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary

135 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

t2 t1

t0 kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary

136 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

t2

t1 kernel void add( global int* Mem ) { t0 ... Mem[100] += 42*Mem[101]; } for LRZ Intel Proprietary

137 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

t2 kernel void add( global int* Mem ) { t1 ... Mem[100] += 42*Mem[101]; }

t0 for LRZ Intel Proprietary

138 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

kernel void add( global int* Mem ) { t2 ... Mem[100] += 42*Mem[101]; }

t1 for LRZt0 Intel Proprietary

139 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Attempt to create a deeply pipelined implementation of kernel – On each clock cycle, we attempt to send in new thread

kernel void add( global int* Mem ) { ... Mem[100] += 42*Mem[101]; }

t2 for LRZt1 Intel Proprietary

140 Execution of Threads on FPGA

Better method involves taking advantage of pipeline parallelism – Throughput = 1 thread per cycle

t0 t1 t2 kernel void t3 add( global int* Mem ) { ... t4 Mem[100] += 42*Mem[101]; t5 } Clock Cycles Clock for LRZt2 Intel Proprietary

141 for LRZ Intel Proprietary OpenCL on Intel FPGAs

Main assumptions made in previous OpenCL programming model – Data level parallelism exists in the kernel program Not all applications well suited for this assumption – Some applications do not map well to data-parallel paradigms These are the only workloads that GPUs support

for LRZ Intel Proprietary

143 Data-Parallel Execution

On the FPGA, we use the idea of pipeline parallelism to achieve acceleration

kernel void t2 sum(global const float *a, Load Load global const float *b, global float *c) t1 { + int xid = get_global_id(0); t0 c[xid] = a[xid] + b[xid]; } Store for LRZ Threads can execute Intelin an embarrassingly Proprietary parallel manner

144 Data-Parallel Execution - Drawbacks

Difficult to express programs which have partial dependencies during execution

kernel void t2 sum(global const float *a, Load Load global const float *b, global float *c) t1 { + int xid = get_global_id(0); t0 c[xid] = c[xid-1] + b[xid]; } Store for LRZ Would require complicated hardware and new language semantics to describe the desired behavior Intel Proprietary

145 Solution: Tasks and Loop-Pipelining Allow users to express programs as a single-thread

for (int i=1; i < n; i++) { c[i] = c[i-1] + b[i]; }

Pipeline parallelism still leveraged to efficiently execute loops in Intel’s FPGA OpenCL i=2 ▪ Parallel execution inferred Load by compiler i=1 ▪ Loop Pipelining + for LRZi=0 Intel ProprietaryStore

146 Loop Carried Dependencies

Loop-carried dependencies are dependencies where one iteration of the loop depends upon the results of another iteration of the loop

kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i

The variable state in iteration 1 dependsfor on LRZ the value from iteration 0. Similarly, iteration 2 depends onIntel the value Proprietary from iteration 1, etc.

147 Loop Carried Dependencies To achieve acceleration, we can pipeline each iteration of a loop containing loop carried dependencies – Analyze any dependencies between iterations – Schedule these operations – Launch the next iteration as soon as possible

kernel void state_machine(ulong n) { t_state_vector state = initial_state(); for (ulong i=0; i

148 Loop Pipelining Example

No Loop Pipelining With Loop Pipelining

i=0 i=0 i=1 i=1 i=2 i=3 Looks almost like i=4 multi-threaded execution! i=2 i=5 Clock Cycles Clock Clock Cycles Clock for LRZ

Finishes Faster because Iterations No Overlap of Iterations!Intel Proprietary Are Overlapped

149 Parallel Threads vs. Loop Pipelining

So what’s the difference?

t0 i=0 Loop t1 i=1 Parallel threads launch 1 dependencies may t2 i=2 thread per clock cycle in not be resolved in t3 i=3 pipelined fashion 1 clock cycle t4 i=4 t5 i=5

Parallel Threads for LRZLoop Pipelining Loop Pipelining enables Pipeline Parallelism *AND* the communication of state information betweenIntel iterations. Proprietary

150 Image Filter Memory

F[3][3]

for LRZ Intel Proprietary Memory

151 Harnessing Dataflow to Reduce Memory forBandwidth LRZ Intel Proprietary Data Movement in GPUs Data is moved from host over PCIexpress Instructions and data is constantly sent back and forth between host cache and memory and GPU memory ▪ Requires buffering larger data sets before passing to GPU to be processed ▪ Significant latency penalty ▪ Requires high memory and host bandwidth ▪ Requires sequential execution of kernels

JPG RGB RGB* JPG* Global forMemory LRZ

UncompressIntel ProprietaryImage Filter Compress

153 Altera_Channels Extension An FPGA has programmable routing Can’t we just send data across wires between kernels? Advantages: – Reduce memory bandwidth – Lower latency through fine-grained synchronization between kernels – Reduce complexity (wires are trivial compared to memory access) o Lower cost, lower area, higher performances – Enable modular dataflow design through small kernels exchanging data – Different workgroup sizes and degrees of parallelism in connected modules

JPG RGB RGB* JPG* Global forMemory LRZ

UncompressIntel ProprietaryImage Filter Compress

154 Data Movement in FPGAs

FPGA allows for result reuse between instructions Ingress/Egress to custom functions 100% flexible Multiple memory banks of various types directly off FPGA – Algorithms can be architected to minimize buffering to external memory or host memory – Multiple optional memory banks can be used to allow simultaneous access

Optional Optional Memory Memory FPGA 100G, 100G, PCIe, for LRZ PCIe, SRIO, SRIO, Kernel 1 Kernel 2 Kernel3 USB, Intel Proprietary USB, etc… etc… Example: Multi-Stage Pipeline An algorithm may be divided into multiple kernels: – Modular design patterns – Partition the algorithm into kernels with different sizes and dimensions – Algorithm may naturally split into both single-threaded and NDRange kernels Generating random data for a Monte Carlo simulation: kernel void rng(int seed) { kernel void sim(...) { int r = seed; int gid = get_global_id(0); while(true) { int rnd = read_channel_altera( r = rand(r); RAND); write_channel_altera( forout[ gidLRZ] = do_sim(data, rnd); RAND, r); } } Intel Proprietary NDRange } Single-Threaded

156 Traditional Data Movement Without Channels

DDR

Memory Controller

DMA Kernel Kernel DMA

PCIe FPGA for LRZ System IntelHOST ProprietaryMemory

157 Data Movement Using Channels

DDR

Data In Data Out Memory Controller FIFO FIFO DMA DMA Kernel FIFO Kernel

PCIe FPGA for LRZ System HOST Intel ProprietaryMemory

158 Data Movement Using Host Channels

DDR

Memory Controller

DMA Kernel DMA

PCIe FPGA for LRZ System HOST Intel ProprietaryMemory

159 An Even Closer Look: FPGA Custom Architectures

Kernel Replication with num_compute_units using OpenCL ▪ Step #1: Design an efficient kernel ▪ Step #2: How can we scale it up?

kernel void PE() { PE

… } Processing element for LRZ (task-based kernel) Intel Proprietary

160 Kernel Replication With Intel® FPGA SDK for OpenCL

Attribute to specify 1-dim or 2-dim array of PE PE PE PE kernels 3 PE PE PE PE Add API to identify kernel in the array 2

PE PE PE PE __attribute__((num_compute_units(4,4))) 1

kernel void PE() { PE PE PE PE 0 row = get_compute_id(0); 0 1 2 3 col = get_compute_id(1); Processing elements … (task-based kernels) } for LRZ Compile-time constants allowsIntel compiler toProprietary specialize each PE

161 Kernel Replication With Intel® FPGA SDK for OpenCL Topology can be expressed with software constructs PE PE PE PE ▪ Channel connections specified through compute IDs 3

channel float4 ch_PE_row[4][4]; PE PE PE PE channel float4 ch_PE_col[4][4]; 2 channel float4 ch_PE_row_side[4]; channel float4 ch_PE_col_side[4]; PE PE PE PE 1 __attribute__((num_compute_units(4,4))) kernel void PE() { PE PE PE PE row = get_compute_id(0); 0 col = get_compute_id(1); 0 1 2 3 float4 a,b;

if (row==0) a = read_channel(ch_PE_col_side[col]); else PE a = read_channel(ch_PE_col[row-for1][col]); LRZ Channel/Pipe Kernel if (col==0) … Intel Proprietary }

162 Matrix Multiply in OpenCL

Every PE / feeder is a kernel feeder PE PE PE PE

Communication via OpenCL channels feeder PE PE PE PE Data-flow network model feeder PE PE PE PE Software control:

– Compute unit granularity feeder PE PE PE PE – Spatial Locality feeder feeder feeder feeder – Interconnect topology Load A – Data movement LoadPE B Drain C Drain interconnect – Caching for LRZ DDR4 – Banking Intel Proprietary feederPE Performance: ~1 TFLOPs Channels/Pipes Kernels

163 Traditional CNN

1 1 Inew 푥 푦 = ෍ ෍ Iold 푥 + 푥′ 푦 + 푦′ × F 푥′ 푦′ 푥′=−1 푦′=−1

Filter Input Feature Map (3D Space) (Set of 2D Images) Output Featurefor Map LRZ Intel ProprietaryRepeat for Multiple Filters to Create Multiple “Layers” of Output Feature Map

164 CNN On FPGA

Want to minimize accessing external memory Want to keep resulting data between layers on the device and between computations Want to leverage reuse of the hardware between computations

Parallelism in the depth of the kernel window and across output features. Defer complex spatial math to random access memory.

Re-use hardware to compute multiple for LRZ layers. Intel Proprietary

165 Efficient Parallel Execution of Convolutions External DDR FPGA ▪ Parallel Convolutions Double-Buffer Filters (on-chip RAM) – Different filters of the same On-Chip RAM convolution layer processed in parallel in different processing elements (PEs)

▪ Vectored Operations – Across the depth of feature map (Output Depth) (Output Filter Parallelism Filter for LRZ▪ PE Array geometry can be customized to Intel Proprietaryhyperparameters of given topology

Programmable Solutions Group Intel Confidential Design Exploration with Reduced Precision

Tradeoff between performance and accuracy ▪ Reduced precision allows more processing to be done in parallel ▪ Using smaller Floating Point format does not require retraining of network ▪ FP11 benefit over using INT8/9 – No need to retrain, better performance, less accuracy loss FP16 Sign, 5-bit exponent, 10-bit mantissa FP11 Sign, 5-bit exponent, 5-bit mantissa FP10 for LRZSign, 5-bit exponent, 4-bit mantissa FP9 Intel ProprietarySign, 5-bit exponent, 3-bit mantissa FP8 Sign, 5-bit exponent, 2-bit mantissa

Programmable Solutions Group Intel Confidential 167 Lab 3 for LRZ Intel Proprietary

Programmable Solutions Group Intel Confidential 168 DSP Builder Advanced Blockset for LRZ Intel Proprietary

169 The Mathworks* Design Environment

MATLAB* ▪ Matlab* Algorithm Development and Analysis – High-level technical computing language – Simple C like language SIMULINK* Model-Based Design – Efficient with vectors and matrices – Built-in mathematical functions Validated – Interactive environment for algorithm development Design – 2D/3D graphing tool for data visualization ▪ Simulink* Third Party Tools – Hierarchical block diagram design & simulation tool

DSP/embedded – Digital, analog/mixed signal & event driven EDA tools for LRZ software tools – Visualize signals Intel Proprietary – Integrated with MATLAB* Hardware DSP, Control Software

170 DSP Builder for Intel® FPGAs

Enables MathWorks* Simulink for Intel FPGA design Device optimized Simulink* DSP Blockset ▪ Key Features: – High-Level Design Exploration – HW-in-the-Loop verification – IP Generation for Intel® Quartusfor LRZ SW / Platform DesignerIntel Proprietary

171 FPGA Design Flow - Traditional

Development Implementation Verification System Level Design HDL Coding RTL Simulation System Level Simulation DSP IP Hardware Verification

System Engineer Hardware Engineer Verification Engineer

Precision*, Synplify* SW ModelSim* tools MATLAB*/Simulink* tools Intel® Quartus® Prime SW Development Kits for LRZ Intel Proprietary

172 FPGA Design Flow – DSP Builder for Intel® FPGAs

Development Implementation Verification System Level Design HDL Coding RTL Simulation System Level Simulation DSP IP Hardware Verification

DSP Builder for Intel® FPGAs Algorithm- System-level level Verification Modeling Synthesis, RTL Simulation

Precision*, Synplify* SW ModelSim* tools MATLAB*/Simulink* tools Intel® Quartus® Prime SW Development Kits

Singlefor Simulink* LRZ Intel ProprietaryRepresentation

173 Core Technologies

▪ IP (ready made) library ▪ Automatic pipelining – Multi-rate, multi-channel filters ▪ Automatic folding and resource sharing – Waveform synthesis (NCO/DDS/Mixers) ▪ Multichannel designs with automatic ▪ Custom IP creation using primitive library vectorization – Vectorization – Zero latency ▪ Avalon® Memory-Mapped and Streaming Interfaces – Scheduled – Aligned RTL generation ▪ Design exploration across device families ▪ System integration for▪ HighLRZ-performance floating-point designs – Platform Designer ▪ System-in-the-Loop accelerated – Processor IntegrationIntel Proprietarysimulation

174 Advanced Blockset - High Performance DSP IP

Over 150 device optimized DSP building blocks for Intel® FPGAs ▪ DSP building blocks ▪ Interfaces ▪ IP library blocks ▪ Primitives library blocks – Math and Basic blocks ▪ Vector and Complex data types for LRZ Intel Proprietary

175 Build Custom FFTs from FFT Element Library

▪ Quickly build DSP designs using Complete FFT IP Functions from the FFT Library ▪ Build custom radix-22 FFTs using blocks from the FFT Element Library

FFT IP Library FFT Element Library FFT Pruning and Twiddle FFT_float Bit vector combine VFFT Butterfly Unit VFFT_float Choose Bits BitReverseCoreC Dual Twiddle Memory VariableBitReverse Edge Detect Floating-Point Twiddlefor Gen LRZ IntelCrossover Switch Proprietary

176 Filter and Waveform Synthesis Library IP Implementations FIR • Half-band DSP Builder includes a comprehensive • L-Band waveform IP library • Symmetric • Decimating ▪ Automatic resource sharing based on • Fractional Rate sample rate • Interpolation • Single-Rate ▪ Support for super sample rate • Super Sample Rate architectures CIC • Decimating • Interpolating • Super Sample Rate Mixer • Complex for LRZ • Real • Super Sample Rate Intel ProprietaryNCO • Super Sample Rate • Multi-bank

177 Library is Technology Independent

▪ Target device using a Device block ▪ Same model generates optimized RTL for each FPGA and speed grade

for LRZ Intel Proprietary

178 Datapath Optimization for Performance

Automatic Timing Driven Synthesis of Model – Based on specified device and clock frequency Optimization Description Pipelining Inserts registers to improve Fmax Algorithmic Retiming Moves registers to balance pipelining Bit Growth Management Manages bit growth for fixed-point designs Multi-rate Optimizes hardware based on sample rate Optimizations Before for LRZ After A B IntelC ProprietaryA B C Bit Growth Retiming Custom IP Generation

b0 + Model Primitive Features Textbook based • Vector support z-1 design entry b1 • Parameterizable + z-1 • Zero latency block …

… • ALU folding

b6 What to do not when to do it + z-1 b 7 for LRZ Intel Proprietary

180 ALU Design Folding Improves Area Efficiency

Optimizes hardware usage for low- Clk throughput designs A Multiply B Multiply

▪ Arranges one of each resources in a C Multiply central arithmetic logic unit (ALU)

fashion TDM TDM A A ▪ Folding factor = clock rate / data rate B X B C C ▪ Performed when Folding factor > 500 for LRZFSM Intel Proprietary TDM Resource Sharing Clock Rate = Sample Rate

F(.)

+ READ WRITE F(.)

Clock Rate = 2*Sample Rate

+ F(.) READ WRITE SERIALIZE for LRZDESERIALIZE Intel Proprietary TDM_CLK

182 TDM Design: Trade-Off Example

49-tap Symmetric Single Rate FIR Filter Resources

Stratix 10 LUT4s Mults Memory bits TDM Factor Clock Rate = 72 MHz 898 26 0 1 Sample Rate = 72 MSPS Clock Rate = 144 MHz 1082 14 0 2 Sample Rate = 72 MSPS Clock Rate = 288 MHz 741 8 0 4 Sample Rate = 72 MSPS for LRZ Clock Rate = 72 MHz 1082 14 0 2 Sample Rate =Intel 36 MSPS Proprietary

183 2 Antenna DUC Reference Design Clock Rate = 179.2MHz Interpolate4FIR: ComplexMixer: ChanCount = 4 ChanCount = 2 (complex channel) ChannelFIR: Output Sample Rate = 89.6 MSPS Sample Rate = 89.6 MSPS ChanCount = 4 Output Period = 2 Period = 2 Output Sample Rate = 11.2 MSPS ChanWireCount = ceil(4/2) = 2 I’ = I*cos – Q*sin Output Period = 16 ChanCycleCount= ceil(4/2) = 2 Q’ = I*sin + Q*cos Output Seq.=I1,I2,Q1,Q2,zeros(1,16-4) Output Seq.= I1, I2 Output i Seq. = I1, I2 Q1, Q2 Demux Output q Seq. = Q1,Q2 (Terminated)

i1 i i data data data Data(2) i2 q

valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos

Interpolate2FIR: valid NCO valid Clock Rate = 179.2 MHz channel channel Deinterleaver: ChanCount = 4 NCO: Sample Rate = 89.6 MSPS Output Sample Rate = 22.4 MSPS forChanCount LRZ= 2 (complex channel) Period = 2 Output Period = 8 Sample Rate = 89.6 MSPS Input I Seq. = I1,I2 Output Seq.=I1,I2,Q1,Q2,zeros(1,8-4) Intel ProprietaryPeriod = 2 Antenna 1 Seq. = I1,- Sine Seq. = sinA1, sinA2 Antenna 2 Seq. = I2,- Reference Design Included with DSP Builder Cosine Seq. = cosA1,cosA2

184 Changing the Design without DSP Builder

▪ Tedious and time consuming ▪ Channel Count = 8, 16, 32 ▪ Clock Rate = 2x, 4x

i1 i i data data data Data(4) i2 q

valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos

valid NCO valid

Specification: channel channel SampleRate = 11.2 for LRZ ChanCount = 8 Intel Proprietary

185 Changing the Design with DSP Builder

▪ Modifications done in minutes ▪ Design still looks the same splitter i1 i i data data data Data(4) i2 q

valid valid valid valid valid valid FIR FIR 2 FIR 4 De channel channel channel channel channel Complex channel Mixer interleaver Sync sin sin cos cos

valid NCO valid

Specification: channel channel SampleRate = 11.2 for LRZ ChanCount = 8Intel Proprietary

186 Five Designs Iterations < 1 Hour

Arria® 10 Arria 10 Arria 10 Stratix® 10 Stratix 10 6 channel 6 channel 12 channel 6 channel 12 channel Requested Clock (MHz) 250 450 450 450 450

Actual Fmax (slow model, 85C) 351 458 458 524 484.5

Multiplier Count (18x18) 10 6 10 6 10

Logic Resources (registers) 686 465for LRZ818 1267 1863 Block Memory Resources (kbits)Intel0 Proprietary0 0 0 25.8

187 Generates Reusable IP for Platform Designer

▪ Platform Designer is the System Integration Environment for Intel® FPGAs DSP Builder ▪ DSP Builder designs fully compatible with

Platform Designer IP Catalog ▪ Integrate with other FPGA IPs Platform – Processors Designer – State machines Project A Project B – Streaming interfaces for LRZ ▪ Design reuse fully Intelsupported Proprietary

188 Typical Design Flow

Identify system architecture, design filters and choose desired Fmax and device Set the top level system parameters in the MATLAB® software using the ‘params’ file - number of channels, performance, etc. Build the system using the Advanced Blockset tool Simulate the design using Simulink® and ModelSim® tools Target the right FPGA family and compile As system design specs changes, editfor the ‘ paramsLRZ’ file and repeat Intel Proprietary

189 Design Flow -Create Model

Create a new blank model Select New Model Wizard from DSP Builder menu

for LRZ Intel Proprietary

190 Top Level Testbench

Control Signals

for LRZ Top-level of a DSPB-IntelAB design isProprietary a testbench Must include Control and Signals blocks

191 Design Flow - Synthesizable Model

for LRZ Enter the design in theIntel subsystem Proprietary Device block marks the top level of the FPGA

192 Design Flow – ModelIP Blocks

Filters Library - Single rate, multi-rate, and fractional rate FIR filters - Decimating and interpolating cascaded integrator comb (CIC) filters Waveform Synthesis Library - Real and complex mixer Note: Supports super-sample rate (data -forNumerically LRZ controlled rate > system clock freq) interpolation by 2 oscillator (NCO) filters. Intel ProprietaryNote: The NCO block supports frequency hopping (each channel can hop to different frequency from a pool of frequencies)

193 Design Flow – ModelPrim Blocks

ChannelIn and ChannelOut blocks to delineate the boundary of a synthesizable primitive subsystem

for LRZAdd SynthesisInfo Block to control pipelining and latency and to view resource Intel Proprietaryusage of the subsystem

194 Design Flow – Parameterize the Design

C structure like template for LRZ Runs when model is Intelopened or simulationProprietary is run

195 Design Flow – Processor Interface

Drop memory and registers in the design

ModelIPs have built in memory mapped interface to control registers, coefficient registers for LRZ Intel Proprietary

196 Design Flow - Running Simulink Simulation

Creates files in location specified by Control block ▪ VHDL Code ▪ Timing constraints file (.sdc) ▪ DSPB-AB subsystem Quartus® IP file

for LRZ Intel Proprietary

197 Design Flow - Documentation Generation Get accurate resource utilization of all modules right after simulation, without place & route DSP Builder > Resource Usage DSP Builder > View Address Map

for LRZ Intel Proprietary

198 Design Verification

Run ModelSim block loads the design into the ModelSim simulator

for LRZ Intel Proprietary RTL Simulation

199 Design Flow – System Integration

Add _hw.tcl directory to Qsys IP Search Path Qsys-> Tools -> Options -> IP Search Path

for LRZ

Add subsystem fromIntel Proprietary the Component pick list 200 for LRZ Intel Proprietary Using FPGAs Just Got Easier

Orchestration / Rack Management

Application FPGA Accelerator SW Application (Loadable Workload)

Software Frameworks Gap: CreatingIncrease Full-Stack Accelerated Provides standard C API to Abstraction AcceleratorPrebuilt and Functions provided standardized FPGA interface mangaerApplications on FPGA is for specific board Libraries Difficult and Time Consuming Increase Ease of Use Open Programmable Pre-built AccelerationLow-Level FPGA Engine Management (OPAE) FPGA Interface Manager Accelerator (Standard I/O Interfaces) Solutions OS Driver for LRZ Intel ProprietaryFPGA IO Interfaces (ecosystem) Intel® FPGA Programmable Accelerator Card (PAC)

* Other names and brands may be claimed as the property of others. 202 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Acceleration Stack for Intel® Xeon® CPU with FPGAs Comprehensive Architecture for Data Center Deployments

Rack-Level Solutions Faster Time to Revenue ▪ Fully validated Intel® board User Applications ▪ Standardized frameworks and high-level compilers Industry Standard Software Frameworks ▪ Partner-developed workload accelerators

Acceleration Libraries Simplified Management ▪ Supported in VMware vSphere* 6.7 Update 1* Intel Developer Tools (Intel Parallel Studio XE, Intel FPGA SDK for OpenCL™, Intel Quartus® Prime) ▪ Rack management and orchestration framework integration Acceleration Environment (Intel Acceleration Engine with OPAE Technology, FPGA Interface Manager (FIM) Broad Ecosystem Support OS & Virtualization Environment ▪ Upstreaming FPGA drivers to Linux* kernel for▪ QualifiedLRZ by industry-leading server OEMs ▪ Partnering with IP partners, OSVs, ISVs, Intel® HardwareIntel ProprietarySIs, and VARs * Demonstrated at VMWorld Las Vegas - August 28-30, 2018 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Programmable Solutions Group 203 Acceleration Stack Provides FPGA Orchestration in Cloud/Data Center

IP Store End User Developed IP IP Workload N Workload 2 Intel Developed IP Workload Workload 1 accelerators 3rd party Public and Private Developed IP Cloud/Datacenter Users Launch workload

Software Virtualized Defined Orchestration Software (FPGA Enabled) Infrastructure Place workload Resource Pool Static/dynamic Storage Networkfor LRZCompute FPGA programming Intel Proprietary Secure Xeon VM FPGA

Programmable Solutions Group Server Virtualization for the Acceleration Stack with VMware

Out-of-the-box support from VMWare for Application Intel Arria 10 PAC and Acceleration Stack in upcoming vSphere 6.7 U1

Server Arria 10 PAC Server virtualization enables Intel Xeon Accelerator customers to deploy IP FPGA workload acceleration Intel Arria 10 Programmable with lower total Acceleration Card for LRZ with Acceleration Stack Intel ProprietaryCompute Solution Stack cost of ownership

Programmable Solutions Group 205 Migrating FPGA-Accelerated Workload with vMotion*

Virtual Virtual Virtual CPU + FPGA Machine Machine Machine

Work Image inference Work vMotion Image inference workload Load Load workload VMware VMware VMware VMware VMware ESXi* ESXi* ESXi* ESXi* ESXi*

Server 1 Server 1 Server 1 Server 2 Server 1 Server 2 1. Run Application on 2. Implement on 3. PVRDMA# connects 4. Use vMotion* to move Bare Metal ESXi* Hypervisor application to remote Intel® application from one server forFPGA LRZ PAC / FPGA device to another Server composition Continuous application # – Unoptimized, proof-of-concept with Lenovo xClarity acceleration during code. Not part of a shipping product. PodIntel Manager* and Proprietary vMotion – industry first See supplementary slide for system VMware vSphere* demonstration configuration details.

Programmable Solutions Group * Other names and brands may be claimed as the property of others. 206 Components of Acceleration Stack: Overview Intel® Developed by User Xeon® Application (Domain Expert) CPU User, Intel, and 3rd Party Libraries (Tuning Expert)

PCIe* Drivers Open Programmable Provided by Intel Drivers Acceleration Engine (OPAE) Provided by Intel PCIe

FPGA Intel FPGA Programmable FPGA Interface Manager Signaling and Acceleration Provided by Intel Management Card rd Acceleration User, Intel, or 3 -Party IP Functional Unit Plugs into AFU Slot Qualified and Validated for (AFU) for LRZ (Tuning Expert) volume deployment Provided by OEMs Intel Proprietary

Programmable Solutions Group 207 PAC with Intel® Arria® 10 FPGA

• Low-profile (half-length, half height) PCIe* slot card • 128 MB Flash • 168 mm × 56 mm • For storage of FPGA • Board Management Controller (BMC) • Maximum component height: 14.47 mm configuration • Server class monitor system • PCIe × 16 mechanical • Accessed via USB or PCIe

• USB 2.0 port for board firmware update and FIM image recovery

• QSFP+ slot accepts pluggable optical modules • 2 – Banks of DDR4-2133 SDRAM, 4 GB each • 64 bit data, 8 bit ECC for LRZ • Total 8 GB • Powered from PCIe+12V rail • 70Intel W total board power ProprietaryPCIe x8 Gen3 • 45 W FPGA power connectivity to Intel® Xeon® host

Programmable Solutions Group 208 PAC with Intel® Stratix® 10 FPGA

• 128 MB Flash • For storage of FPGA configuration • Board Management Controller (BMC) ¾, length, full height, dual slot PCIe* slot card • For BMC firmware • Server class monitor system • Accessed via USB or PCIe

USB 2.0 port for board firmware update and FIM image recovery

• 2x QSFP+ slot accept pluggable optical modules • Up to 100GbE each

for LRZ • 4 – Banks of DDR4-2400 SDRAM, 8 GB each • 64 bit data, 8 bit ECC • Total 32 GB • IntelPowered from PCIe+12V Proprietary rail PCIe Gen3 x16 connectivity to • 225 W total board power Intel® Xeon® host

Programmable Solutions Group 209 Nearly Transparent Software Application Use Model

Properties Token Handle Object Object Object Object model

Start / stop Acquire Map AFU Allocate / define Discover / computation on ownership of registers to user shared memory search resource AFU and wait resource space space for result

Deallocate Relinquish Unmap MMIO Reconfigure forshared LRZmemory ownership AFUIntel Proprietary

Programmable Solutions Group 210 Enumeration and Discovery

link fpga_properties prop fpga_token token fpgaEnumerate()

objtype: FPGA_ACCELERATOR guid: 0xabcdef resource>

FPGA_DEVICE

fpga_properties prop; fpga_token token; FPGA_ACCELERATOR fpga_guid myguid; /* 0xabcdef */

fpgaGetProperties(NULL, &prop); AFU_ID: 0xabcdef fpgaPropertiesSetObjectType(prop, FPGA_ACCELERATOR); fpgaPropertiesSetGUIDfor LRZ(prop, myguid); fpgaEnumerate(&prop, 1, &token, 1, &n);

Intel fpgaDestroyPropertiesProprietary(&prop);

Programmable Solutions Group 211 Acquire and Release Accelerator Resource

link fpga_token token fpga_handle handle fpgaOpen()

resource>

FPGA_DEVICE

fpga_token token; FPGA_ACCELERATOR // ... enumeration ... fpga_handle handle; AFU_ID: 0xabcdef fpgaOpen(token, &handle, 0); . . for LRZ . Intel fpgaCloseProprietary(handle);

Programmable Solutions Group 212 Memory-Mapped I/O SW application process address space SW application (virtual) link TEXT

fpgaMapMMIO(…, &mmio_ptr) DATA

FPGA_DEVICE BSS mmio_ptr

libopae-c control register FPGA_ACCELERATOR control register AFU_ID:control 0xabcdefregister control register fpgaReadMMIO() control register forfpgaWriteMMIO LRZ() control register Intel Proprietary

Programmable Solutions Group 213 Management and Reconfiguration

link

SW application (with admin privilege)

FPGA_DEVICE load fpgaReconfigureSlot(…, buf, len, 0) Storage FPGA_ACCELERATOR

AFU_ID: 0xbe11e50xabcdef libopae-c GBS metadata GBS file interface_id Partial configuration xyz.gbs afu_id for LRZ … Intel Proprietary

Programmable Solutions Group 214 Management and Reconfiguration

link

fpga_handle handle; /* handle to device */ FILE *gbs_file; void *gbs_ptr; size_t gbs_size; FPGA_DEVICE /* Read bitstream file */ gbs_ptr = malloc(gbs_size); fread(gbs_ptr, 1, gbs_len, gbs_file);

/* Program GBS to FPGA */ FPGA_ACCELERATOR fpgaReconfigureSlot(handle, 0, gbs_ptr, gbs_size, 0); /* ... */ AFU_ID: 0xbe11e50xabcdef for LRZ Intel Proprietary

Programmable Solutions Group 215 Where to Get AFU’s for the FPGA

Self-Developed Externally-Sourced

Higher Productivity Performance Optimized Intel® Reference Designs Contracted Engagement

C/C++ Programming Language VHDL or Verilog Ecosystem Partner

Intel® HLS Compiler

Intel® FPGA SDK for OpenCL™ for LRZ Accelerator Intel ProprietaryFunctional Unit (AFU)

Programmable Solutions Group 216 Growing the Xeon+FPGA Ecosystem

IP and solutions Developer Community Universities ISV Partners

Portfolio of Accelerator Enabling software developers access Reaching over 200,000 Expanding the reach for Solutions developed by Intel via: students per year with system vendors with and third-party technologists • Intel Builder programs FPGA publications, platforms and ready-to- to expedite application • AI Academy for LRZworkshops and hands-on use application workloads. development and deployment • Intel Developer Zone (IDZ) research labs •IntelRocketboards.org ProprietaryCommitted to Open Source vision

Programmable Solutions Group Growing List of Accelerator Solution Partners

Easing Development and Data Center Deployment of Intel FPGAs For Workload Optimization

Data Analytics

Finance Genomics Media Transcoding Cyber Security AI for LRZ Intel Proprietary

Programmable Solutions Group 218 Intel PAC Top Solutions for Data Center Acceleration

Cassandra PostgreSQL Genomics JPEG2Lepton Big Data Financial Network GATK JPEG2Webp Streaming Black Scholes Security/ Analytics Monitoring

96% latency ½ TCO 2.5X 3-4X 5X 8X 3x reduction performance performancefor LRZperformance performance performance Intel Proprietary

Programmable Solutions Group Customer Application: Big Data Applications running on Spark/Kafka Platforms Current solution: Run Spark/SQL on a cluster of CPUs Challenge: For many applications in the FinServ/Genomics/Intelligence Agencies/etc. Spark performance does not meet customers SLA requirements, especially for delay sensitive streaming workloads Solution for LRZ Value Intel Proprietary Proposition Customer Application: Risk Management acceleration framework (financial back-testing) Current solution: Deploy a cluster of CPUs or GPUs with complex data access Challenge: Traditional risk management methods are compute intensive, time consuming applications - > 10+ hours for financial back-testing Solution for LRZ Value Intel Proprietary Proposition Leverage FPGA Developers and Build Your Own

HDL Programming OpenCL Programming ASE OPAE Intel® Quartus Intel® HLS from Intel from Intel Prime Pro Compiler

C HDL Host Kernels

SW Syn. SW OpenCL Compiler PAR Compiler Compiler

AFU exe AFU Image exe Image

AFU Application AFU Application AFU Simulation OPAE for LRZOpenCL OPAE Environment Software FIM Emulator Software FIM (ASE) CPU IntelFPGA Proprietary CPU FPGA

Programmable Solutions Group 222 AFU RTL AFU Overview Flow

AFU Simulation Simulation Test & Environment Compilation Validate AFU

OPAE SW Software Application Compilation Hardware System

Xeon® FPGA Quartus® Generate the Compilation AF

AF Simulation Environment (ASE) enables seamless portability to real HW ▪ Allows fast verification of OPAE software together with AF RTL without HW – SW Application loads ASE library and connectsfor to LRZRTL simulation ▪ For execution on HW,Intel application loadsProprietary Runtime library and RTL is compiled by Intel® Quartus into FPGA bitstream

Programmable Solutions Group 223 FPGA Components of Acceleration Stack Partial Reconfiguration (PR) Region FPGA

FPGA Interface PCIe* Core Cache EMIF DDR4 Unit Interface (FIU) (CCI) Accelerator EMIF Local DDR4 Functional Unit (AFU) Memory EMIF** Interfaces High Speed DDR4** Serial 10Gb/40Gb QSFP+ Interface 100Gb** EMIF** DDR4** (HSSI) for LRZ Intel Proprietary * Could be other interfaces in the future (e.g. UPI) ** Stratix 10 PAC Card

Programmable Solutions Group 224 AFU Development Flow Using OPAE SDK Specify the Platform Configuration AFU requests the ccip_std_afu top level interface classes Design the AFU ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/hello_afu.json

Specify Build AFU RTL files implementing accelerated function Configuration ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/afu.sv Generate the ASE List all source files and platform configuration file Build Environment ▪ $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/hw/rtl/filelist.txt In terminal window, enter these commands:for LRZ ▪ cd $OPAE_PLATFORM_ROOT/Intelhw /samples/Proprietaryhello_afu ▪ afu_sim_setup--source hw/rtl/filelist.txt build_sim

Programmable Solutions Group 225 AFU Development Flow Using OPAE SDK Specify the Platform Compile AFU and platform simulation models and start simulation server process Configuration

▪ cd build_sim Design the AFU ▪ make Specify Build ▪ make sim Configuration In 2nd terminal window compile the host application and start the client process Generate the ASE ▪ Export ASE_WORKDIR= $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/ Build Environment build_sim/work

▪ cd $OPAE_PLATFORM_ROOT/hw/samples/hello_afu/sw Verify AFU with ASE ▪ make clean for LRZ ▪ make USE_ASE=1 Intel Proprietary ▪ ./hello_afu

Programmable Solutions Group 226 AFU Simulation Environment (ASE)

Hardware software co-simulation environment for the Intel Xeon FPGA development

Uses simulator Direct Programming Interface (DPI) for HW/SW connectivity ▪ Not cycle accurate (used for functional correctness) ▪ Converts SW API to CCI transactions Provides transactional model for the Core Cache Interface (CCI-P) protocol and memory model for the FPGA-attached local memory Validates compliance to ▪ CCI-P protocol specification for LRZ ▪ Avalon® Memory MappedIntel (Avalon -MM)Proprietary Interface Specification ▪ Open Programmable Acceleration Engine

Programmable Solutions Group 227 Simulation Complete

for LRZ Intel Proprietary AFU Simulator Window (server) Application SW Window (client)

Programmable Solutions Group 228 AFU Development Flow Using OPAE SDK Specify the Platform Configuration Generate the AF build environment: ▪ cd $OPAE_PLATFORM_ROOT/hw/samples/hello_afu Design the AFU

▪ afu_synth_setup --source hw/rtl/filelist.txt build_synth Specify Build Configuration

Generate the AF Generate the ASE Build Environment ▪ cd build_synth Verify AFU with ASE ▪ $OPAE_PLATFORM_ROOT/bin/run.sh for LRZ Generate the AF Intel Proprietary Build Environment Generate the AF

Programmable Solutions Group 229 Using the Quartus GUI

Compiling the AFU uses a command line-driven PR compilation flow ▪ Builds PR region AF as a .gbs file to be loaded into OPAE hardware platform Can use the Quartus GUI for the following types of work: ▪ Viewing compilation reports ▪ Interactive Timing Analysis ▪ Adding SignalTap instances and nodes for LRZ Intel Proprietary

Programmable Solutions Group 230 Lab 3 for LRZ Intel Proprietary

Programmable Solutions Group 231 Getting Started with Acceleration

Buy Server w/ PAC Server OEM (e.g. Dell)

Install Server OS OS Vendor Website (e.g. CentOS, RHEL)

Download & Install Deployment Download & Write Host Deployment Package Flow Install Workload Application of Acceleration Stack

Download & Install Downloadfor &Install LRZ Create & Development Download & Developer Package HLS or OpenCL Simulate Flow Install Simulator of AccelerationIntel Stack Proprietary(Optional) Workload Intel Website Vendor Website

Programmable Solutions Group Getting Qualified Hardware is Step 1

Now: Now: Available soon: Dell PowerEdge* PRIMERGY* HPE ProLiant* And more coming ….. R640, R740, RX2540 M4 DL360,for DL380LRZ R740xd, R840, R940xa Intel Proprietary

233 Programmable Acceleration Cards (PAC)

Intel® Arria® 10 Intel Stratix® 10 Accelerator Card Accelerator Card

Broadest Deployment at Lowest Power Highest Performance and Throughput

40G, PCIe* Gen3 x8 for LRZ2x 100G, PCIe Gen3 x16 ½ length, ½ height, singleIntel-slot PCIe Proprietary card ¾ length, full height, dual-slot PCIe card Lowest power 66W TDP Up to 225 W maximum

Programmable Solutions Group 234 • Acceleration Stack for Intel® Xeon® with FPGAs • FPGA Acceleration Platforms Intel® portal for all things related • Acceleration Solutions & Ecosystem to FPGA acceleration • Knowledge Center for• FPGALRZ as a Service Intel Proprietary• 01.org *

* 01.org is an open source community site

25 for LRZ Intel Proprietary https://www.intel.com/content/www/us/en/programmable/ Follow-On Courses support/training/overview.html

Introduction to Cloud Computing

Introduction to High Performance Computing (HPC)

Introduction to Apache™ Hadoop

Introduction to Apache Spark™

Introduction to Kafka™

Introduction to Intel® FPGAs for Software Developers

Introduction to the Acceleration Stack for Intel® Xeon® CPU with FPGA

Application Development on the Acceleration Stack for Intel® Xeon® CPU with FPGAs Building RTL Workloads for the Acceleration Stack for Intel®for Xeon® LRZ CPU with FPGAs OpenCL™ Development with theIntel Acceleration StackProprietary for Intel® Xeon® CPU with FPGA Intel FPGA OpenCL Trainings and HLS Trainings

Programmable Solutions Group 237 Teaching Resources

University-focused content & curriculum ▪ Semester-long laboratory exercises for hands-on learning with solutions ▪ Tutorials and online workshops for self-study on key use cases ▪ Free library of IP common for student projects ▪ Example designs and sample projects Easy-to-use, powerful software tools ▪ Quartus Prime CAD Environment ▪ ModelSim ▪ Intel FPGA Monitor Program for assemblyfor & C developmentLRZ ▪ Intel® SDK for OpenCL™Intel Applications Proprietary ▪ Intel OpenVINO™ toolkit (Visual Inference & Neural Network Optimization)

Programmable Solutions Group 238 Teaching Resources (cont.)

Hardware designed for education ▪ 4 different FPGA kits with a variety of peripherals to match project needs ▪ Compact designs with robust shielding to provide longevity ▪ Reduced academic prices (range: $55-$275) ▪ Donations available in some circumstances Support ▪ Total access to all developer resources – Documentation – Design examples for LRZ – Support forum Intel Proprietary – Virtual or on-demand trainings

Programmable Solutions Group 239 DE-Series Development Boards

DE10-Standard DE1-SOC DE10-Nano DE10-Lite Cyclone V FPGA + SoC Cyclone V FPGA + SoC Cyclone V FPGA + SoC Max 10 FPGA $259 $175 for $99LRZ $55 Visit our website for fullIntel specs on theseProprietary boards See the full catalog of Intel FPGA boards & kits at www.terasic.com

Programmable Solutions Group 240 Full-Featured Beginner FPGA Dev Kit FPGA+SoC Academic Dev Kit Academic Dev Kit Dev Kit Intel DE10-Lite Intel DE10-Nano Intel DE1-SoC Intel DE10-Standard Academic Price $55 $99 $175 $259 FPGA Max® 10 Cyclone® V Cyclone® V Cyclone® V Logic Elements 50,000 110,000 85,000 110,000 ARM Cortex-A9 Dual-Core 800 MHz 925 MHz 925 MHz System-on-Chip (SoC)  1 GB DDR3 SDRAM (HPS), 64 MB 1 GB DDR3 SDRAM (HPS), Memory 64 MB SDRAM 1 GB DDR3 SDRAM (HPS) SDRAM (FPGA) 64 MB SDRAM (FPGA) PLLs 4 9 9 9 GPIO Count 500 469 469 469 7 Segment Displays 6  6 6 Switches 10 4 10 10 Buttons 2 2 4 4 LEDs 10 8 10 10 Clocks (2x) 50 MHz (3x) 50 MHz (4x) 50 MHz (4x) 50 MHz GPIO Count 40-pin header (2x) 40-pin header (2x) 40-pin header 40-pin header Video Out VGA 12-bit DAC HDMI VGA 24-bit DAC VGA 24-bit DAC ADC Channels  8 8 + programmable voltage range 8 + programmable voltage range Video In   NTSC, PAL, Multi-format NTSC, PAL, Multi-format Line In/Out, Microphone In (24 bit Line In/Out, Microphone In Audio In/Out   Audio CODEC) (24 bit Audio CODEC) Ethernet  Gigabit 10/100/1000 Ethernet (x1) 10/100/1000 Ethernet (x1) USB OTG  1x USB OTG 2x USB 2.0 (Type A) 2x USB 2.0 (Type A) LCD    128x64 backlit Micro SD Card Support  ✓ ✓ ✓ Accelerometer ✓ for✓ LRZ✓ ✓ PS/2 Mouse/Keyboard Port   ✓ ✓ Infrared Intel Proprietary ✓ ✓ HSMC Header    ✓ Arduino Header ✓ ✓  

Programmable Solutions Group 241 Undergrad Lab Exercise Suites: Digital Logic

First digital hardware course in EE, CompEng or CS curriculum Traditionally introduced sophomore year Offered in VHDL or Verilog

Lab 1 - Switches, Lights, and Multiplexers Lab 7 - Finite State Machines Lab 2 - Numbers and Displays Lab 8 - Memory Blocks Lab 3 - Latches, Flip-flops, and Registers Lab 9 - A Simple Processor Lab 4 - Counters forLab LRZ 10 - An Enhanced Processor Lab 5 - Timers and Real-TimeIntel Clock ProprietaryLab 11 - Implementing Algorithms in Hardware Lab 6 - Adders, Subtractors, and Multipliers Lab 12 - Basic Digital Signal Processing

242 Undergrad Lab Exercise Suites: Comp Organization

Typically second hardware course in EE, CompEng or CS curriculum Introduction to microprocessors & assembly language program Use ARM processor (on SOC kits) or NIOS II soft processor Intel FPGA Monitor Program for compiling & debugging assembly & C code

Lab 1 - Using an ARM Cortex-A9 System or NIOS Lab 5 - Using Interrupts with Assembly Code II System Lab 2 - Using Logic Instructions with the ARM Lab 6 - Using C code with the ARM Processor Processor for LRZ Lab 3 - Subroutines and StacksIntel ProprietaryLab 7 - Using Interrupts with C code Lab 4 - Input/Output in an Embedded System Lab 8 - Introduction to Graphics and Animation

243 Intel FPGA MONITOR PROGRAM

Design environment used to compile, assemble, download & debug programs for ARM* Cortex* A9 processor in Intel’s Cyclone® V SoC FPGA devices ▪ Compile programs, specified in assembly language or C, and download the resulting machine code into the hardware system ▪ Display the machine code stored in memory ▪ Run the ARM processor, either continuously or by single-stepping instructions ▪ Modify the contents of processor registers ▪ Modify the contents of memory, as well as memory-mapped registers in I/O devices ▪ Set breakpoints that stop the execution of a program at a specified address, or when certain conditions are met Clean and simple UX for LRZ Tutorials at fpgauniversity.intel.comIntel Proprietary Download independently or as part of University Program Installer (always free!)

Programmable Solutions Group 244 Undergrad Lab Exercise Suites: Embedded Systems

Typically third hardware course in EE, CompEng or CS curriculum Combines hardware and software Introduction to embedded Linux

Lab 1 - Getting Started with Linux Lab 5 - Using ASCII Graphics for Animation Lab 2 - Developing Linux Programs that Lab 6 - Introduction to Graphics and Animation Communicate with the FPGA Lab 3 - Character Device Drivers forLab 7 -LRZUsing the ADXL345 Accelerometer Lab 8 - Audio and an Introduction to Multithreaded Lab 4 - Using Character DeviceIntel Drivers Proprietary Applications

Programmable Solutions Group 245 Lab Exercise Suites: Machine Learning Basics

Machine Learning on FPGAs Senior or grad-level course in EE, CompEng, CS or data science curriculum Teaches how to use the Intel® SDK for OpenCL™ Applications with FPGAs Basic understanding of AI fundamentals recommended*

Lab 1 – Introduction to OpenCL Lab 5 – Neural Networks Lab 2 – Image Processing Lab 6 – Using the Deep Learning Accelerator Library Lab 3 – Lane Detection for Autonomous Lab 7 – Integration OpenCL Accelerators into Driving forExisting LRZ Software Lab 4 – Linear Classifier forIntel Handwritten ProprietaryDigits *For foundational AI & Machine Learning curriculums, visit our partner program Intel AI Academy

Programmable Solutions Group 246 AI Academy Course Outline

Runs in Cloud on Arria 10 PAC card Contains Slides, Lab exercises, and recordings for each class https://software.intel.com/en-us/ai-academy/students/kits/dl-inference-fpga

Class 1 - Introduction to FPGAs for deep learning inferencing Class 2 - Building a deep learning computer vision Lab 1 - Deploy an application on an Intel CPU application w/ Acceleration using DL framework Lab 2 - Deploy an application on an Intel CPU Class 3 - Introduction to the OpenVINO™ toolkit using the OpenVINO toolkit Class 4 - Introduction to the Deep Learning for LRZ Lab 3 - Accelerate the application on an Intel FPGA Accelerator Suite for IntelIntel FPGAs Proprietary Class 5 - Introduction to the Acceleration Stack for Intel Xeon CPU with FPGAs

Programmable Solutions Group In-Person Workshops

Throughout the year our technical outreach team visits universities and industry conferences around the world to conduct hands-on workshops that train professors and students on how to use Intel FPGAs for education and research.

Topics:

Intro to FPGAs and Quartus (4 hrs.) Embedded Design using Nios II (4 hrs.)

High-Speed IO (4 hrs.) High-level Synthesis (4 hrs.)

Static Timing Analysis of Digital Circuits (4 hrs.) Machine Learning Acceleration (4 hrs.) Simulation & Debug (4 hrs.) for LRZModern Applications of FPGAs (1 hr.) Embedded LinuxIntel (4 hrs.) ProprietaryHow to Get Hired in the Tech Industry (1 hr.) Contact us at [email protected] to inquire about scheduling a workshop

Programmable Solutions Group 248 Find Materials: FPGAUniversity.INTEL.com

for LRZ Intel Proprietary

249 Membership: FPGAUniversity.INTEL.com

for LRZ Intel Proprietary

250 Contact the University Team

Rebecca Nevin Larry Landis Outreach Manager Senior Manager Intel FPGA University Program for LRZNew User Experience Group [email protected] [email protected]

251 GPU Comparison for LRZ Intel Proprietary

252 How do GPUs Deal With Fine Grained Data Sharing?

Some GPU techniques involve implicit SIMT synchronization

for LRZ FPGA threads aren’t warpIntel-locked, Proprietary so implicit sync doesn’t make sense ▪ FPGAs do exactly what you ask them to do the way you code it

253 An Even Closer Look: CUDA Execution Model Thread Block

Grid

Warp CUDA Scheduler NDRange Data

FERMI FERMI KEPLER KEPLER MAXWELL GF100 GF104 GK104 GK110 GM107 SM SM SMX SMX SMM Thread Compute Capability 2.0 2.1 3.0 3.5 5.0

Shared Memory/SM 48KB 48KB 48KB 48KB 64KB

32-bit Registers/SM 32768 32768 64K 64K 64K

Max Threads/Thread Block 1024 1024 1024 1024 1024

Max Thread Blocks/SM 8 8 16 16 32 Max Threads/SM 1536 1536 for2048 LRZ2048 2048 Threads/Warp 32 32 32 32 32

Max Warps/SM Intel48 Proprietary48 64 64 64

Max Registers/Thread 63 63 63 255 255

254 FPGA Execution Model

Single Block of Data

Custom Instructions

Multiple Blocks of Data, with Multiple Instructions

Custom Instructions

Custom Instructions

Custom Instructions

Customfor Instructions LRZ Intel ProprietaryCustom Instructions Custom Instructions All execute in parallel

255 Divergent Control Flow on GPU

for (i=0;i

GPU uses SIMT pipeline to save area on control logic Branch

mask = (x[i]

256 Divergent Control Flow: Just Fine for FPGA

FPGA data path already has all operations in silicon ▪ Speculatively execute

Overlap branch Branch Path B Path A condition BranchBranch Branch computation

PathPath AA Path B Path A PathPath B B for LRZ Absorb into Branch. Path B. Path A Compress the one block Intelschedule Proprietary No longer any control flow

257 Memory Hierarchy

1. Register data: Registers in FPGA fabric

2. Private data: Registers in FPGA fabric

3. Local memory: On-chip RAMs

4. Global memory: for LRZ Off-chip externalIntel Proprietary memory External Memory Dynamic Coalescing

For CPU/GPU the cache and memory controller handle

For FPGA, we create dynamic coalescing hardware matched to specific memory characteristics connected to – Re-order memory accesses at runtime to exploit data locality – DDR is extremely inefficient at random access – Access with row bursts whenever possible 0 0 0 1 1 1 + 2 2 2 3 3 3 4 4 4 load 5 5 5 6 6 6 7 7 7 8 pipeline 8 8

9 9 to DDR 9

for from LRZ 10 10 10 x 11 11 11 Intel Proprietary12 12 12 13 13 13 14 14 14 15 15 15 259 On-chip FPGA Memory

“Local” memory uses on-chip block RAM resources – Very high bandwidth, 8TB/s, – Random access in 2 cycles – Limited capacity

The memory system is customized to your application – Huge value proposition over fixed-architecture accelerators

Banking configuration (number of banks, width), and interconnect all customized for your kernel – Automatically optimized to eliminate or minimize access contention

Key idea: Let the compiler minimize bank contention – If your code is optimized for another architecturefor (e.g.LRZ array[tid + 1] to avoid bank collisions), undo the fixed-architecture workaroundsIntel Proprietary – Can prevent optimal structure from being inferred FPGA Local Memory

M20K M20K M20K M20K M20K M20K M20K M20K

Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7

Arbitration Network

Load/Stor Load/Stor Load/Stor Load/Stor Split memory into logicale banks e e e ▪ An N-bank configuration can handle N-requests per clock cycle as long as each request addresses a different bank for LRZ ▪ Manipulate memory addressesIntel so Proprietarythat parallel threads likely to access different banks – reduce collisions

261 Local Memory Attributes

Annotations added to local memory variables to improve throughput or reduce area Banking control: – numbanks – bankwidth Port control: – numreadports/numwriteports – singlepump/doublepump for LRZ Intel Proprietary

262 numbanks(N) and bankwidth(N) memory attribute

What does it do? Specifies the banking geometry for your local memory system A bank = single independent memory system

What is it for? Can be used to optimize LSU-to-memory connectivity in an effort to boost performance for LRZ Banking should be setIntel up to maximize Proprietary “stall-free” accesses

263 numbanks(N) and bankwidth(N) memory attribute

local int lmem[8][4]; Not stall-free

#pragma unroll LSU1 0,0 0,1 0,2 0,3

for(int i = 0; i<4; i+=2) 1,0 1,1 1,2 1,3 { lmem[i][x] = …; LSU2 2,0 2,1 2,2 2,3

} 3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3 arbitration 5,0 5,1 5,2 5,3

6,0 6,1 6,2 6,3

7,0 7,1 7,2 7,3 for LRZ local int lmem[8][4] Intel Proprietary

264 numbanks(N) and bankwidth(N) memory attribute

Stall-free local int __attribute__((numbanks(8), LSU1 0,0 0,1 0,2 0,3 bankwidth(16))) Bank 0 lmem[8][4]; 1,0 1,1 1,2 1,3 Bank 1

2,0 2,1 2,2 2,3 #pragma unroll LSU2 Bank 2 for(int i = 0; i<4; i+=2) 3,0 3,1 3,2 3,3 Bank 3 { 4,0 4,1 4,2 4,3 lmem[i][x & 0x3] = …; Bank 4 } 5,0 5,1 5,2 5,3 Bank 5 6,0 6,1 6,2 6,3 Bank 6 Mask access to tell compiler 7,0 7,1 7,2 7,3 no out-of-bounds accesses Bank 7 for LRZ local int lmem[8][4] Intel Proprietary

265 numreadports/numwriteports and singlepump/doublepump memory attribute What does it do? numports: specifies the number of read/write ports in the local memory system pump: specifies the pumping of the local memory system (1x/2x clock)

What is it for? Controls the number of memory blocksfor used LRZ to implement the local memory system Intel Proprietary

266 numreadports/numwriteports and singlepump/doublepump memory attribute

local int read_0 __attribute__((singlepump, M20k numreadports(3), read_1 numwriteports(1)))) read_2 lmem[16]; M20k write

M20k lmem local int __attribute__((doublepump, read_0 numreadports(3), read_1 numwriteports(1)))) read_2 lmem[16]; for LRZ M20k Intel Proprietarywrite lmem

267