Opencl and the Opencl Logo Are Trademarks of Apple Inc
Total Page:16
File Type:pdf, Size:1020Kb
IWOCL *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Agenda ▪ Intel® FPGA SDK for OpenCL™ ▪ Optimizing ND Range Kernels ▪ Single Work-Item Execution ▪ Using Channels / Pipes ▪ Optimizing Memory *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 2 Innovation Across the Board FPGA/CPLD FPGA FPGA FPGA PowerSoCs Lowest Cost, Cost/Power Balance Mid-range FPGAs Optimized for High-efficiency Lowest Power SoC & Transceivers SoC & Transceivers High Bandwidth Power Management Resources Embedded Soft and Design Development Intellectual Hard Processors Software Kits Property (IP) Nios® II ▪ Industrial ▪ Computing Arm* ▪ Enterprise Programmable Solutions Group 3 *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Intel® FPGA SDK for OpenCL™ Section Agenda ▪ Introduction ▪ Intel® FPGA SDK for OpenCL™ Usage ▪ Overview of Debug and Optimizing Reports *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 5 Intel® FPGA SDK for OpenCL™ Usage OpenCL OpenCL Host Program Kernels Standard C Intel FPGA OpenCL Libraries Offline Compiler Compiler (OpenCL Kernel Compiler) Executable Binary File Programming File *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 6 FPGA Architecture DSP Block ▪ Massive Parallelism – Millions of logic elements – Thousands of embedded memory blocks – Thousands of Variable Precision DSP blocks – Programmable routing – Dozens of High-speed transceivers – Various built-in hardened IP Adaptive Logic Module (ALM) ▪ FPGA Advantages Programmable Routing Switch – Custom hardware! – Efficient processing Lookup FFFF Table FF – Low power FF Logic – Ability to reconfigure Modules – Fast time-to-market Programmable Solutions Group 7 FPGA Architecture for OpenCL™ Implementation Precompiled periphery (BSP) FPGA External External DDR Memory Controller Memory Controller Processor Host Interface & PHY & PHY Global Memory Interconnect On-Chip On-Chip Memory Memory Custom Built Kernel Kernel Kernel System Pipeline Pipeline Local Memory Interconnect Local Memory Interconnect *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 8 OpenCL™ Kernels to Dataflow Circuits Each kernel is converted into custom dataflow hardware (Compute Unit) ▪ Gain the benefits of FPGAs without the length design process ▪ Implement C operators as circuits – HDL code located in <SDK Installation>\ip – Load Store units to read/write memory – Arithmetic units to perform calculations – Flow control units – Connect circuits according to data flow in the kernel ▪ May replicate circuit to accelerate algorithm *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 9 Compilation Example Kernel compiled into dataflow circuit with flow control ▪ Includes branch and merge units For Entry __kernel void my_kernel ( __global float *a, Load a[i] Load b[i] __global float *b, __global float *c, int N) a[i] + b[i] { int i; aoc for (i = 0; i < N; i++) Store c[i] c[i] = a[i] + b[i]; } For End Programmable Solutions Group 10 Pipeline Execution of NDRange Kernels and Loops ▪ For NDRange work-items and loop iterations ▪ On each cycle the portions of the pipeline are processing different threads ▪ While work-item 2 is being loaded, work-item 1 is being added, and work-item 0 is being stored Example Workgroup with 8 work-items Thread IDs 3021 Load 201 7 76 765 6754 57643 46532 35421 24310 10 Store 0 + 3012 Load 201 Programmable Solutions Group 11 Simultaneous Multithreading Execution Model Tasks distributed through multiple queues can run in parallel ▪ Same device or multiple devices ▪ AOC implements dedicated compute units for each kernel – Different kernels can run in parallel Implicit Parallelism Sequential execution Task Parallelism in in Algorithm with one queue OpenCL™ implementation u = foo(x); myqueue.enqueueNDRangeKernel(cl_foo,…) Q1.clEnqueueNDrangeKernel(cl_foo,…) y = bar(x); myqueue.enqueueNDRangeKernel(cl_bar,…) Q2.clEnqueueNDrangeKernel(cl_bar,…) Device foo_CU cl_foo K1 K1 K2 cl_bar, cl_foo K2 bar_CU cl_bar *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 12 Intel® FPGA SDK for OpenCL™ Section Agenda ▪ Introduction ▪ Intel® FPGA SDK for OpenCL™ Usage – SDK Content - AOCL Utility – Kernel Compilation - Runtime – Host Compilation - Libraries ▪ Overview of Debug and Optimizing Reports *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 13 SDK Components ▪ Offline Compiler (AOC) – Translates your OpenCL™ C kernel source file into an Intel® FPGA hardware image – Requires Intel Quartus® Development Environment ▪ Host Libraries – Provides the OpenCL host API to be used by OpenCL host applications ▪ AOCL Utility – Perform various tasks related to the board, drivers, and compile process ▪ Intel Code Builder for OpenCL API with FPGA kernel development framework – Provides Microsoft* Visual Studio or Eclipse-based IDE for code development *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 14 Compiling Kernels __kernel void sum (__global float *a, __global float *b, __global float *y) Run the Offline Compiler { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; ▪ aoc -list-boards } – List available boards within the current board package ▪ aoc –board=<board> <kernel file> Offline Compiler – Compile the kernel to a board in the board package – Generates the kernel hardware system and compiles it using AOCX the Intel® Quartus® Prime software to target a specific board Programmable Solutions Group 15 OpenCL™ Libraries Create libraries from RTL or OpenCL™ source and call those library functions from User OpenCL code Verilog OpenCL™ User’s OpenCL code VHDL Library AOC AOCX See the Intel® FPGA SDK for OpenCL Programming Guide for detailed examples *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 16 aoc Output Files ▪ <kernel file>.aoco – Intermediate object file representing the created hardware system ▪ <kernel file>.aocx – Kernel executable file used to program FPGA ▪ Inside <kernel file> folder – <kernel file folder>\reports\report.html – Interactive HTML report – Static report showing optimization, detailed area, and architectural information – <kernel file>.log compilation log – Intel® Quartus® Prime software generated source and report files Programmable Solutions Group 17 Intel FPGA Preferred Board for OpenCL ▪ Intel® FPGA Preferred Board for OpenCL™ – Available for purchase from preferred partners – Passes conformance testing ▪ Download and install Intel FPGA OpenCL compatible BSP from vendor – Supplies board information required by the offline compiler – Provides software layer necessary to interact with the host code including drivers *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 18 Custom Platform Framework of host software and FPGA interface design to enable the use of OpenCL™ on a custom board ▪ FPGA design, software, and board bring up skills required ▪ Custom BSP provides Host Software FPGA Board Hardware User OpenCL host DMA – Timing-closed Hardware application – MMD software layer DDR / QDR OpenCL Lib – Some AOCL utility function HAL OpenCL MMD Interface IP/XCVR kernel Provided by Intel® User-Provided Custom User Application FPGA Platform *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 19 Compiling the Host Program main() { read_data( … ); manipulate( … ); ▪ Include CL/opencl.h or CL/cl.hpp clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); ▪ Use a conventional C compiler (Visual Studio*/GCC) display_result( … ); } ▪ Add $INTELFPGAOCLSDKROOT/host/include to your file search path Intel FPGA Standard – Recommended to use aocl compile-config Libraries C Compiler ▪ Link to Intel® FPGA OpenCL™ libraries – Link to libraries located in the $INTELFPGAOCLSDKROOT/host/<OS>/lib directory – Recommended to use aocl link-config *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 20 Intel® FPGA SDK for OpenCL™ Section Agenda ▪ Introduction ▪ Intel® FPGA SDK for OpenCL™ Usage ▪ Overview of Debug and Optimizing Reports *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos Programmable Solutions Group 21 Kernel Development Flow and Tools Modify kernel.cl Functional bugs? Emulator (secs) Poor performance? HTML Report (~1 min) Loop inefficiencies? Loop Optimization Report Undesired hardware structure? Detailed Area Report Sub-optimal memory Architectural Viewer interconnect? Profiler (full compile time) Done Programmable Solutions Group 22 Debugging Kernels Using printf printf instructions in kernels are supported ▪ Conforms to OpenCL™ 1.2 specification – No usage limitations – Can use inside if-then-else statements, loops, etc. ▪ Order of concurrent calls (from different work-items) are not guaranteed ▪ aoc allocates 64kB global buffer for printfs – Once kernel execution completes, contents are printed to standard output – If the buffer overflows, kernel execution stalls until the host reads