18-643 Lecture 11: OpenCL for FPGA

Shashank Obla Department of ECE Carnegie Mellon University

18-643-F20-L11-S1, James . Hoe, CMU/ECE/CALCM, ©2020 Housekeeping

• Your goal today: understand Intel’s interpretation of OpenCL for FPGAs • Notices – Handout #6: lab 2, due Monday noon, 10/12 – Handout #7: lab 3 (look for on Friday) – Project status report due each Friday 3 weeks to project proposal!! • Readings – skim Ch 10, Reconfigurable Computing – for concrete reference: Intel SDK for OpenCL: Programming Guide and Best Practices Guide

18-643-F20-L11-S2, James C. Hoe, CMU/ECE/CALCM, ©2020 Khronos’ OpenCL

18-643-F20-L11-S3, James C. Hoe, CMU/ECE/CALCM, ©2020 Two Parts to OpenCL 1. Platform model – host ( & memory) – 1 or more accelerator devices + device-side mem hierarchy: global/local/private – for host- to interact with devices • launch compute kernels to devices • prepare (load/unload) device memory 2. Kernel – perfect triply-nested loops – no loop-carried dependence

18-643-F20-L11-S4, James C. Hoe, CMU/ECE/CALCM, ©2020 OpenCL terms introduced in bold OpenCL Platform Model

host global compute CPU mem device

compute host global devicecompute mem mem devicecompute device

18-643-F20-L11-S5, James C. Hoe, CMU/ECE/CALCM, ©2020 What are these “compute devices”??? Basic Host Program Example main ( ) { . . . get device and queue ...... allocate memory buf objects ...... get kernel object. . . while ( ) { . . . initialize memory buf data ...... bind buf objects to kernel arguments ...... add kernel and buf objects to device queue ...... wait for kernel to finish ...... retrieve memory buf object for result . . . }

18-643-F20-L11-S6, James C. Hoe, CMU/ECE/CALCM, ©2020 What are these “kernels”??? What are these kernels?

18-643-F20-L11-S7, James C. Hoe, CMU/ECE/CALCM, ©2020 Specifically talking about OpenCL C Conceptually . . .

for (int i=0; i < R0; i++)

for (int j=0; j < R1; j++)

for (int k=0; k < R2; k++) { << local variable declarations >> << arbitrary C-code with access to global memory >> } • Loop body must be data-parallel – local variables limited to scope – disallow loop-carried dependencies through global memory ==> statements from different iterations can interleave in any order (using disambiguated local variables) 18-643-F20-L11-S8, James C. Hoe, CMU/ECE/CALCM, ©2020 Concretely . . . • Only specify loop body as a kernel function __kernel foo(<< pointers to global mem buf>>) { int i=get_global_id(2), j=get…(1), k=get…(0); << local variable declarations >> << arbitrary C-code with access to global memory >> } • Triply-nested loops hardwired as NDRange – specified as 3 integer constants, i.e., the loop

bounds (R0, R1, R2) – 1 execution of kernel function is a work-item work-item has private memory for local var’s

– 1 kernel execution is R0×R1×R2 work-items 18-643-F20-L11-S9, James C. Hoe, CMU/ECE/CALCM, ©2020 Example: N-by-N MMM

__kernel mmm(__global float *A, … *B, … *C) { int i=get_global_id(1); int j=get_global_id(0); for(int k=0; k

18-643-F20-L11-S10, James C. Hoe, CMU/ECE/CALCM, ©2020 (For Your Reference: N-by-N MMM in C)

float A[N][N], B[N][N], C[N][N];

for(int i=0; i

• Note: – Loop body of the inner-most loop is not data- parallel---dependency through C[i][j] – Loop body of the second inner-most loop is

18-643-F20-L11-S11, James C. Hoe, CMU/ECE/CALCM, ©2020 Work-Group

• Partition NDRange of R0×R1×R2 work-items into 3D work-groups of G0×G1×G2 work-items

– G0/1/2 must divide R0/1/2 evenly – get_local_id(dim): id within group NDRange=(9, 3, 6) – get_group_id(dim): id of group Group Size=(3, 3, 3) Group Range=(3,1,2) • Work-group signifies “locality” b/w work-items – execute together by a processing element – can share per-group local memory – can synchronize by barrier() Why do we need this? 18-643-F20-L11-S12, James C. Hoe, CMU/ECE/CALCM, ©2020 OpenCL Kernels on GPGPUs • Work-items is a CUDA thread • Work-group executes as a thread block---broken down into 32-work-item SIMD Warps • Work-groups from same and different kernels are interleaved on a Streaming Processor 128-SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz • 1 kernel can execute on upto 20 StrmProc’s (as 1 compute device), peak 8,873 GFLOPS • Global=GDDR; local= 96KB SRAM; private= 256KB SRAM terms in italic-bold; numbers for GTX 1080

18-643-F20-L11-S13, James C. Hoe, CMU/ECE/CALCM, ©2020 Aside: Barrel Processor [HEP, Smith] • Each cycle, select a “ready” thread from scheduling pool – only one instruction per thread in flight at once – on a long latency stall, remove the thread from scheduling • Simpler and faster implementation since – no data dependence, hence, no stall or forwarding – no penalty in making pipeline deeper

InstT1 A B C E E G H “warp” InstT2 A B C D E F G H InstT3 A B C D E F G H InstT4 A B C D E F G H InstT5 A B C D E F G H InstT6 A B C D E F G H InstT7 A B C D E F G H InstT8 A B C D E F G H InstT1 A B C D E F G H

18-643-F20-L11-S14, James C. Hoe, CMU/ECE/CALCM, ©2020 To get 8,873 GFLOPS . . .

• # work-items ≥ 128 x StrmProc pipeline depth x 20 • Computation entirely of Fused-Multiply-Add Interleaved warps so no worries about RAW stalls • No if-then-else (branch divergence) BTW: • ld’s and st’s take up inst. issue BW of-the-top • 320 GB/sec DRAM BW  AI > 108 SP FP / float – only certain access pattern can sustain peak BW – SIMD ld’s and st’s in a warp must go to the same memory line (memory coalescing)

18-643-F20-L11-S15, James C. Hoe, CMU/ECE/CALCM, ©2020 Intel OpenCL for FPGA

18-643-F20-L11-S16, James C. Hoe, CMU/ECE/CALCM, ©2020 OpenCL FPGA Platform Model

FPGA host global mem CPU (DRAM) custom Kernelcustom global devicekernelcustom Mem device (DRAM) kernel host device mem global Mem (DRAM)

Compute devices synthesized from kernel functions

18-643-F20-L11-S17, James C. Hoe, CMU/ECE/CALCM, ©2020 Example: N-by-N MM“Add”

__kernel mma(__global float *A, … *B, … *C) { int i=get_global_id(1); int j=get_global_id(0);

C[i*N+j]=A[i*N+j]+B[i*N+j]; }

• NDRange=(N, N, 1) • Note in this example: – data-parallel kernel function – no loop in kernel function

18-643-F20-L11-S18, James C. Hoe, CMU/ECE/CALCM, ©2020 Fully-pipelined Kernel NDRange = (N,N,1) (gid0,gid1) stream =(0,1),…(0,N-1),(1,0)…(1,N-1)……(N-1,0)…(N-1,N-1)

A gid0 gid1 B C

addr addr addr calc calc calc &A[i*N+j] &B[i*N+j] &C[i*N+j] load load

A[i*N+j] B[i*N+j] add

store

18-643-F20-L11-S19, James C. Hoe, CMU/ECE/CALCM, ©2020 fully unrolled loop in kernel fxn also okay What about MMM?

__kernel mmm(__global float *A, … *B, … *C) { int i=get_global_id(1); int j=get_global_id(0); for(int k=0; k

18-643-F20-L11-S20, James C. Hoe, CMU/ECE/CALCM, ©2020 Use of Single Work-Item Kernel

__kernel mmm(__global float *A, … *B, … *C) { for(int i=0; i

global global

host PCIe mem load kernelA chnl kernelB chnl krnlC store mem PCIe host (DRAM) (DRAM)

18-643-F20-L11-S22, James C. Hoe, CMU/ECE/CALCM, ©2020 Different programming mindset from GPU Program-to-HW, not Function-to-IP

FPGA Shell User Space Local Memory

host PCIe PCIe CPU Controller HLS kernel Local

HLS Interconnect

host Device Memory Interconnect Global kernel mem DRAM Controller Local Memory

Object of design/optimization is the entire FPGA system

18-643-F20-L11-S23, James C. Hoe, CMU/ECE/CALCM, ©2020 Inside an example: Tiled MMM

for (int i = 0; i < HA; i += BHA) { for (int j = 0; j < WB; j += BWB) { for (int k = 0; k < WA; k += BWA) { for (int ii = 0; ii < BHA; ++ii) { for (int jj = 0; jj < BWB; ++jj) { #pragma unroll for (int kk = 0; kk < BWA; ++kk) { C[(i + ii) * WC + j + jj] += A[(i + ii) * WA + k + kk] * B[(k + kk) * WB + j + jj];

18-643-F20-L11-S24, James C. Hoe, CMU/ECE/CALCM, ©2020 Performance-Geared Synthesis • Automatic pipelining and unrolling – no pipeline directive, but disable_loop_pipelining so you still have control – unroll factor controls extent of unrolling • Local memory Optimizations – if you don’t say anything, the tool does it’s best – banks, replicates, private copies and interconnect local memory bank 0 bank 1 replicate 0 replicate 1 replicate 0 replicate 1 height

18-643-F20-L11-S25, James C. Hoe, CMU/ECE/CALCM, ©2020 byte/cycle = width  ports Global Memory Interconnect

• Load-Store Units – automatically chosen based on access pattern

Sequential Random Access Area Efficient Prefetching Pipelined Burst-Coalesced Area Expensive Burst-Coalesced Cached (loads only)

• Constant memory • DDR banking automatically handled

18-643-F20-L11-S26, James C. Hoe, CMU/ECE/CALCM, ©2020 Revisiting the example: Tiled MMM

for (int i = 0; i < HA; i += BHA) { for (int j = 0; j < WB; j += BWB) { for (int k = 0; k < WA; k += BWA) { for (int ii = 0; ii < BHA; ++ii) { for (int jj = 0; jj < BWB; ++jj) { #pragma unroll for (int kk = 0; kk < BWA; ++kk) { C[(i + ii) * WC + j + jj] += A[(i + ii) * WA + k + kk] * B[(k + kk) * WB + j + jj];

18-643-F20-L11-S27, James C. Hoe, CMU/ECE/CALCM, ©2020 Using Local Memory

for (int ii = 0; ii < BHA; ++ii) { for (int ii = 0; ii < BHA; ++ii) { #pragma unroll #pragma unroll 1 for (int jj = 0; jj < BWB; ++jj) { for (int jj = 0; jj < BWB; ++jj) { #pragma unroll #pragma unroll for (int kk = 0; kk < BWA; ++kk) { for (int kk = 0; kk < BWA; ++kk) { C_local[ii][jj] += A_local[ii][kk] C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk]; * B_local[jj][kk];

18-643-F20-L11-S28, James C. Hoe, CMU/ECE/CALCM, ©2020 Also, Style Coding FedB • Express structure – describe models such as systolic arrays

– instantiate explicit registers – timing FedA – vectorize operations • Not so readable as “program” float PE(struct vec_float_t_bool valA, vec_float_t valB, float *accum) { while(1) { float oldAcc = __fpga_reg(accum[0]); ... float sum = (valA.c ? 0.0f : oldAcc); #pragma unroll #pragma unroll for (int row = 0; row < ROWS; row++) { for (int i = 0; i < VEC; i++) { #pragma unroll sum += valA.data[i] * valB[i]; for (int col = 0; col < COLS; col++) { } // compute and store outputs in shift register float result = PE(fedA[row], fedB[col], accum[row][col]); #pragma unroll if (fedA[row].c) { for (int i = 0; i < ACCUM_SIZE - 1; i++) { drain[col][row * ACCUM_SIZE] = result; accum[i] = accum[i + 1]; } } fedA[row].data = __fpga_reg(__fpga_reg(fedA[row].data)); accum[ACCUM_SIZE - 1] = sum; fedA[row].c = __fpga_reg(__fpga_reg(fedA[row].c)); return oldAcc; fedB[col] = __fpga_reg(__fpga_reg(fedB[col])); 18-643-F20-L11-S29, James C. Hoe, CMU/ECE/CALCM, ©2020 An excerpt from a 500-line device code Parting Thoughts

• OpenCL = platform model/API + SIMD language – kernel language forces regular and – SIMD parallelism different on GPUs vs FPGAs • FPGA OpenCL = platform model/API + memory system + “regular” HLS – single-work-item kernel unlocks parallelism style – kernel-kernel channels alleviate DRAM and PCIe bottleneck for streaming use cases – develop/debug/analysis tools integral to appeal

GPUs great at SIMD; FPGAs good for more than SIMD

18-643-F20-L11-S30, James C. Hoe, CMU/ECE/CALCM, ©2020