Benchmarking the Intel FPGA SDK for Opencl Memory Interface

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench. The benchmark suite covers ability in maximizing memory performance in their design. For multiple memory access patterns that are regularly found in some of these cases, we will provide work-arounds to improve HPC applications with different number of input and output memory bandwidth efficiency; however, a general solution will arrays. Moreover, multiple compile-time and run-time- require major changes in the memory controller itself. configurable parameters are provided that can affect the memory access pattern and/or performance. Keywords—Intel FPGA, OpenCL, memory controller, memory bandwidth efficiency, blocking, padding x halo z bsize a) y I. INTRODUCTION Input width Due to the ever-increasing need for more power efficient Out of bound means of accelerating computational workloads, FPGAs are now being considered as a potentially more-efficient b) Halo alternative to GPUs [1]. With the programmability challenge of using FPGAs being significantly alleviated with adoption Input height of modern C/C++ and OpenCL-based High-Level Synthesis Non-overlapped (HLS) techniques [2, 3] in the past few years, research groups, large IT companies and even cloud providers are more rapidly adopting these devices to accelerate different workloads. In bsize line with this trend, a lot of effort has been made to optimize different applications on FPGAs. However, most such efforts c) have been focused on kernel-level optimizations such as optimizing loop initiation interval, reducing area usage, and maximizing parallelism, while the critical effect of external memory and its controller on performance is largely overlooked. It has been shown that due to the low external memory bandwidth and low byte to FLOP ratio of modern Fig. 1. Access pattern for a) 1D, b) 1.5D, and c) 2.5D overlapped blocking FPGAs, it is frequently not possible to fully utilize the compute performance of these devices for typical applications Three main classes of memory access patterns are [1, 4]. Moreover, even this limited external memory supported by our benchmark suite. The first class, shown in bandwidth cannot always be efficiently utilized due to Fig. 1 a), implements 1D overlapped blocking. In this case, limitations in the memory controller [4, 5, 6]. In this work, we 1D input and output arrays are used and, as shown by the red will focus on analyzing the efficiency of the external memory arrows, memory accesses happen in a block-based manner controller on Intel FPGAs for OpenCL-based designs under starting from the address −ℎ푎푙표 , continuing linearly in a different configurations. Our contributions are as follows: vectorized fashion to 푏푠푧푒 − ℎ푎푙표, and then the next block starts from address 푐푠푧푒 = 푏푠푧푒 − 2 × ℎ푎푙표. Here, bsize is • We create an OpenCL-based memory benchmark suite the block size which is a compile-time variable, and the blocks for Intel FPGAs with support for different memory are overlapped by 2 × ℎ푎푙표 indexes, with halo being a run- access patterns and multiple compile-time and run-time- time variable that controls halo/overlap size. A halo size of configurable parameters. zero results in simple sequential reading/writing with no overlapping. The second and third classes implement 1.5D • Using our benchmark suite, we perform a comprehensive and 2.5D overlapped spatial blocking, respectively, that are analysis of the memory controller performance and widely used in 2D and 3D stencil computation [5, 6, 7, 8, 9, 1 This work was conducted at Tokyo Institute of Technology 10]. For the 1.5D class, the x dimension is blocked and operating frequency, which is not necessarily linear, to be also memory accesses are streamed row by row until the last index taken into account. To overcome this issue, we created a work- in the y dimension, before moving to the next block (Fig. 1 around to override the final operating frequency of the design b)). Similarly, in the 2.5D class, the x and y dimensions are to a user-defined value, the details of which are provided in blocked, while the accesses are streamed plane by plane in the our open-source benchmark repository. However, this work- z dimension. In the 1.5D class, blocks are overlapped by around is usable only if the design’s maximum-achievable 2 × ℎ푎푙표 columns, while in the 2.5D class, they are operating frequency is above the user-defined value. overlapped by 2 × ℎ푎푙표 rows and columns. All out-of-bound accesses (negative row, column and indexes, and out-of- B. Hardware and Software Setup bound indexes in the rightmost and bottommost blocks if input For our evaluation, we use a Nallatech 385A board dimensions are not divisible by csize) are avoided in the equipped with an Arria 10 GX 1150 FPGA and two banks of kernels using conditional statements. DDR4 memory running at 2133 MT/s. Each memory bank is connected to the FPGA via a 64-bit bus, providing a Each class of memory access patterns has both a standard theoretical peak memory bandwidth of 34.128 GB/s1. The and a channelized implementation. The former performs all memory controller of the FPGA runs at 1⁄8 of the memory memory operations in the same kernel, while the latter uses frequency (266.666 MHz). We also use an old NVIDIA Tesla one kernel for reading and one kernel for writing, with data K20X and a modern Tesla V100 SXM2 for comparison. The being passed between them using on-chip channels. For the peak memory bandwidth of these GPUs is 249.6 and 897.0 1D blocking class, each kernel file includes both an NDRange GB/s, respectively. All our machines use an updated version (NDR) and a Single Work-item (SWI) implementation of 5 of CentOS 7, and we use GCC v7.4.0 for compiling host codes kernels with different number of input and output arrays. As and CUDA v10.0 for GPUs. The latest BSP released by shown in Fig. 2, four global memory buffers are used in total Nallatech for our board is only compatible with Quartus and the supported array configurations are R1W0, R1W1, v17.1.0 and hence, we cannot use the latest version of Intel’s R2W1, R3W1, and R2W2; here, R and W refer to “read” and compiler for our evaluation. However, since support for BSP “write” respectively, and the numbers following them show backward compatibility with up to 2 lower major versions of the number of arrays that are read or written. To prevent the Quartus was added in Intel FPGA SDK for OpenCL v18.1, it R1W0 kernel from being optimized out by the compiler, a is possible to use the v17.1.0 BSP with all 18.1.x versions of one-time memory write is also added in the kernel. The R1W1 the compiler. Hence, we use Intel FPGA SDK for OpenCL and R2W2 kernels implement one and two direct buffer v18.1.2, Quartus v17.1.2, and Nallatech BSP v17.1.0 in our copies, respectively, while the R2W1 and R3W1 kernels sum evaluation. Even though we encountered multiple cases where up the input values to generate the output. The 1.5D and 2.5D our kernels compiled using this setup hung during execution blocking classes support all the above array configurations for no clear reason, we continued using this version so that our except R1W0. All Single Work-item kernels use collapsed evaluation reflects the latest-possible version of the compiler, loops with the exit condition optimization from [4, 5, 6] for and worked around the hangs by changing the placement and best timing and hence, are constructed as a doubly-nested routing seed. Since the issue is likely caused by the backward loops, with the fully-unrolled innermost loop having a trip compatibility feature, we would not recommend this setup for count equal to the vector size, and the outer loop having an a production environment. We employed test functions in initiation interval of one.

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

A Modern Primer on Processing in Memory

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Advanced X86

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Motorola Mpc107 Pci Bridge/Integrated Memory Controller

The Impulse Memory Controller

Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models

Exascale Computing Project -- Software

The Global Arrays User Manual

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps