Improving the Performance of Medical Imaging Applications Using SYCL
Total Page:16
File Type:pdf, Size:1020Kb
ANL/ALCF-19/4 Improving the Performance of Medical Imaging Applications using SYCL Argonne Leadership Computing Facility About Argonne National Laboratory Argonne is a U.S. Department of Energy laboratory managed by UChicago Argonne, LLC under contract DE-AC02-06CH11357. The Laboratory’s main facility is outside Chicago, at 9700 South Cass Avenue, Argonne, Illinois 60439. For information about Argonne and its pioneering science and technology programs, see www.anl.gov. DOCUMENT AVAILABILITY Online Access: U.S. Department of Energy (DOE) reports produced after 1991 and a growing number of pre-1991 documents are available free via DOE’s SciTech Connect (http://www.osti.gov/scitech/) Reports not in digital format may be purchased by the public from the National Technical Information Service (NTIS): U.S. Department of Commerce National Technical Information Service 5301 Shawnee Rd Alexandria, VA 22312 www.ntis.gov Phone: (800) 553-NTIS (6847) or (703) 605-6000 Fax: (703) 605-6900 Email: [email protected] Reports not in digital format are available to DOE and DOE contractors from the Office of Scientific and Technical Information (OSTI): U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831-0062 www.osti.gov Phone: (865) 576-8401 Fax: (865) 576-5728 Email: [email protected] Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor UChicago Argonne, LLC, nor any of their employees or officers, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of document authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof, Argonne National Laboratory, or UChicago Argonne, LLC. ANL/ALCF-19/4 Improving the Performance of Medical Imaging Applications using SYCL prepared by Zheming Jin Argonne Leadership Computing Facility, Argonne National Laboratory December 3, 2019 Improving the Performance of Medical Imaging Applications using SYCL I. INTRODUCTION II. BACKGROUND A. SYCL In this report, we are interested in applying the SYCL programming model to medical imaging applications for a C++ AMP, CUDA, Thrust C++ are representative single- study on performance portability and programming source C++ programming models for accelerators [3]. Such productivity. The SYCL standard specifies a cross-platform languages are easy to use and can be type-checked as abstraction layer that enables programming of heterogeneous everything sits in a single source file. They facilitate offline computing systems using standard C++. As opposed to the compilation so that the binary can be checked at compile Open Computing Language (OpenCL) programming model, time. A SYCL program, based on a single-source C++ model in which host and device code are generally written in as shown in Figure 1, can be compiled for a host while different programming languages [1], SYCL can combine kernel(s) are extracted from the source and compiled for a host and device code for an application in a type-safe way to device. A SYCL device compiler parses a SYCL application improve development productivity and performance and generates intermediate representations (IR). A standard portability. host compiler parses the same application to generate native Rodinia is a widely used open-source benchmark suite host code. The SYCL runtime will load IR at runtime, for heterogeneous computing. We choose two medical enabling other compilers to parse it into native device code. imaging applications (Heart Wall and Particle Filter) in the Hence, people can continue to use existing toolchains for a Rodinia benchmark suite [ 2 ], migrate the OpenCL host platform, and choose preferred device compilers for a implementations of the applications to the SYCL target platform. implementations, and evaluate their performance and The design of SYCL allows for the combination of the productivity on Intel® microprocessors that contains a central performance and portability features of OpenCL and the processing unit (CPU) and an integrated graphics processing flexibility of using high-level C++ abstractions. Most of the unit (GPU). Currently, the maturing SYCL compilers, based abstraction features of C++, such as templates, classes, and on a conformant implementation of the SYCL 1.2.1 Khronos operator overloading, are available for a kernel function in specification, are optimized for Intel® computing platforms. SYCL. Some C++ language features, such as virtual The experimental results are promising in terms of the functions, virtual inheritance, throwing/catching exceptions, raw performance and productivity. Although the SYCL and run-time type-information, are not allowed inside kernels implementation of the Heart Wall application does not due to the capabilities of the underlying OpenCL standard. execute successfully on a CPU, it is on average 15% faster These features are available outside kernel scope. than the OpenCL implementation on an Intel® Iris™ Pro A SYCL application is logically structured in three integrated GPU. For the Particle Filter application, the scopes: application scope, command-group scope, and kernel performance difference between the SYCL and OpenCL scope. The kernel scope specifies a single-kernel function implementations is less than 1% on the GPUs for most cases, that will be executed on a device after compilation. The but the SYCL implementation is on average 4.5X faster on command-group scope specifies a unit of work that will an Intel® Xeon® four-core CPU. Arguably, we use lines of comprise of a kernel function and buffer accessors. The code as a way to measure programming productivity in software. The SYCL programs reduce the lines of code of the OpenCL programs by 52% and 38% for the Heart Wall and Particle Filter, respectively. The results indicate that SYCL is a promising programming model for heterogeneous computing with the maturing compilers. We organize the remainder of the report as follows. Section II introduces the SYCL programming model, compares the major differences between an OpenCL program and a SYCL program, and describes the characteristics of the two applications. Section III describes the SYCL programming model in more details, and shows the SYCL implementation of a kernel function in the Particle Filter as an example. In Section IV, we evaluate and analyze the performance of the applications on the CPUs and GPUs. Section V concludes the report. Figure 1. SYCL is a single-source programming model TABLE I. MAPPING FROM OPENCL TO SYCL TABLE II. CHARACTERISTICS OF THE TWO OPENCL APPLICATIONS Step OpenCL Program SYCL Program App. LOC LOC Number of Max number of 1 Platform query (host) (kernel) kernels kernel arguments 2 Device query of a platform Device selector class HW 775 1043 1 34 3 Create context for devices PF 720 149 4 20 4 Create command queue for context Queue class 5 Create memory objects Buffer class 6 Create program object and CUDA programming models for a heterogeneous 7 Build a program Lambda expressions computing device. 8 Create kernel(s) Table II summarizes the characteristics of the two 9 Set kernel arguments applications in terms of the OpenCL host and kernel 10 Submit a SYCL kernel Enqueue a kernel object for execution programs. For the PF application, we focus on the single- to a queue precision floating-point version. We measure LOC of the 11 Transfer data from device to host Implicit via accessors OpenCL host and kernel programs using the utility “cloc” 12 Event handling Event class 13 Release resources Implicit via destructor [6], and count the number of kernels and the maximum number of kernel arguments of a single or multiple kernel(s). While comments and blank lines are not included, and multiple lines are joined into one line for the OpenCL built- application scope specifies all other codes outside of a in function calls, the LOC of both applications indicate that command-group scope. A SYCL kernel function may be developing the host programs is tedious and error-prone. In defined by the body of a lambda function, by a function the original OpenCL and CUDA applications, the authors object or by the binary generated from an OpenCL kernel reduced the significant GPU overhead of data transfer and string. Although an OpenCL kernel is interoperable in the kernel launch for the kernel of the HW application by SYCL programming model, in this report we focus on the combining multiple kernels into one kernel call. Hence, the implementation of kernel functions using lambda functions. kernel size of the HW application is much larger than the Table I lists the steps of creating an OpenCL application sizes of the four kernels in the PF application. On the other and their corresponding steps in SYCL. The first three steps hand, both kernels require a great number of kernel in OpenCL are reduced to the instantiation of a device arguments. They reduce programming productivity as it selector class in SYCL. A selector searches a device of a makes debugging difficult when the order of setting kernel user’s provided preference (e.g., GPU) at runtime. The arguments in a host program does not strictly match that in a SYCL queue class is an out-of-order queue that encapsulates kernel function. a queue for scheduling kernels on a device. A kernel function in SYCL can be invoked as a lambda function. The function III. IMPLEMENTATIONS is grouped into a command group object, and then it is Overall, it is relatively straightforward to port an submitted to execution via command queue.