Analysis of GPGPU Programs for Data-Race and Barrier Divergence

Analysis of GPGPU Programs for Data-race and Barrier Divergence Santonu Sarkar1, Prateek Kandelwal2, Soumyadip Bandyopadhyay3 and Holger Giese3 1ABB Corporate Research, India 2MathWorks, India 3Hasso Plattner Institute fur¨ Digital Engineering gGmbH, Germany Keywords: Verification, SMT Solver, CUDA, GPGPU, Data Races, Barrier Divergence. Abstract: Todays business and scientific applications have a high computing demand due to the increasing data size and the demand for responsiveness. Many such applications have a high degree of parallelism and GPGPUs emerge as a fit candidate for the demand. GPGPUs can offer an extremely high degree of data parallelism owing to its architecture that has many computing cores. However, unless the programs written to exploit the architecture are correct, the potential gain in performance cannot be achieved. In this paper, we focus on the two important properties of the programs written for GPGPUs, namely i) the data-race conditions and ii) the barrier divergence. We present a technique to identify the existence of these properties in a CUDA program using a static property verification method. The proposed approach can be utilized in tandem with normal application development process to help the programmer to remove the bugs that can have an impact on the performance and improve the safety of a CUDA program. 1 INTRODUCTION ans that the program will never end-up in an errone- ous state, or will never stop functioning in an arbitrary With the order of magnitude increase in computing manner, is a well-known and critical property that an demand in the business and scientific applications, de- operational system should exhibit (Lamport, 1977). velopers are more and more inclined towards massi- A parallel program can reach an unsafe state when vely data-parallel computing platform like a general- multiple threads enter into a race condition, conflict purpose graphics processing unit (GPGPU) (Nickolls with each other for data access enters into a dead- and Dally, 2010). GPGPUs are programmable co- lock situation and so on. Such a situation never arises processors where designers can develop applications in a sequential program. Thus, unlike a sequential in CUDA if they are using NVIDIA GPGPU. CUDA software, detecting a violation of the safety property (Compute Unified Device Architecture) is an exten- and identifying its cause in a parallel program is far sion to standard ANSI C through additional key- more challenging. Petri net based verification of pa- words. A CUDA program follows a Single-Program rallel programs has been reported in (Bandyopadhyay Multiple- Data (SPMD) model, where a piece of code, et al., 2017; Bandyopadhyay et al., 2018), however, it called kernel, is executed by hundreds of threads de- checks only the computational equiavlence. pending on the device’s computing capability, in a One of the well-known techniques for detecting lock-step fashion (where all the participating thre- such an unsafe operation in a program is the pro- ads execute the same instruction). The developer can perty verification method. This approach models po- bundle these threads into a set of thread-blocks, where tential causes for a program safety violation such as a set of threads in a block can share their data and data-races, or deadlock as properties of the program. synchronize their actions through a built-in barrier The verification program analyzes these properties, function called __syncthreads(). extracted from the code, for their satisfiability. While While sharing of data among threads is essential this technique has been proposed for a long time and for any reasonably complex parallel application, it can used in the context of sequential programs, recently lead the program to a unsafe state where the behavior it has gained renewed interest with the resurgence of of the program is unpredictable and incorrect. The GPGPUs and its programming model. Analysis of safety property of a program, which informally me- a parallel program differs from the analysis of a se- 460 Sarkar, S., Kandelwal, P., Bandyopadhyay, S. and Giese, H. Analysis of GPGPU Programs for Data-race and Barrier Divergence. DOI: 10.5220/0006834904600471 In Proceedings of the 13th International Conference on Software Technologies (ICSOFT 2018), pages 460-471 ISBN: 978-989-758-320-9 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Analysis of GPGPU Programs for Data-race and Barrier Divergence quential program because of the possible ordering of rate runtime traces of memory accesses by different instructions, can be many (unless restricted otherwise threads, which is analyzed to discover various bugs with synchronization calls or atomic access) when like data races and bank conflicts.In comparison, our these instructions are executed in parallel by multiple approach analyzes the symbolic constraints associa- threads. The analysis of such a program requires the ted with the potentially conflicting access, along with methodology to consider all possible inter-leavings of checking for divergent barriers, which are not handled the instructions where there is no specific ordering in the former case. imposed by the program. This has a significant im- PUG tool (Li, 2010) is a symbolic verifier based pact on the number of permutations that need to be approach to detect data race conflicts, barrier diver- analyzed. This problem can be severe in the case of gence, and bank conflicts. PUGpara (Li and Gopala- GPU programs, as the number of threads can be in krishnan, 2012) is an extension of the PUG tool which the order of thousands, with absolutely no synchro- can also check for functional correctness and equiva- nization available between two threads belonging to lence of CUDA kernels. PUGpara models a single separate blocks. parameterized thread and uses symbolic analysis for We propose a static verification tool that can ana- verification of data race and equivalence checking. A lyze a CUDA program and verify all possible data- limitation of the tool is that the tool does not detect all races spanning all the barrier intervals, which may the races in a program, and exits on the first encounter occur when the program is executed on a GPU. The of a race. Also, the barrier divergence detection me- major contributions of this paper are as follows: chanism in presence of branches, checks if the num- 1. Existing literature such as (Li and Gopala- ber of barriers executed along the path to be the same krishnan, 2010; Lv et al., 2011) detects a sin- in the absence of branch divergence, which is not cor- gle data race or a set of races between two bar- rect as per the CUDA specification, where for each rier functions. This, in turn, requires the de- barrier synchronization call, it is expected that either velopers to repeatedly run and fix the data race all the threads within the block should execute it, or conditions, which can be quite cumbersome and none at all. time-consuming when the program is large, invol- Performance degradation analysis of GPUs con- ving many barriers. Thus, our approach generates cerning bank conflicts and coalesced global memory more comprehensive results in one run which can has been modeled as an unsatisfiability problem in substantially improve the program repairing time. (Lv et al., 2011). The approach produces a counter- 2. Our method can detect whether the barriers are example where threads with consecutive thread-ids divergent. may not access the consecutive memory locations. Here factors like data-type and the nature of access 3. We have provided a formal analysis of our data play an essential role (CUD, 2017). While the ap- race and barrier divergence detection algorithms proach reported in the paper is novel, it is indeed and proved that our approach is sound, i.e., it does not robust in considering all possible scenarios of un- not have any false positive results. coalesced access or complex bank conflict scenarios. The paper has been organized as follows. In A combination of static and dynamic analysis has Section 3, we provide an overview of the approach. been proposed by (Zheng et al., 2011; Zheng et al., Next, we elaborate the program translation mecha- 2014) where, initially a static analysis is used to re- nism in Section 4. In Section 5 we elaborate the verifi- duce the number of statements to be instrumented for cation approach. We report the experimental study of checking data races. Next, the access patterns of the our tool in Section 6. Finally, we conclude our paper. instrumented statements are collected and analyzed using dynamic analysis. An approach presented in (Said et al., 2011) ge- 2 RELATED WORK nerates concrete thread schedules that can determi- nistically trigger a particular data race. This appro- Unlike a sequential program, a parallel program exe- ach also leverages the satisfiability modulo theory cution can lead to several new safety and correctness (SMT) to generate the schedule, modeled as con- issues (Kirk and Hwu, 2016; Lin et al., 2015). A more straints. The approach, however, has been develo- recent article (Ernstsson et al., 2017) further substan- ped for task-parallelism in Java involving fork/join se- tiates several safety issues in GPU programming. mantics. An early approach by Boyer et al. (Boyer et al., Recently a synchronous delayed visibility se- 2008) proposed a dynamic analysis of a CUDA pro- mantics based program verification has been propo- gram. The CUDA kernel is instrumented to gene- sed(Betts et al., 2015) to detect data races and barrier 461 ICSOFT 2018 - 13th International Conference on Software Technologies divergence. Unlike this approach, we are using the notion of the predicated form of a thread in our ap- s h a r e d i n t A[ 1 0 2 4 ] , i n t B[ 1 0 2 4 ] , proach while translating the program to assist in the i n t C[ 1 0 2 4 ] ; d e v i c e void d a t a r a c e ( ) generation of symbolic constraints.

Analysis of GPGPU Programs for Data-Race and Barrier Divergence

Subchapter 2.4–Hp Server Rp5400 Series

Evolving GPU Machine Code

Readingsample

MT-088: Analog Switches and Multiplexers Basics

Processor-103

A Remotely Accessible Network Processor-Based Router for Network Experimentation

Abatement of Area and Power in Multiplexer Based on Cordic Using

Graphical Microcode Simulator with a Reconfigurable Datapath

HP 9000 V2500 Enterprise Server Client/Server with 24 C360 Front-Ends

Unit 2 : Combinational Circuit

16-Channel Analog Multiplexer/Demultiplexer Rev

Multiplexer Circuit