The Model of Computation of CUDA and Its Formal Semantics

Universitat¨ Augsburg The Model of Computation of CUDA and its Formal Semantics Based on the Master’s Thesis of Axel Habermaier Axel Habermaier Report 2011-14 October 2011 Institut fur¨ Informatik D-86135 Augsburg Copyright c Axel Habermaier Institut für Informatik Universität Augsburg D–86135 Augsburg, Germany http://www.Informatik.Uni-Augsburg.DE — all rights reserved — Abstract We formalize the model of computation of modern graphics cards based on the specification of Nvidia’s Compute Unified Device Architecture (CUDA). CUDA programs are executed by thousands of threads concurrently and have access to several different types of memory with unique access patterns and latencies. The underlying hardware uses a single instruction, multiple threads execution model that groups threads into warps. All threads of the same warp execute the program in lockstep. If threads of the same warp execute a data-dependent control flow instruction, control flow might diverge and the different execution paths are executed sequentially. Once all paths complete execution, all threads are executed in parallel again. An operational semantics of a significant subset of CUDA’s memory operations and programming instructions is presented, including shared and non-shared memory operations, atomic operations on shared memory, cached memory access, recursive function calls, control flow instructions, and thread synchronization. Based on this formalization we prove that CUDA’s single instruction, multiple threads execution model is safe: For all threads it is provably true that a thread only executes the instructions it is allowed to execute, that it does not continue execution after processing the last instruction of the program, and that it does not skip any instructions it should have executed. On the other hand, we demonstrate that CUDA’s inability to handle control flow instructions individually for each thread can cause unexpected program behavior in the sense that a liveness property is violated. Contents 1 Introduction 4 2 Overview of CUDA 7 2.1 Evolution of GPUs . 7 2.2 Architecture of a Modern GPU . 9 2.2.1 Chip Layout . 10 2.2.2 Processing Units . 11 2.2.3 Memories and Caches . 13 2.2.4 Compute Capability . 14 2.3 CUDA Programs . 15 2.3.1 Thread Organization . 17 2.3.2 Memory Types . 18 2.3.3 Introduction to PTX . 21 2.3.4 Compilation Infrastructure . 23 2.4 Alternatives to CUDA . 24 2.4.1 OpenCL . 24 2.4.2 Direct Compute . 25 3 Conventions, Global Constants, and Rules 27 3.1 Conventions . 27 3.2 Global Constants . 29 3.3 Rules . 29 4 Formal Memory Model 32 4.1 Overview of Memory Programs . 34 4.2 Formalization of Caches . 35 4.2.1 Cache Lines . 36 4.2.2 Cache Operations . 37 4.3 Formalization of the Memory Environment . 42 4.4 Formalization and Formal Semantics of Memory Programs . 45 4.4.1 Implicit Program State . 46 4.4.2 Boolean Expressions . 48 4.4.3 Operational Expressions . 49 4.4.4 Program Statements . 54 4.5 Memory Operation Semantics . 56 4.6 Formal Semantics of the Memory Environment . 60 4.7 Summary . 63 5 Formal Semantics of CUDA’s Model of Computation 64 5.1 Formalization of PTX . 64 5.1.1 PTX Instructions and Features Included in the Formalization . 65 5.1.2 Definition of Program Environments . 67 5.1.3 Representation of PTX Programs as Program Environments . 70 5.2 Expression Semantics . 76 5.3 Thread Semantics . 78 5.3.1 Thread Actions . 79 5.3.2 Thread Rules . 81 5.3.3 Conditional Thread Execution . 84 5.4 Warp Semantics . 85 5.4.1 Formalization of the Branching Algorithm . 86 5.4.2 Formalization of Votes and Barriers . 99 5.4.3 Warp Rules . 100 5.5 Thread Block Semantics . 101 5.5.1 Formalization of the Thread Block Synchronization Mechanism . 103 5.5.2 Thread Block Rules . 107 5.6 Grid Semantics . 108 5.7 Context Semantics . 110 5.7.1 Formalization of Warp, Thread Block, and Grid Scheduling . 110 5.7.2 Context Input/Output . 112 5.7.3 Context Rules . 113 5.8 Device Semantics . 115 5.8.1 Linking the Program Semantics to the Memory Environment . 115 5.8.2 Device Input/Output . 116 5.8.3 Device Rules . 118 5.9 Summary . 120 6 Correctness of the Branching Algorithm 121 6.1 Implications of Warp Level Branching . 121 6.2 Control Flow Consistency . 123 6.3 Single Warp Semantics . 125 6.3.1 Modified Thread Rules . 125 6.3.2 Modified Warp Rules . 127 6.4 Proof of the Safety Property . 130 6.5 Formalization of the Liveness Property . 140 6.6 Summary . 141 7 Conclusion 142 List of Symbols 145 List of Listings 146 List of Figures 147 Bibliography 148 1 Introduction Stanford University’s Folding@Home1 is a distributed computing application designed to study protein folding and protein folding diseases. With more than 70 peer-reviewed papers2, the project’s aim is to help form a better understanding of the protein folding process and related diseases including Alzheimer’s disease, Parkinson’s disease, and many forms of Cancer. Even though some proteins fold in only a millionth of a second, computer simulations of protein folding are extremely slow, taking years in some cases3. Being so computationally demanding, the Folding@Home project distributes the necessary calculations to thousands of client computers. People from around the world support the project by downloading the client and donating their idle CPU time to the project. Additionally, in recent years people have been able to run the client on their graphics card as well, resulting in a significant increase in computational power devoted to Folding@Home. All in all, the computational resources available to the project have already crossed the peta FLOPS barrier. Since the molecular dynamics simulations performed by the Folding@Home client are computationally expensive, running them on GPUs has the potential of drastically reducing computation times. With the advent of general purpose GPU programming supported by both Nvidia and AMD graphics cards, it has become viable to develop new Folding@Home clients that run the computations on the GPU instead of the CPU. The results speak for themselves: The GPU clients outperform their CPU counterparts by at least two orders of magnitude, even though the peak theoretical power of the GPUs has not yet been reached, i.e. further optimizations are still possible. Additionally, many molecules simulated by the CPU client are too small to fully utilize the graphics card because of its massively parallel nature. Therefore, it is expected that the performance gap between CPUs and GPUs will increase even further for simulations of larger molecules [1]. But even today GPUs already play an important role for Folding@Home as illustrated by figure 1.1. Even though the number of active GPUs contributing to the project is way smaller than the amount of CPUs, the actual floating point operations per second exceed those of the CPUs by an order of magnitude. The same applies to the PlayStation 3, which runs a special client optimized for the console’s Cell chip. Platform TFLOPS Active Processors CPUs (Windows) 213 223933 AMD GPUs 697 6481 Nvidia GPUs 2199 8760 PlayStation 3 1671 28090 Figure 1.1: Folding@Home Client Statistics4 1http://folding.stanford.edu/ 2http://folding.stanford.edu/English/Papers, last access 2010-10-28 3http://folding.stanford.edu/English/Science, last access 2010-10-28 4based on http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats, last access 2010-10-28; see also http://folding.stanford.edu/English/FAQ-flops 4 Figure 1.2: Screenshot of the Folding@Home Client for Nvidia GPUs There are some issues, however, that the GPU client has to deal with. As general purpose GPU programming has only been recently introduced, development tools and environments are not yet as mature as their CPU counterparts. Additionally, GPU programming is still very close to the hardware, so problems like efficient memory accesses, CPU/GPU synchronization, and divergent control flow must be considered carefully. Precision of floating point operations is an important issue too, as using double precision computations instead of single precision ones is significantly slower on today’s GPUs. Another area where GPUs excel is the reconstruction phase of magnetic resonance imaging. There, the data sampled by a scanner needs to be transformed into an image that is then presented to a human for further analysis. Again, a speedup of two orders of magnitude can be achieved by offloading the required transformations to the GPU. As [2, 8] shows, hardware-specific optimizations are of utmost importance, though: A naive implementation is only about ten times faster then the equivalent CPU program. However, once memory accesses are optimized and the hardware’s trigonometry function units are fully utilized, GPUs outperform CPUs by a factor of about 100 in total and one specific sub-problem is even computed around 357 times faster. Figure 1.3: MRI Scan of a Human Head5 As these two examples show, GPUs can significantly speed up certain algorithms and operations — provided that they suit the novel model of computation of GPUs which vastly 5http://de.wikipedia.org/wiki/Magnetresonanztomographie, last access 2010-10-28 5 differs from the traditional one of x86 CPUs. Because of those tremendous performance improvements, Folding@Home will most likely continue to focus on the development of the GPU clients. But besides Folding@Home, there are many other research projects in physics, chemistry, biology, and other sciences that benefit significantly from GPU-accelerated computations. On the other hand, GPUs are also used outside academia in real-world applications where they might affect people’s life or health as the aforementioned MRI example illustrates.

Load more