GPU Concurrency and Consistency

GPU Concurrency and Consistency Dennis Sebastian Rieber Technische Informatik Universitaet Heidelberg Email: [email protected] Abstract—Modern graphics processing units (GPUs) are gaining popularity and become more popular for general purpose computing. However, Alglave et al. showed in a recent publication that the memory behavior of software and hardware does not always behave as specified in the documentation. This report takes a look current consistency models for GPGPU computing and the advantages and disadvantages of recent models. Fig. 1. Message-Passing Litmus After that, it takes a look at emerging technologies that intro- duce new concepts for software and hardware to create an easier user-experience while maintaing or improving performance. II. MEMORY CONSISTENCY Before we take a look at GPU Consistency the term of consistency in general is introduced, along with some more I. INTRODUCTION popular consistency concepts for CPUs. While GPGPU computing is gaining popularity, the pro- One definition for consistency is: ”Consistency dictates the gramming model and memory behaviour isn’t as well under- value that a read on a shared variable in a multithreaded stood as traditional CPU computing. Using a set of Litmus test program is allowed to return” [2]. While this definition may Alglave et. al demonstrated weak memory behaviour and for sound simple at first, it bears some implications that are crucial both NVidia and AMD hardware. While weak memory itself to the understanding of consistency. isn’t a problem, they could demonstrate that documentation, • Does hardware has to guarantee that a write is visible to instructions and examples by the vendors were flawed and pro- all threads at the same time? This is where coherence has duced unintended results. For example, the message passing a strong influence on consistency. litmus in figure 1 could not be implemented reliably on certain • Is reordering of loads and stores allowed? architectures. The goal was to isolate the memory-accesses 11 • How and where has the user to intervene to maintain an and 21 from 13 and 23 through memory fences. However, intended ordering? on NVidia’s Tesla C2075 the fences were not working as • Is stale data allowed in the system? advertised when targeting L1 cache, resulting in the inability Every component that alters code needs to stick to the contract to reliably communicate between to CTA. And on AMD’s that is defined by the consistency model, from the user over GCN 1.0 Architecture fence instructions were removed by compilers to hardware. If one component, for example the the compiler, without any warning. The research also showed compiler, falls out of line a correct execution can not be that increased stress on the hardware, more running threads guaranteed anymore. As an example, lets take a look at a basic and memory accesses, inflates the probability of unintended producer-consumer pattern. In this pattern the particular order results. However, developers need to be sure that they are of loads and stores is important to maintain correctnes. This writing correct software, so the issues lies with the possibility means that the order of instructions given by the user needs to and not the probability of a failure. Lacking documentation is be maintained, otherwise correct exection is not guaranteed. identified as a source of these problems, especially in the area If the consistency model does not give a guarantee for this by of memory consistency. [1] default, means to enforce the ordering need to provided to the This report will talk about how the field of GPU memory user. The user himself has to aware of these cirumstances and models looks like today and what is planned for the future. use the tools provided by the languange. Unlike coherence, Section II looks at consistency models in general, where they which is usually invisible to user, consistency is visible to become important, how they affect the users and why things the user and different models of consistency have different aren’t as simple on GPUs. Section III looks what current effects on how a user has to write software in order to programming models, represented by OpenCL, offer in terms maintain correctness. But for all models, the user has to know to memory ordering and where possible flaws are. Section IV how the model is going to behave, especially for concurrent introduces the reader to the concept of DeNovo coherence and programming. how changes this also influences the consistency. In a strictly single-threaded environment consistency is a lesser problem because reader and writer are always the same Fig. 2. Transitive message-passing example. entity. Concurrency however can create conflicts in concurrent C. Weak Consistency code, that isn’t visible immediately. In this context a conflict Weak consistency realeases every ordering and gives vir- represents the situation of two accesses to a shared variable tually no guarantees in which order memory operations are from different threads, where at least one access is write. happening. Synchronization access falls back to sequential Because the instructions of multiple threads interleave in a consistency, where (1) all memory operations have to be non-controllable faishon during the parallel execution, the finished before the synchronization, (2) no memory operations consistency model has to provide rules on what has to be are allowed during a synchronization access and (3) normal read, and how this has to be enforced. data operations resume after synchronization has finished. In the following this reports discusses a selection of consis- This allows the compilers and hardware to freely rearrange tency models that are relevant for this report and gives a short access in order to maximize performance. In hardware, this summary how it influences users, compilers and hardware. allows coalescing and out-of-order execution. However, this also means the user has to be carefully A. Sequential Consistency synchronize each critical point where the actual order of execution matters to maintain correctness. Also, it introduces Sequential consistency was defined 1979 by Lamport and potential for errors or over-synchronization which in the end describes a concurrent consistency model, that most intuitive nullifies the performance gained by the optimization freedom. for the user. ”...the results of any execution is the same as if the operations of all the processors were executed in D. Data-Race-Free some sequential order, and the operations of each individual While the preceeding models have been mainly hardware processor appear in this sequence in the order specified by oriented, Data-Race-Free(DRF) is a more software oriented its program.” [6] This means that all memory operations, model that appears in many specifications of programming especially write operations, are perceived in the same order languages, such as Java [2], C++11 [2] or OpenCL [7] and has by all processors, and at the same time. It does not matter, if become a de-facto standard for modern concurrent languages. this is the order the operations were actually issued in. It was observed, that a well written program is either data race While this concept is easy to implement on systems with free or correctly synchronized. This observation, together with a bus interconnect, it becomes inefficient on modern systems the ability to distinguish between data and synchronization like Hyper-Transport because every time a memory operation access on all levels is leveraged to create the appearance of is issued, no other processor can do so, because it has to wait sequential consistency for the user, but allows relaxation of the for the preceding one to complete. memory ordering for compiler and hardware. DRF-0 defines po a happens-before order −! for all memory operations in a B. Processor Consistency program and −!so for all synchronization operations. This order is bound to the program order and the overall happens-before This model is the one used by many modern processors, po order −!hb is defined as the irreflexive, transitive closure of −! including the x86 processor family. Each processor sees its so own write operations in order, using a store buffer, but the and −!. po write operations of other processors may not appear in order. −!hb = (−![ −!so )+ This is easy to realize with modern system interconnects and saves the processor from stalling on unfinished write oper- In an multi-threaded environment, the −!hb of each thread T ations and coherence operations. To maintain the processors interleaves to form a global order < T .[1] intrinsic ordering, a store buffer is used that additionally allows own reads to bypass outstanding writes. Each read makes a E. C11 Example look-up in the store-buffer to get the latest value, if present. In C11, DRF-0 is used to specify the concurrency memory Every other ordering (W2W, R2R, R2W) is maintained. behaviour. Other than Java, no guarantees are given in case of a The user however, suffers from exposure to hardware struc- data race. Every event of a race is handled by the programmer, tures, namely the store buffer. In order to guarantee visibility using sequentially consistent atomics to resolve the race. C11 at certain points memory fences have to issued to flush the further allows the specification of other memory orderings, store buffers. especially more relaxed models. However, this comes with the price of much higher complexity and unwanted side-effects. Just like C11, OpenCL 2.0 uses relaxed memory ordering For example, it is possible that the relaxed consistency ”leaks” that appears to be sequentially consistent, if sequentially out of or into a library breaking code that is expected to be consistent atomics are used. Again, the memory ordering can sequentially consistent, if not used properly. be relaxed by the user. Figure 2 shows an example for passing a value Dat from T1 This concept aligns with hierarchical thread and memory to thread T3 over a middle-man T2.

Load more