CUDA C: Race Conditions, Atomics, Locks, Mutex, and Warps

CUDA C: race conditions, atomics, locks, mutex, and warps Will Landau Race conditions CUDA C: race conditions, atomics, locks, Brute force fixes: atomics, locks, and mutex, and warps mutex Warps Will Landau Iowa State University October 21, 2013 Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 1 / 33 CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 2 / 33 Race conditions CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 3 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau I Let int *x point to global memory. *x++ happens in 3 Race conditions steps: Brute force fixes: 1. Read the value in *x into a register. atomics, locks, and mutex 2. Add 1 to the value read in step 1. Warps 3. Write the result back to *x. I If we want parallel threads A and B to both increment *x, then we want something like: 1. Thread A reads the value, 7, from *x. 2. Thread A adds 1 to its value, 7, to make 8. 3. Thread A writes its value, 8, back to *x. 4. Thread B reads the value, 8, from *x. 5. Thread B adds 1 to its value, 8, to make 9. 6. Thread B writes the value, 9, back to *x. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 4 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: I But since the threads are parallel, we might actually atomics, locks, and mutex get: Warps 1. Thread A reads the value, 7, from *x. 2. Thread B reads the value, 7, from *x. 3. Thread A adds 1 to its value, 7, to make 8. 4. Thread A writes its value, 8, back to *x. 5. Thread B adds 1 to its value, 7, to make 8. 6. Thread B writes the value, 8, back to *x. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 5 / 33 Race conditions CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps Will Landau 1#include <s t d i o . h> 2#include <s t d l i b . h> Race conditions 3#include <cuda . h> 4#include <c u d a r u n t i m e . h> Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d )f mutex 7 ∗a d += 1 ; 8 g Warps 9 10 int main() f 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ; Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 6 / 33 Race conditions CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps 24 cudaEventRecord( stop, 0 ); Will Landau 25 cudaEventSynchronize( stop ); 26 cudaEventElapsedTime( &elapsedTime, start , stop ); Race conditions 27 cudaEventDestroy( start ); 28 cudaEventDestroy( stop ); Brute force fixes: 29 p r i n t f ("GPU Time elapsed:%f seconds nn", elapsedTime/1000.0); atomics, locks, and 30 mutex 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); Warps 33 34 p r i n t f ("a=%d nn", a); 35 cudaFree ( a d ) ; 36 37 g 1 > nvcc r a c e condition.cu −o r a c e c o n d i t i o n 2 > . / r a c e c o n d i t i o n 3 GPU Time elapsed: 0.000148 seconds 4 a = 88 I Since we started with a at 0, we should have gotten a = 1000 · 1000 = 1; 000; 000. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 7 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and I Race condition: A computational hazard that arises mutex when the results of the program depend on the timing Warps of uncontrollable events, such as the execution order or threads. I Many race conditions are caused by violations of the SIMD paradigm. I Atomic operations and locks are brute force ways to fix race conditions. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 8 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 9 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race Atomics conditions, atomics, locks, mutex, and warps Will Landau I Atomic operation: an operation that forces otherwise Race conditions parallel threads into a bottleneck, executing the Brute force fixes: atomics, locks, and operation one at a time. mutex Warps I In colonel(), replace *a d += 1; with an atomic function, atomicAdd(a d, 1); to fix the race condition in race condition.cu. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 10 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau 1#include <s t d i o . h> 2#include <s t d l i b . h> Race conditions 3#include <cuda . h> 4#include <c u d a r u n t i m e . h> Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d )f mutex 7 atomicAdd(a d , 1) ; 8 g Warps 9 10 int main() f 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ; Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 11 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: 24 cudaEventRecord( stop, 0 ); atomics, locks, and 25 cudaEventSynchronize( stop ); mutex 26 cudaEventElapsedTime( &elapsedTime, start , stop ); 27 cudaEventDestroy( start ); Warps 28 cudaEventDestroy( stop ); 29 p r i n t f ("GPU Time elapsed:%f seconds nn", elapsedTime/1000.0); 30 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); 33 34 p r i n t f ("a=%d nn", a); 35 cudaFree ( a d ) ; 36 37 g Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 12 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau 1 > nvcc r a c e c o n d i t i o n f i x e d . cu −arch sm 20 −o r a c e c o n d i t i o n f i x e d Race conditions 2 > . / r a c e d o n c i t i o n f i x e d Brute force fixes: atomics, locks, and 3 GPU Time elapsed: 0.01485 seconds mutex 4 a = 1000000 Warps I We got the right answer this time, and execution was slower because we forced the threads to execute the addition sequentially. I If you're using builtin atomic functions like atomicAdd(), use the -arch sm 20 flag in compilation. I This is to make sure you're using CUDA compute capability (version) 2.0 or above. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 13 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race CUDA C builtin atomic functions conditions, atomics, locks, mutex, and warps I With CUDA compute capability 2.0 or above, you can Will Landau use: Race conditions I atomicAdd() Brute force fixes: I atomicSub() atomics, locks, and mutex I atomicMin() Warps I atomicMax() I atomicInc() I atomicDec() I atomicAdd() I atomicExch() I atomicCAS() I atomicAnd() I atomicOr() I atomicXor() I For documentation, refer to the CUDA C programming guide. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 14 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race atomicCAS(int *address, int compare, conditions, atomics, locks, int val): needed for locks mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and mutex Warps 1.

Load more