CUDA C: race conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

CUDA C: race conditions, atomics, locks, Brute force fixes: atomics, locks, and mutex, and warps mutex Warps

Will Landau

Iowa State University

October 21, 2013

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 1 / 33 CUDA C: race Outline conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and Race conditions mutex

Warps

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 2 / 33 Race conditions

CUDA C: race Outline conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and Race conditions mutex

Warps

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 3 / 33 Race conditions

CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps

Will Landau

I Let int *x point to global memory. *x++ happens in 3 Race conditions steps: Brute force fixes: 1. Read the value in *x into a register. atomics, locks, and mutex 2. Add 1 to the value read in step 1. Warps 3. Write the result back to *x. I If we want parallel threads A and B to both increment *x, then we want something like: 1. A reads the value, 7, from *x. 2. Thread A adds 1 to its value, 7, to make 8. 3. Thread A writes its value, 8, back to *x. 4. Thread B reads the value, 8, from *x. 5. Thread B adds 1 to its value, 8, to make 9. 6. Thread B writes the value, 9, back to *x.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 4 / 33 Race conditions

CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: I But since the threads are parallel, we might actually atomics, locks, and mutex get: Warps 1. Thread A reads the value, 7, from *x. 2. Thread B reads the value, 7, from *x. 3. Thread A adds 1 to its value, 7, to make 8. 4. Thread A writes its value, 8, back to *x. 5. Thread B adds 1 to its value, 7, to make 8. 6. Thread B writes the value, 8, back to *x.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 5 / 33 Race conditions

CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps

Will Landau 1#include 2#include Race conditions 3#include 4#include Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d ){ mutex 7 ∗a d += 1 ; 8 } Warps 9 10 int main() { 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ;

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 6 / 33 Race conditions

CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps

24 cudaEventRecord( stop, 0 ); Will Landau 25 cudaEventSynchronize( stop ); 26 cudaEventElapsedTime( &elapsedTime, start , stop ); Race conditions 27 cudaEventDestroy( start ); 28 cudaEventDestroy( stop ); Brute force fixes: 29 p r i n t f (”GPU Time elapsed:%f seconds \n”, elapsedTime/1000.0); atomics, locks, and 30 mutex 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); Warps 33 34 p r i n t f (”a=%d \n”, a); 35 cudaFree ( a d ) ; 36 37 }

1 > nvcc r a c e condition.cu −o r a c e c o n d i t i o n 2 > . / r a c e c o n d i t i o n 3 GPU Time elapsed: 0.000148 seconds 4 a = 88

I Since we started with a at 0, we should have gotten a = 1000 · 1000 = 1, 000, 000.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 7 / 33 Race conditions

CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and I Race condition: A computational hazard that arises mutex when the results of the program depend on the timing Warps of uncontrollable events, such as the execution order or threads.

I Many race conditions are caused by violations of the SIMD paradigm. I Atomic operations and locks are brute force ways to fix race conditions.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 8 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race Outline conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and Race conditions mutex

Warps

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 9 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race Atomics conditions, atomics, locks, mutex, and warps

Will Landau

I Atomic operation: an operation that forces otherwise Race conditions parallel threads into a bottleneck, executing the Brute force fixes: atomics, locks, and operation one at a time. mutex Warps I In colonel(), replace

*a d += 1;

with an atomic function,

atomicAdd(a d, 1);

to fix the race condition in race condition.cu.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 10 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps

Will Landau 1#include 2#include Race conditions 3#include 4#include Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d ){ mutex 7 atomicAdd(a d , 1) ; 8 } Warps 9 10 int main() { 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ;

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 11 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: 24 cudaEventRecord( stop, 0 ); atomics, locks, and 25 cudaEventSynchronize( stop ); mutex 26 cudaEventElapsedTime( &elapsedTime, start , stop ); 27 cudaEventDestroy( start ); Warps 28 cudaEventDestroy( stop ); 29 p r i n t f (”GPU Time elapsed:%f seconds \n”, elapsedTime/1000.0); 30 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); 33 34 p r i n t f (”a=%d \n”, a); 35 cudaFree ( a d ) ; 36 37 }

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 12 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps

Will Landau 1 > nvcc r a c e c o n d i t i o n f i x e d . cu −arch sm 20 −o r a c e c o n d i t i o n f i x e d Race conditions 2 > . / r a c e d o n c i t i o n f i x e d Brute force fixes: atomics, locks, and 3 GPU Time elapsed: 0.01485 seconds mutex 4 a = 1000000 Warps

I We got the right answer this time, and execution was slower because we forced the threads to execute the addition sequentially. I If you’re using builtin atomic functions like atomicAdd(), use the -arch sm 20 flag in compilation.

I This is to make sure you’re using CUDA compute capability (version) 2.0 or above.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 13 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race CUDA C builtin atomic functions conditions, atomics, locks, mutex, and warps

I With CUDA compute capability 2.0 or above, you can Will Landau use: Race conditions I atomicAdd() Brute force fixes: I atomicSub() atomics, locks, and mutex I atomicMin() Warps I atomicMax() I atomicInc() I atomicDec() I atomicAdd() I atomicExch() I atomicCAS() I atomicAnd() I atomicOr() I atomicXor()

I For documentation, refer to the CUDA C programming guide.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 14 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race atomicCAS(int *address, int compare, conditions, atomics, locks, int val): needed for locks mutex, and warps Will Landau

Race conditions

Brute force fixes: atomics, locks, and mutex

Warps 1. Read the value, old, located at address. 2. *address = (old == compare) ? val : old; 3. Return old.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 15 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race Locks and mutex conditions, atomics, locks, mutex, and warps I : a mechanism in that forces an Will Landau entire segment of code to be executed atomically. Race conditions I mutex Brute force fixes: I “”, the principle behind locks. atomics, locks, and mutex I While a thread is running code inside a lock, it shuts all the other threads out of the lock. Warps

1 g l o b a l v o i d someKernel(void) { 2 Lock myLock; 3 4// some parallel code 5 6 mylock.lock(); 7// some sequential code 8 mylock.unlock(); 9 10// some parallel code 11 }

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 16 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race The concept conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 17 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race Lock.h conditions, atomics, locks, mutex, and warps

Will Landau 1 struct Lock { 2 int ∗mutex ; Race conditions 3 4 Lock ( ) { Brute force fixes: 5 int state = 0; atomics, locks, and 6 mutex 7 cudaMalloc((void ∗∗) &mutex , sizeof(int))); 8 cudaMemcpy(mutex, &state , sizeof(int) , cudaMemcpyHostToDevice)); Warps 9 } 10 11 ˜ Lock ( ) { 12 cudaFree ( mutex ) ; 13 } 14 15 d e v i c e v o i d lock() { 16 while(atomicCAS(mutex, 0, 1) != 0); 17 } 18 19 d e v i c e v o i d unlock() { 20 atomicExch(mutex, 0); 21 } 22 23 };

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 18 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race A closer look at the lock function conditions, atomics, locks, mutex, and warps 15 d e v i c e v o i d lock() { 16 while(atomicCAS(mutex, 0, 1) != 0); Will Landau

17 } Race conditions

Brute force fixes: atomics, locks, and I In pseudocode: mutex Warps 1 d e v i c e void lock() { 2 r e p e a t { 3 do atomically { 4 5 if(mutex == 0) { 6 mutex = 1 ; 7 r e t u r n v a l u e = 0 ; 8 } 9 10 else if(mutex == 1) { 11 r e t u r n v a l u e = 1 ; 12 } 13 } // do atomically 14 15 if(return v a l u e == 0) 16 e x i t l o o p ; 17 18 } // repeat 19 } // lock

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 19 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race Example: counting the number of blocks conditions, atomics, locks, mutex, and warps

I Compare the two kernels: Will Landau

Race conditions 5 g l o b a l v o i d blockCounterUnlocked(int ∗ Brute force fixes: n b l o c k s ) { atomics, locks, and 6 if(threadIdx.x == 0) { mutex 7 ∗ n b l o c k s = ∗ n b l o c k s + 1 ; Warps 8 } 9 }

11 g l o b a l v o i d blockCounter1(Lock lock , int ∗ n b l o c k s ) { 12 if(threadIdx.x == 0) { 13 l o c k . l o c k ( ) ; 14 ∗ n b l o c k s = ∗ n b l o c k s + 1 ; 15 l o c k . u n l o c k ( ) ; 16 } 17 }

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 20 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race blockCounter.cu conditions, atomics, locks, 1#include”../ common/lock.h” mutex, and warps 2#define NBLOCKS TRUE 512 3#define NTHREADS TRUE 512 ∗ 2 Will Landau 4 5 g l o b a l v o i d blockCounterUnlocked( int ∗ n b l o c k s ){ Race conditions 6 if(threadIdx.x == 0) { 7 ∗ n b l o c k s = ∗ n b l o c k s + 1 ; Brute force fixes: 8 } atomics, locks, and 9 } mutex 10 11 g l o b a l v o i d blockCounter1( Lock lock , int ∗ n b l o c k s ){ Warps 12 if(threadIdx.x == 0) { 13 l o c k . l o c k ( ) ; 14 ∗ n b l o c k s = ∗ n b l o c k s + 1 ; 15 l o c k . u n l o c k ( ) ; 16 } 17 } 18 19 int main() { 20 int nblocks h o s t , ∗ n b l o c k s d e v ; 21 Lock l o c k ; 22 float elapsedTime; 23 c u d a E v e n t t start, stop; 24 25 cudaMalloc ( (void ∗∗) &n b l o c k s d e v , sizeof(int)); 26 27//blockCounterUnlocked: 28 29 n b l o c k s h o s t = 0 ; 30 cudaMemcpy ( n b l o c k s dev , &nblocks h o s t , sizeof(int), cudaMemcpyHostToDevice ) ;

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 21 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race blockCounter.cu conditions, atomics, locks, mutex, and warps

Will Landau

32 cudaEventCreate(&start); Race conditions 33 cudaEventCreate(&stop); 34 cudaEventRecord( start, 0 ); Brute force fixes: 35 atomics, locks, and 36 blockCounterUnlocked<<>>(n b l o c k s d e v ) ; mutex 37 38 cudaEventRecord( stop, 0 ); Warps 39 cudaEventSynchronize( stop ); 40 cudaEventElapsedTime( &elapsedTime, start , stop ); 41 42 cudaEventDestroy( start ); 43 cudaEventDestroy( stop ); 44 45 cudaMemcpy( &nblocks host , nblocks d e v , sizeof(int), cudaMemcpyDeviceToHost ) ; 46 p r i n t f (”blockCounterUnlocked <<< %d,%d >>> () counted%d blocks in %f ms. \ n”, 47 NBLOCKS TRUE, 48 NTHREADS TRUE, 49 n b l o c k s h o s t , 50 elapsedTime ) ;

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 22 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race blockCounter.cu conditions, atomics, locks, mutex, and warps 51//blockCounter1: 52 Will Landau 53 n b l o c k s h o s t = 0 ; 54 cudaMemcpy ( n b l o c k s dev , &nblocks h o s t , sizeof(int), cudaMemcpyHostToDevice ) ; Race conditions 55 56 cudaEventCreate(&start); Brute force fixes: 57 cudaEventCreate(&stop); atomics, locks, and 58 cudaEventRecord( start, 0 ); mutex 59 60 blockCounter1<<>>(lock , nblocks d e v ) ; Warps 61 62 cudaEventRecord( stop, 0 ); 63 cudaEventSynchronize( stop ); 64 cudaEventElapsedTime( &elapsedTime, start , stop ); 65 66 cudaEventDestroy( start ); 67 cudaEventDestroy( stop ); 68 69 cudaMemcpy( &nblocks host , nblocks d e v , sizeof(int), cudaMemcpyDeviceToHost ) ; 70 p r i n t f (”blockCounter1 <<< %d,%d >>> () counted%d blocks in%f ms .\ n”, 71 NBLOCKS TRUE, 72 NTHREADS TRUE, 73 n b l o c k s h o s t , 74 elapsedTime ) ; 75 76 cudaFree ( n b l o c k s d e v ) ; 77 }

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 23 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race blockCounter.cu conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and mutex

78 > nvcc blockCounter.cu −a r c h sm 20 −o blockCounter Warps 79 > ./blockCounter 80 blockCounterUnlocked <<< 512 , 1024 >>> () counted 47 blocks in 0.057920 ms. 81 b l o c k C o u n t e r 1 <<< 512 , 1024 >>> () counted 512 blocks in 0.636064 ms.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 24 / 33 Brute force fixes: atomics, locks, and mutex

CUDA C: race blockCounter.cu pauses indefinitely with this conditions, atomics, locks, kernel mutex, and warps Will Landau

Race conditions

Brute force fixes: atomics, locks, and 1 g l o b a l v o i d blockCounter2(Lock lock , int ∗ mutex

n b l o c k s ) { Warps 2 lock.lock(); 3 if(threadIdx.x == 0) { 4 ∗ n b l o c k s = ∗ n b l o c k s + 1 ; 5 } 6 lock.unlock(); 7 }

I Why? warps.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 25 / 33 Warps

CUDA C: race Outline conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and Race conditions mutex

Warps

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 26 / 33 Warps

CUDA C: race Warps conditions, atomics, locks, I Warp: a group of 32 threads in the same block that mutex, and warps execute in lockstep. Will Landau

I That is, they synchronize after every step (as if Race conditions

syncthreads() is called as often as possible). Brute force fixes: I All blocks are partitioned into warps. atomics, locks, and mutex

Warps Threads in the same warp:

Step 1 Step 3 Thread A: Wait Step 2

Step 2 Wait Thread B: Step 1 Step 3 Wait

Time

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 27 / 33 Warps

CUDA C: race Warps conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions Threads in different warps: Brute force fixes: atomics, locks, and mutex

Warps

Step 1 Step 2 Step 3 Thread A:

Step 2 Step 3 Thread B: Step 1

Time

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 28 / 33 Warps

CUDA C: race Warps and locks conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions Threads in different warps: Brute force fixes: atomics, locks, and mutex

Get to the lock first Unlock the Warps and reserve it. Do instruction lock. Do intruction set I (atomic). set II (parallel). Thread A:

Get to the lock second and get stuck in the lock's Reserve the Do instruction loop. lock. set I (atomic). Thread B: Wait for thread A to unlock the lock.

Time

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 29 / 33 Warps

CUDA C: race Warps and locks conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions Threads in the same warp: Brute force fixes: atomics, locks, and mutex Get to the lock first and block all the Warps other threads. Wait for thread B to exit lock's loop. Thread A:

Get to the lock second and get stuck in the lock's loop. Wait for thread A to unlock the lock. Thread B:

Time

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 30 / 33 Warps

CUDA C: race Outline conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and Race conditions mutex

Warps

Brute force fixes: atomics, locks, and mutex

Warps

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 31 / 33 Warps

CUDA C: race Resources conditions, atomics, locks, mutex, and warps

Will Landau I Texts: 1. J. Sanders and E. Kandrot. CUDA by Example. Race conditions Brute force fixes: Addison-Wesley, 2010. atomics, locks, and 2. D. Kirk, W.H. Wen-mei, and W. Hwu. Programming mutex processors: a hands-on approach. Warps Morgan Kaufmann, 2010.

I Code from today:

I race condition.cu I race condition fixed.cu I blockCounter.cu

I Dot product with atomic operations:

I dot product atomic builtin.cu I dot product atomic lock.cu

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 32 / 33 Warps

CUDA C: race That’s all for today. conditions, atomics, locks, mutex, and warps

Will Landau

Race conditions

Brute force fixes: atomics, locks, and mutex

Warps I Series materials are available at http://will-landau.com/gpu.

Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 33 / 33