CUDA C: Race Conditions, Atomics, Locks, Mutex, and Warps

CUDA C: race conditions, atomics, locks, mutex, and warps Will Landau Race conditions CUDA C: race conditions, atomics, locks, Brute force fixes: atomics, locks, and mutex, and warps mutex Warps Will Landau Iowa State University October 21, 2013 Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 1 / 33 CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 2 / 33 Race conditions CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 3 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau I Let int *x point to global memory. *x++ happens in 3 Race conditions steps: Brute force fixes: 1. Read the value in *x into a register. atomics, locks, and mutex 2. Add 1 to the value read in step 1. Warps 3. Write the result back to *x. I If we want parallel threads A and B to both increment *x, then we want something like: 1. Thread A reads the value, 7, from *x. 2. Thread A adds 1 to its value, 7, to make 8. 3. Thread A writes its value, 8, back to *x. 4. Thread B reads the value, 8, from *x. 5. Thread B adds 1 to its value, 8, to make 9. 6. Thread B writes the value, 9, back to *x. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 4 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: I But since the threads are parallel, we might actually atomics, locks, and mutex get: Warps 1. Thread A reads the value, 7, from *x. 2. Thread B reads the value, 7, from *x. 3. Thread A adds 1 to its value, 7, to make 8. 4. Thread A writes its value, 8, back to *x. 5. Thread B adds 1 to its value, 7, to make 8. 6. Thread B writes the value, 8, back to *x. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 5 / 33 Race conditions CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps Will Landau 1#include <s t d i o . h> 2#include <s t d l i b . h> Race conditions 3#include <cuda . h> 4#include <c u d a r u n t i m e . h> Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d )f mutex 7 ∗a d += 1 ; 8 g Warps 9 10 int main() f 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ; Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 6 / 33 Race conditions CUDA C: race Example: race condition.cu conditions, atomics, locks, mutex, and warps 24 cudaEventRecord( stop, 0 ); Will Landau 25 cudaEventSynchronize( stop ); 26 cudaEventElapsedTime( &elapsedTime, start , stop ); Race conditions 27 cudaEventDestroy( start ); 28 cudaEventDestroy( stop ); Brute force fixes: 29 p r i n t f ("GPU Time elapsed:%f seconds nn", elapsedTime/1000.0); atomics, locks, and 30 mutex 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); Warps 33 34 p r i n t f ("a=%d nn", a); 35 cudaFree ( a d ) ; 36 37 g 1 > nvcc r a c e condition.cu −o r a c e c o n d i t i o n 2 > . / r a c e c o n d i t i o n 3 GPU Time elapsed: 0.000148 seconds 4 a = 88 I Since we started with a at 0, we should have gotten a = 1000 · 1000 = 1; 000; 000. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 7 / 33 Race conditions CUDA C: race Race conditions conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and I Race condition: A computational hazard that arises mutex when the results of the program depend on the timing Warps of uncontrollable events, such as the execution order or threads. I Many race conditions are caused by violations of the SIMD paradigm. I Atomic operations and locks are brute force ways to fix race conditions. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 8 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race Outline conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and Race conditions mutex Warps Brute force fixes: atomics, locks, and mutex Warps Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 9 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race Atomics conditions, atomics, locks, mutex, and warps Will Landau I Atomic operation: an operation that forces otherwise Race conditions parallel threads into a bottleneck, executing the Brute force fixes: atomics, locks, and operation one at a time. mutex Warps I In colonel(), replace *a d += 1; with an atomic function, atomicAdd(a d, 1); to fix the race condition in race condition.cu. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 10 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau 1#include <s t d i o . h> 2#include <s t d l i b . h> Race conditions 3#include <cuda . h> 4#include <c u d a r u n t i m e . h> Brute force fixes: 5 atomics, locks, and 6 g l o b a l v o i d colonel(int ∗a d )f mutex 7 atomicAdd(a d , 1) ; 8 g Warps 9 10 int main() f 11 12 int a=0, ∗a d ; 13 14 cudaMalloc ( (void ∗∗) &a d , sizeof(int)); 15 cudaMemcpy ( a d , &a , sizeof(int) , cudaMemcpyHostToDevice); 16 17 float elapsedTime; 18 c u d a E v e n t t start, stop; 19 cudaEventCreate(&start); 20 cudaEventCreate(&stop); 21 cudaEventRecord( start, 0 ); 22 23 c o l o n e l <<<1000,1000>>>(a d ) ; Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 11 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau Race conditions Brute force fixes: 24 cudaEventRecord( stop, 0 ); atomics, locks, and 25 cudaEventSynchronize( stop ); mutex 26 cudaEventElapsedTime( &elapsedTime, start , stop ); 27 cudaEventDestroy( start ); Warps 28 cudaEventDestroy( stop ); 29 p r i n t f ("GPU Time elapsed:%f seconds nn", elapsedTime/1000.0); 30 31 32 cudaMemcpy(&a , a d , sizeof(int) , cudaMemcpyDeviceToHost); 33 34 p r i n t f ("a=%d nn", a); 35 cudaFree ( a d ) ; 36 37 g Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 12 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race race condition fixed.cu conditions, atomics, locks, mutex, and warps Will Landau 1 > nvcc r a c e c o n d i t i o n f i x e d . cu −arch sm 20 −o r a c e c o n d i t i o n f i x e d Race conditions 2 > . / r a c e d o n c i t i o n f i x e d Brute force fixes: atomics, locks, and 3 GPU Time elapsed: 0.01485 seconds mutex 4 a = 1000000 Warps I We got the right answer this time, and execution was slower because we forced the threads to execute the addition sequentially. I If you're using builtin atomic functions like atomicAdd(), use the -arch sm 20 flag in compilation. I This is to make sure you're using CUDA compute capability (version) 2.0 or above. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 13 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race CUDA C builtin atomic functions conditions, atomics, locks, mutex, and warps I With CUDA compute capability 2.0 or above, you can Will Landau use: Race conditions I atomicAdd() Brute force fixes: I atomicSub() atomics, locks, and mutex I atomicMin() Warps I atomicMax() I atomicInc() I atomicDec() I atomicAdd() I atomicExch() I atomicCAS() I atomicAnd() I atomicOr() I atomicXor() I For documentation, refer to the CUDA C programming guide. Will Landau (Iowa State University) CUDA C: race conditions, atomics, locks, mutex, and warpsOctober 21, 2013 14 / 33 Brute force fixes: atomics, locks, and mutex CUDA C: race atomicCAS(int *address, int compare, conditions, atomics, locks, int val): needed for locks mutex, and warps Will Landau Race conditions Brute force fixes: atomics, locks, and mutex Warps 1.

CUDA C: Race Conditions, Atomics, Locks, Mutex, and Warps

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support