Memory Key challenge in modern computer architecture Lecture 2: different memory no point in blindingly fast computation if data can’t be and variable types moved in and out fast enough need lots of memory for big applications Prof. Mike Giles very fast memory is also very expensive [email protected] end up being pushed towards a hierarchical design Oxford University Mathematical Institute Oxford e-Research Centre Lecture 2 – p. 1 Lecture 2 – p. 2 CPU Memory Hierarchy Memory Hierarchy Execution speed relies on exploiting data locality 2 – 8 GB Main memory 1GHz DDR3 temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache ? 200+ cycle access, 20-30GB/s spatial locality: neighbouring data is also likely to be 6 used soon, so load them into the cache at the same 2–6MB time using a ‘wide’ bus (like a multi-lane motorway) L3 Cache 2GHz SRAM ??25-35 cycle access 66 This wide bus is only way to get high bandwidth to slow 32KB + 256KB main memory ? L1/L2 Cache faster 3GHz SRAM more expensive ??? 6665-12 cycle access smaller registers Lecture 2 – p. 3 Lecture 2 – p. 4 Caches Importance of Locality The cache line is the basic unit of data transfer; Typical workstation: typical size is 64 bytes 8 8-byte items. ≡ × 10 Gflops CPU 20 GB/s memory L2 cache bandwidth With a single cache, when the CPU loads data into a ←→ 64 bytes/line register: it looks for line in cache 20GB/s 300M line/s 2.4G double/s ≡ ≡ if there (hit), it gets data At worst, each flop requires 2 inputs and has 1 output, if not (miss), it gets entire line from main memory, forcing loading of 3 lines = 100 Mflops displacing an existing line in cache (usually least ⇒ recently used) If all 8 variables/line are used, then this increases to 800 Mflops. When the CPU stores data from a register: To get up to 10Gflops needs temporal locality, re-using data same procedure Lecture 2 – p. 5 already in the cache. Lecture 2 – p. 6 Fermi Fermi Fermi GPU ¨ 128 bytes cache line (32 floats or 16 doubles) ¨¨ ¨ C C C C 384-bit memory bus from device memory to L2 cache ¨¨ ¨ C C C C C C C C up to 190 GB/s bandwidth SM SM SM SM SM SM SM C C C C unified 768kB L2 cache for all SM’s C C C C each SM has 16kB or 48kB of L1 cache J @ C C C C J @ L2 cache J @ C C C C (64kB is split 16/48 or 48/16 between L1 cache and J @ C C C C shared memory) J @ J no global cache coherency as in CPUs, so should SM SM SM SM SM SM SM J L1 cache / (almost) never have different blocks updating the same J shared memory J global array elements J Lecture 2 – p. 7 Lecture 2 – p. 8 GPU Memory Hierarchy Importance of Locality 1Tflops GPU (single precision) 1 – 6 GB 150 GB/s memory L2 cache bandwidth Device memory 1GHz GDDR5 ←→ 128 bytes/line ? 200-300 cycle access, 150GB/s 150GB/s 1200M line/s 20G double/s 6 ≡ ≡ 768KB At worst, each flop requires 2 inputs and has 1 output, L2 Cache forcing loading of 3 lines = 400 Mflops ⇒ ?? 66200-300 cycle access, 300GB/s If all 16 doubles/line are used, then this increases to 6.4 16/48KB Gflops. ? L1 Cache faster To get up to 500Gflops needs about 25 flops per double more expensive ??? 66680 cycle access transferred to/from device memory smaller registers Even with careful implementation, many algorithms are Lecture 2 – p. 9 bandwidth-limited not compute-bound Lecture 2 – p. 10 Practical 1 kernel A bad kernel __global__ void my_first_kernel(float *x) __global__ void bad_kernel(float *x) { { int tid = threadIdx.x + blockDim.x*blockIdx.x; int tid = threadIdx.x + blockDim.x*blockIdx.x; x[tid] = threadIdx.x; x[1000*tid] = threadIdx.x; } } 32 threads in a warp will address neighbouring in this case, different threads within a warp access elements of array x widely spaced elements of array x if the data is correctly “aligned” so that x[0] is at the each access involves a different cache line, so beginning of a cache line, then x[0] – x[31] will be in performance will be awful same cache line hence we get perfect spatial locality Lecture 2 – p. 11 Lecture 2 – p. 12 Global arrays Global variables So far, concentrated on global / device arrays: Global variables can also be created by declarations with held in the large device memory global scope within kernel code file allocated by host code __device__ int reduction_lock=0; pointers held by host code and passed into kernels continue to exist until freed by host code __global__ void kernel_1(...) { ... since blocks execute in an arbitrary order, if one block } modifies an array element, no other block should read or write that same element __global__ void kernel_2(...) { ... } Lecture 2 – p. 13 Lecture 2 – p. 14 Global variables Constant variables the __device__ prefix tells nvcc this is a global Very similar to global variables, except that they can’t be variable in the GPU, not the CPU. modified by kernels: the variable can be read and modified by any kernel defined with global scope within the kernel file using the its lifetime is the lifetime of the whole application prefix __constant__ can also declare arrays of fixed size initialised by the host code using cudaMemcpyToSymbol, cudaMemcpyFromSymbol can read/write by host code using special routines or cudaMemcpy in combination with cudaMemcpyToSymbol, cudaMemcpyFromSymbol cudaGetSymbolAddress or with standard cudaMemcpy in combination with cudaGetSymbolAddress I use it all the time in my applications; practical 2 has an example in my own CUDA programming, I rarely use this capability but it is occasionally very useful Lecture 2 – p. 15 Lecture 2 – p. 16 Constant variables Constants Only 64KB of constant memory, but big benefit is that each A constant variable has its value set at run-time SM has a 8KB cache when all threads read the same constant, I think it’s as But code also often has plain constants whose value is fast as a register known at compile-time: doesn’t tie up a register, so very helpful in minimising #define PI 3.1415926f the total number of registers required Fermi provides new capabilities to use same cache for a = b / (2.0f * PI); global arrays declared to be read-only within a kernel Leave these as they are – they seem to be embedded into the executable code so they don’t use up any registers Lecture 2 – p. 17 Lecture 2 – p. 18 Registers Registers Within each kernel, by default, individual variables are 32K 32-bit registers per SM assigned to registers: up to 63 registers per thread __global__ void lap(int I, int J, up to 1536 threads (at most 1024 per thread block) float *u1, float *u2) { int i = threadIdx.x + blockIdx.x*blockDim.x; max registers per thread = 520 threads ⇒ int j = threadIdx.y + blockIdx.y*blockDim.y; max threads = 21 registers per thread int id = i + j*I; ⇒ not much difference between “fat” and “thin” threads if (i==0 || i==I-1 || j==0 || j==J-1) { u2[id] = u1[id]; // Dirichlet b.c.'s } else { u2[id] = 0.25f * ( u1[id-1] + u1[id+1] + u1[id-I] + u1[id+I] ); } } Lecture 2 – p. 19 Lecture 2 – p. 20 Registers Registers What happens if your application needs more registers? Avoiding register spill is now one of my main concerns in big applications, but remember: They “spill” over into L1 cache, and from there to device memory – precise mechanism unclear, but with 1024 threads, 400-600 cycle latency of device memory is usually OK because some warps can do either certain variables become device arrays with one useful work while others wait for data element per thread provided there are 20 flops per variable read from (or or the contents of some registers get “saved” to device written to) device memory, the bandwidth is not a memory so they can used for other purposes, then the data limiting issue gets “restored” later Either way, the application suffers from the latency and bandwidth implications of using device memory Lecture 2 – p. 21 Lecture 2 – p. 22 Local arrays Local arrays What happens if your application uses a little array? In simple cases like this (quite common) compiler converts to scalar registers: __global__ void lap(float *u) { __global__ void lap(float *u) { int tid = threadIdx.x + blockIdx.x blockDim.x; float ut[3]; * float ut0 = u[tid+0*gridDim.x*blockDim.x]; float ut1 = u[tid+1*gridDim.x*blockDim.x]; int tid = threadIdx.x + blockIdx.x*blockDim.x; float ut2 = u[tid+2*gridDim.x*blockDim.x]; for (int k=0; k<3; k++) u[tid+0*gridDim.x*blockDim.x] = ut[k] = u[tid+k*gridDim.x*blockDim.x]; A[0]*ut0 + A[1]*ut1 + A[2]*ut2; u[tid+1 gridDim.x blockDim.x] = for (int k=0; k<3; k++) * * A[3] ut0 + A[4] ut1 + A[5] ut2; u[tid+k gridDim.x blockDim.x] = * * * * * u[tid+2 gridDim.x blockDim.x] = A[3 k] ut[0]+A[3 k+1] ut[1]+A[3 k+2] ut[2]; * * * * * * * * A[6] ut0 + A[7] ut1 + A[8] ut2; } * * * } Lecture 2 – p. 23 Lecture 2 – p. 24 Local arrays Shared memory In more complicated cases, it puts the array into device In a kernel, the prefix __shared__ as in memory __shared__ int x_dim; still referred to in the documentation as a “local array” __shared__ float x[128]; because each thread has its own private copy declares data to be shared between all of the threads in the thread block – any thread can set its value, or read it held in L1 cache by default, may never be transferred to device memory There can be several benefits: 16kB of L1 cache equates to 4096 32-bit variables, which is only 8 per thread when using 1024 threads essential for operations requiring communication between threads (e.g.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-