Lecture 2: Different Memory and Variable Types Memory CPU

Memory Key challenge in modern computer architecture Lecture 2: different memory no point in blindingly fast computation if data can’t be and variable types moved in and out fast enough need lots of memory for big applications Prof. Mike Giles very fast memory is also very expensive [email protected] end up being pushed towards a hierarchical design Oxford University Mathematical Institute Oxford e-Research Centre Lecture 2 – p. 1 Lecture 2 – p. 2 CPU Memory Hierarchy Memory Hierarchy Execution speed relies on exploiting data locality 32 – 128 GB Main memory 2GHz DDR4 temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache ❄ 200+ cycle access, 60 – 120GB/s spatial locality: neighbouring data is also likely to be ✻ used soon, so load them into the cache at the same 12 – 24 MB time using a ‘wide’ bus (like a multi-lane motorway) L3 Cache 2GHz SRAM ❄❄ ✻✻25-35 cycle access, 200 – 400GB/s This wide bus is only way to get high bandwidth to slow 32KB + 512KB main memory ❄ L1/L2 Cache faster 3GHz SRAM more expensive ❄❄❄ ✻✻✻5-12 cycle access smaller registers Lecture 2 – p. 3 Lecture 2 – p. 4 Caches Importance of Locality The cache line is the basic unit of data transfer; Typical workstation: typical size is 64 bytes 8 8-byte items. ≡ × 20 Gflops per core 40 GB/s L3 L2 cache bandwidth With a single cache, when the CPU loads data into a ←→ 64 bytes/line register: it looks for line in cache 40GB/s 600M line/s 5G double/s ≡ ≡ if there (hit), it gets data At worst, each flop requires 2 inputs and has 1 output, if not (miss), it gets entire line from main memory, forcing loading of 3 lines = 200 Mflops displacing an existing line in cache (usually least ⇒ recently used) If all 8 variables/line are used, then this increases to 1.6 Gflops. When the CPU stores data from a register: To get up to 20Gflops needs temporal locality, re-using data same procedure Lecture 2 – p. 5 already in the L2 cache. Lecture 2 – p. 6 Pascal Pascal Pascal GPU ✟ usually 32 bytes cache line (8 floats or 4 doubles) ✟✟ ✟✟ ✏✏ ✟ ✏✏ P100: 4092-bit memory path from HBM2 device ✟ ✏ memory to L2 cache = up to 720 GB/s bandwidth ⇒ SM SM SM SM GeForce cards: 384-bit memory bus from GDDR5 device memory to L2 cache = up to 320 GB/s ❏ ❅ ⇒ ❏ ❅ unified 4MB L2 cache for all SM’s L2 cache ❏ ❅ ❏ ❅ ❏ ❅ each SM has 64-96kB of shared memory, and ❏ shared memory 24-48kB of L1 cache SM SM SM SM ❏ ❏ no global cache coherency as in CPUs, so should ❏ L1 cache ❏ (almost) never have different blocks updating the same global array elements Lecture 2 – p. 7 Lecture 2 – p. 8 GPU Memory Hierarchy Importance of Locality 5Tflops GPU 4 – 12 GB 320 GB/s memory L2 cache bandwidth Device memory 5GHz GDDR5 ←→ 32 bytes/line ❄ 200-300 cycle access, 250 – 500GB/s 320GB/s 10G line/s 40G double/s ✻ ≡ ≡ 2 – 4MB At worst, each flop requires 2 inputs and has 1 output, L2 Cache forcing loading of 3 lines = 3 Gflops ⇒ ❄❄ ✻✻200-300 cycle access, 500 – 1000GB/s If all 4 doubles/line are used, increases to 13 Gflops 16/32/48KB ❄ L1 Cache To get up to 2TFlops needs about 50 flops per double faster more expensive ❄❄❄ transferred to/from device memory ✻✻✻80 cycle access smaller registers Even with careful implementation, many algorithms are bandwidth-limited not compute-bound Lecture 2 – p. 9 Lecture 2 – p. 10 Practical 1 kernel A bad kernel __global__ void my_first_kernel(float *x) __global__ void bad_kernel(float *x) { { int tid = threadIdx.x + blockDim.x*blockIdx.x; int tid = threadIdx.x + blockDim.x*blockIdx.x; x[tid] = threadIdx.x; x[1000*tid] = threadIdx.x; } } 32 threads in a warp will address neighbouring in this case, different threads within a warp access elements of array x widely spaced elements of array x – a “strided” array if the data is correctly “aligned” so that x[0] is at the access beginning of a cache line, then x[0] – x[31] will be in each access involves a different cache line, so same cache line – a “coalesced” transfer performance will be awful hence we get perfect spatial locality Lecture 2 – p. 11 Lecture 2 – p. 12 Global arrays Global variables So far, concentrated on global / device arrays: Global variables can also be created by declarations with held in the large device memory global scope within kernel code file allocated by host code __device__ int reduction_lock=0; pointers held by host code and passed into kernels continue to exist until freed by host code __global__ void kernel_1(...) { ... since blocks execute in an arbitrary order, if one block } modifies an array element, no other block should read or write that same element __global__ void kernel_2(...) { ... } Lecture 2 – p. 13 Lecture 2 – p. 14 Global variables Constant variables the __device__ prefix tells nvcc this is a global Very similar to global variables, except that they can’t be variable in the GPU, not the CPU. modified by kernels: the variable can be read and modified by any kernel defined with global scope within the kernel file using the its lifetime is the lifetime of the whole application prefix __constant__ can also declare arrays of fixed size initialised by the host code using cudaMemcpyToSymbol, cudaMemcpyFromSymbol can read/write by host code using special routines or cudaMemcpy in combination with cudaMemcpyToSymbol, cudaMemcpyFromSymbol cudaGetSymbolAddress or with standard cudaMemcpy in combination with cudaGetSymbolAddress I use it all the time in my applications; practical 2 has an example in my own CUDA programming, I rarely use this capability but it is occasionally very useful Lecture 2 – p. 15 Lecture 2 – p. 16 Constant variables Constants Only 64KB of constant memory, but big benefit is that each A constant variable has its value set at run-time SM has a 10KB cache when all threads read the same constant, almost as fast But code also often has plain constants whose value is as a register known at compile-time: doesn’t tie up a register, so very helpful in minimising #define PI 3.1415926f the total number of registers required a = b / (2.0f * PI); Leave these as they are – they seem to be embedded into the executable code so they don’t use up any registers Don’t forget the f at the end if you want single precision; in C/C++ single double = double Lecture 2 – p. 17 × Lecture 2 – p. 18 Registers Registers Within each kernel, by default, individual variables are 64K 32-bit registers per SM assigned to registers: __global__ void lap(int I, int J, up to 255 registers per thread float *u1, float *u2) { up to 2048 threads (at most 1024 per thread block) int i = threadIdx.x + blockIdx.x*blockDim.x; int j = threadIdx.y + blockIdx.y blockDim.y; * max registers per thread = 256 threads int id = i + j*I; ⇒ max threads = 32 registers per thread if (i==0 || i==I-1 || j==0 || j==J-1) { ⇒ u2[id] = u1[id]; // Dirichlet b.c.’s big difference between “fat” and “thin” threads } else { u2[id] = 0.25f * ( u1[id-1] + u1[id+1] + u1[id-I] + u1[id+I] ); } } Lecture 2 – p. 19 Lecture 2 – p. 20 Registers Local arrays What happens if your application needs more registers? What happens if your application uses a little array? They “spill” over into L1 cache, and from there to device __global__ void lap(float *u) { memory – precise mechanism unclear, but float ut[3]; either certain variables become device arrays with one element per thread int tid = threadIdx.x + blockIdx.x*blockDim.x; or the contents of some registers get “saved” to device for (int k=0; k<3; k++) memory so they can used for other purposes, then the data ut[k] = u[tid+k gridDim.x blockDim.x]; gets “restored” later * * for (int k=0; k<3; k++) Either way, the application suffers from the latency and u[tid+k gridDim.x blockDim.x] = bandwidth implications of using device memory * * A[3*k]*ut[0]+A[3*k+1]*ut[1]+A[3*k+2]*ut[2]; } Lecture 2 – p. 21 Lecture 2 – p. 22 Local arrays Local arrays In simple cases like this (quite common) compiler converts In more complicated cases, it puts the array into device to scalar registers: memory __global__ void lap(float *u) { this is because registers are not dynamically int tid = threadIdx.x + blockIdx.x*blockDim.x; addressable – compiler has to specify exactly which float ut0 = u[tid+0*gridDim.x*blockDim.x]; registers are used for each instruction float ut1 = u[tid+1*gridDim.x*blockDim.x]; still referred to in the documentation as a “local array” float ut2 = u[tid+2*gridDim.x*blockDim.x]; because each thread has its own private copy u[tid+0*gridDim.x*blockDim.x] = held in L1 cache by default, may never be transferred to A[0]*ut0 + A[1]*ut1 + A[2]*ut2; device memory u[tid+1*gridDim.x*blockDim.x] = 48kB of L1 cache equates to 12k 32-bit variables, A[3]*ut0 + A[4]*ut1 + A[5]*ut2; which is only 12 per thread when using 1024 threads u[tid+2*gridDim.x*blockDim.x] = A[6]*ut0 + A[7]*ut1 + A[8]*ut2; beyond this, it will have to spill to device memory } Lecture 2 – p. 23 Lecture 2 – p. 24 Shared memory Shared memory In a kernel, the prefix __shared__ as in If a thread block has more than one warp, it’s not pre-determined when each warp will execute its instructions __shared__ int x_dim; – warp 1 could be many instructions ahead of warp 2, __shared__ float x[128]; or well behind.

Lecture 2: Different Memory and Variable Types Memory CPU

Memory Hierarchy Memory Hierarchy

Data Storage the CPU-Memory

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

Cache-Aware Roofline Model: Upgrading the Loft

ARM-Architecture Simulator Archulator

IBM Power Systems Performance Report Apr 13, 2021

Multi-Tier Caching Technology™

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

Cache & Memory System