Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Chapter 4 SIMD or MIMD? Data-Level Parallelism ● Based on past history, the book predicts that for x86 in – We will get two additional cores per chip per year ● Linear growth Vector, SIMD, and GPU – We will get twice as wide SIMD instructions every 4 years ● Exponential growth – Potential speedup from SIMD to be twice that from MIMD! Architectures 1 4 Introduction: Focusing on SIMD Vector Architecture Basics ● SIMD architectures can exploit significant data- level parallelism for: ● The underlying idea: – matrix-oriented scientific computing – The processor does everything on a set of numbers at a time – media-oriented image and sound processors ● Read sets of data elements into “vector registers” ● SIMD is more energy efficient than MIMD ● Operate on those vector registers – Only needs to fetch one instruction per data operation ● Disperse the results back into memory in sets – Makes SIMD attractive for personal mobile devices ● Registers are controlled by compiler to... ● SIMD allows programmer to continue to think sequentially – ...hide memory latency – “Backwards compatibility” ● The price of latency paid only once for the first element of the vector ● Three major implementations ● No need for sophisticated prefetcher – Vector architectures – Always load consecutive elements from memory (unless gather-scatter) – SIMD extensions – ...leverage memory bandwidth – Graphics Processor Units (GPUs) 2 5 ILP vs. SIMD vs. MIMD Example Architecture: VMIPS ... ● Loosely based on Cray-1 ... ● Vector registers – Each register holds a 64-element, 64 bits/element vector – Register file has 16 read ports and 8 write ports ILP ... ● Vector functional units – Fully pipelined: new instruction every cycle ... – Data and control hazards are detected ... ● Vector load-store unit MIMD – Fully pipelined – One word per clock cycle after initial latency ● Scalar registers SIMD – 32 general-purpose registers of MIPS – 32 floating-point registers of MIPS 3 6 How Vector Processors Work: Y=a*X+Y Example Calculation of Execution Time LV V1,Rx ;load vector X L.D F0, a L.D F0, a MULVS.D V2,V1,F0 ;vector-scalar multiply DADDIU R4, Rx, #512 LV V1, Rx LV V3,Ry ;load vector Y Loop: MULVS.D V2, V1, F0 ADDVV.D V4,V2,V3 ;add two vectors L.D F2, 0(Rx) LV V3, Ry SV Ry,V4 ;store the sum MUL.D F2, F2, F0 ADDVV.DV4, V2, V3 L.D F4, 0(Ry) SV V4, Ry Convoys: ADD.D F4, F4, F2 1 LV MULVS.D S.D F4, 9(Ry) DADDIU Rx, Rx, #8 2 LV ADDVV.D DADDIU Ry, Ry, #8 3 SV DSUB R20, R4, Rx BNEZ Loop 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5 600 instructions 6 instructions, no loop For 64 element vectors, requires 64 x 3 = 192 clock cycles 7 10 Vector Execution Time Vector Execution Challenges ● Execution time depends on three factors: ● Needed improvements – Length of operand vectors ● Start up time – > 1 element per clock cycle – Structural hazards ● Compare with CPI < 1 – Data dependencies – Latency of vector functional unit – Non-64 wide vectors ● In VMIPS: – Assume the same as Cray-1 – IF statements in vector code – Functional units consume one element per clock cycle ● Floating-point add → 6 – Memory system – Execution time is approximately the vector length cycles optimizations to support ● Floating-point multiply → 7 vector processors ● Convoy cycles – Multiple dimensional – Set of vector instructions that could potentially execute together ● Floating-point divide → 20 matrices ● cycles No structural hazards – Sparse matrices ● ● RAW hazards are avoided through chaining Vector load → 12 cycles – Programming a vector – In superscalar RISC it is called forwarding computer – Number of convoys gives an estimate of execution time 8 11 Chaining and Chimes Vector Optimizations: Multiple Lanes pipe 0 pipe 1 pipe 2 pipe 3 + + + A[9] B[9] + ● Chaining A[8] B[8] – Like forwarding in superscalar RISC Lane 0 Lane 1 Lane 2 Lane 3 A[7] B[7] – Allows a vector operation to start as soon as the individual V[1], V[5], V[2], V[6], V[3], V[7], V[4], V[8], elements of its vector source operand become available V[9], … V[10], … V[11], … V[12], … A[6] B[6] – Flexible chaining chaining with any other instruction in execution, not just the once sharing a register A[5] B[5] ● Chime A[4] B[4] x x x x – Unit of time to execute one convoy A[3] B[3] A[9] B[9] – m convoys executes in m chimes A[2] B[2] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] – For vector length of n, it is required m*n clock cycles A[1] B[1] A[1] B[1] A[2] B[2] A[3] B[3] A[4] B[4] – Ignores some processor specific overheads ● Better at estimating time for longer vector processors + + + + + – Chime is an underestimate Element C[1] C[1] C[2] group C[3] C[4] 9 12 Arbitrary Length Loops with Vector-Length Register Example: Vector Mask Register ● for (i=0; i < N; ++i) Y[i] = a * X[i] + Y[i] ● What if N < 64 (=Maximum Vector Length)? ● for (i=0 ; i<64; ++i) ● Use a Vector-Length Registers if (X[i] != 0) X[i] = X[i] – Y[i] ● LV V1, Rx – Load N into VLR LV V2, Ry – Vector instructions (loads, stores, arithmetic, …) use VLR for the L.D F0, #0 maximum number of iterations SNEVS.DV1, F0 – By default, VLR=64 ● Vector register utilization SUBVV.DV1, V1, V2 SV V1, Rx – If VLR<64 then the registers are not fully utilized drops with the dropping count of 1's in mask register ● Vector startup time is a greater portion of the instruction ● Use Amdahl's law to calculate the effect of diminishing vector length 13 16 Arbitrary Length Loops with Strip Mining Strided Access and Multidimensional Arrays ● ● for (i=0; i < N; ++i) Y[i] = a * X[i] + Y[i] Matrix-matrix multiply C= alpha * C + beta * A * B: for (i = 0; i < 64; ++i) ● What if N > 64? for (j = 0; j < 64; ++j) { C[i, j] = alpha * C[i, j] ● Use strip mining to split the loop into smaller chunks for (k = 0; k < 64; ++k) C[i,j] = C[i,j] + beta * A[i, k] * B[k, j] for (j = 0; j < (N/64)*64; j += 64) {//N - N%64 } for (i=0 ; i < 64; ++i) ● C[0,0] = A[0,0]*B[0,0]+A[0,1]*B[1,0]+A[0,2]*B[2,0]+... Y[j+i] = a * X[j+i] + Y[i] ● A is accessed by rows and B is accessed by columns } ● Two most common storage formats: ● Clean-up code for N not divisible by 64 1. C: row-major (store matrix by rows) if (N % 64) 2. F: column-major (store matrix by columns) ● There is a need for non-unit stride for accessing memory for (i = 0; i < N%64; ++i) // use VLR here – Best handled by memories with multiple banks Y[j+i] = a * X[j+i] + Y[i] – Stride: 64 elements * 8 bytes = 512 bytes 14 17 Handling Conditional Statements with Mask Regs. Strided Access and Memory Banks ● for (i=0 ; i < 64; ++i) if (X[i] != 0) X[i] = X[i] – Y[i] ● Special load instructions just for accessing strided data ● Values of X are not known until runtime – LVWS=load vector with stride and SVWS ● The loop cannot be analyzed at compile time – The stride (distance between elements) is loaded in GPR (general purpose register) just like vector address ● Conditional is the inherent part of the program logic ● Multiple banks help with fast data access ● Scientific codes are full of conditionals and hardware support is a must – Multiple loads and/or store can be active at the same time – Critical metric: bank busy time (minimum period of time between ● Vector mask register is used as to prohibit execution of vector successive accesses to the same bank) instructions on some elements of the vector for (i = 0; i < 64; ++i) ● For strided accesses, compute LCM (least common multiple) of VectorMask[i] = (X[i] == 0) stride and number of banks to see if bank conflict occurs for (i = 0; i < 64; ++i) X[i] = X[i] – Y[i] only if VectorMask[i] != 0 #banks < bank busy time cycles LCM(stride ,#banks) 15 18 Strided Access Patterns Vector Processors and Compilers 0 1 2 3 4 5 6 7 Stride = 1 1 2 3 4 1 CPU cycle ● Vector machine vendors supplied good compilers to take advantage of the hardware 0 1 2 3 4 5 6 7 Stride = 9 – High profit margins allowed investment in compilers 1 – Without well vectorized code, hardware cannot deliver high 2 Gflop/s rates 3 ● Vector compilers were... 1 CPU cycle 4 – Good at recognizing vectorization opportunities 0 1 2 3 4 5 6 7 Stride = 8 – Good at reporting vectorization process failures and successes 1 – Good at taking hints from the programmer in form of vendor- specific language extensions such as pragmas 2 3 1 bank cycle) max(1 CPU cycle, 4 19 22 Example: Memory banks for Cray T90 SIMD Extensions for Multimedia Processing ● Design parameters ● Modern SIMD extensions in x86 date back to early desktop – 32 processors processing of multimedia – Each processor: 4 loads and 2 stores per cycle ● Typical case of Amdahl's law: – Processor cycle time: 2.167 ns – Speedup the common case or the slowest component – Cycle time for SRAMs: 15 ns ● Image processing: operates on 8 bit colors, 3 components, 1 alpha ● How many memory banks required to keep up with the transparency processors? – Typical pixel length: 32 bits – Typical operations: additions – 32 processors * 6 references = 192 references per cycle ● Audio processing: 8 bit or 16 bit samples (Compact Disc) – SRAM busy time: 15/2.167 = 6.92 ~ 7 CPU cycles ● SIMD extensions versus vector processing – Required cycles: 192 * 7 = 1344 memory banks – Fixed vector length encoded in instruction opcode for SIMD – Original design had only 1024 memory banks.

Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support