Data Parallelism
I Data Parallel: performing a Examples of data parallelism: repeated operation (or chain of operation) over vectors of data. ∀i ∈ 0..n I Conventionally expressed as a a1(i) = b1(i) + c1(i) loop, but implementations can be if (b2(i) 6= 0) → a2(i) = b2(i) + 4 constructed to perform loop a3(i) = b3(i) + c3(i + 1) operations as a single operation. I Operations can be conditional on elements (see a2 assignment). ∀i ∈ 0, 2, 4 ··· n I Non-unit strides are often used a4(i) = b4(i) + c4(i) (second example).
1 / 10 Supporting Data Parallelism
Three basic solutions:
I Vector processors I SIMD processors I GPU processors
Contrary to the textbook prose, vector processors do not predate SIMD processors by 30 years; the ILLIAC-IV was completed in 1966 (predating vector processors)!! Of course the embedding of SIMD in x86 occurred much later.
Interestingly, the Burroughs BSP was effectively the successor to the ILLIAC IV. It had 16 arithmetic units and 17 memory units to facilitate parallel access across a broader range of stride lengths.
2 / 10 Hardware Support for Data Parallelism
I Incorporate vector operations and registers into the architecture.
I Define vector load/store operations. I Masking can occur at load/store or during execution.
3 / 10 Example of Vector Processing
Main memory
Vector FP add/subtract load/store
FP multiply
FP divide
Vector registers Integer
Logical
Scalar registers
4 / 10 Vector Processing
I Vector registers: high port count to allow multiple concurrent reads/writes each cycle
I Vector functional units: usually heavily pipelined to permit the processing of a new operation each cycle
I Vector load/store unit: to feed the beast these are also pipelined to get data into/outof the core
I Scalar registers: integer and floating point
5 / 10 Vector Processing
I Have functional unit iterate over the vector registers
I Often can dispatch new vector operation each clock provided data dependencies can be satisfied. Can use chaining to forward intermediate results between adjacent (and in process) vector operations
I Convoy: A set of vector operations that can potentially be executed together (no structural hazards in a convoy)
I Vector mask register to conditionally control operations on each vector element
I Gather-scatter: Load/store mechanisms to read/write non-zero elements of memory
I Often the hardware will support subdividing the vector elements (called lanes) so each element can be treated as sub-parts to be operated over (for example, working on 8, 16, or 32 bits in a 64-bit element)
6 / 10 Multiple Lanes
Element group (a) (b)
7 / 10 SIMD
DS1 PU MM 1 1
DS2 PU MM IS 2 2 CU SM
DSn PU MM n m
IS
8 / 10 SIMD
I Many statements in the textbook are incorrect in the general sense; SIMD machines did have vector mask register type support and the ability to support non-unit stride data access
I Most of their discussions are limited to the x86 SIMD capabilities; you need to read this section with a heavy dose of skepticism
I Oddities in style abound in this space; the Connection Machines were built with 1-bit arithmetic units and was probably one of the most successful SIMD platform ever built
9 / 10 GPU/GPGPU
I GPUs provide multiple types of parallelism that was originally developed for processing the vectors and vector operations commonly found in graphics processing.
I Processing with GPUs if often considered a heterogeneous processing platform.
I CUDA/OpenCL: two models for programming heterogeneous systems (CUDA is specifically for GPGPU; OpenCL is more general).
I The real challenge is planning the migration of data into/outof the GPGPU. Often the GPGPU has limited memory space and feeding the beast becomes an issue.
10 / 10