Data Parallelism

I Data Parallel: performing a Examples of data parallelism: repeated operation (or chain of operation) over vectors of data. ∀i ∈ 0..n I Conventionally expressed as a a1(i) = b1(i) + c1(i) loop, but implementations can be if (b2(i) 6= 0) → a2(i) = b2(i) + 4 constructed to perform loop a3(i) = b3(i) + c3(i + 1) operations as a single operation. I Operations can be conditional on elements (see a2 assignment). ∀i ∈ 0, 2, 4 ··· n I Non-unit strides are often used a4(i) = b4(i) + c4(i) (second example).

1 / 10 Supporting Data Parallelism

Three basic solutions:

I Vector processors I SIMD processors I GPU processors

Contrary to the textbook prose, vector processors do not predate SIMD processors by 30 years; the ILLIAC-IV was completed in 1966 (predating vector processors)!! Of course the embedding of SIMD in x86 occurred much later.

Interestingly, the Burroughs BSP was effectively the successor to the ILLIAC IV. It had 16 arithmetic units and 17 memory units to facilitate parallel access across a broader range of stride lengths.

2 / 10 Hardware Support for Data Parallelism

I Incorporate vector operations and registers into the architecture.

I Deﬁne vector load/store operations. I Masking can occur at load/store or during execution.

3 / 10 Example of Vector Processing

Main memory

Vector FP add/subtract load/store

FP multiply

FP divide

Vector registers Integer

Logical

Scalar registers

4 / 10 Vector Processing

I Vector registers: high port count to allow multiple concurrent reads/writes each cycle

I Vector functional units: usually heavily pipelined to permit the processing of a new operation each cycle

I Vector load/store unit: to feed the beast these are also pipelined to get data into/outof the core

I Scalar registers: integer and ﬂoating point

5 / 10 Vector Processing

I Have functional unit iterate over the vector registers

I Often can dispatch new vector operation each clock provided data dependencies can be satisﬁed. Can use chaining to forward intermediate results between adjacent (and in process) vector operations

I Convoy: A set of vector operations that can potentially be executed together (no structural hazards in a convoy)

I Vector mask register to conditionally control operations on each vector element

I Gather-scatter: Load/store mechanisms to read/write non-zero elements of memory

I Often the hardware will support subdividing the vector elements (called lanes) so each element can be treated as sub-parts to be operated over (for example, working on 8, 16, or 32 bits in a 64-bit element)

6 / 10 Multiple Lanes

Element group (a) (b)

7 / 10 SIMD

DS1 PU MM 1 1

DS2 PU MM IS 2 2 CU SM

DSn PU MM n m

8 / 10 SIMD

I Many statements in the textbook are incorrect in the general sense; SIMD machines did have vector mask register type support and the ability to support non-unit stride data access

I Most of their discussions are limited to the x86 SIMD capabilities; you need to read this section with a heavy dose of skepticism

I Oddities in style abound in this space; the Connection Machines were built with 1-bit arithmetic units and was probably one of the most successful SIMD platform ever built

9 / 10 GPU/GPGPU

I GPUs provide multiple types of parallelism that was originally developed for processing the vectors and vector operations commonly found in graphics processing.

I Processing with GPUs if often considered a heterogeneous processing platform.

I CUDA/OpenCL: two models for programming heterogeneous systems (CUDA is speciﬁcally for GPGPU; OpenCL is more general).

I The real challenge is planning the migration of data into/outof the GPGPU. Often the GPGPU has limited memory space and feeding the beast becomes an issue.

10 / 10