<<

Parallel Architectures

Part 1:

The rise of parallel machines

Intel Core i7

4 CPU cores

2 hardware per core

(8 “cores”)

Lab Cluster

Intel Xeon

4/10/16/18 CPU cores

2 hardware thread per core

(8/20/32/36 “cores”)

+ multi socket boards

SUN UltraSPARC T3

16 CPU cores

8 hardware thread per core

(128 “cores”)

IBM Power 8

GPUs

• 2,000+ cores on one chip

NVIDIA TITAN Z

Top500.org

Part 2:

Taxonomies for Parallel Architectures

Taxonomies for Parallel Architectures • Floyd’s Taxonomy - program control and memory access • Taxonomy Based on Memory Organization • Taxonomy Based on Granularity • Taxonomy Based on Processor Synchronization • Taxonomy Based on Interconnection Architecture

Floyd’s Taxonomy

architectures: – SISD – MISD – SIMD – MIMD • Based on method of program control and memory access

SISD

• Standard sequential computer. • A single processing unit receives a single stream of instructions that operate on a single stream of data.

MISD Computers

• p processors, each with its own , share a common memory.

SIMD Computers

• All p identical processors operate under the control of a single instruction stream issued by a central control unit. • There are p data streams, one per processor so different data can be used in each processor.

MIMD Computers

• p processors • p streams of instructions • p streams of data

Taxonomy Based on Memory Organization • – UMA – NUMA

Distributed Memory

• Each processor has its own memory • Communication is usually performed by • Each processor can access – its own memory, directly – memory of another processor, via message passing

Interconnect

Shared Memory

• provides hardware support for read/write to a shared memory space • has a single address space shared by all processors

I/O devices

Mem Mem Mem Mem I/O ctrl I/O ctrl

Interconnect Interconnect

Processor Processor

Scaling Up…

– Problem is interconnect: cost (crossbar) or bandwidth () – Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large – Distributed memory or non- (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) – Caching shared (particularly nonlocal) data?

Taxonomy Based on Processor Granularity

• Coarse Grained: Few powerful processors • Fined Grained: Many small processors () • Medium Grained: …between the two...

Taxonomy Based on Processor Synchronization

• Asynchronous: Processors run on independent clocks. User has synchronize via message passing or shared variable. • Fully Synchronous: Processors run in sync on one global clock. • Bulk-synchronous: Hybrid. Processors have independent clocks. Support is provided for global synchronization to be called by the user’s application program.

Taxonomy Based on Interconnection Architectures

• Static – Point to point connections

• Dynamic – Network with – Crossbars – Buses Interconnect Network

Static Interconnection Topologies

Linear Array

Ring

• Diameter (Max distance between processors) • Bisection Width (Min cuts to break into equal halves) • Cost (number of links)

Static Interconnection Topologies

Mesh Torus

Diameter? Bisection Width ? Cost ? Static Interconnection Topologies • Tree

Diameter? Bisection Width ? Cost ?

Static Interconnection Topologies • Complete Network

Diameter? Bisection Width ? Cost ?

Static Interconnection Topologies • -dim Hypercube

d 2 processors d=4 d=0 d=1 d=2 d=5

d=3

Diameter? Bisection Width ? Cost ? Static Interconnection Topologies •

Diameter? Bisection Width ? Cost ?

Switch based interconnection network

Summary

Taxanomy of parallel machines

Fine grained

Coarse grained massively parallel coarse clusters Distributed grained memory clusters

GPU Shared multi-core memory

MIMD SIMD

• Massively parallel cluster (MIMD, distributed memory, fine grained) • Coarse grained cluster (MIMD, distributed memory, coarse grained) • Multi-core processor (MIMD, shared memory, coarse grained)

• GPU (SIMD, shared memory, fine grained)