Parallel Architectures

Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 “cores”) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36 “cores”) + multi socket boards SUN UltraSPARC T3 16 CPU cores 8 hardware thread per core (128 “cores”) IBM Power 8 GPUs • 2,000+ cores on one chip NVIDIA TITAN Z Top500.org Part 2: Taxonomies for Parallel Architectures Taxonomies for Parallel Architectures • Floyd’s Taxonomy - program control and memory access • Taxonomy Based on Memory Organization • Taxonomy Based on Processor Granularity • Taxonomy Based on Processor Synchronization • Taxonomy Based on Interconnection Architecture Floyd’s Taxonomy • Computer architectures: – SISD – MISD – SIMD – MIMD • Based on method of program control and memory access SISD Computers • Standard sequential computer. • A single processing unit receives a single stream of instructions that operate on a single stream of data. MISD Computers • p processors, each with its own control unit, share a common memory. SIMD Computers • All p identical processors operate under the control of a single instruction stream issued by a central control unit. • There are p data streams, one per processor so different data can be used in each processor. MIMD Computers • p processors • p streams of instructions • p streams of data Taxonomy Based on Memory Organization • Distributed memory • Shared memory – UMA – NUMA Distributed Memory • Each processor has its own memory • Communication is usually performed by message passing • Each processor can access – its own memory, directly – memory of another processor, via message passing Interconnect Shared Memory • provides hardware support for read/write to a shared memory space • has a single address space shared by all processors I/O devices Mem Mem Mem Mem I/O ctrl I/O ctrl Interconnect Interconnect Processor Processor Scaling Up… – Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large – Distributed memory or non-uniform memory access (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) – Caching shared (particularly nonlocal) data? Taxonomy Based on Processor Granularity • Coarse Grained: Few powerful processors • Fined Grained: Many small processors (massively parallel) • Medium Grained: …between the two... Taxonomy Based on Processor Synchronization • Asynchronous: Processors run on independent clocks. User has synchronize via message passing or shared variable. • Fully Synchronous: Processors run in sync on one global clock. • Bulk-synchronous: Hybrid. Processors have independent clocks. Support is provided for global synchronization to be called by the user’s application program. Taxonomy Based on Interconnection Architectures • Static – Point to point connections • Dynamic – Network with switches – Crossbars – Buses Interconnect Network Static Interconnection Topologies Linear Array Ring • Diameter (Max distance between processors) • Bisection Width (Min cuts to break into equal halves) • Cost (number of links) Static Interconnection Topologies Mesh Torus Diameter? Bisection Width ? Cost ? Static Interconnection Topologies • Tree Diameter? Bisection Width ? Cost ? Static Interconnection Topologies • Complete Network Diameter? Bisection Width ? Cost ? Static Interconnection Topologies • d-dim Hypercube d 2 processors d=4 d=0 d=1 d=2 d=5 d=3 Diameter? Bisection Width ? Cost ? Static Interconnection Topologies • Fat Tree Diameter? Bisection Width ? Cost ? Switch based interconnection network Summary Taxanomy of parallel machines Fine grained Coarse grained massively parallel coarse clusters Distributed grained memory clusters GPU Shared multi-core memory MIMD SIMD • Massively parallel cluster (MIMD, distributed memory, fine grained) • Coarse grained cluster (MIMD, distributed memory, coarse grained) • Multi-core processor (MIMD, shared memory, coarse grained) • GPU (SIMD, shared memory, fine grained).

Parallel Architectures

2.5 Classification of Parallel Computers

A Massively-Parallel Mixed-Mode Computer Designed to Support

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

SIMD Extensions

Computer Hardware Architecture Lecture 4

Massively Parallel Computing with CUDA

CS 677: Parallel Programming for Many-Core Processors Lecture 1

Trafficdb: HERE's High Performance Shared-Memory Data Store

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2