<<

MIMD & Shared-Nothing Systems Programmierung Paralleler und Verteilter Systeme (PPV)

Sommer 2015

Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Multiple Instruction Multiple Data (MIMD)

Most common parallel hardware architecture today ■ Example: All many-core processors, clusters, distributed systems From perspective [Pfister] ■ SPMD - Single Program Multiple Data □ Sometimes denoted as ,application cluster‘ □ Examples: Load-balancing cluster or failover cluster for , web servers, application servers, ... ■ MPMD - Multiple Program Multiple Data □ Multiple implementations work together on one parallel computation □ Example: Master / worker cluster, / reduce framework

2 MIMD Classification

3 Memory Architectures

Uniform Memory Access Non- (UMA) (NUMA)

Distributed Memory Hybrid

4 vs. System

Shared memory (SM) systems ■ SM-SIMD: Single CPU vector processors ■ SM-MIMD: Multi-CPU vector processors, OpenMP ■ Variant: Clustered shared-memory systems (NEC SX-6, CraySV1ex) Distributed memory (DM) systems ■ DM-SIMD: -array machines; -step approach; front processor and control processor ■ DM-MIMD: large variety in interconnection networks Distributed (Virtual) shared-memory systems ■ High-Performance , TreadMarks

5 Shared Memory Architectures

All processors act independently, access the same global Changes in one memory location are visible for all others Uniform memory access (UMA) system ■ Equal load and store access for all processors to all memory ■ Default approach for majority of SMP systems in the past Non-uniform memory access (NUMA) system ■ Delay on memory access according to the accessed region ■ Typically realized by processor interconnection network and local memories ■ -coherent NUMA (CC-NUMA), completely implemented in hardware ■ About to become standard approach with recent chips

6 NUMA Classification

7 MIMD Systems

Sequent Balance

8 Sequent Symmetry

Sequent was bought by IBM in 1999. IBM produced several -based servers based on

Sequent’s later NUMA architecture… 9 Caches – managing contention

Effect of write-through and write-back cache coherency protocols on Sequent Symmetry

1 0 Intel Paragon XP/S i860 RISC processor (64 bit, 50 MHz, 75 MFlops) Standard OS () on each node Cluster in a box

1 1 Intel Paragon XP/S – interconnection network

1 2 Intel Paragon XP/S - partitioning

1 3 IBM SP/2

1 4 Example: Intel Nehalem SMP System

I/O I/O

Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 L3 Cache Cache L3 Controller Controller Controller Controller Memory

Memory Core Core Core Core

Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 Cache L3 Controller Controller Controller Controller Memory Memory Core Core Core Core

I/O I/O

15 An Intel Nehalem Cluster: SMP + NUMA

Network

16 CC-NUMA

Still SMP programming model, but non-NUMA aware software scales poorly Different implementations lead to diffuse understanding of „node“, typical: ■ Region of memory where every byte has the same distance from each processor Tackles problems of pure SMP architectures, while keeping the location independence promise Recent research tendency towards non-cache-coherent NUMA approach (Intel Single Chip Cloud Computer)

Processor A Processor B Processor Processor D Cache Cache Cache Cache

High-Speed Memory Memory Interconnect

1 7 Scalable Coherent Interface

ANSI / IEEE standard for NUMA interconnect, used in HPC world ■ 64bit global address space, translation by SCI bus adapter (I/O-window) ■ Used as 2D / 3D torus

Processor A Processor B Processor C Processor D Cache Cache Cache Cache

Memory Memory SCI Cache SCI Cache SCI Bridge SCI Bridge

...

18 Experimental Approaches

Systolic Arrays Data flow architectures Problem: common clock – maximum path restricted by frequency Fault contention: single faulty processing element will break entire machine

1 9 Parallel Processing

20 ■ Inside the processor □ Instruction-level parallelism (ILP) □ Multicore □ Shared memory ■ With multiple processing elements in one machine □ □ Shared memory ■ With multiple processing elements in many machines □ Multicomputer □ Shared nothing (in terms of a globally accessible memory) Clusters

21 ■ Collection of stand-alone machines connected by a local network □ Cost-effective technique for a large-scale parallel computer □ Low cost of both hardware and software □ Users are builders, have control over their own system (hardware and software), low costs as major issue ■ Distributed processing as extension of DM-MIMD □ Communication orders of magnitude slower than with SM □ Only feasible for coarse-grained parallel activities

Web Load Balancer Web Server

Web Server

Web Server Clusters

22 History of Clusters

23 ■ 1977: ARCnet (Datapoint) □ LAN protocol, DATABUS programming language □ Single computer with terminals □ Addition of ‚compute resource‘ and ‚data resource‘ transparent for the application ■ May 1983: VAXCluster (DEC) □ Cluster of VAX computers, no single-point-of-failure □ Every duplicated □ High-speed messaging interconnect Distributed version of VMS OS ■ Distributed lock manager for shared resources History of Clusters - NOW

24 ■ Berkeley Network Of Workstations (NOW) - 1995 ■ Building large-scale system with COTS hardware ■ GLUnix □ Transparent remote execution, network PID‘s □ Load balancing □ Virtual Node Numbers (for communication) ■ Network RAM - idle machines as paging device ■ Collection of low-latency, parallel communication primitives - ‘active ’ ■ Berkeley sockets, shared address space parallel C, MPI Cluster System Classes

25 ■ High-availability (HA) clusters – Improvement of cluster availability □ Linux-HA project (multi-protocol heartbeat, resource grouping) ■ Load-balancing clusters – Server farm for increased performance / availability □ Linux Virtual Server (IP load & application-level balancing) ■ High-performance computing (HPC) clusters – Increased performance by splitting tasks among different nodes □ Speed up the computation of one distributed job (FLOPS) ■ High-throughput computing (HTC) clusters – Maximize the number of finished jobs □ All kinds of simulations, especially parameter sweep □ Special case: Idle Time Computing for cycle harvesting Processing (MPP)

26 ■ Hierarchical SIMD / MIMD architecture with a lot of processors □ Still standard components (in contrast to mainframes) □ Specialized setup of these components □ Host nodes responsible for loading program and data to PE‘s □ High-performance interconnect (bus, ring, 2D mesh, hypercube, tree, ...) □ For applications, mostly simulations (atom bomb, climate, earthquake, airplane, car crash, ...) ■ Examples □ Distributed Array Processor (1979), 64x64 single bit PEs □ BlueGene/L (2007), 106.496 nodes x 2 PowerPC (700MHz) □ IBM Sequoia (2012), 16,3 PFlops, 1.6 PB memory, 98304 compute nodes, 1.6 Million cores, 7890 kW power 1.1 View from the outside

The Blue Gene/P system has the familiar, slanted profile that was introduced with the Blue Gene/L system. However the increased compute power requires an increase in airflow, resulting in a larger footprint. Each of the air plenums on the Blue Gene/P system are just over ten inches wider than the plenums of the previous model. Additionally, each Blue Gene/P rack is approximately four inches wider. There are two additional Bulk Power Modules mounted in the Bulk Power enclosure on the top of the rack. Rather than a circuit breaker style , there is an on/off toggle switch to power on the machine.

1.1.1 Packaging

FigureBlueGene 1-1 illustrates / theL packaging of the Blue Gene/L system.

27 System 64 Racks, 64x32x32

Rack 32 node cards

Node card 180/360 TF/s (32 chips 4x4x2) 32 TB 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute card 2 chips, 1x2x1

90/180 GF/s Chip 16 GB 2 processors

5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB Figure 1-1 Blue Gene/L packaging

Figure 1-2 on page 3 shows how the Blue Gene/P system is packaged. The changes start at the lowest point of the chain. Each chip is made up of four processors rather than just two processors like the Blue Gene/L system supports.

At the next level, only one chip is on each of the compute (processor) cards. This design is easier to maintain with less waste. On the Blue Gene/L system, the replacement of a compute node because of a single failed processor requires the discard of one usable chip because the chips are packaged with two per card. The design of the Blue Gene/P system has only one chip per processor card, eliminating the disposal of a good chip when a compute card is replaced.

Each node card still has 32 chips, but now the maximum number of I/O nodes per node card is two, so that only two ports are on the front of each node card. Like the Blue Gene/L system, there are two midplanes per rack. The lower midplane is considered to be the

2 Evolution of the IBM System Blue Gene Solution Blue Gene/P

28 System Blue Gene/P 72 Racks, 72x32x32 Rack Cabled 8x8x16

32 Node Cards

Node Card 1 PF/s (32 chips 4x4x2) 144 (288) TB 32 compute, 0-1 IO cards 13.9 TF/s Compute Card 2 (4) TB 1 chip, 20 435 GF/s DRAMs 64 (128) GB

Chip 13.6 GF/s 4 processors 2.0 GB DDR2 (4.0GB 6/30/08)

13.6 GF/s 8 MB EDRAM ©2009 IBM Corporation Blue Gene/Q

IBM System Technology Group 29 Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Optical Modules, Link Chips; 5D Torus 16 GB DDR3 Memory,

Heat Spreader for H2O Cooling 2. Single Chip Module

1. Chip: 16+2 !P cores

5b. IO drawer: 7. System: 8 IO cards w/16 GB 96 racks, 20PF/s 8 PCIe Gen2 x8 slots 3D I/O torus

5a. Midplane: 16 Node Cards

•Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency

6. Rack: 2 Midplanes © 2011 IBM Corporation Blue Gene/Q

IBM System Technology Group 30 BlueGene/Q Compute chip ! 360 mm² Cu-45 technology (SOI) ! 16 user + 1 service PPC processors System-on-a-Chip design : integrates processors, – plus 1 redundant processor – all processors are symmetric memory and networking logic into a single chip – 11 metal layer – each 4-way multi-threaded – 64 bits – 1.6 GHz – L1 I/D cache = 16kB/16kB – L1 prefetch engines – each processor has Quad FPU (4-wide double precision, SIMD) – peak performance 204.8 GFLOPS @ 55 W ! Central shared L2 cache: 32 MB – eDRAM – multiversioned cache – supports , . – supports scalable atomic operations

! Dual – 16 GB external DDR3 memory – 42.6 GB/s DDR3 (1.333 GHz DDR3) (2 channels each with chip kill protection)

! Chip-to-chip networking – 5D Torus topology + external link " 5 x 2 + 1 high speed serial links – each 2 GB/s send + 2 GB/s receive – DMA, remote put/get, collective operations

! External (file) IO -- when used as IO chip. – PCIe Gen2 x8 interface (4 GB/s Tx + 4 GB/s Rx) – re-uses 2 serial links – interface to Ethernet or Infiniband cards

© 2011 IBM Corporation Blue Gene/Q System Architecture Blue Gene/Q

31 collective Node I/O Node C-Node 0 C-Node n Front-end File System Linux Console Nodes Servers fs client app app

ciod CNK CNK optical FunctionalFunctional NetworkNetwork MMCS torus DB2 10Gb10Gb QDR QDR I/O Node C-Node 0 C-Node n

Linux

fs client optical app app ControlControl LoadLeveler Ethernet Ethernet ciod CNK CNK (1Gb)(1Gb) FPGA JTAG

26 ©2009 IBM Corporation IBM System Technology Group BBlueG/Q Gene/Q Softwa re Stack Openness

32 I/O and Compute Nodes Service Nodes/Login Nodes d n e o XL Runtime ESSL XL

i GNU Runtime BGMon BG Nav GNU Compilers h t c a S c / i l r ISV HPC p e runjob Sched API Loadleveler s p MPI Charm++ MPI-IO Schedulers, Toolkit A U debuggers

PAMI TEAL Diag Harness BGWS Compute CIO Services totalviewd (Converged Node Kernel BG master m m e e t Messaging Stack) (CNK) t High Level Control System (MMCS) s GPFS s y y Partitioning, Job management and

S S DB2 monitoring, RAS, Administrator interface Messaging SPIs Node SPIs

e e Low Level Control System r r

a Node Firmware a Power On/Off, Hardware probe,

w Init, Bootloader, RAS, w Hardware init, Parallel monitoring

m Diagnostics m r Recovery Mailbox r

i i Parallel boot, Mailbox F F

W Compute nodes I/O nodes W Node cards Service cards SN SSNs LNs H H

New open source reference implementation licensed under CPL. New open source community under CPL license. Active IBM participation. Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started.. Closed. No source provided. Not buildable.

© 2011 IBM Corporation MPP Properties

33 ■ Standard components (processors, harddisks, ...) ■ Specific non-standardized interconnection network □ Low latency, high speed; distributed file system ■ Specific packaging of components and nodes for cooling and upgradeability □ Whole system provided by one vendor (IBM, HP) □ Extensibility as major issue, in order to save investment □ Distributed processing as extension of DM-MIMD ■ Scalability only limited by application, and not by hardware ■ Proprietary wiring of standard components ■ Demands custom operating system and aligned applications ■ No major consideration of availability ■ Power consumption, cooling Distributed System

34 ■ Tanenbaum (Distributed Operating Systems): „A distributed system is a collection of independent computers that appear to the users of the system as a single computer.“ ■ Coulouris et al.: „... [system] in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.“ ■ Lamport: „A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ■ Consequences: concurrency, no global clock, independent failures ■ Challenges: heterogeneity, openness, security, scalability, failure handling, concurrency, need for transparency SMP vs. Cluster vs. Distributed System

35 ■ Clusters are composed of computers, SMPs are composed of processors □ High availability is cheaper with clusters, but demands additional software components □ Scalability is easier with a cluster □ SMPs are easier to maintain from administrators point of view □ Software licensing becomes more expensive with a cluster ■ Clusters for capability computing, integrated machines for capacity computing ■ Cluster vs. Distributed System □ Both contain of multiple nodes for parallel processing □ Nodes in a distributed system have their own identity □ Physical vs. virtual organization Comparison

36 MPP SMP Cluster Distributed

Number of nodes O(100)-O(1000) O(10)-O(100) O(100) or less O(10)-O(1000)

Medium or coarse Node Complexity Fine grain Medium grain Wide range grained

Centralized and Shared files, RPC, Internode passing / distributed shared Message Passing, communication shared variables (SM) memory IPC

Single run queue on Single run queue Multiple queues but Job Independent queues host mostly coordinated

SSI support Partially Always in SMP Desired No

Address Space Multiple Single Multiple or single Multiple

Internode Security Irrelevant Irrelevant Required if exposed Required

One or many Ownership One organization One organization Many organizations organizations

K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming; WCB/McGraw-Hill, 1998 Interconnection Networks

37 ■ Shared-Nothing systems demand structured connectivity □ Processor-to-processor interaction □ Processor-to-memory interaction ■ Static network □ Point-to-point links, fixed route ■ Dynamic network □ Consists of links and switching elements □ Flexible configuration of processor interaction Interconnection Networks

38 Interconnection Networks

39 ■ Dynamic networks are built from a graph of configurable switching elements ■ General packet switching network counts as irregular static network [Peter Newman] Interconnection Networks

■ Network Interfaces □ Processors talk to the network via a network interface connector (NIC) hardware □ Network interfaces attached to the interconnect ◊ Cluster vs. tightly-coupled multi-computer □ Next generation hardware will include NIC in the processor die ■ Switching elements map a fixed number of inputs to outputs □ Total number of ports is the degree of the switch □ The cost of a switch grows as square of the degree □ The hardware grows linearly as the degree

Interconnection Networks

■ A variety of network topologies proposed and implemented ■ Each topology has a performance / cost tradeoff ■ Commercial machines often implement hybrids □ Optimize packaging and costs ■ Metrics for an interconnection network graph □ Diameter: Maximum distance between any two nodes □ Connectivity: Minimum number of edges that must be removed to get two independent graphs □ Link width / weight: Transfer capacity of an edge □ Bisection width: Minimum transfer capacity given between any two halves of the graph □ Costs: Number of edges in the network ■ Often optimization for connectivity metric Bus Systems

42 ■ Static interconnect technology ■ Shared communication path, broadcasting of information □ Diameter: O(1) □ Connectivity: O(1) □ Bisection width: O(1) □ Costs: O(p)

… 07.01.2013

Crossbar switch " (Kreuzschienenverteiler)

• Arbitrary number of permutations

• Collision-free data exchange

• High cost, quadratic growth

Crossbar Switch• n * (n-1) connection points

43 07.01.2013 ■ Dynamic switch-based network ■ Non-blocking ■ Supports multiple connections without collisions 79 Crossbar switch " ■ Diameter:(Kreuzschienenverteiler) O(1) ■ Connectivity: O(1) • Arbitrary number of ■ Bisectionpermutations width: O(n) 2 ■ Costs:• Collision-free O(n ) data exchange □ High costsDelta with networks quadratic• High cost, quadraticgrowth, badgrowth scalability • Only n/2 log n delta- ■ N*(n-1)• n * (n-1) connection connection points

• Limited cost

• Not all possible permutations operational in parallel

79

Delta networks 80

• Only n/2 log n delta- switches

• Limited cost

• Not all possible 40 permutations operational in parallel

80

40 Crossbar Switch

44 Crossbar Switch

45 Multistage Interconnection Networks

46 ■ Connection by switching elements ■ Typical solution to connect processing and memory elements ■ Can implement sorting or shuffling in the network Omega Network

47 ■ Inputs are crossed or not, depending on routing logic □ Destination-tag routing: Use positional bit for switch decision □ XOR-tag routing: Use positional bit of XOR result for decision

■ For N PE’s, N/2 switches per stage, log2N stages ■ Decrease bottleneck probability on parallel communication Delta Networks

48 ■ Stage n checks bit k of 0 the destination tag 1 ■ Only (n/2 * log n) delta switches needed 2 ■ Limited cost 3 ■ Not all possible permutations 4 operational in parallel 5 ■ Possible effect of ‚output port contention‘ 6 and ‚path contention‘ 7 07.01.2013

Close Coupling – Delta Networks and Crossbar

Clos coupling49 networks

• Combination of delta network and crossbar

C.Clos, A Study of Nonblocking Switching Networks, " Bell System Technical Journal, vol. 32, no. 2, " 1953, pp. 406-424(19) 81

Fat-Tree networks

• PEs arranged as leafs on a binary tree

• Capacity of tree (links) doubles on each layer

82

41 Bitonic Mergesort

50 Completety Connected / Star Connected Networks

51 Cartesian Topology Network

52

Linear Arrays

2D and 3D Meshes 07.01.2013

Point-to-point networks: " ring and fully connected graph

• Ring has only two connections per PE (almost optimal)

• Fully connected graph – optimal connectivity (but high cost)

83

Cartesian Topology Network

53 ■ Linear array: Each node has two neighbours ■ 1D torus / ring: Linear array with connected endings Mesh■ 2D torus and / meshTorus: Each node has four neighbours ■ d-dimensional mesh: Nodes with 2d neighbours •■ Compromise Hypercube between cost and connectivity □ d-dimensional mesh where d=log n (# processors) □ Construction of hypercube from lower dimensional hypercube

4-way 2D mesh 4-way 2D torus 8-way 2D mesh

84

42 07.01.2013

Cubic Mesh

• PEs are arranged in a cubic fashion

• Each PE has 6 links to neighbors

85

Hypercube Hypercubes • Dimensions 0-4, recursive definition 54

86

43 Hypercubes

55 ■ Diameter: At most log(n) ■ Each node has log(n) neighbours ■ Distance: Number of bit positions differing between the nodes Hypercubes

56 Fat Trees

57

■ Tree structure: □ The distance between any two nodes is no more than 2 log p. □ Links higher up potentially carry more traffic, bottleneck at root node □ Can be laid out in 2D with no wire crossings. ■ : □ Fattens the links as we go up the tree. 07.01.2013

Scalable Coherent Interface

• ANSI / IEEE standard for NUMA interconnect, used in HPC world

• 64bit global address space, translation by SCI bus adapter (I/O-window)

• Used as 2D / 3D torus

Processor A Processor B Processor C Processor D Cache Cache Cache Cache

Memory Memory SCI Cache SCI Cache SCI Bridge SCI Bridge

...

ParProg | Hardware 53 PT 2010

Systolic Arrays

58 ■Experimental Data flow " Approachesarchitecture ■ Common clock ■ Maximum signal Systolic Arrays path restricted by • Datafrequency flow architectures ■• Problem:Single faulty common clock – maximumelement signalbreaks path the restrictedcomplete by array frequency • Fault contention: single faulty processing element will break entire machine

54

27 Comparison

Bisection Arc Cost Network Diameter Width Connectivity (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube Comparison

Arc Cost Network Diameter Bisection Width Connectivity (No. of links)

Crossbar

Omega Network

Dynamic Tree Example: T3E

Interconnection network of the Cray T3E: (a) node architecture; (b) . Example: SGI Origin 3000

Architecture of the SGI Origin 3000 family of servers. Example: Sun HPC Systems

Architecture of the Sun Enterprise family of servers. Example: Blue Gene/Q 5D Torus

64