MIMD & Shared-Nothing Systems

MIMD & Shared-Nothing Systems Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Multiple Instruction Multiple Data (MIMD) Most common parallel hardware architecture today ■ Example: All many-core processors, clusters, distributed systems From software perspective [Pfister] ■ SPMD - Single Program Multiple Data □ Sometimes denoted as ,application cluster‘ □ Examples: Load-balancing cluster or failover cluster for databases, web servers, application servers, ... ■ MPMD - Multiple Program Multiple Data □ Multiple implementations work together on one parallel computation □ Example: Master / worker cluster, map / reduce framework 2 MIMD Classification 3 Memory Architectures Uniform Memory Access Non-Uniform Memory Access (UMA) (NUMA) Distributed Memory Hybrid 4 Shared Memory vs. Distributed Memory System" ! Shared memory (SM) systems ■ SM-SIMD: Single CPU vector processors ■ SM-MIMD: Multi-CPU vector processors, OpenMP ■ Variant: Clustered shared-memory systems (NEC SX-6, CraySV1ex) Distributed memory (DM) systems ■ DM-SIMD: processor-array machines; lock-step approach; front processor and control processor ■ DM-MIMD: large variety in interconnection networks Distributed (Virtual) shared-memory systems ■ High-Performance Fortran, TreadMarks 5 Shared Memory Architectures! All processors act independently, access the same global address space Changes in one memory location are visible for all others Uniform memory access (UMA) system ■ Equal load and store access for all processors to all memory ■ Default approach for majority of SMP systems in the past Non-uniform memory access (NUMA) system ■ Delay on memory access according to the accessed region ■ Typically realized by processor interconnection network and local memories ■ Cache-coherent NUMA (CC-NUMA), completely implemented in hardware ■ About to become standard approach with recent x86 chips 6 NUMA Classification! 7 MIMD Computer Systems! Sequent Balance 8 Sequent Symmetry! Sequent was bought by IBM in 1999. IBM produced several Intel-based servers based on Sequent’s later NUMA architecture…! 9 Caches – managing bus contention! Effect of write-through and write-back cache coherency protocols on Sequent Symmetry 1 0 Intel Paragon XP/S! i860 RISC processor (64 bit, 50 MHz, 75 MFlops) Standard OS (Mach) on each node Cluster in a box 1 1 Intel " Paragon XP/S –" " interconnection network! 1 2 Intel Paragon XP/S - partitioning! 1 3 IBM SP/2! 1 4 Example: Intel Nehalem SMP System! I/O I/O Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 L3 Cache Cache L3 Controller Controller Memory Memory Core Core Core Core Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 Cache L3 Controller Controller Controller Memory Memory Core Core Core Core I/O I/O 15 An Intel Nehalem Cluster:" SMP + NUMA! Network 16 CC-NUMA! Still SMP programming model, but non-NUMA aware software scales poorly Different implementations lead to diffuse understanding of „node“, typical: ■ Region of memory where every byte has the same distance from each processor Tackles scalability problems of pure SMP architectures, while keeping the location independence promise Recent research tendency towards non-cache-coherent NUMA approach (Intel Single Chip Cloud Computer) Processor A Processor B Processor C Processor D Cache Cache Cache Cache High-Speed Memory Memory Interconnect 1 7 Scalable Coherent Interface! ANSI / IEEE standard for NUMA interconnect, used in HPC world ■ 64bit global address space, translation by SCI bus adapter (I/O-window) ■ Used as 2D / 3D torus Processor A Processor B Processor C Processor D Cache Cache Cache Cache Memory Memory SCI Cache SCI Cache SCI Bridge SCI Bridge ... 18 Experimental " Approaches! Systolic Arrays Data flow architectures Problem: common clock – maximum signal path restricted by frequency Fault contention: single faulty processing element will break entire machine 1 9 Parallel Processing 20 ■ Inside the processor □ Instruction-level parallelism (ILP) □ Multicore □ Shared memory ■ With multiple processing elements in one machine □ Multiprocessing □ Shared memory ■ With multiple processing elements in many machines □ Multicomputer □ Shared nothing (in terms of a globally accessible memory) Clusters 21 ■ Collection of stand-alone machines connected by a local network □ Cost-effective technique for a large-scale parallel computer □ Low cost of both hardware and software □ Users are builders, have control over their own system (hardware and software), low costs as major issue ■ Distributed processing as extension of DM-MIMD □ Communication orders of magnitude slower than with SM □ Only feasible for coarse-grained parallel activities Web Server Load Internet Balancer Web Server Web Server Web Server Clusters 22 History of Clusters 23 ■ 1977: ARCnet (Datapoint) □ LAN protocol, DATABUS programming language □ Single computer with terminals □ Addition of ‚compute resource‘ and ‚data resource‘ computers transparent for the application ■ May 1983: VAXCluster (DEC) □ Cluster of VAX computers, no single-point-of-failure □ Every duplicated □ High-speed messaging interconnect Distributed version of VMS OS ■ Distributed lock manager for shared resources History of Clusters - NOW 24 ■ Berkeley Network Of Workstations (NOW) - 1995 ■ Building large-scale parallel computing system with COTS hardware ■ GLUnix operating system □ Transparent remote execution, network PID‘s □ Load balancing □ Virtual Node Numbers (for communication) ■ Network RAM - idle machines as paging device ■ Collection of low-latency, parallel communication primitives - ‘active messages’ ■ Berkeley sockets, shared address space parallel C, MPI Cluster System Classes 25 ■ High-availability (HA) clusters – Improvement of cluster availability □ Linux-HA project (multi-protocol heartbeat, resource grouping) ■ Load-balancing clusters – Server farm for increased performance / availability □ Linux Virtual Server (IP load & application-level balancing) ■ High-performance computing (HPC) clusters – Increased performance by splitting tasks among different nodes □ Speed up the computation of one distributed job (FLOPS) ■ High-throughput computing (HTC) clusters – Maximize the number of finished jobs □ All kinds of simulations, especially parameter sweep □ Special case: Idle Time Computing for cycle harvesting Massively Parallel Processing (MPP) 26 ■ Hierarchical SIMD / MIMD architecture with a lot of processors □ Still standard components (in contrast to mainframes) □ Specialized setup of these components □ Host nodes responsible for loading program and data to PE‘s □ High-performance interconnect (bus, ring, 2D mesh, hypercube, tree, ...) □ For embarrassingly parallel applications, mostly simulations (atom bomb, climate, earthquake, airplane, car crash, ...) ■ Examples □ Distributed Array Processor (1979), 64x64 single bit PEs □ BlueGene/L (2007), 106.496 nodes x 2 PowerPC (700MHz) □ IBM Sequoia (2012), 16,3 PFlops, 1.6 PB memory, 98304 compute nodes, 1.6 Million cores, 7890 kW power 1.1 View from the outside The Blue Gene/P system has the familiar, slanted profile that was introduced with the Blue Gene/L system. However the increased compute power requires an increase in airflow, resulting in a larger footprint. Each of the air plenums on the Blue Gene/P system are just over ten inches wider than the plenums of the previous model. Additionally, each Blue Gene/P rack is approximately four inches wider. There are two additional Bulk Power Modules mounted in the Bulk Power enclosure on the top of the rack. Rather than a circuit breaker style switch, there is an on/off toggle switch to power on the machine. 1.1.1 Packaging FigureBlueGene 1-1 illustrates / theL packaging of the Blue Gene/L system. 27 System 64 Racks, 64x32x32 Rack 32 node cards Node card 180/360 TF/s (32 chips 4x4x2) 32 TB 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute card 2 chips, 1x2x1 90/180 GF/s Chip 16 GB 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB Figure 1-1 Blue Gene/L packaging Figure 1-2 on page 3 shows how the Blue Gene/P system is packaged. The changes start at the lowest point of the chain. Each chip is made up of four processors rather than just two processors like the Blue Gene/L system supports. At the next level, only one chip is on each of the compute (processor) cards. This design is easier to maintain with less waste. On the Blue Gene/L system, the replacement of a compute node because of a single failed processor requires the discard of one usable chip because the chips are packaged with two per card. The design of the Blue Gene/P system has only one chip per processor card, eliminating the disposal of a good chip when a compute card is replaced. Each node card still has 32 chips, but now the maximum number of I/O nodes per node card is two, so that only two Ethernet ports are on the front of each node card. Like the Blue Gene/L system, there are two midplanes per rack. The lower midplane is considered to be the 2 Evolution of the IBM System Blue Gene Solution Blue Gene/P 28 System Blue Gene/P 72 Racks, 72x32x32 Rack Cabled 8x8x16 32 Node Cards Node Card 1 PF/s (32 chips 4x4x2) 144 (288) TB 32 compute, 0-1 IO cards 13.9 TF/s Compute Card 2 (4) TB 1 chip, 20 435 GF/s DRAMs 64 (128) GB Chip 13.6 GF/s 4 processors 2.0 GB DDR2 (4.0GB 6/30/08) 13.6 GF/s 8 MB EDRAM ©2009 IBM Corporation Blue Gene/Q IBM System Technology Group 29 Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Optical Modules, Link Chips; 5D Torus 16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module 1. Chip: 16+2 !P cores 5b. IO drawer: 7. System: 8 IO cards w/16 GB 96 racks, 20PF/s 8 PCIe Gen2 x8 slots 3D I/O torus 5a. Midplane: 16 Node Cards •Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency 6.

MIMD & Shared-Nothing Systems

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Computer Hardware Architecture Lecture 4

Massively Parallel Computing with CUDA

Parallel Computer Architecture

Trafficdb: HERE's High Performance Shared-Memory Data Store

A Review of Multicore Processors with Parallel Programming

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Computer Architecture: Parallel Processing Basics

Thread-Level Parallelism I