Distributed Memory Machines and Programming Lecture 7

Recap of Lecture 6 •"Shared memory multiprocessors Distributed Memory •" Caches may be either shared or distributed. Machines and Programming •" Multicore chips are likely to have shared caches •" Cache hit performance is better if they are distributed (each cache is smaller/closer) but they must be kept Lecture 7 " coherent -- multiple cached copies of same location must be kept equal. •" Requires clever hardware (see CS258, CS252). James Demmel ! •" Distant memory much more expensive to access. www.cs.berkeley.edu/~demmel/cs267_Spr15! •" Machines scale to 10s or 100s of processors. ! •"Shared memory programming •" Starting, stopping threads. Slides from Kathy Yelick! •" Communication by reading/writing shared variables. •" Synchronization with locks, barriers. CS267 Lecture 7! 1! 02/10/2015 CS267 Lecture 7! 2! Outline Architectures in Top 500, Nov 2014 •"Distributed Memory Architectures 500 •" Properties of communication networks 450 •" Topologies 400 •" Performance models 350 Cluster •"Programming Distributed Memory Machines 300 MPP 250 using Message Passing Constellations •" Overview of MPI 200 SMP •" Basic send/receive use 150 SIMD 100 •" Non-blocking communication Single Processor 50 •" Collectives 0 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 02/10/2015 CS267 Lecture 7! 3! 02/10/2015 CS267 Lecture 2 1 Historical Perspective Network Analogy •" To have a large number of different transfers occurring at once, •" Early distributed memory machines were: you need a large number of distinct wires •" Collection of microprocessors. •" Not just a bus, as in shared memory •" Communication was performed using bi-directional queues between nearest neighbors. •" Networks are like streets: •" Messages were forwarded by processors on path. •" Link = street. • Switch = intersection. •" “Store and forward” networking " •" Distances (hops) = number of blocks traveled. •" There was a strong emphasis on topology in algorithms, •" Routing algorithm = travel plan. in order to minimize the number of hops = minimize time •" Properties: •" Latency: how long to get between nodes in the network. •" Street: time for one car = dist (miles) / speed (miles/hr) •" Bandwidth: how much data can be moved per unit time. •" Street: cars/hour = density (cars/mile) * speed (miles/hr) * #lanes •" Network bandwidth is limited by the bit rate per wire and #wires 02/10/2015 CS267 Lecture 7! 5! 02/10/2015 CS267 Lecture 7! 6! Design Characteristics of a Network Performance Properties of a Network: Latency •" Topology (how things are connected) •" Diameter: the maximum (over all pairs of nodes) of the •" Crossbar; ring; 2-D, 3-D, higher-D mesh or torus; shortest path between a given pair of nodes. hypercube; tree; butterfly; perfect shuffle, dragon fly, … •" Latency: delay between send and receive times •" Routing algorithm: •" Latency tends to vary widely across architectures •" Example in 2D torus: all east-west then all north-south •" Vendors often report hardware latencies (wire time) (avoids deadlock). •" Application programmers care about software •" Switching strategy: latencies (user program to user program) •" Circuit switching: full path reserved for entire message, like the telephone. •" Observations: •" Packet switching: message broken into separately- •" Latencies differ by 1-2 orders across network designs routed packets, like the post office, or internet •" Software/hardware overhead at source/destination •" Flow control (what if there is congestion): dominate cost (1s-10s usecs) •" Stall, store data temporarily in buffers, re-route data to •" Hardware latency varies with distance (10s-100s nsec other nodes, tell source node to temporarily halt, per hop) but is small compared to overheads discard, etc. •" Latency is key for programs with many small messages 02/10/2015 CS267 Lecture 7! 7! 02/10/2015 CS267 Lecture 7! 8! CS267 Lecture 2 2 Latency on Some Machines/Networks End to End Latency (1/2 roundtrip) Over Time 100 8-byte Roundtrip Latency nCube/2CS2 24.2 SP2 25 CM5 22.1 Paragon MPI ping-pong CM5 SP1CS2 36.34 20 18.5 nCube/2 T3D T3D T3E18.916 14.6 15 Myrinet KSR 12.080511.027 10 SP-Power39.25 9.6 usec 7.2755 10 Cenju3 6.9745 6.905 6.6 4.81 3.3 5 SPP Roundtrip Latency (usec) Latency Roundtrip Quadrics 2.6 SPP Quadrics 0 T3E Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed 1 1990 1995 2000 2005 2010 Year (approximate) •" Latencies shown are from a ping-pong test using MPI •" These are roundtrip numbers: many people use ½ of roundtrip time •" Latency has not improved significantly, unlike Moore’s Law to approximate 1-way latency (which can’t easily be measured) •" T3E (shmem) was lowest point – in 1997 Data from Kathy Yelick, UCB and NERSC" 02/10/2015 CS267 Lecture 7! 9! 02/10/2015 CS267 Lecture 7! 10! Performance Properties of a Network: Bandwidth Bandwidth on Existing Networks •" The bandwidth of a link = # wires / time-per-bit Flood Bandwidth for 2MB messages 100% 857 •" Bandwidth typically in Gigabytes/sec (GB/s), 225 1504 20 90% MPI i.e., 8* 2 bits per second 244 80% •" Effective bandwidth is usually lower than physical link 630 610 bandwidth due to packet overhead. 70% Routing and control 60% header 50% •"Bandwidth is important for applications 40% Data with mostly large messages payload 30% 20% Percent HW peak (BW in MB) in (BW peak HW Percent 10% Error code Trailer 0% Elan3/A lpha Elan4/IA 64 My r inet/x 86 IB/G5 IB/Opteron SP/Fed •" Flood bandwidth (throughput of back-to-back 2MB messages) 02/10/2015 CS267 Lecture 7! 11! 02/10/2015 CS267 Lecture 7! 12! CS267 Lecture 2 3 Bandwidth Chart Performance Properties of a Network: Bisection Bandwidth 400 Note: bandwidth depends on SW, not just HW •" Bisection bandwidth: bandwidth across smallest cut that 350 divides network into two equal halves 300 •" Bandwidth across “narrowest” part of the network T3E/MPI T3E/Shmem 250 IBM/MPI not a IBM/LAPI Compaq/Put bisection 200 Compaq/Get bisection M2K/MPI cut M2K/GM cut Bandwidth (MB/sec) Bandwidth Dolphin/MPI 150 Giganet/VIPL SysKonnect 100 50 bisection bw= link bw bisection bw = sqrt(p) * link bw 0 2048 4096 8192 16384 32768 65536 131072 •" Bisection bandwidth is important for algorithms in which Message Size (Bytes) all processors need to communicate with all others Data from Mike Welcome, NERSC" 02/10/2015 CS267 Lecture 7! 13! 02/10/2015 CS267 Lecture 7! 14! Network Topology Linear and Ring Topologies •" In the past, there was considerable research in network •" Linear array topology and in mapping algorithms to topology. •" Key cost to be minimized: number of “hops” between •" Diameter = n-1; average distance ~n/3. nodes (e.g. “store and forward”) •" Bisection bandwidth = 1 (in units of link bandwidth). •" Modern networks hide hop cost (i.e., “wormhole •" Torus or Ring routing”), so topology less of a factor in performance of many algorithms •" Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec. •" Need some background in network topology •" Diameter = n/2; average distance ~ n/4. •" Algorithms may have a communication topology •" Bisection bandwidth = 2. •" Example later of big performance impact •" Natural for algorithms that work with 1D arrays. 02/10/2015 CS267 Lecture 7! 15! 02/10/2015 CS267 Lecture 7! 16! CS267 Lecture 2 4 Meshes and Tori – used in Hopper Hypercubes d Two dimensional mesh Two dimensional torus •" Number of nodes n = 2 for dimension d. •" Diameter = 2 * (sqrt( n ) – 1) •" Diameter = sqrt( n ) •" Diameter = d. •" Bisection bandwidth = sqrt(n) •" Bisection bandwidth = 2* sqrt(n) •" Bisection bandwidth = n/2. •" 0d 1d 2d 3d 4d •" Popular in early machines (Intel iPSC, NCUBE). •" Lots of clever algorithms. 110 111 •" See 1996 online CS267 notes. 010 011 •" Generalizes to higher dimensions •" Greycode addressing: 100 101 •" Cray XT (eg Hopper@NERSC) uses 3D Torus •" Each node connected to 000 001 •" Natural for algorithms that work with 2D and/or 3D arrays (matmul) d others with 1 bit different. 02/10/2015 CS267 Lecture 7! 17! 02/10/2015 CS267 Lecture 7! 18! Trees Butterflies •" Diameter = log n. •" Diameter = log n. •" Bisection bandwidth = 1. •" Bisection bandwidth = n. Ex: to get from proc 101 to 110, •" Easy layout as planar graph. •" Cost: lots of wires. Compare bit-by-bit and •" Many tree algorithms (e.g., summation). •" Used in BBN Butterfly. Switch if they disagree, else not •" Fat trees avoid bisection bandwidth problem: •" Natural for FFT. •" More (or wider) links near top. •" Example: Thinking Machines CM-5. O 1 O 1 O 1 O 1 butterfly switch multistage butterfly network 02/10/2015 CS267 Lecture 7! 19! 02/10/2015 CS267 Lecture 7! 20! CS267 Lecture 2 5 Does Topology Matter? Dragonflies – used in Edison •" Motivation: Exploit gap in cost and performance between optical 1 MB multicast on BG/P, Cray XT5, and Cray XE6 interconnects (which go between cabinets in a machine room) and electrical 8192 networks (inside cabinet) BG/P •" Optical more expensive but higher bandwidth when long XE6 4096 XT5 •" Electrical networks cheaper, faster when short •" Combine in hierarchy 2048 •" One-to-many via electrical networks inside cabinet •" Just a few long optical interconnects between cabinets 1024 •" Clever routing algorithm to avoid bottlenecks: 512 •" Route from source to randomly chosen intermediate cabinet •" Route from intermediate cabinet to destination Bandwidth (MB/sec) 256 •" Outcome: programmer can (usually) ignore topology, get good performance •" Important in virtualized, dynamic environment 128 •" Programmer can still create serial bottlenecks 8 64 512 4096 •" Drawback: variable performance #nodes See EECS Tech Report UCB/EECS-2011-92, August 2011 •" Details in “Technology-Drive, Highly-Scalable Dragonfly Topology,” J. Kim. W. Dally, S. Scott, D. Abts, ISCA 2008 02/10/2015 CS267 Lecture 7! 21! 02/10/2015 22! CS267 Lecture 7! Evolution of Distributed Memory Machines •" Special queue connections are being replaced by direct memory access (DMA): •" Network Interface (NI) processor packs or copies messages.

Load more