<<

Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor CSE 560 • SIMD – single instruction, multiple data Systems Architecture • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data Multiprocessors • Each executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • 1 2

1 2

Shared-Memory Multiprocessors Distributed-Memory Multiprocessors

Conceptual model …but systems actually look more like this • Memory is physically distributed • The shared-memory abstraction • Previously covered common address space and coherence • Familiar and feels natural to programmers • Scales to about 10s to 100 processors • Life would be easy if systems actually looked like this… • When we want to scale up to 1000s (or millions) of cores • Separate address spaces • Arbitrary interconnect – custom, LAN, WAN P0 P1 P2 P3

P0 P1 P2 P3

Memory $ M0 $ M1 $ M2 $ M3 Router/interface Router/interface Router/interface Router/interface

3 Interconnect 4

3 4

Connect Processors via Network Programming Models

Cluster approach • The interconnect is a Local-Area Network (LAN) TCP/IP message delivery • Off-the-shelf processors (each of which is a multicore) • • IP addresses • Connect using off-the-shelf networking technology • Network handles routing, etc. • Leverages existing components  inexpensive to design • Socket-based programming • Cloud service providers do this a lot! • Higher-level abstractions • Amazon Web Services (AWS) • Distributed Azure • Works but performs poorly – latency again • Scales up very easily • -Reduce • 1000s of nodes • Hadoop, etc. • Long latency to move data • Streaming data • Traverse network for one • , etc. cache line? Nope! • Explicit (more later) 5 6

5 6 Virtualization Cluster Interconnect

Sharing the processor cores Ethernet • VM technology allows multiple virtual machines to run on a • 1st tier are top-of-rack (ToR) switches single physical machine • Additional tiers connect racks, top tier talks to outside world • Hypervisor schedules VMs onto physical cores • Lots of redundant paths

7 8

7 8

Can we fix latency issue? Custom Interconnect

Cluster approach Known topology, trusted environment • TCP/IP network technology is dominant • Routing is easier Still • But is it needed? Or just readily available? • Security is easier TCP/IP

Custom interconnect

9 10

9 10

Interconnect Topologies Dragonfly

• Mesh Custom Design for Torus (wraparound mesh) • • Big applications with lots of parallelism • Low-overhead message • All tiers in one (Aries) delivery • Routing is straightforward • Move along row to destination column • Move along column to destination • Forwarding can be fast • Old-school: store-and- forward • Modern: cut-through 11 12

11 12 Cray Dragonfly Network Back to Standardized Interconnect

Mesh with additional links Issue with Ethernet is latency • Protocol processing at endpoints Still • Store-and-forward routing TCP/IP

Infiniband interconnect

13 14

13 14

Infiniband Network

• Standardized technology Message Passing • Multiple vendors • MPI (Message Passing Interface) is de facto standard • Equipment works together • Used by almost all supercomputing applications • Competition • Not trying to be the “Internet”

• Focus on low latency interconnect needs • Minimize protocol processing • E.g., easier routing, simpler security model • Fast forwarding • Cut-through packet delivery • Remote Direct Memory Access (RDMA) • Supports single-ended messaging 15 16

15 16

More MPI Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 MPI capabilities beyond just send() and rcve() • SISD – single instruction, single data • One-sided communication: get() and put() • Traditional uniprocessor • Collective operations • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 17 18

17 18 SIMD Instructions Graphics Engines

Heterogeneous Multiprocessor • Many processing elements (PE), many threads per PE • Collections of threads execute in lock-step (SIMD-like) • Hide latency to memory by switching threads

By Decora at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30547549 19 20

19 20

Flynn’s Taxonomy Systolic Arrays • Proposed by Michael Flynn in 1966 H.T. Kung, “Why Systolic Architectures?,” Computer, 1982 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 21 22

21 22

Systolic Arrays

Purpose-built design for specific problem • Custom PE, replicated many times • E.g., array of MA (multiply-accumulate) units for FIR filter

• RNA folding [Jacob et al. 2010]

23 24

23 24 Tensor Processing Unit Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 25 26

25 26