Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor CSE 560 • SIMD – single instruction, multiple data Computer Systems Architecture • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data Multiprocessors • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 1 2
1 2
Shared-Memory Multiprocessors Distributed-Memory Multiprocessors
Conceptual model …but systems actually look more like this • Memory is physically distributed • The shared-memory abstraction • Previously covered common address space and cache coherence • Familiar and feels natural to programmers • Scales to about 10s to 100 processors • Life would be easy if systems actually looked like this… • When we want to scale up to 1000s (or millions) of cores • Separate address spaces • Arbitrary interconnect – custom, LAN, WAN P0 P1 P2 P3
P0 P1 P2 P3
Memory $ M0 $ M1 $ M2 $ M3 Router/interface Router/interface Router/interface Router/interface
3 Interconnect 4
3 4
Connect Processors via Network Programming Models
Cluster approach • The interconnect is a Local-Area Network (LAN) TCP/IP message delivery • Off-the-shelf processors (each of which is a multicore) • • IP addresses • Connect using off-the-shelf networking technology • Network handles routing, etc. • Leverages existing components inexpensive to design • Socket-based programming • Cloud service providers do this a lot! • Higher-level abstractions • Amazon Web Services (AWS) • Distributed shared memory • Microsoft Azure • Works but performs poorly – latency again • Scales up very easily • Map-Reduce • 1000s of nodes • Hadoop, etc. • Long latency to move data • Streaming data • Traverse network for one • Apache Storm, etc. cache line? Nope! • Explicit message passing (more later) 5 6
5 6 Virtualization Cluster Interconnect
Sharing the processor cores Ethernet Switches • VM technology allows multiple virtual machines to run on a • 1st tier are top-of-rack (ToR) switches single physical machine • Additional tiers connect racks, top tier talks to outside world • Hypervisor schedules VMs onto physical cores • Lots of redundant paths
7 8
7 8
Can we fix latency issue? Custom Interconnect
Cluster approach Known topology, trusted environment • TCP/IP network technology is dominant • Routing is easier Still • But is it needed? Or just readily available? • Security is easier TCP/IP
Custom interconnect
9 10
9 10
Interconnect Topologies Cray Dragonfly
• Mesh Custom Design for Supercomputers Torus (wraparound mesh) • • Big applications with lots of parallelism • Low-overhead message • All tiers in one switch (Aries) delivery • Routing is straightforward • Move along row to destination column • Move along column to destination • Forwarding can be fast • Old-school: store-and- forward • Modern: cut-through 11 12
11 12 Cray Dragonfly Network Back to Standardized Interconnect
Mesh with additional links Issue with Ethernet is latency • Protocol processing at endpoints Still • Store-and-forward routing TCP/IP
Infiniband interconnect
13 14
13 14
Infiniband Network Programming Paradigm
• Standardized technology Message Passing • Multiple vendors • MPI (Message Passing Interface) is de facto standard • Equipment works together • Used by almost all supercomputing applications • Competition • Not trying to be the “Internet”
• Focus on low latency interconnect needs • Minimize protocol processing • E.g., easier routing, simpler security model • Fast forwarding • Cut-through packet delivery • Remote Direct Memory Access (RDMA) • Supports single-ended messaging 15 16
15 16
More MPI Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 MPI capabilities beyond just send() and rcve() • SISD – single instruction, single data • One-sided communication: get() and put() • Traditional uniprocessor • Collective operations • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 17 18
17 18 SIMD Instructions Graphics Engines
Heterogeneous Multiprocessor • Many processing elements (PE), many threads per PE • Collections of threads execute in lock-step (SIMD-like) • Hide latency to memory by switching threads
By Decora at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30547549 19 20
19 20
Flynn’s Taxonomy Systolic Arrays • Proposed by Michael Flynn in 1966 H.T. Kung, “Why Systolic Architectures?,” Computer, 1982 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 21 22
21 22
Systolic Arrays Tensor Processing Unit
Purpose-built design for specific problem • Custom PE, replicated many times • E.g., array of MA (multiply-accumulate) units for FIR filter
• RNA folding [Jacob et al. 2010]
23 24
23 24 Tensor Processing Unit Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 25 26
25 26