Flynn's Taxonomy

Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor CSE 560 • SIMD – single instruction, multiple data Computer Systems Architecture • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data Multiprocessors • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 1 2 1 2 Shared-Memory Multiprocessors Distributed-Memory Multiprocessors Conceptual model …but systems actually look more like this • Memory is physically distributed • The shared-memory abstraction • Previously covered common address space and cache coherence • Familiar and feels natural to programmers • Scales to about 10s to 100 processors • Life would be easy if systems actually looked like this… • When we want to scale up to 1000s (or millions) of cores • Separate address spaces • Arbitrary interconnect – custom, LAN, WAN P0 P1 P2 P3 P0 P1 P2 P3 Memory $ M0 $ M1 $ M2 $ M3 Router/interface Router/interface Router/interface Router/interface 3 Interconnect 4 3 4 Connect Processors via Network Programming Models Cluster approach • The interconnect is a Local-Area Network (LAN) TCP/IP message delivery • Off-the-shelf processors (each of which is a multicore) • • IP addresses • Connect using off-the-shelf networking technology • Network handles routing, etc. • Leverages existing components inexpensive to design • Socket-based programming • Cloud service providers do this a lot! • Higher-level abstractions • Amazon Web Services (AWS) • Distributed shared memory • Microsoft Azure • Works but performs poorly – latency again • Scales up very easily • Map-Reduce • 1000s of nodes • Hadoop, etc. • Long latency to move data • Streaming data • Traverse network for one • Apache Storm, etc. cache line? Nope! • Explicit message passing (more later) 5 6 5 6 Virtualization Cluster Interconnect Sharing the processor cores Ethernet Switches • VM technology allows multiple virtual machines to run on a • 1st tier are top-of-rack (ToR) switches single physical machine • Additional tiers connect racks, top tier talks to outside world • Hypervisor schedules VMs onto physical cores • Lots of redundant paths 7 8 7 8 Can we fix latency issue? Custom Interconnect Cluster approach Known topology, trusted environment • TCP/IP network technology is dominant • Routing is easier Still • But is it needed? Or just readily available? • Security is easier TCP/IP Custom interconnect 9 10 9 10 Interconnect Topologies Cray Dragonfly • Mesh Custom Design for Supercomputers Torus (wraparound mesh) • • Big applications with lots of parallelism • Low-overhead message • All tiers in one switch (Aries) delivery • Routing is straightforward • Move along row to destination column • Move along column to destination • Forwarding can be fast • Old-school: store-and- forward • Modern: cut-through 11 12 11 12 Cray Dragonfly Network Back to Standardized Interconnect Mesh with additional links Issue with Ethernet is latency • Protocol processing at endpoints Still • Store-and-forward routing TCP/IP Infiniband interconnect 13 14 13 14 Infiniband Network Programming Paradigm • Standardized technology Message Passing • Multiple vendors • MPI (Message Passing Interface) is de facto standard • Equipment works together • Used by almost all supercomputing applications • Competition • Not trying to be the “Internet” • Focus on low latency interconnect needs • Minimize protocol processing • E.g., easier routing, simpler security model • Fast forwarding • Cut-through packet delivery • Remote Direct Memory Access (RDMA) • Supports single-ended messaging 15 16 15 16 More MPI Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 MPI capabilities beyond just send() and rcve() • SISD – single instruction, single data • One-sided communication: get() and put() • Traditional uniprocessor • Collective operations • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 17 18 17 18 SIMD Instructions Graphics Engines Heterogeneous Multiprocessor • Many processing elements (PE), many threads per PE • Collections of threads execute in lock-step (SIMD-like) • Hide latency to memory by switching threads By Decora at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30547549 19 20 19 20 Flynn’s Taxonomy Systolic Arrays • Proposed by Michael Flynn in 1966 H.T. Kung, “Why Systolic Architectures?,” Computer, 1982 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 21 22 21 22 Systolic Arrays Tensor Processing Unit Purpose-built design for specific problem • Custom PE, replicated many times • E.g., array of MA (multiply-accumulate) units for FIR filter • RNA folding [Jacob et al. 2010] 23 24 23 24 Tensor Processing Unit Flynn’s Taxonomy • Proposed by Michael Flynn in 1966 • SISD – single instruction, single data • Traditional uniprocessor • SIMD – single instruction, multiple data • Execute the same instruction on many data elements • Vector machines, graphics engines • MIMD – multiple instruction, multiple data • Each processor executes its own instructions • Multicores are all built this way • SPMD – single program, multiple data (extension proposed by Frederica Darema) • MIMD machine, each node is executing the same code • MISD – multiple instruction, single data • Systolic array 25 26 25 26.

Load more