Advanced Architectures Goals of Distributed Computing Evaluating

6/7/2018 Advanced Architectures Goals of Distributed Computing 15A. Distributed Computing • better services 15B. Multi-Processor Systems – scalability • apps too big to run on a single computer 15C. Tightly Coupled Distributed Systems • grow system capacity to meet growing demand 15D. Loosely Coupled Distributed Systems – improved reliability and availability 15E. Cloud Models – improved ease of use, reduced CapEx/OpEx 15F. Distributed Middle-Ware • new services – applications that span multiple system boundaries – global resource domains, services (vs. systems) – complete location transparency Advanced Architectures 1 Advanced Architectures 2 Major Classes of Distributed Systems Evaluating Distributed Systems • Symmetric Multi-Processors (SMP) • Performance – multiple CPUs, sharing memory and I/O devices – overhead, scalability, availability • Single-System Image (SSI) & Cluster Computing • Functionality – a group of computers, acting like a single computer – adequacy and abstraction for target applications • loosely coupled, horizontally scalable systems • Transparency – coordinated, but relatively independent systems – compatibility with previous platforms • application level distributed computing – scope and degree of location independence – peer-to-peer, application level protocols • Degree of Coupling – distributed middle-ware platforms – on how many things do distinct systems agree – how is that agreement achieved Advanced Architectures 3 Advanced Architectures 4 SMP systems and goals Symmetric Multi-Processors • Characterization: – multiple CPUs sharing memory and devices CPU 1 CPU 2 CPU 3 CPU 4 interrupt controller • Motivations: cache cache cache cache – price performance (lower price per MIP) shared memory & device busses – scalability (economical way to build huge systems) – perfect application transparency device device device controller controller controller • Example: memory – multi-core Intel CPUs – multi-socket mother boards Advanced Architectures 5 Advanced Architectures 6 1 6/7/2018 SMP Price/Performance SMP Operating System Design • a computer is much more than a CPU • one processor boots with power on – mother-board, disks, controllers, power supplies, case – it controls the starting of all other processors – CPU might cost 10-15% of the cost of the computer • same OS code runs in all processors • adding CPUs to a computer is very cost-effective – one physical copy in memory, shared by all CPUs – a second CPU yields cost of 1.1x, performance 1.9x • – a third CPU yields cost of 1.2x, performance 2.7x Each CPU has its own registers, cache, MMU – • same argument also applies at the chip level they must cooperatively share memory and devices – making a machine twice as fast is ever more difficult • ALL kernel operations must be Multi-Thread-Safe – adding more cores to the chip gets ever easier – protected by appropriate locks/semaphores • massive multi-processors are obvious direction – very fine grained locking to avoid contention Advanced Architectures 7 Advanced Architectures 8 SMP Parallelism The Challenge of SMP Performance • scheduling and load sharing • scalability depends on memory contention – each CPU can be running a different process – memory bandwidth is limited, can't handle all CPUs – just take the next ready process off the run-queue – most references satisfied from per-core cache – processes run in parallel – if too many requests go to memory, CPUs slow down – most processes don't interact (other than in kernel) • scalability depends on lock contention • serialization – waiting for spin-locks wastes time – mutual exclusion achieved by locks in shared memory – context switches waiting for kernel locks waste time – locks can be maintained with atomic instructions • contention wastes cycles, reduces throughput – spin locks acceptable for VERY short critical sections – 2 CPUs might deliver only 1.9x performance – if a process blocks, that CPU finds next ready process – 3 CPUs might deliver only 2.7x performance Advanced Architectures 9 Advanced Architectures 10 Managing Memory Contention Non-Uniform Memory Architecture • Fast n-way memory is very expensive Symmetric Multi-Processors – without it, memory contention taxes performance CPU n CPU n+1 – cost/complexity limits how many CPUs we can add local local cache memory cache memory • Non-Uniform Memory Architectures (NUMA) – each CPU has its own memory PCI bridge PCI bridge • each CPU has fast path to its own memory PCI bus PCI bus – connected by a Scalable Coherent Interconnect CC NUMA device device CC NUMA device device • a very fast , very local network between memories interface controller controller interface controller controller • accessing memory over the SCI may be 3-20x slower – these interconnects can be highly scalable Scalable Coherent Interconnect (e.g. Intel Quick Path Interconnect) Advanced Architectures 11 Advanced Architectures 12 2 6/7/2018 OS design for NUMA systems Single System Image (SSI) Clusters • it is all about local memory hit rates • Characterization: – every outside reference costs us 3-20x performance – a group of seemingly independent computers – we need 75-95% hit rate just to break even collaborating to provide SMP-like transparency • • How can the OS ensure high hit-rates? Motivation: – – replicate shared code pages in each CPU's memory higher reliability, availability than SMP/NUMA – – assign processes to CPUs, allocate all memory there more scalable than SMP/NUMA – – migrate processes to achieve load balancing excellent application transparency – spread kernel resources among all the CPUs • Examples: – attempt to preferentially allocate local resources – Locus, MicroSoft Wolf-Pack, OpenSSI – migrate resource ownership to CPU that is using it – Oracle Parallel Server Advanced Architectures 13 Advanced Architectures 14 Modern Clustered Architecture OS design for SSI clustering • all nodes agree on the state of all OS resources geographic fail over switch switch – file systems, processes, devices, locks IPC ports – any process can operate on any object, transparently ethernet request replication • they achieve this by exchanging messages SMP SMP SMP SMP system #1 system #2 system #3 system #4 – advising one-another of all changes to resources • each OS's internal state mirrors the global state dual ported synchronous FC replication dual ported optional – request execution of node-specific requests primary site RAID RAID back-up site • node-specific requests are forwarded to owning node Active systems service independent requests in parallel. They cooperate to maintain • implementation is large, complex, difficult shared global locks, and are prepared to take over partner’s work in case of failure. State replication to a back-up site is handled by external mechanisms. • the exchange of messages can be very expensive Advanced Architectures 15 Advanced Architectures 16 SSI Clustered Performance Lessons Learned • clever implementation can minimize overhead • consensus protocols are expensive – 10-20% overall is not uncommon, can be much worse – they converge slowly and scale poorly • • complete transparency systems have a great many resources – – even very complex applications "just work" resource change notifications are expensive • – they do not have to be made "network aware" location transparency encouraged non-locality – remote resource use is much more expensive • good robustness • a greatly complicated operating system – when one node fails, others notice and take-over – distributed objects are more complex to manage – often, applications won't even notice the failure – complex optimizations to reduce the added overheads • nice for application developers and customers – new modes of failure w/complex recovery procedures – but they are complex, and not particularly scalable • Bottom Line: Deutsch was right! Advanced Architectures 17 Advanced Architectures 18 3 6/7/2018 Loosely Coupled Systems Horizontal Scalability w/HA • Characterization: WAN to clients – a parallel group of independent computers load balancing switch – serving similar but independent requests w/fail-over – minimal coordination and cooperation required … web web web web web app app app app app … D1 • Motivation: server server server server server server server server server server – scalability and price performance – availability – if protocol permits stateless servers content HA – distribution database D2 ease of management, reconfigurable capacity server server • Examples: – web servers, Google search farm, Hadoop Advanced Architectures 19 Advanced Architectures 20 (elements of architecture) Horizontally scaled performance • farm of independent servers • individual servers are very inexpensive – servers run same software, serve different requests – blade servers may be only $100-$200 each – may share a common back-end database • scalability is excellent – • front-ending switch 100 servers deliver approximately 100x performance • – distributes incoming requests among available servers service availability is excellent – front-end automatically bypasses failed servers – can do both load balancing and fail-over – stateless servers and client retries fail-over easily • service protocol • the challenge is managing thousands of servers – stateless servers and idempotent operations – automated installation, global configuration services – successive requests may be sent to different servers – self monitoring, self-healing systems Advanced Architectures 21 Advanced Architectures 22 Limited Transparency Clusters Limited LocationTransparency

Advanced Architectures Goals of Distributed Computing Evaluating

Distributed & Cloud Computing: Lecture 1

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Grid Computing: What Is It, and Why Do I Care?*

Computing at Massive Scale: Scalability and Dependability Challenges

Lecture 18: Parallel and Distributed Systems

Hybrid Computing – Co-Processors/Accelerators Power-Aware Computing – Performance of Applications Kernels

Cost Model: Work, Span and Parallelism 1 Overview of Cilk

Difference Between Grid Computing Vs. Distributed Computing

Distributed GPU-Based K-Means Algorithm for Data-Intensive Applications: Large-Sized Image Segmentation Case

Parallel Sorting Algorithms in Intel Cilk Plus

Operating System Architecture Based on Distributed Objects

A Distributed Multi-GPU System for Fast Graph Processing