6/7/2018
Advanced Architectures Goals of Distributed Computing
15A. Distributed Computing • better services 15B. Multi-Processor Systems – scalability • apps too big to run on a single computer 15C. Tightly Coupled Distributed Systems • grow system capacity to meet growing demand 15D. Loosely Coupled Distributed Systems – improved reliability and availability 15E. Cloud Models – improved ease of use, reduced CapEx/OpEx 15F. Distributed Middle-Ware • new services – applications that span multiple system boundaries – global resource domains, services (vs. systems) – complete location transparency
Advanced Architectures 1 Advanced Architectures 2
Major Classes of Distributed Systems Evaluating Distributed Systems
• Symmetric Multi-Processors (SMP) • Performance – multiple CPUs, sharing memory and I/O devices – overhead, scalability, availability • Single-System Image (SSI) & Cluster Computing • Functionality – a group of computers, acting like a single computer – adequacy and abstraction for target applications • loosely coupled, horizontally scalable systems • Transparency – coordinated, but relatively independent systems – compatibility with previous platforms • application level distributed computing – scope and degree of location independence – peer-to-peer, application level protocols • Degree of Coupling – distributed middle-ware platforms – on how many things do distinct systems agree – how is that agreement achieved
Advanced Architectures 3 Advanced Architectures 4
SMP systems and goals Symmetric Multi-Processors • Characterization: – multiple CPUs sharing memory and devices CPU 1 CPU 2 CPU 3 CPU 4 interrupt controller • Motivations: cache cache cache cache – price performance (lower price per MIP) shared memory & device busses – scalability (economical way to build huge systems) – perfect application transparency device device device controller controller controller • Example: memory – multi-core Intel CPUs – multi-socket mother boards
Advanced Architectures 5 Advanced Architectures 6
1 6/7/2018
SMP Price/Performance SMP Operating System Design
• a computer is much more than a CPU • one processor boots with power on – mother-board, disks, controllers, power supplies, case – it controls the starting of all other processors – CPU might cost 10-15% of the cost of the computer • same OS code runs in all processors • adding CPUs to a computer is very cost-effective – one physical copy in memory, shared by all CPUs – a second CPU yields cost of 1.1x, performance 1.9x • – a third CPU yields cost of 1.2x, performance 2.7x Each CPU has its own registers, cache, MMU – • same argument also applies at the chip level they must cooperatively share memory and devices – making a machine twice as fast is ever more difficult • ALL kernel operations must be Multi-Thread-Safe – adding more cores to the chip gets ever easier – protected by appropriate locks/semaphores • massive multi-processors are obvious direction – very fine grained locking to avoid contention
Advanced Architectures 7 Advanced Architectures 8
SMP Parallelism The Challenge of SMP Performance
• scheduling and load sharing • scalability depends on memory contention – each CPU can be running a different process – memory bandwidth is limited, can't handle all CPUs – just take the next ready process off the run-queue – most references satisfied from per-core cache – processes run in parallel – if too many requests go to memory, CPUs slow down – most processes don't interact (other than in kernel) • scalability depends on lock contention • serialization – waiting for spin-locks wastes time – mutual exclusion achieved by locks in shared memory – context switches waiting for kernel locks waste time – locks can be maintained with atomic instructions • contention wastes cycles, reduces throughput – spin locks acceptable for VERY short critical sections – 2 CPUs might deliver only 1.9x performance – if a process blocks, that CPU finds next ready process – 3 CPUs might deliver only 2.7x performance
Advanced Architectures 9 Advanced Architectures 10
Managing Memory Contention Non-Uniform Memory Architecture • Fast n-way memory is very expensive Symmetric Multi-Processors – without it, memory contention taxes performance CPU n CPU n+1 – cost/complexity limits how many CPUs we can add local local cache memory cache memory • Non-Uniform Memory Architectures (NUMA) – each CPU has its own memory PCI bridge PCI bridge
• each CPU has fast path to its own memory PCI bus PCI bus – connected by a Scalable Coherent Interconnect CC NUMA device device CC NUMA device device • a very fast , very local network between memories interface controller controller interface controller controller • accessing memory over the SCI may be 3-20x slower – these interconnects can be highly scalable Scalable Coherent Interconnect (e.g. Intel Quick Path Interconnect)
Advanced Architectures 11 Advanced Architectures 12
2 6/7/2018
OS design for NUMA systems Single System Image (SSI) Clusters
• it is all about local memory hit rates • Characterization: – every outside reference costs us 3-20x performance – a group of seemingly independent computers – we need 75-95% hit rate just to break even collaborating to provide SMP-like transparency • • How can the OS ensure high hit-rates? Motivation: – – replicate shared code pages in each CPU's memory higher reliability, availability than SMP/NUMA – – assign processes to CPUs, allocate all memory there more scalable than SMP/NUMA – – migrate processes to achieve load balancing excellent application transparency – spread kernel resources among all the CPUs • Examples: – attempt to preferentially allocate local resources – Locus, MicroSoft Wolf-Pack, OpenSSI – migrate resource ownership to CPU that is using it – Oracle Parallel Server
Advanced Architectures 13 Advanced Architectures 14
Modern Clustered Architecture OS design for SSI clustering
• all nodes agree on the state of all OS resources geographic fail over switch switch – file systems, processes, devices, locks IPC ports – any process can operate on any object, transparently ethernet request replication • they achieve this by exchanging messages SMP SMP SMP SMP system #1 system #2 system #3 system #4 – advising one-another of all changes to resources • each OS's internal state mirrors the global state
dual ported synchronous FC replication dual ported optional – request execution of node-specific requests primary site RAID RAID back-up site • node-specific requests are forwarded to owning node
Active systems service independent requests in parallel. They cooperate to maintain • implementation is large, complex, difficult shared global locks, and are prepared to take over partner’s work in case of failure. State replication to a back-up site is handled by external mechanisms. • the exchange of messages can be very expensive
Advanced Architectures 15 Advanced Architectures 16
SSI Clustered Performance Lessons Learned
• clever implementation can minimize overhead • consensus protocols are expensive – 10-20% overall is not uncommon, can be much worse – they converge slowly and scale poorly • • complete transparency systems have a great many resources – – even very complex applications "just work" resource change notifications are expensive • – they do not have to be made "network aware" location transparency encouraged non-locality – remote resource use is much more expensive • good robustness • a greatly complicated operating system – when one node fails, others notice and take-over – distributed objects are more complex to manage – often, applications won't even notice the failure – complex optimizations to reduce the added overheads • nice for application developers and customers – new modes of failure w/complex recovery procedures – but they are complex, and not particularly scalable • Bottom Line: Deutsch was right!
Advanced Architectures 17 Advanced Architectures 18
3 6/7/2018
Loosely Coupled Systems Horizontal Scalability w/HA
• Characterization: WAN to clients – a parallel group of independent computers load balancing switch – serving similar but independent requests w/fail-over – minimal coordination and cooperation required
… web web web web web app app app app app … D1 • Motivation: server server server server server server server server server server – scalability and price performance – availability – if protocol permits stateless servers content HA – distribution database D2 ease of management, reconfigurable capacity server server • Examples: – web servers, Google search farm, Hadoop
Advanced Architectures 19 Advanced Architectures 20
(elements of architecture) Horizontally scaled performance
• farm of independent servers • individual servers are very inexpensive – servers run same software, serve different requests – blade servers may be only $100-$200 each – may share a common back-end database • scalability is excellent – • front-ending switch 100 servers deliver approximately 100x performance • – distributes incoming requests among available servers service availability is excellent – front-end automatically bypasses failed servers – can do both load balancing and fail-over – stateless servers and client retries fail-over easily • service protocol • the challenge is managing thousands of servers – stateless servers and idempotent operations – automated installation, global configuration services – successive requests may be sent to different servers – self monitoring, self-healing systems
Advanced Architectures 21 Advanced Architectures 22
Limited Transparency Clusters Limited LocationTransparency
• Single System Image Clusters had problems • what things look the same as local? – all nodes had to agree on state of all objects – remote file systems – lots of messages, lots of complexity, poor scalability – remote terminal sessions, X sessions • What if they only had to agree on a few objects – remote procedure calls – like cluster membership and global locks • what things don't look the same as local? – fewer objects, fewer operations, much less traffic – primitive synchronization (e.g. mutexes) – objects could be designed for distributed use – basic Inter-Process Communication (e.g. signals) • leases, commitment transactions, dynamic server binding – • Simpler, better performance, better scalability process create, destroy, status, authorization – – combines best features of SSI and Horizontally scaled Accessing devices (e.g. tape drives)
Advanced Architectures 23 Advanced Architectures 24
4 6/7/2018
Clouds: Applied Horizontal Scalability Geographic Disaster Recovery
• Many servers, continuous change • Cloud reliability/availability are key – dramatic fluctuations in load volume and types – one data center serves many (10 3-10 7) clients – continuous node additions for increased load • Local redundancy can only provide 4-5 nines – nodes and devices are failing continuously – fires, power and communications disruptions – continuous and progressive s/w updates – regional scale (e.g. flood, earthquake) disasters • Most services delivered via switched HTTP • – clients/server communication is over WAN links Data Centers in distant Availability Zones – large (whole file) transfers to optimize throughput – may be running active/active or active/stand-by – switches route requests to appropriate servers – key data is replicated to multiple data centers – heavy reliance on edge caching – traffic can be redirected if a primary site fails
Advanced Architectures 25 Advanced Architectures 26
WAN-Scale Replication WAN-Scale Consistency
• WAN-scale mirroring is slow and expensive • CAP theorem - it is not possible to assure: – much slower than local RAID or network mirroring – Consistency (all readers see the same result) • Synchronous Mirroring – Availability (bounded response time) – each write must be ACKed by remote servers – Partition Tolerance (survive node failures) • Asynchronous Mirroring • ACID databases sacrifice partition tolerance – write locally, queue for remote replication • BASE semantics sacrifice consistency • Mirrored Snapshots – Basic Availability (most services most of the time) – writes are local, snapshots are mirrored – Soft state (there is no global consistent state) • Fundamental tradeoff: reliability vs. latency – Eventual consistency (changes propagate, slowly)
Advanced Architectures 27 Advanced Architectures 28
Dealing with Eventual Consistency Distributed Computing Reformation
• distributed system has no single, global state • systems must be more loosely coupled – – state updates are not globally serialized events tight coupling is complex, slow, and error-prone – move towards coordinated independent systems – different nodes may have different opinions • move away from old single system APIs • expose the inconsistencies to the applications – local objects and services don’t generalize – ask the cloud, receive multiple answers – services are obtained through messages (or RPCs) – let each application reconcile the inconsistencies – in-memory objects, local calls are a special case • BASE semantics are neither simple nor pretty • embrace the brave new (distributed) world – – they embrace parallelism and independence topology and partnerships are ever-changing – failure-aware services (commits, leases, rebinds) – they reflect the complexity of distributed systems – accept distributed (e.g. BASE) semantics
Advanced Architectures 29 Advanced Architectures 30
5 6/7/2018
How to Exploit a Cloud Distributed Middle-Ware
• Replace physical machines w/virtual machines • API adapters – cloud provides inexpensive elastic resources – e.g. HIVE (SQL bridge) • Run massively parallel applications • complexity hiding – requiring huge numbers of computers – e.g. remote procedure calls, distributed objects • Massively distributed systems are very difficult • restricted programming platforms – to design, build, maintain and manage – e.g. Java Applets, Erlang, state machines • How can we make exploiting parallelism easy? • new programming models – – new, tool supported, programming models e.g. publish-subscribe, MapReduce, Key-Value stores – encapsulate complexity of distributed systems • powerful distributed applications – e.g. search engines, Watson, Deep Learning
Advanced Architectures 32
Distributed Systems – Summary Changing Architectural Paradigms
• different degrees of transparency • a “System” is a collection of services – do applications see a network or single system image – interacting via stable and standardized protocols • different degrees of coupling – implemented by app software deployed on nodes – making multiple computers cooperate is difficult • Operating Systems – doing it without shared memory is even worse – manage the hardware on which the apps run • performance vs. independence vs. robustness – implement the services/ABIs the apps need – cooperating redundant nodes offer higher availability • The operating system is a platform – communication and coordination are expensive – upon which higher level software can be built – mutual-dependency creates more modes of failure – goodness is measured by how well it does that job
Advanced Architectures 33 Advanced Architectures 34
What Operating Systems Do Final Exam
• Originally (and at the start of this course) • Wed 6/13, Boelter 3400 – abstract heterogeneous hardware into useful services – manage system resources for user-mode processes • Part I: 08:00-09:30 – ensure resource integrity and trusted resource sharing – 10 questions, similar to mid-term – provide a powerful platform for developers – • None of this has changed, but … closed book – notion of a self-contained system becoming obsolete – covering reading and lectures since mid-term – hardware and OS heterogeneity is a given • – most important interfaces are higher level protocols Part II: 09:30-11:00 • Operating Systems continue to evolve as – 3/6 questions, similar to mid-term XC – new applications demand new services – closed book – new hardware must be integrated and exploited – covering entire course
Advanced Architectures 35 Advanced Architectures 38
6 6/7/2018
The Dream
Programs don’t run on hardware, they run atop operating systems. physical systems All the resources that processes see are already virtualized. Instead of merely virtualizing all the resources in a single system,
proc 101 virtualize all the resources in a cluster of systems. Applications proc 103 CD1 that run in such a cluster are (automatically and transparently) proc 106 distributed. lock 1A
virtual HA computer w/4x MIPS & memory one global pool of devices processes 101, 103, 106, + Supplementary Slides proc 202 LP2 202, 204, 205, CD1 proc 204 + 301, 305, 306, proc 205 + 403, 405, 407
locks CD3 1A, 3B CD3 proc 301 proc 305 proc 306 one large HA virtual file system LP2 LP3 lock 3B primary copies
LP3 disk 1A disk 2A disk 3A disk 4A
SCN4 SCN4 proc 403 disk 3B disk 4B disk 1B disk 2B proc 405 proc 407 secondary replicas
Advanced Architectures 39
Structure of a Modern OS Transparency
system call interfaces user visible OS • Ideally, a distributed system would be just like model file namespace authorization file file I/O IPC process/thread exception synchronization model model model model model model model model a single machine system • But better run-time configuration fault quality transport … higher level file systems loader services management of service protocols services – More resources stream volume hot-plug block I/O services management services services memory logging swapping paging network serial display storage scheduling & tracing – More reliable I/O class driver class driver class driver class driver abstraction virtual execution fault process/thread processes asynchronous – Faster device drivers device drivers handling scheduling (resource containers) events engine
DMA configuration thread memory memory thread • Transparent distributed systems look as much bus drivers services analysis dispatching allocation segments synchronization
boot I/O resource enclosure processor processor context kernel processor like single machine systems as possible strap allocation management exceptions initialization switching debugger abstraction I/O processor memory cache atomic DMA interrupts traps timers operations mode mapping mgmt updates
Advanced Architectures 41
Loosely Coupled Scalability Example: A Hadoop Cluster (Beowulf High Performance Computing Cluster)
client Beowulf Head Node name node task coordination NFS job submit/results task node server MPI Message Passing Interface data node exchanging information between sub-tasks job tracker task tracker
NFS replication programs and data map or data node MPI MPI MPI MPI reduce task sub-task sub-task sub-task sub-task … … Beowulf Slave Node Beowulf Slave Node Beowulf Slave Node Beowulf Slave Node data node There is no effort at transparency here. Applications are specifically written for a parallel execution platform and use a Message Passing Interface to mediate exchanges between cooperating computations.
HDFS cluster
Advanced Architectures 43 Advanced Architectures 44
7 6/7/2018
(Hadoop Distributed Middleware) Embarrassingly Parallel Jobs
• Client • – Problems where it’s really, really easy to up-loads data into an HFDS cluster parallelize them – creates map/reduce data analysis program – submits job to a Hadoop Job Tracker • Probably because the data sets are easily • Job Tracker divisible – sends sub-tasks to task trackers on many nodes • And exactly the same things are done on each • Task Trackers piece – spawn/monitor map/reduce tasks, collect status • • So you just parcel them out among the nodes Map/Reduce Tasks and let each go independently – run analysis program on a defined data sub-set – reading data from/writing results to HDFS cluster • Everyone finishes at more or less same time
Advanced Architectures 45
MapReduce The Idea Behind MapReduce • There is a single function you want to perform • Perhaps the most common cloud on a lot of data computing software tool/technique – Such as searching it for a string • Divide the data into disjoint pieces • A method of dividing large problems into • Perform the function on each piece on a compartmentalized pieces separate node ( map ) • Each of which can be performed on a • Combine the results to obtain output separate node (reduce ) • With an eventual combined set of results
An Example The Example Continued
• We have 64 megabytes of text data • Count how many times each word occurs in 1 2 3 4 the text • Divide it into 4 chunks of 16 Mbytes
Foo 1 Zoo 6 Foo 7 Zoo 1 Foo 2 Zoo 2 Foo 4 Zoo 9 • Assign each chunk to one processor Bar 4 Yes 12 Bar 3 Yes 17 Bar 6 Yes 10 Bar 7 Yes 3 Baz 3 Too 5 Baz 9 Too 8 Baz 2 Too 4 Baz 5 Too 7 • Perform the map function of “count words” on each That’s the map stage
8 6/7/2018
On To Reduce Continuing the Example
• We might have two more nodes assigned to doing the reduce operation • They will each receive a share of data from a Foo 1 Zoo 6 Foo 7 Zoo 1 Foo 2 Zoo 2 Foo 4 Zoo 9 Bar 4 Yes 12 Bar 3 Yes 17 Bar 6 Yes 10 Bar 7 Yes 3 map node Baz 3 Too 5 Baz 9 Too 8 Baz 2 Too 4 Baz 5 Too 7 • The reduce node performs a reduce operation to “combine” the shares • Outputting its own result
The Reduce Nodes Do Their Job But I Wanted A Combined List
• No problem Write out the results to files • Run another (slightly different) MapReduce on And MapReduce is done! the outputs • Have one reduce node that combines everything
Foo 14 Zoo 16 Bar 20 Yes 42 Baz 19 Too 24
Synchronization in MapReduce
• Each map node produces an output file for each reduce node • It is produced atomically • The reduce node can’t work on this data until the whole file is written • Forcing a synchronization point between the map and reduce phases
9