<<

6/7/2018

Advanced Architectures Goals of Distributed

15A. • better services 15B. Multi- Systems – • apps too big to run on a single computer 15C. Tightly Coupled Distributed Systems • grow system capacity to meet growing demand 15D. Loosely Coupled Distributed Systems – improved reliability and availability 15E. Cloud Models – improved ease of use, reduced CapEx/OpEx 15F. Distributed Middle-Ware • new services – applications that span multiple system boundaries – global resource domains, services (vs. systems) – complete location transparency

Advanced Architectures 1 Advanced Architectures 2

Major Classes of Distributed Systems Evaluating Distributed Systems

• Symmetric Multi-Processors (SMP) • Performance – multiple CPUs, sharing memory and I/O devices – overhead, scalability, availability • Single-System Image (SSI) & Cluster Computing • Functionality – a group of computers, acting like a single computer – adequacy and abstraction for target applications • loosely coupled, horizontally scalable systems • Transparency – coordinated, but relatively independent systems – compatibility with previous platforms • application level distributed computing – scope and degree of location independence – peer-to-peer, application level protocols • Degree of Coupling – distributed middle-ware platforms – on how many things do distinct systems agree – how is that agreement achieved

Advanced Architectures 3 Advanced Architectures 4

SMP systems and goals Symmetric Multi-Processors • Characterization: – multiple CPUs sharing memory and devices CPU 1 CPU 2 CPU 3 CPU 4 interrupt controller • Motivations: cache cache cache cache – price performance (lower price per MIP) & device busses – scalability (economical way to build huge systems) – perfect application transparency device device device controller controller controller • Example: memory – multi-core Intel CPUs – multi-socket mother boards

Advanced Architectures 5 Advanced Architectures 6

1 6/7/2018

SMP Price/Performance SMP Design

• a computer is much more than a CPU • one processor boots with power on – mother-board, disks, controllers, power supplies, case – it controls the starting of all other processors – CPU might cost 10-15% of the cost of the computer • same OS code runs in all processors • adding CPUs to a computer is very cost-effective – one physical copy in memory, shared by all CPUs – a second CPU yields cost of 1.1x, performance 1.9x • – a third CPU yields cost of 1.2x, performance 2.7x Each CPU has its own registers, cache, MMU – • same argument also applies at the chip level they must cooperatively share memory and devices – making a machine twice as fast is ever more difficult • ALL kernel operations must be Multi--Safe – adding more cores to the chip gets ever easier – protected by appropriate locks/semaphores • massive multi-processors are obvious direction – very fine grained locking to avoid contention

Advanced Architectures 7 Advanced Architectures 8

SMP Parallelism The Challenge of SMP Performance

and load sharing • scalability depends on memory contention – each CPU can be running a different – memory bandwidth is limited, can't handle all CPUs – just take the next ready process off the run-queue – most references satisfied from per-core cache – processes run in parallel – if too many requests go to memory, CPUs slow down – most processes don't interact (other than in kernel) • scalability depends on contention • serialization – waiting for spin-locks wastes time – achieved by locks in shared memory – context switches waiting for kernel locks waste time – locks can be maintained with atomic instructions • contention wastes cycles, reduces throughput – spin locks acceptable for VERY short critical sections – 2 CPUs might deliver only 1.9x performance – if a process blocks, that CPU finds next ready process – 3 CPUs might deliver only 2.7x performance

Advanced Architectures 9 Advanced Architectures 10

Managing Memory Contention Non-Uniform Memory Architecture • Fast n-way memory is very expensive Symmetric Multi-Processors – without it, memory contention taxes performance CPU n CPU n+1 – cost/complexity limits how many CPUs we can add local local cache memory cache memory • Non-Uniform Memory Architectures (NUMA) – each CPU has its own memory PCI bridge PCI bridge

• each CPU has fast path to its own memory PCI PCI bus – connected by a Scalable Coherent Interconnect CC NUMA device device CC NUMA device device • a very fast , very local network between memories interface controller controller interface controller controller • accessing memory over the SCI may be 3-20x slower – these interconnects can be highly scalable Scalable Coherent Interconnect (.g. Intel Quick Path Interconnect)

Advanced Architectures 11 Advanced Architectures 12

2 6/7/2018

OS design for NUMA systems (SSI) Clusters

• it is all about local memory hit rates • Characterization: – every outside reference costs us 3-20x performance – a group of seemingly independent computers – we need 75-95% hit rate just to break even collaborating to provide SMP-like transparency • • How can the OS ensure high hit-rates? Motivation: – – replicate shared code pages in each CPU's memory higher reliability, availability than SMP/NUMA – – assign processes to CPUs, allocate all memory there more scalable than SMP/NUMA – – migrate processes to achieve load balancing excellent application transparency – spread kernel resources among all the CPUs • Examples: – attempt to preferentially allocate local resources – Locus, MicroSoft Wolf-Pack, OpenSSI – migrate resource ownership to CPU that is using it – Oracle Parallel Server

Advanced Architectures 13 Advanced Architectures 14

Modern Clustered Architecture OS design for SSI clustering

• all nodes agree on the state of all OS resources geographic fail over switch switch – file systems, processes, devices, locks IPC ports – any process can operate on any object, transparently request • they achieve this by exchanging messages SMP SMP SMP SMP system #1 system #2 system #3 system #4 – advising one-another of all changes to resources • each OS's internal state mirrors the global state

dual ported synchronous FC replication dual ported optional – request execution of -specific requests primary site RAID RAID back-up site • node-specific requests are forwarded to owning node

Active systems service independent requests in parallel. They cooperate to maintain • implementation is large, complex, difficult shared global locks, and are prepared to take over partner’s work in case of failure. State replication to a back-up site is handled by external mechanisms. • the exchange of messages can be very expensive

Advanced Architectures 15 Advanced Architectures 16

SSI Clustered Performance Lessons Learned

• clever implementation can minimize overhead • consensus protocols are expensive – 10-20% overall is not uncommon, can be much worse – they converge slowly and scale poorly • • complete transparency systems have a great many resources – – even very complex applications "just work" resource change notifications are expensive • – they do not have to be made "network aware" location transparency encouraged non-locality – remote resource use is much more expensive • good robustness • a greatly complicated operating system – when one node fails, others notice and take-over – distributed objects are more complex to manage – often, applications won't even notice the failure – complex optimizations to reduce the added overheads • nice for application developers and customers – new modes of failure w/complex recovery procedures – but they are complex, and not particularly scalable • Bottom Line: Deutsch was right!

Advanced Architectures 17 Advanced Architectures 18

3 6/7/2018

Loosely Coupled Systems Horizontal Scalability w/HA

• Characterization: WAN to clients – a parallel group of independent computers load balancing switch – serving similar but independent requests w/fail-over – minimal coordination and cooperation required

… web web web web web app app app app app … D1 • Motivation: server server server server server server server server server server – scalability and price performance – availability – if protocol permits stateless servers content HA – distribution D2 ease of management, reconfigurable capacity server server • Examples: – web servers, Google search farm, Hadoop

Advanced Architectures 19 Advanced Architectures 20

(elements of architecture) Horizontally scaled performance

• farm of independent servers • individual servers are very inexpensive – servers run same software, serve different requests – blade servers may be only $100-$200 each – may share a common back-end database • scalability is excellent – • front-ending switch 100 servers deliver approximately 100x performance • – distributes incoming requests among available servers service availability is excellent – front-end automatically bypasses failed servers – can do both load balancing and fail-over – stateless servers and client retries fail-over easily • service protocol • the challenge is managing thousands of servers – stateless servers and idempotent operations – automated installation, global configuration services – successive requests may be sent to different servers – self monitoring, self-healing systems

Advanced Architectures 21 Advanced Architectures 22

Limited Transparency Clusters Limited LocationTransparency

• Single System Image Clusters had problems • what things look the same as local? – all nodes had to agree on state of all objects – remote file systems – lots of messages, lots of complexity, poor scalability – remote terminal sessions, X sessions • What if they only had to agree on a few objects – remote procedure calls – like cluster membership and global locks • what things don't look the same as local? – fewer objects, fewer operations, much less traffic – primitive synchronization (e.g. mutexes) – objects could be designed for distributed use – basic Inter-Process Communication (e.g. signals) • leases, commitment transactions, dynamic server binding – • Simpler, better performance, better scalability process create, destroy, status, authorization – – combines best features of SSI and Horizontally scaled Accessing devices (e.g. tape drives)

Advanced Architectures 23 Advanced Architectures 24

4 6/7/2018

Clouds: Applied Horizontal Scalability Geographic Disaster Recovery

• Many servers, continuous change • Cloud reliability/availability are key – dramatic fluctuations in load volume and types – one data center serves many (10 3-10 7) clients – continuous node additions for increased load • Local redundancy can only provide 4-5 nines – nodes and devices are failing continuously – fires, power and communications disruptions – continuous and progressive s/w updates – regional scale (e.g. flood, earthquake) disasters • Most services delivered via switched HTTP • – clients/server communication is over WAN links Data Centers in distant Availability Zones – large (whole file) transfers to optimize throughput – may be running active/active or active/stand-by – switches route requests to appropriate servers – key data is replicated to multiple data centers – heavy reliance on edge caching – traffic can be redirected if a primary site fails

Advanced Architectures 25 Advanced Architectures 26

WAN-Scale Replication WAN-Scale Consistency

• WAN-scale mirroring is slow and expensive • CAP theorem - it is not possible to assure: – much slower than local RAID or network mirroring – Consistency (all readers see the same result) • Synchronous Mirroring – Availability (bounded response time) – each write must be ACKed by remote servers – Partition Tolerance (survive node failures) • Asynchronous Mirroring • ACID sacrifice partition tolerance – write locally, queue for remote replication • BASE semantics sacrifice consistency • Mirrored Snapshots – Basic Availability (most services most of the time) – writes are local, snapshots are mirrored – Soft state (there is no global consistent state) • Fundamental tradeoff: reliability vs. latency – Eventual consistency (changes propagate, slowly)

Advanced Architectures 27 Advanced Architectures 28

Dealing with Eventual Consistency Distributed Computing Reformation

• distributed system has no single, global state • systems must be more loosely coupled – – state updates are not globally serialized events tight coupling is complex, slow, and error-prone – move towards coordinated independent systems – different nodes may have different opinions • move away from old single system • expose the inconsistencies to the applications – local objects and services don’t generalize – ask the cloud, receive multiple answers – services are obtained through messages (or RPCs) – let each application reconcile the inconsistencies – in-memory objects, local calls are a special case • BASE semantics are neither simple nor pretty • embrace the brave new (distributed) world – – they embrace parallelism and independence topology and partnerships are ever-changing – failure-aware services (commits, leases, rebinds) – they reflect the complexity of distributed systems – accept distributed (e.g. BASE) semantics

Advanced Architectures 29 Advanced Architectures 30

5 6/7/2018

How to Exploit a Cloud Distributed Middle-Ware

• Replace physical machines w/virtual machines • API adapters – cloud provides inexpensive elastic resources – e.g. HIVE (SQL bridge) • Run applications • complexity hiding – requiring huge numbers of computers – e.g. remote procedure calls, distributed objects • Massively distributed systems are very difficult • restricted programming platforms – to design, build, maintain and manage – e.g. Applets, Erlang, state machines • How can we make exploiting parallelism easy? • new programming models – – new, tool supported, programming models e.g. publish-subscribe, MapReduce, Key-Value stores – encapsulate complexity of distributed systems • powerful distributed applications – e.g. search engines, Watson, Deep Learning

Advanced Architectures 32

Distributed Systems – Summary Changing Architectural Paradigms

• different degrees of transparency • a “System” is a collection of services – do applications see a network or single system image – interacting via stable and standardized protocols • different degrees of coupling – implemented by app software deployed on nodes – making multiple computers cooperate is difficult • Operating Systems – doing it without shared memory is even worse – manage the hardware on which the apps run • performance vs. independence vs. robustness – implement the services/ABIs the apps need – cooperating redundant nodes offer higher availability • The operating system is a platform – communication and coordination are expensive – upon which higher level software can be built – mutual-dependency creates more modes of failure – goodness is measured by how well it does that job

Advanced Architectures 33 Advanced Architectures 34

What Operating Systems Do Final Exam

• Originally (and at the start of this course) • Wed 6/13, Boelter 3400 – abstract heterogeneous hardware into useful services – manage system resources for user-mode processes • Part I: 08:00-09:30 – ensure resource integrity and trusted resource sharing – 10 questions, similar to mid-term – provide a powerful platform for developers – • None of this has changed, but … closed book – notion of a self-contained system becoming obsolete – covering reading and lectures since mid-term – hardware and OS heterogeneity is a given • – most important interfaces are higher level protocols Part II: 09:30-11:00 • Operating Systems continue to evolve as – 3/6 questions, similar to mid-term XC – new applications demand new services – closed book – new hardware must be integrated and exploited – covering entire course

Advanced Architectures 35 Advanced Architectures 38

6 6/7/2018

The Dream

Programs don’t run on hardware, they run atop operating systems. physical systems All the resources that processes see are already virtualized. Instead of merely virtualizing all the resources in a single system,

proc 101 virtualize all the resources in a cluster of systems. Applications proc 103 CD1 that run in such a cluster are (automatically and transparently) proc 106 distributed. lock 1A

virtual HA computer w/4x MIPS & memory one global pool of devices processes 101, 103, 106, + Supplementary Slides proc 202 LP2 202, 204, 205, CD1 proc 204 + 301, 305, 306, proc 205 + 403, 405, 407

locks CD3 1A, 3B CD3 proc 301 proc 305 proc 306 one large HA virtual file system LP2 LP3 lock 3B primary copies

LP3 disk 1A disk 2A disk 3A disk 4A

SCN4 SCN4 proc 403 disk 3B disk 4B disk 1B disk 2B proc 405 proc 407 secondary replicas

Advanced Architectures 39

Structure of a Modern OS Transparency

system call interfaces user visible OS • Ideally, a distributed system would be just like model file namespace authorization file file I/O IPC process/thread exception synchronization model model model model model model model model a single machine system • But better run-time configuration fault quality transport … higher level file systems loader services management of service protocols services – More resources stream volume hot-plug block I/O services management services services memory logging swapping paging network serial display storage scheduling & tracing – More reliable I/O class driver class driver class driver class driver abstraction virtual execution fault process/thread processes asynchronous – Faster device drivers device drivers handling scheduling (resource containers) events engine

DMA configuration thread memory memory thread • Transparent distributed systems look as much bus drivers services analysis dispatching allocation segments synchronization

boot I/O resource enclosure processor processor context kernel processor like single machine systems as possible strap allocation management exceptions initialization switching debugger abstraction I/O processor memory cache atomic DMA interrupts traps timers operations mode mapping mgmt updates

Advanced Architectures 41

Loosely Coupled Scalability Example: A Hadoop Cluster (Beowulf High Performance Computing Cluster)

client Beowulf Head Node name node task coordination NFS job submit/results task node server MPI Interface data node exchanging information between sub-tasks job tracker task tracker

NFS replication programs and data or data node MPI MPI MPI MPI reduce task sub-task sub-task sub-task sub-task … … Beowulf Slave Node Beowulf Slave Node Beowulf Slave Node Beowulf Slave Node data node There is no effort at transparency here. Applications are specifically written for a parallel execution platform and use a Message Passing Interface to mediate exchanges between cooperating computations.

HDFS cluster

Advanced Architectures 43 Advanced Architectures 44

7 6/7/2018

(Hadoop Distributed ) Jobs

• Client • – Problems where it’s really, really easy to up-loads data into an HFDS cluster parallelize them – creates map/reduce data analysis program – submits job to a Hadoop Job Tracker • Probably because the data sets are easily • Job Tracker divisible – sends sub-tasks to task trackers on many nodes • And exactly the same things are done on each • Task Trackers piece – spawn/monitor map/reduce tasks, collect status • • So you just parcel them out among the nodes Map/Reduce Tasks and let each go independently – run analysis program on a defined data sub-set – reading data from/writing results to HDFS cluster • Everyone finishes at more or less same time

Advanced Architectures 45

MapReduce The Idea Behind MapReduce • There is a single function you want to perform • Perhaps the most common cloud on a lot of data computing software tool/technique – Such as searching it for a string • Divide the data into disjoint pieces • A method of dividing large problems into • Perform the function on each piece on a compartmentalized pieces separate node ( map ) • Each of which can be performed on a • Combine the results to obtain output separate node (reduce ) • With an eventual combined set of results

An Example The Example Continued

• We have 64 megabytes of text data • Count how many times each word occurs in 1 2 3 4 the text • Divide it into 4 chunks of 16 Mbytes

Foo 1 Zoo 6 Foo 7 Zoo 1 Foo 2 Zoo 2 Foo 4 Zoo 9 • Assign each chunk to one processor Bar 4 Yes 12 Bar 3 Yes 17 Bar 6 Yes 10 Bar 7 Yes 3 Baz 3 Too 5 Baz 9 Too 8 Baz 2 Too 4 Baz 5 Too 7 • Perform the map function of “count words” on each That’s the map stage

8 6/7/2018

On To Reduce Continuing the Example

• We might have two more nodes assigned to doing the reduce operation • They will each receive a share of data from a Foo 1 Zoo 6 Foo 7 Zoo 1 Foo 2 Zoo 2 Foo 4 Zoo 9 Bar 4 Yes 12 Bar 3 Yes 17 Bar 6 Yes 10 Bar 7 Yes 3 map node Baz 3 Too 5 Baz 9 Too 8 Baz 2 Too 4 Baz 5 Too 7 • The reduce node performs a reduce operation to “combine” the shares • Outputting its own result

The Reduce Nodes Do Their Job But I Wanted A Combined List

• No problem Write out the results to files • Run another (slightly different) MapReduce on And MapReduce is done! the outputs • Have one reduce node that combines everything

Foo 14 Zoo 16 Bar 20 Yes 42 Baz 19 Too 24

Synchronization in MapReduce

• Each map node produces an output file for each reduce node • It is produced atomically • The reduce node can’t work on this data until the whole file is written • Forcing a synchronization point between the map and reduce phases

9