Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
Systems for Resource Management
Corso di Sistemi e Architetture per Big Data A.A. 2016/17
Valeria Cardellini
The reference Big Data stack
High-level Interfaces Support / Integration
Data Processing
Data Storage
Resource Management
Matteo Nardelli - SABD 2016/17 1 Outline
• Cluster management systems – Mesos – YARN – Borg – Omega – Kubernetes
• Resource management policies – DRF
Valeria Cardellini - SABD 2016/17 2
Motivations
• Rapid innovation in cloud computing
• No single framework optimal for all applications
• Running each framework on its dedicated cluster: – Expensive – Hard to share data
Valeria Cardellini - SABD 2016/17 3 A possible solution
• Run multiple frameworks on a single cluster • How to share the (virtual) cluster resources among multiple and non homogeneous frameworks executed in virtual machines/ containers? • The classical solution: Static partitioning • Is it efficient?
Valeria Cardellini - SABD 2016/17 4
What we need
• The Datacenter as a Computer idea by D. Patterson – Share resources to maximize their utilization – Share data among frameworks – Provide a unified API to the outside – Hide the internal complexity of the infrastructure from applications • The solution: A cluster-scale resource manager that employs dynamic partitioning
Valeria Cardellini - SABD 2016/17 5 Apache Mesos
• Cluster manager that provides a common resource sharing layer over which diverse frameworks can run • Abstracts the entire datacenter into a single pool of computing resources, simplifying running distributed systems at scale • A distributed system to run distributed systems on top of it
Dynamic partitioning
Valeria Cardellini - SABD 2016/17 6
Apache Mesos (2) • Designed and developed at the University of Berkeley - Top open-source project by Apache mesos.apache.org • Twitter and Airbnb as first users; now supports some of the largest applications in the world • Cluster: a dynamically shared pool of resources Static partitioning Dynamic partitioning
Valeria Cardellini - SABD 2016/17 7 Mesos goals
• High utilization of resources
• Supports diverse frameworks (current and future)
• Scalability to 10,000's of nodes
• Reliability in face of failures
Valeria Cardellini - SABD 2016/17 8
Mesos in the data center
• Where does Mesos fit as an abstraction layer in the datacenter?
Valeria Cardellini - SABD 2016/17 9 Computation model
• A framework (e.g., Hadoop, Spark) manages and runs one or more jobs • A job consists of one or more tasks • A task (e.g., map, reduce) consists of one or more processes running on same machine
Valeria Cardellini - SABD 2016/17 10
What Mesos does
• Enables fine-grained resource sharing (at the level of tasks within a job) of resources (CPU, RAM, …) across frameworks • Provides common functionalities: - Failure detection - Task distribution - Task starting - Task monitoring - Task killing
- Task cleanup
Valeria Cardellini - SABD 2016/17 11 Fine-grained sharing
• Allocation at the level of tasks within a job • Improves utilization, latency, and data locality
Coarse-grain sharing Fine-grain sharing
Valeria Cardellini - SABD 2016/17 12
Frameworks on Mesos
• Frameworks must be aware of running on Mesos – DevOps tooling: Vamp (deployment and workflow tool for container orchestration) – Long running services: Aurora (service scheduler), … – Big Data processing: Hadoop, Spark, Storm, MPI, … – Batch scheduling: Chronos, … – Data storage: Alluxio, Cassandra, ElasticSearch, … – Machine learning: TFMesos (Tensorflow in Docker on Mesos) See the full list at mesos.apache.org/documentation/latest/frameworks/
Valeria Cardellini - SABD 2016/17 13 Mesos: architecture
• Master-slave architecture • Slaves publish available resources to master • Master sends resource offers to frameworks • Master election and service discovery via ZooKeeper
Source: Mesos: a platform for fine-grained resource sharing in the data center, NSDI'11
Valeria Cardellini - SABD 2016/17 14
Mesos component: Apache ZooKeeper • Coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services • Used in many distributed systems, among which Mesos, Storm and Kafka • Allows distributed processes to coordinate with each other through a shared hierarchical name space of data (znodes) – File-system-like API – Name space similar to a standard file system – Limited amount of data in znodes – It is no really a: file system, database, key-value store, lock service • Provides high throughput, low latency, highly available, strictly ordered access to the znodes Valeria Cardellini - SABD 2016/17 15 Mesos component: ZooKeeper (2) • Replicated over a set of machines that maintain an in-memory image of the data tree – Read requests processed locally by the ZooKeeper server – Write requests forwarded to other ZooKeeper servers and consensus before a response is generated (primary-backup system) – Uses Paxos as leader election protocol to determine which server is the master • Implements atomic broadcast – Processes deliver the same messages (agreement) and deliver them in the same order (total order) – Message = state update
Valeria Cardellini - SABD 2016/17 16
Mesos and framework components
• Mesos components - Master - Slaves or agents • Framework components - Scheduler: registers with the master to be offered resources - Executors: launched on agent nodes to run the framework’s tasks
Valeria Cardellini - SABD 2016/17 17 Scheduling in Mesos • Scheduling mechanism based on resource offers - Mesos offers available resources to frameworks • Each resource offer contains a list of
Valeria Cardellini - SABD 2016/17 18
Mesos: resource offers
• Resource allocation is based on Dominant Resource Fairness (DRF)
Valeria Cardellini - SABD 2016/17 19 Mesos: resource offers in details
• Slaves continuously send status updates about resources to the Master
Valeria Cardellini - SABD 2016/17 20
Mesos: resource offers in details (2)
Valeria Cardellini - SABD 2016/17 21 Mesos: resource offers in details (3)
• Framework scheduler can reject offers
Valeria Cardellini - SABD 2016/17 22
Mesos: resource offers in details (4)
• Framework scheduler selects resources and provides tasks • Master sends tasks to slaves
Valeria Cardellini - SABD 2016/17 23 Mesos: resource offers in details (5)
• Framework executors launch tasks
Valeria Cardellini - SABD 2016/17 24
Mesos: resource offers in details (6)
Valeria Cardellini - SABD 2016/17 25 Mesos: resource offers in details (7)
Valeria Cardellini - SABD 2016/17 26
Mesos fault tolerance
• Task failure • Slave failure • Host or network failure • Master failure • Framework scheduler failure
Valeria Cardellini - SABD 2016/17 27 Fault tolerance: task failure
Valeria Cardellini - SABD 2016/17 28
Fault tolerance: task failure (2)
Valeria Cardellini - SABD 2016/17 29 Fault tolerance: slave failure
Valeria Cardellini - SABD 2016/17 30
Fault tolerance: slave failure (2)
Valeria Cardellini - SABD 2016/17 31 Fault tolerance: host or network failure
Valeria Cardellini - SABD 2016/17 32
Fault tolerance: host or network failure (2)
Valeria Cardellini - SABD 2016/17 33 Fault tolerance: host or network failure (3)
Valeria Cardellini - SABD 2016/17 34
Fault tolerance: master failure
Valeria Cardellini - SABD 2016/17 35 Fault tolerance: master failure (2)
• When the leading Master fails, the surviving masters use ZooKeeper to elect a new leader
Valeria Cardellini - SABD 2016/17 36
Fault tolerance: master failure (3)
• The slaves and frameworks use ZooKeeper to detect the new leader and reregister
Valeria Cardellini - SABD 2016/17 37 Fault tolerance: framework scheduler failure
Valeria Cardellini - SABD 2016/17 38
Fault tolerance: framework scheduler failure (2)
• When a framework scheduler fails, another instance can reregister to the Master without interrupting any of the running tasks
Valeria Cardellini - SABD 2016/17 39 Fault tolerance: framework scheduler failure (3)
Valeria Cardellini - SABD 2016/17 40
Fault tolerance: framework scheduler failure (4)
Valeria Cardellini - SABD 2016/17 41 Resource allocation
1. How to assign the cluster resources to the tasks? – Design alternatives • Global (monolithic) scheduler • Two-level scheduler
2. How to allocate resources of different types?
Valeria Cardellini - SABD 2016/17 42
Global (monolithic) scheduler
• Job requirements – Response time • Pros – Throughput – Can achieve optimal – Availability schedule (global knowledge) • Job execution plan • Cons: – Task direct acyclic graph (DAG) – Complexity: hard to scale – Inputs/outputs and ensure resilience • Estimates – Hard to anticipate future – Task duration frameworks requirements – Need to refactor existing – Input sizes frameworks – Transfer sizes Valeria Cardellini - SABD 2016/17 43 Two-level scheduling in Mesos
• Resource offer • Pros: – Vector of available – Simple: easier to scale and resources on a node make resilient – Easy to port existing – E.g., node1: <1CPU, 1GB>, frameworks and support node2: <4CPU, 16GB> new ones • Master sends resource • Cons: offers to frameworks – Distributed scheduling • Frameworks select which decision: not optimal offers to accept and which
tasks to run Push task placement to frameworks Valeria Cardellini - SABD 2016/17 44
Mesos: resource allocation
• Based on Dominant Resource Fairness (DRF) algorithm
Valeria Cardellini - SABD 2016/17 45 DRF: background on fair sharing • Consider a single resource: fair sharing – n users want to share a resource, e.g., CPU – Solution: allocate each 1/n of the shared resource • Generalized by max-min fairness – Handles if a user wants less than its fair share – E.g., user 1 wants no more than 20% • Generalized by weighted max-min fairness – Gives weights to users according to importance
– E.g., user 1 gets weight 1, user 2 weight 2 Valeria Cardellini - SABD 2016/17 46
Max-min fairness
• 1 resource type: CPU • Total resources: 20 CPU • User 1 has x tasks and wants <1CPU> per task • User 2 has y tasks and wants <2CPU> per task
max(x, y) (maximize allocation) subject to x + 2y ≤ 20 (CPU constraint) x = 2y (fairness)
Solution: x = 10 y = 5
Valeria Cardellini - SABD 2016/17 47 Why is fair sharing useful?
• Proportional allocation – User 1 gets weight 2, user 2 weight 1 • Priorities – Give user 1 weight 1000, user 2 weight 1 • Reservations: – Ensure user 1 gets 10% of a resource, so give user 1 weight 10, sum weights 100 • Isolation policy: – Users cannot affect others beyond their fair share
Valeria Cardellini - SABD 2016/17 48
Why is fair sharing useful? (2)
• Share guarantee – Each user can get at least 1/n of the resource – But will get less if its demand is less • Strategy-proof – Users are not better off by asking for more than they need – Users have no reason to lie
• Max-min fairness is the only reasonable mechanism with these two properties • Many schedulers use max-min fairness – OS, networking, datacenters (e.g., YARN),
Valeria Cardellini - SABD 2016/17 49 Max-min fairness drawback • When is max-min fairness not enough? • Need to schedule multiple, heterogeneous resources (CPU, memory, disk, I/O) • Single resource example – 1 resource: CPU – User 1 wants <1CPU> per task – User 2 wants <2CPU> per task • Multi-resource example – 2 resources: CPUs and memory – User 1 wants <1CPU, 4GB> per task – User 2 wants <3CPU, 1GB> per task
• What is a fair allocation? Valeria Cardellini - SABD 2016/17 50
A first (wrong) solution • Asset fairness: gives weights to resources (e.g., 1 CPU = 1 GB) and equalizes total value allocated to each user • Total resources: 28 CPU and 56GB RAM (e.g., 1 CPU = 2 GB) – User 1 has x tasks and wants <1CPU, 2GB> per task – User 2 has y tasks and wants <1CPU, 4GB> per task • Asset fairness yields: max(x, y) x + y ≤ 28 2x + 4y ≤ 56 4x = 6y • User 1: x = 12 i.e. <43%CPU, 43%GB> (sum = 86%) • User 2: y = 8 i.e. <29%CPU, 57%GB> (sum = 86%)
Valeria Cardellini - SABD 2016/17 51 A first (wrong) solution (2)
• Problem: violates share guarantee – User 1 gets less than 50% of both CPU and RAM – Better off in a separate cluster with half the resources
Valeria Cardellini - SABD 2016/17 52
What Mesos needs
• A fair sharing policy that provides: – Share guarantee – Strategy-proofness
• Challenge: can we generalize max-min fairness to multiple resources?
• Solution: Dominant Resource Fairness (DRF) Source: Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, NSDI'11
Valeria Cardellini - SABD 2016/17 53 DRF
• Dominant resource of a user: the resource that user has the biggest share of – Example: • Total resources: <8CPU, 5GB> • User 1 allocation: <2CPU, 1GB> • 2/8 = 25%CPU and 1/5 = 20%RAM • Dominant resource of user 1 is CPU (25% > 20%)
• Dominant share of a user: the fraction of the dominant resource allocated to the user – User 1 dominant share is 25%
Valeria Cardellini - SABD 2016/17 54
DRF (2) • Apply max-min fairness to dominant shares: give every user an equal share of its dominant resource • Equalize the dominant share of the users – Total resources: <9CPU, 18GB> – User 1 wants <1CPU, 4GB> – Dominant resource for user 1: RAM (1/9 < 4/18) – User 2 wants <3CPU, 1GB> – Dominant resource for user 2: CPU (3/9 > 1/18) max(x, y) x + 3y ≤ 9 4x + y ≤ 18 (4/18)x = (3/9)y • User 1: x = 3 <33%CPU, 66%GB>
• User 2: y = 2 <66%CPU, 16%GB> Valeria Cardellini - SABD 2016/17 55 Online DRF
• Whenever there are available resources and tasks to run: – Presents resource offers to the framework with the smallest dominant share among all frameworks
Valeria Cardellini - SABD 2016/17 56
DRF: efficiency-fairness trade-off
• DRF has under-utilized resources
• DRF schedulesEfficiency-Fairness at the level of tasks Trade-off (lead to sub-optimal job completion time)
• Fairness is fundamentally at odds with overall efficiency (how to trade-off?)
Valeria Cardellini - SABD 2016/17 57 Mesos and containers
• Mesos plays well with existing container technologies (e.g., Docker) and also provides its own container technology – Docker containers: tasks run inside Docker container – Mesos containers: tasks run with an array of pluggable isolators provided by Mesos • Also supports composing different container technologies (e.g., Docker and Mesos) • Which scale? 50,000 live containers
Valeria Cardellini - SABD 2016/17 58
Mesos deployment and evolution
• Mesos clusters can be deployed – On nearly every IaaS cloud provider infrastructure – In your own physical datacenter
• Mesos platform evolution towards microservices • Mesosphere’s Datacenter Operating System (DC/OS) – Open source operating system and distributed system built upon Mesos • Marathon – Container orchestration platform for Mesos and DC/OS
Valeria Cardellini - SABD 2016/17 59 Apache Hadoop YARN • YARN: Yet Another Resource Negotiator – A framework for job scheduling and cluster resource management
• Turns out Hadoop into an analytics platform in which resource management functions are separated from the programming model – Many scheduling-related functions are delegated to per-job components
• Why separation? – Provides flexibility in the choice of programming framework – Platform can support not only MapReduce, but also other models (e.g., Spark, Storm)
Source: Apache Hadoop YARN: Yet Another Resource Negotiator, SoCC'13
Valeria Cardellini - SABD 2016/17 60
YARN architecture • Global ResourceManager (RM) • A set of per-application ApplicationMasters (AMs) • A set of NodeManagers (NMs)
Valeria Cardellini - SABD 2016/17 61 YARN architecture: ResourceManager • One RM per cluster – Central: global view – Enable global properties: fairness, capacity, locality • Job requests are submitted to RM – Request-based – To start a job (application), RM finds a container to spawn AM • Container – Logical bundle of resources (CPU, memory) • Only handles an overall resource profile for each application – Local optimization is up to the application • Preemption – Request resources back from an application – Checkpoint snapshot instead of explicitly killing jobs/migrate computation to other containers
Valeria Cardellini - SABD 2016/17 62
YARN architecture: ResourceManager (2)
• No static resource partitioning • RM: resource scheduler in the system – Organizes the global assignment of computing resources to application instances • Composed of Scheduler and ApplicationsManager – Scheduler: responsible for allocating resources to the various running applications • Pluggable scheduling policy to partition the cluster resources among the various applications • Available pluggable schedulers: – CapacityScheduler and FairScheduler – ApplicationsManager accepts job submissions and negotiates the first set of resources for job execution
Valeria Cardellini - SABD 2016/17 63 YARN architecture: ApplicationMaster • One AM per application • The head of a job • Runs as a container • Issues resource requests to RM to obtain containers – # of containers, resources per container, locality preferences, ... • Dynamically changing resource consumption, based on the containers it receives from the RM • Requests are late-binding – The process spawned is not bound to the request, but to the lease – The conditions that caused the AM to issue the request may not remain true when it receives its resources • Can run any user code, e.g., MapReduce, Spark, etc. • AM determines the semantics of the success or failure of the container Valeria Cardellini - SABD 2016/17 64
YARN architecture: NodeManager
• One NM per node – The worker daemon – Runs as local agent on the execution node • Registers with RM, heartbeats its status and receives instructions • Reports resources to RM: memory, CPU, ... – Responsible for monitoring resource availability, reporting faults, and container lifecycle management (e.g., starting, killing) • Containers are described by a container launch context (CLC) – The command necessary to create the process – Environment variables, security tokens, …
Valeria Cardellini - SABD 2016/17 65 YARN architecture: NodeManager (2)
• Configures the environment for task execution • Garbage collection • Auxiliary services – A process may produce data that persist beyond the life of the container – Output intermediate data between map and reduce tasks.
Valeria Cardellini - SABD 2016/17 66
YARN: application startup
• Submitting the application: passing a CLC for the AM to the RM • The RM launches the AM, which registers with the RM • Periodically advertises its liveness and requirements over the heartbeat protocol • Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM – It monitors the status of the running container and stops it when the resource should be reclaimed • Once the AM is done with its work, it should unregister from the RM and exit cleanly
Valeria Cardellini - SABD 2016/17 67 Comparing Mesos and YARN
Mesos YARN Two-level scheduler Two-level scheduler (limited) Offer-based approach Request-based approach Pluggable scheduling policy Pluggable scheduling policy (DRF as default) Framework gets resource offer Framework asks a container with to choose (minimal information) specification + preferences (lots of information) General-purpose scheduler Designed and optimized for Hadoop jobs (including stateful services) (stateless batch jobs) Multi-dimensional resource RAM/CPU slots resource granularity granularity Multiple Masters Single ResourceManager Written in C++ Written in Java Linux cgroups (performance Simple Linux processes isolation)
Valeria Cardellini - SABD 2016/17 68
Cluster resource management: Hot research topic
• Borg @EuroSys 2015 – Cluster management system used by Google for the last decade, ancestor of Kubernetes • Firmament @OSDI 2016 firmament.io – Scheduler (with a Kubernetes plugin called Poseidon) that balances the high-quality placement decisions of a centralized scheduler with the speed of a distributed scheduler • Slicer @OSDI 2016 – Google’s auto-sharding service that separates out the responsibility for sharding, balancing, and re- balancing from individual application frameworks
Valeria Cardellini - SABD 2016/17 69 Container orchestration • Container orchestration: set of operations that allows cloud and application providers to define how to select, deploy, monitor, and dynamically control the configuration of multi-container packaged applications in the cloud – The tool used by organizations adopting containers for enterprise production to integrate and manage those containers at scale • Some examples – Cloudify – Kubernetes – Docker Swarm
Valeria Cardellini - SABD 2016/17 70
Container-management systems at Google
• Application-oriented shift “Containerization transforms the data center from being machine-oriented to being application- oriented”
• Goal: to allow container technology to operate at Google scale – Everything at Google runs as a container
• Borg -> Omega -> Kubernetes – Borg and Omega: purely Google-internal systems – Kubernetes: open-source
Valeria Cardellini - SABD 2016/17 71 Borg
Cluster management system at Google that achieves high utilization by: • Admission control • Efficient task-packing • Over-commitment • Machine sharing
Valeria Cardellini - SABD 2016/17 72
The user perspective • Users: Google developers and system administrators mainly • Heterogeneous workload: production and batch, mainly • Cell: subset of cluster machines – Around 10K nodes – Machines in a cell are heterogeneous in many dimensions • Jobs and tasks – A job is a group of tasks – Each task maps to a set of Linux processes running in a container on a machine
• task = container Valeria Cardellini - SABD 2016/17 73 The user perspective • Alloc (allocation) – Reserved pool of resources on a machine in which one or more tasks can run – The smallest deployable units that can be created, scheduled, and manage – Task = alloc – Job = alloc set • Priority, quota, and admission control – Job has a priority (preempting) – Quota is used to decide which jobs to admit for scheduling • Naming and monitoring – Service’s clients and other systems need to be able to find tasks, even after their relocation to a new machine – BNS example: 50.jfoo.ubar.cc.borg.google.com ! – Monitoring health of the task and thousands of performance metrics Valeria Cardellini - SABD 2016/17 74
Scheduling a job
job hello_world = { runtime = { cell = “ic” } //what cell should run it in? binary = ‘../hello_world_webserver’ //what program to run? args = { port = ‘%port%’ } requirements = { RAM = 100M disk = 100M CPU = 0.1 } replicas = 10000 }
Valeria Cardellini - SABD 2016/17 75 Borg architecture
• Borgmaster – Main BorgMaster process • Five replicas – Scheduler • Borglets – Manage and monitor tasks and resource – Borgmaster polls Borglets every few seconds
Valeria Cardellini - SABD 2016/17 76
Borg architecture
• Fauxmaster: high-fidelity Borgmaster simulator – Simulate previous runs from checkpoints – Contains full Borg code • Used for debugging, capacity planning, evaluate new policies and algorithms
Valeria Cardellini - SABD 2016/17 77 Scheduling in Borg
• Two-phase scheduling algorithm 1. Feasibility checking: find machines for a given job 2. Scoring: pick one machine – User preferences and build-in criteria • Minimize the number and priority of the preempted tasks • Picking machines that already have a copy of the task’s packages • Spreading tasks across power and failure domains • Packing by mixing high and low priority tasks
Valeria Cardellini - SABD 2016/17 78
Borg’s allocation policies
• Advanced bin-packing algorithms – Avoid stranding of resources – Bin packing problem: objects of different volumes
vi must be packed into a finite number of bins each of volume V in a way that minimizes the number of bins used – Bin packing: NP-hard problem -> many heuristics, among which: best fit, first fit, worst fit • Evaluation metric: cell-compaction – Find smallest cell that we can pack the workload into… – Remove machines randomly from a cell to maintain cell heterogeneity
Valeria Cardellini - SABD 2016/17 79 Design choices for scalability in Borg
• Separate scheduler • Separate threads to poll the Borglets • Partition functions across the five replicas • Score caching – Evaluating feasibility and scoring a machine is expensive, so cache them • Equivalence classes – Group tasks with identical requirements • Relaxed randomization – Wasteful to calculate feasibility and scores for all the machines in a large cell
Valeria Cardellini - SABD 2016/17 80
Borg ecosystem
• Many different systems built in, on, and around Borg to improve upon the basic container-management services – Naming and service discovery (the Borg Name Service, or BNS) – Master election, using Chubby – Application-aware load balancing – Horizontal (instance number) and vertical (instance size) auto-scaling – Rollout tools for deployment of new binaries and configuration data – Workflow tools – Monitoring tools
Valeria Cardellini - SABD 2016/17 81 Omega
• Offspring of Borg • State of the cluster stored in a centralized Paxos-based transaction-oriented store – Accessed by the different parts of the cluster control plane (such as schedulers), using optimistic concurrency control to handle the occasional conflicts Mesos Borg
Valeria Cardellini - SABD 2016/17 82
Kubernetes (K8s)
• Google’s open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure kubernetes.io kubernetes.io/docs/tutorials/kubernetes-basics/ • Features: – Portable: public, private, hybrid, multi-cloud – Extensible: modular, pluggable, hookable, composable – Self-healing: auto-placement, auto-restart, auto- replication, auto-scaling
Valeria Cardellini - SABD 2016/17 83 Borg and Kubernetes
• Loosely inspired by Borg
Directly derived Improved • Borglet => Kubelet • Job => labels • alloc => pod • Managed ports => IP per • Borg containers => pod Docker • Monolithic master => • Declarative specifications Micro-services
Valeria Cardellini - SABD 2016/17 84
Pod
• Pod: basic unit in Kubernetes – Group of containers that are co- scheduled • A resource envelope in which one or more containers run – Containers that are part of the same pod are guaranteed to be scheduled together onto the same machine, and can share state via local volumes • Users organize pods using labels – Label: arbitrary key/value pair attached to pod – E.g., role=frontend and stage=production !
Valeria Cardellini - SABD 2016/17 85 IP per pod
Borg Kubernetes • Containers running on a • IP address per pod, machine share the thus aligning network host’s IP address, so identity (IP address) Borg assigns the with application identity containers unique port – Easier to run off-the-shelf numbers software on Kubernetes – Burdens on infrastructure and application developers, e.g., it requires to replace DNS service
Valeria Cardellini - SABD 2016/17 86
Omega and Kubernetes
• Like Omega, shared persistent store – Components watch for changes to relevant objects • In contrast to Omega, state accessed through a domain-specific REST API that applies higher-level versioning, validation, semantics, and policy, in support of a more diverse array of clients – Stronger focus on experience of developers writing cluster applications
Valeria Cardellini - SABD 2016/17 87 Kubernetes architecture
• Master: cluster control plane • Kubelets: node agents • Cluster state backed by distributed storage system (etcd, a highly available distributed key- value store)
Valeria Cardellini - SABD 2016/17 88
Kubernetes on Mesos
• Mesos allows dynamic sharing of cluster resources between Kubernetes and other first-class Mesos frameworks such as HDFS, Spark, and Chronos kubernetes.io/docs/getting-started-guides/mesos/
Valeria Cardellini - SABD 2016/17 89 References
• B. Burns et al., Borg, Omega, and Kubernetes. Commun. ACM, 2016. • A. Ghodsi et al., Dominant resource fairness: fair allocation of multiple resource types, NSDI 2011. • B. Hindman et al., Mesos: a platform for fine-grained resource sharing in the data center, NSDI 2011. • M. Schwarzkopf et al., Omega: flexible, scalable schedulers for large compute clusters, EuroSys 2013. • V. K. Vavilapalli et al., Apache Hadoop YARN: yet another resource negotiator, SOCC 2013. • A. Verma et al., Large-scale cluster management at Google with Borg, EuroSys 2015. Valeria Cardellini - SABD 2016/17 90