Systems for Resource Management the Reference Big Data Stack

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Systems for Resource Management

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

The reference Big Data stack

High-level Interfaces Support / Integration

Data Processing

Data Storage

Resource Management

Matteo Nardelli - SABD 2016/17 1 Outline

• Cluster management systems – Mesos – YARN – Borg – Omega – Kubernetes

• Resource management policies – DRF

Valeria Cardellini - SABD 2016/17 2

Motivations

• Rapid innovation in cloud computing

• No single framework optimal for all applications

• Running each framework on its dedicated cluster: – Expensive – Hard to share data

Valeria Cardellini - SABD 2016/17 3 A possible solution

• Run multiple frameworks on a single cluster • How to share the (virtual) cluster resources among multiple and non homogeneous frameworks executed in virtual machines/ containers? • The classical solution: Static partitioning • Is it efficient?

Valeria Cardellini - SABD 2016/17 4

What we need

• The Datacenter as a Computer idea by D. Patterson – Share resources to maximize their utilization – Share data among frameworks – Provide a unified API to the outside – Hide the internal complexity of the infrastructure from applications • The solution: A cluster-scale resource manager that employs dynamic partitioning

Valeria Cardellini - SABD 2016/17 5 Apache Mesos

• Cluster manager that provides a common resource sharing layer over which diverse frameworks can run • Abstracts the entire datacenter into a single pool of computing resources, simplifying running distributed systems at scale • A distributed system to run distributed systems on top of it

Dynamic partitioning

Valeria Cardellini - SABD 2016/17 6

Apache Mesos (2) • Designed and developed at the University of Berkeley - Top open-source project by Apache mesos.apache.org • Twitter and Airbnb as first users; now supports some of the largest applications in the world • Cluster: a dynamically shared pool of resources Static partitioning Dynamic partitioning

Valeria Cardellini - SABD 2016/17 7 Mesos goals

• High utilization of resources

• Supports diverse frameworks (current and future)

• Scalability to 10,000's of nodes

• Reliability in face of failures

Valeria Cardellini - SABD 2016/17 8

Mesos in the data center

• Where does Mesos fit as an abstraction layer in the datacenter?

Valeria Cardellini - SABD 2016/17 9 Computation model

• A framework (e.g., Hadoop, Spark) manages and runs one or more jobs • A job consists of one or more tasks • A task (e.g., map, reduce) consists of one or more processes running on same machine

Valeria Cardellini - SABD 2016/17 10

What Mesos does

• Enables fine-grained resource sharing (at the level of tasks within a job) of resources (CPU, RAM, …) across frameworks • Provides common functionalities: - Failure detection - Task distribution - Task starting - Task monitoring - Task killing

- Task cleanup

Valeria Cardellini - SABD 2016/17 11 Fine-grained sharing

• Allocation at the level of tasks within a job • Improves utilization, latency, and data locality

Coarse-grain sharing Fine-grain sharing

Valeria Cardellini - SABD 2016/17 12

Frameworks on Mesos

• Frameworks must be aware of running on Mesos – DevOps tooling: Vamp (deployment and workflow tool for container orchestration) – Long running services: Aurora (service scheduler), … – Big Data processing: Hadoop, Spark, Storm, MPI, … – Batch scheduling: Chronos, … – Data storage: Alluxio, Cassandra, ElasticSearch, … – Machine learning: TFMesos (Tensorflow in Docker on Mesos) See the full list at mesos.apache.org/documentation/latest/frameworks/

Valeria Cardellini - SABD 2016/17 13 Mesos: architecture

• Master-slave architecture • Slaves publish available resources to master • Master sends resource offers to frameworks • Master election and service discovery via ZooKeeper

Source: Mesos: a platform for fine-grained resource sharing in the data center, NSDI'11

Valeria Cardellini - SABD 2016/17 14

Mesos component: Apache ZooKeeper • Coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services • Used in many distributed systems, among which Mesos, Storm and Kafka • Allows distributed processes to coordinate with each other through a shared hierarchical name space of data (znodes) – File-system-like API – Name space similar to a standard file system – Limited amount of data in znodes – It is no really a: file system, database, key-value store, lock service • Provides high throughput, low latency, highly available, strictly ordered access to the znodes Valeria Cardellini - SABD 2016/17 15 Mesos component: ZooKeeper (2) • Replicated over a set of machines that maintain an in-memory image of the data tree – Read requests processed locally by the ZooKeeper server – Write requests forwarded to other ZooKeeper servers and consensus before a response is generated (primary-backup system) – Uses Paxos as leader election protocol to determine which server is the master • Implements atomic broadcast – Processes deliver the same messages (agreement) and deliver them in the same order (total order) – Message = state update

Valeria Cardellini - SABD 2016/17 16

Mesos and framework components

• Mesos components - Master - Slaves or agents • Framework components - Scheduler: registers with the master to be offered resources - Executors: launched on agent nodes to run the framework’s tasks

Valeria Cardellini - SABD 2016/17 17 Scheduling in Mesos • Scheduling mechanism based on resource offers - Mesos offers available resources to frameworks • Each resource offer contains a list of ! - Each framework chooses which resources to use and which tasks to launch • Two-level scheduler architecture - Mesos delegates the actual scheduling of tasks to frameworks - Improves scalability: the master does not have to know the scheduling intricacies of every type of application that it supports

Valeria Cardellini - SABD 2016/17 18

Mesos: resource offers

• Resource allocation is based on Dominant Resource Fairness (DRF)

Valeria Cardellini - SABD 2016/17 19 Mesos: resource offers in details

• Slaves continuously send status updates about resources to the Master

Valeria Cardellini - SABD 2016/17 20

Mesos: resource offers in details (2)

Valeria Cardellini - SABD 2016/17 21 Mesos: resource offers in details (3)

• Framework scheduler can reject offers

Valeria Cardellini - SABD 2016/17 22

Mesos: resource offers in details (4)

• Framework scheduler selects resources and provides tasks • Master sends tasks to slaves

Valeria Cardellini - SABD 2016/17 23 Mesos: resource offers in details (5)

• Framework executors launch tasks

Valeria Cardellini - SABD 2016/17 24

Mesos: resource offers in details (6)

Valeria Cardellini - SABD 2016/17 25 Mesos: resource offers in details (7)

Valeria Cardellini - SABD 2016/17 26

Mesos fault tolerance

• Task failure • Slave failure • Host or network failure • Master failure • Framework scheduler failure

Valeria Cardellini - SABD 2016/17 27 Fault tolerance: task failure

Valeria Cardellini - SABD 2016/17 28

Fault tolerance: task failure (2)

Valeria Cardellini - SABD 2016/17 29 Fault tolerance: slave failure

Valeria Cardellini - SABD 2016/17 30

Fault tolerance: slave failure (2)

Valeria Cardellini - SABD 2016/17 31 Fault tolerance: host or network failure

Valeria Cardellini - SABD 2016/17 32

Fault tolerance: host or network failure (2)

Valeria Cardellini - SABD 2016/17 33 Fault tolerance: host or network failure (3)

Valeria Cardellini - SABD 2016/17 34

Fault tolerance: master failure

Valeria Cardellini - SABD 2016/17 35 Fault tolerance: master failure (2)

• When the leading Master fails, the surviving masters use ZooKeeper to elect a new leader

Valeria Cardellini - SABD 2016/17 36

Fault tolerance: master failure (3)

• The slaves and frameworks use ZooKeeper to detect the new leader and reregister

Valeria Cardellini - SABD 2016/17 37 Fault tolerance: framework scheduler failure

Valeria Cardellini - SABD 2016/17 38

Fault tolerance: framework scheduler failure (2)

• When a framework scheduler fails, another instance can reregister to the Master without interrupting any of the running tasks

Valeria Cardellini - SABD 2016/17 39 Fault tolerance: framework scheduler failure (3)

Valeria Cardellini - SABD 2016/17 40

Fault tolerance: framework scheduler failure (4)

Valeria Cardellini - SABD 2016/17 41 Resource allocation

1. How to assign the cluster resources to the tasks? – Design alternatives • Global (monolithic) scheduler • Two-level scheduler

2. How to allocate resources of different types?

Valeria Cardellini - SABD 2016/17 42

Global (monolithic) scheduler

• Job requirements – Response time • Pros – Throughput – Can achieve optimal – Availability schedule (global knowledge) • Job execution plan • Cons: – Task direct acyclic graph (DAG) – Complexity: hard to scale – Inputs/outputs and ensure resilience • Estimates – Hard to anticipate future – Task duration frameworks requirements – Need to refactor existing – Input sizes frameworks – Transfer sizes Valeria Cardellini - SABD 2016/17 43 Two-level scheduling in Mesos

• Resource offer • Pros: – Vector of available – Simple: easier to scale and resources on a node make resilient – Easy to port existing – E.g., node1: <1CPU, 1GB>, frameworks and support node2: <4CPU, 16GB> new ones • Master sends resource • Cons: offers to frameworks – Distributed scheduling • Frameworks select which decision: not optimal offers to accept and which

tasks to run Push task placement to frameworks Valeria Cardellini - SABD 2016/17 44

Mesos: resource allocation

• Based on Dominant Resource Fairness (DRF) algorithm

Valeria Cardellini - SABD 2016/17 45 DRF: background on fair sharing • Consider a single resource: fair sharing – n users want to share a resource, e.g., CPU – Solution: allocate each 1/n of the shared resource • Generalized by max-min fairness – Handles if a user wants less than its fair share – E.g., user 1 wants no more than 20% • Generalized by weighted max-min fairness – Gives weights to users according to importance

– E.g., user 1 gets weight 1, user 2 weight 2 Valeria Cardellini - SABD 2016/17 46

Max-min fairness

• 1 resource type: CPU • Total resources: 20 CPU • User 1 has x tasks and wants <1CPU> per task • User 2 has y tasks and wants <2CPU> per task

max(x, y) (maximize allocation) subject to x + 2y ≤ 20 (CPU constraint) x = 2y (fairness)

Solution: x = 10 y = 5

Valeria Cardellini - SABD 2016/17 47 Why is fair sharing useful?

• Proportional allocation – User 1 gets weight 2, user 2 weight 1 • Priorities – Give user 1 weight 1000, user 2 weight 1 • Reservations: – Ensure user 1 gets 10% of a resource, so give user 1 weight 10, sum weights 100 • Isolation policy: – Users cannot affect others beyond their fair share

Valeria Cardellini - SABD 2016/17 48

Why is fair sharing useful? (2)

• Share guarantee – Each user can get at least 1/n of the resource – But will get less if its demand is less • Strategy-proof – Users are not better off by asking for more than they need – Users have no reason to lie

• Max-min fairness is the only reasonable mechanism with these two properties • Many schedulers use max-min fairness – OS, networking, datacenters (e.g., YARN),

Valeria Cardellini - SABD 2016/17 49 Max-min fairness drawback • When is max-min fairness not enough? • Need to schedule multiple, heterogeneous resources (CPU, memory, disk, I/O) • Single resource example – 1 resource: CPU – User 1 wants <1CPU> per task – User 2 wants <2CPU> per task • Multi-resource example – 2 resources: CPUs and memory – User 1 wants <1CPU, 4GB> per task – User 2 wants <3CPU, 1GB> per task

• What is a fair allocation? Valeria Cardellini - SABD 2016/17 50

A first (wrong) solution • Asset fairness: gives weights to resources (e.g., 1 CPU = 1 GB) and equalizes total value allocated to each user • Total resources: 28 CPU and 56GB RAM (e.g., 1 CPU = 2 GB) – User 1 has x tasks and wants <1CPU, 2GB> per task – User 2 has y tasks and wants <1CPU, 4GB> per task • Asset fairness yields: max(x, y) x + y ≤ 28 2x + 4y ≤ 56 4x = 6y • User 1: x = 12 i.e. <43%CPU, 43%GB> (sum = 86%) • User 2: y = 8 i.e. <29%CPU, 57%GB> (sum = 86%)

Valeria Cardellini - SABD 2016/17 51 A first (wrong) solution (2)

• Problem: violates share guarantee – User 1 gets less than 50% of both CPU and RAM – Better off in a separate cluster with half the resources

Valeria Cardellini - SABD 2016/17 52

What Mesos needs

• A fair sharing policy that provides: – Share guarantee – Strategy-proofness

• Challenge: can we generalize max-min fairness to multiple resources?

• Solution: Dominant Resource Fairness (DRF) Source: Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, NSDI'11

Valeria Cardellini - SABD 2016/17 53 DRF

• Dominant resource of a user: the resource that user has the biggest share of – Example: • Total resources: <8CPU, 5GB> • User 1 allocation: <2CPU, 1GB> • 2/8 = 25%CPU and 1/5 = 20%RAM • Dominant resource of user 1 is CPU (25% > 20%)

• Dominant share of a user: the fraction of the dominant resource allocated to the user – User 1 dominant share is 25%

Valeria Cardellini - SABD 2016/17 54

DRF (2) • Apply max-min fairness to dominant shares: give every user an equal share of its dominant resource • Equalize the dominant share of the users – Total resources: <9CPU, 18GB> – User 1 wants <1CPU, 4GB> – Dominant resource for user 1: RAM (1/9 < 4/18) – User 2 wants <3CPU, 1GB> – Dominant resource for user 2: CPU (3/9 > 1/18) max(x, y) x + 3y ≤ 9 4x + y ≤ 18 (4/18)x = (3/9)y • User 1: x = 3 <33%CPU, 66%GB>

• User 2: y = 2 <66%CPU, 16%GB> Valeria Cardellini - SABD 2016/17 55 Online DRF

• Whenever there are available resources and tasks to run: – Presents resource offers to the framework with the smallest dominant share among all frameworks

Valeria Cardellini - SABD 2016/17 56

DRF: efficiency-fairness trade-off

• DRF has under-utilized resources

• DRF schedulesEfficiency-Fairness at the level of tasks Trade-off (lead to sub-optimal job completion time)

• Fairness is fundamentally at odds with overall efficiency (how to trade-off?)

Valeria Cardellini - SABD 2016/17 57 Mesos and containers

• Mesos plays well with existing container technologies (e.g., Docker) and also provides its own container technology – Docker containers: tasks run inside Docker container – Mesos containers: tasks run with an array of pluggable isolators provided by Mesos • Also supports composing different container technologies (e.g., Docker and Mesos) • Which scale? 50,000 live containers

Valeria Cardellini - SABD 2016/17 58

Mesos deployment and evolution

• Mesos clusters can be deployed – On nearly every IaaS cloud provider infrastructure – In your own physical datacenter

• Mesos platform evolution towards microservices • Mesosphere’s Datacenter Operating System (DC/OS) – Open source operating system and distributed system built upon Mesos • Marathon – Container orchestration platform for Mesos and DC/OS

Valeria Cardellini - SABD 2016/17 59 Apache Hadoop YARN • YARN: Yet Another Resource Negotiator – A framework for job scheduling and cluster resource management

• Turns out Hadoop into an analytics platform in which resource management functions are separated from the programming model – Many scheduling-related functions are delegated to per-job components

• Why separation? – Provides flexibility in the choice of programming framework – Platform can support not only MapReduce, but also other models (e.g., Spark, Storm)

Source: Apache Hadoop YARN: Yet Another Resource Negotiator, SoCC'13

Valeria Cardellini - SABD 2016/17 60

YARN architecture • Global ResourceManager (RM) • A set of per-application ApplicationMasters (AMs) • A set of NodeManagers (NMs)

Valeria Cardellini - SABD 2016/17 61 YARN architecture: ResourceManager • One RM per cluster – Central: global view – Enable global properties: fairness, capacity, locality • Job requests are submitted to RM – Request-based – To start a job (application), RM finds a container to spawn AM • Container – Logical bundle of resources (CPU, memory) • Only handles an overall resource profile for each application – Local optimization is up to the application • Preemption – Request resources back from an application – Checkpoint snapshot instead of explicitly killing jobs/migrate computation to other containers

Valeria Cardellini - SABD 2016/17 62

YARN architecture: ResourceManager (2)

• No static resource partitioning • RM: resource scheduler in the system – Organizes the global assignment of computing resources to application instances • Composed of Scheduler and ApplicationsManager – Scheduler: responsible for allocating resources to the various running applications • Pluggable scheduling policy to partition the cluster resources among the various applications • Available pluggable schedulers: – CapacityScheduler and FairScheduler – ApplicationsManager accepts job submissions and negotiates the first set of resources for job execution

Valeria Cardellini - SABD 2016/17 63 YARN architecture: ApplicationMaster • One AM per application • The head of a job • Runs as a container • Issues resource requests to RM to obtain containers – # of containers, resources per container, locality preferences, ... • Dynamically changing resource consumption, based on the containers it receives from the RM • Requests are late-binding – The process spawned is not bound to the request, but to the lease – The conditions that caused the AM to issue the request may not remain true when it receives its resources • Can run any user code, e.g., MapReduce, Spark, etc. • AM determines the semantics of the success or failure of the container Valeria Cardellini - SABD 2016/17 64

YARN architecture: NodeManager

• One NM per node – The worker daemon – Runs as local agent on the execution node • Registers with RM, heartbeats its status and receives instructions • Reports resources to RM: memory, CPU, ... – Responsible for monitoring resource availability, reporting faults, and container lifecycle management (e.g., starting, killing) • Containers are described by a container launch context (CLC) – The command necessary to create the process – Environment variables, security tokens, …

Valeria Cardellini - SABD 2016/17 65 YARN architecture: NodeManager (2)

• Configures the environment for task execution • Garbage collection • Auxiliary services – A process may produce data that persist beyond the life of the container – Output intermediate data between map and reduce tasks.

Valeria Cardellini - SABD 2016/17 66

YARN: application startup

• Submitting the application: passing a CLC for the AM to the RM • The RM launches the AM, which registers with the RM • Periodically advertises its liveness and requirements over the heartbeat protocol • Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM – It monitors the status of the running container and stops it when the resource should be reclaimed • Once the AM is done with its work, it should unregister from the RM and exit cleanly

Valeria Cardellini - SABD 2016/17 67 Comparing Mesos and YARN

Mesos YARN Two-level scheduler Two-level scheduler (limited) Offer-based approach Request-based approach Pluggable scheduling policy Pluggable scheduling policy (DRF as default) Framework gets resource offer Framework asks a container with to choose (minimal information) specification + preferences (lots of information) General-purpose scheduler Designed and optimized for Hadoop jobs (including stateful services) (stateless batch jobs) Multi-dimensional resource RAM/CPU slots resource granularity granularity Multiple Masters Single ResourceManager Written in C++ Written in Java Linux cgroups (performance Simple Linux processes isolation)

Valeria Cardellini - SABD 2016/17 68

Cluster resource management: Hot research topic

• Borg @EuroSys 2015 – Cluster management system used by Google for the last decade, ancestor of Kubernetes • Firmament @OSDI 2016 firmament.io – Scheduler (with a Kubernetes plugin called Poseidon) that balances the high-quality placement decisions of a centralized scheduler with the speed of a distributed scheduler • Slicer @OSDI 2016 – Google’s auto-sharding service that separates out the responsibility for sharding, balancing, and re- balancing from individual application frameworks

Valeria Cardellini - SABD 2016/17 69 Container orchestration • Container orchestration: set of operations that allows cloud and application providers to define how to select, deploy, monitor, and dynamically control the configuration of multi-container packaged applications in the cloud – The tool used by organizations adopting containers for enterprise production to integrate and manage those containers at scale • Some examples – Cloudify – Kubernetes – Docker Swarm

Valeria Cardellini - SABD 2016/17 70

Container-management systems at Google

• Application-oriented shift “Containerization transforms the data center from being machine-oriented to being application- oriented”

• Goal: to allow container technology to operate at Google scale – Everything at Google runs as a container

• Borg -> Omega -> Kubernetes – Borg and Omega: purely Google-internal systems – Kubernetes: open-source

Valeria Cardellini - SABD 2016/17 71 Borg

Cluster management system at Google that achieves high utilization by: • Admission control • Efficient task-packing • Over-commitment • Machine sharing

Valeria Cardellini - SABD 2016/17 72

The user perspective • Users: Google developers and system administrators mainly • Heterogeneous workload: production and batch, mainly • Cell: subset of cluster machines – Around 10K nodes – Machines in a cell are heterogeneous in many dimensions • Jobs and tasks – A job is a group of tasks – Each task maps to a set of Linux processes running in a container on a machine

• task = container Valeria Cardellini - SABD 2016/17 73 The user perspective • Alloc (allocation) – Reserved pool of resources on a machine in which one or more tasks can run – The smallest deployable units that can be created, scheduled, and manage – Task = alloc – Job = alloc set • Priority, quota, and admission control – Job has a priority (preempting) – Quota is used to decide which jobs to admit for scheduling • Naming and monitoring – Service’s clients and other systems need to be able to find tasks, even after their relocation to a new machine – BNS example: 50.jfoo.ubar.cc.borg.google.com ! – Monitoring health of the task and thousands of performance metrics Valeria Cardellini - SABD 2016/17 74

Scheduling a job

job hello_world = { runtime = { cell = “ic” } //what cell should run it in? binary = ‘../hello_world_webserver’ //what program to run? args = { port = ‘%port%’ } requirements = { RAM = 100M disk = 100M CPU = 0.1 } replicas = 10000 }

Valeria Cardellini - SABD 2016/17 75 Borg architecture

• Borgmaster – Main BorgMaster process • Five replicas – Scheduler • Borglets – Manage and monitor tasks and resource – Borgmaster polls Borglets every few seconds

Valeria Cardellini - SABD 2016/17 76

Borg architecture

• Fauxmaster: high-fidelity Borgmaster simulator – Simulate previous runs from checkpoints – Contains full Borg code • Used for debugging, capacity planning, evaluate new policies and algorithms

Valeria Cardellini - SABD 2016/17 77 Scheduling in Borg

• Two-phase scheduling algorithm 1. Feasibility checking: find machines for a given job 2. Scoring: pick one machine – User preferences and build-in criteria • Minimize the number and priority of the preempted tasks • Picking machines that already have a copy of the task’s packages • Spreading tasks across power and failure domains • Packing by mixing high and low priority tasks

Valeria Cardellini - SABD 2016/17 78

Borg’s allocation policies

• Advanced bin-packing algorithms – Avoid stranding of resources – Bin packing problem: objects of different volumes

vi must be packed into a finite number of bins each of volume V in a way that minimizes the number of bins used – Bin packing: NP-hard problem -> many heuristics, among which: best fit, first fit, worst fit • Evaluation metric: cell-compaction – Find smallest cell that we can pack the workload into… – Remove machines randomly from a cell to maintain cell heterogeneity

Valeria Cardellini - SABD 2016/17 79 Design choices for scalability in Borg

• Separate scheduler • Separate threads to poll the Borglets • Partition functions across the five replicas • Score caching – Evaluating feasibility and scoring a machine is expensive, so cache them • Equivalence classes – Group tasks with identical requirements • Relaxed randomization – Wasteful to calculate feasibility and scores for all the machines in a large cell

Valeria Cardellini - SABD 2016/17 80

Borg ecosystem

• Many different systems built in, on, and around Borg to improve upon the basic container-management services – Naming and service discovery (the Borg Name Service, or BNS) – Master election, using Chubby – Application-aware load balancing – Horizontal (instance number) and vertical (instance size) auto-scaling – Rollout tools for deployment of new binaries and configuration data – Workflow tools – Monitoring tools

Valeria Cardellini - SABD 2016/17 81 Omega

• Offspring of Borg • State of the cluster stored in a centralized Paxos-based transaction-oriented store – Accessed by the different parts of the cluster control plane (such as schedulers), using optimistic concurrency control to handle the occasional conflicts Mesos Borg

Valeria Cardellini - SABD 2016/17 82

Kubernetes (K8s)

• Google’s open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure kubernetes.io kubernetes.io/docs/tutorials/kubernetes-basics/ • Features: – Portable: public, private, hybrid, multi-cloud – Extensible: modular, pluggable, hookable, composable – Self-healing: auto-placement, auto-restart, auto- replication, auto-scaling

Valeria Cardellini - SABD 2016/17 83 Borg and Kubernetes

• Loosely inspired by Borg

Directly derived Improved • Borglet => Kubelet • Job => labels • alloc => pod • Managed ports => IP per • Borg containers => pod Docker • Monolithic master => • Declarative specifications Micro-services

Valeria Cardellini - SABD 2016/17 84

Pod

• Pod: basic unit in Kubernetes – Group of containers that are co- scheduled • A resource envelope in which one or more containers run – Containers that are part of the same pod are guaranteed to be scheduled together onto the same machine, and can share state via local volumes • Users organize pods using labels – Label: arbitrary key/value pair attached to pod – E.g., role=frontend and stage=production !

Valeria Cardellini - SABD 2016/17 85 IP per pod

Borg Kubernetes • Containers running on a • IP address per pod, machine share the thus aligning network host’s IP address, so identity (IP address) Borg assigns the with application identity containers unique port – Easier to run off-the-shelf numbers software on Kubernetes – Burdens on infrastructure and application developers, e.g., it requires to replace DNS service

Valeria Cardellini - SABD 2016/17 86

Omega and Kubernetes

• Like Omega, shared persistent store – Components watch for changes to relevant objects • In contrast to Omega, state accessed through a domain-specific REST API that applies higher-level versioning, validation, semantics, and policy, in support of a more diverse array of clients – Stronger focus on experience of developers writing cluster applications

Valeria Cardellini - SABD 2016/17 87 Kubernetes architecture

• Master: cluster control plane • Kubelets: node agents • Cluster state backed by distributed storage system (etcd, a highly available distributed key- value store)

Valeria Cardellini - SABD 2016/17 88

Kubernetes on Mesos

• Mesos allows dynamic sharing of cluster resources between Kubernetes and other first-class Mesos frameworks such as HDFS, Spark, and Chronos kubernetes.io/docs/getting-started-guides/mesos/

Valeria Cardellini - SABD 2016/17 89 References

• B. Burns et al., Borg, Omega, and Kubernetes. Commun. ACM, 2016. • A. Ghodsi et al., Dominant resource fairness: fair allocation of multiple resource types, NSDI 2011. • B. Hindman et al., Mesos: a platform for fine-grained resource sharing in the data center, NSDI 2011. • M. Schwarzkopf et al., Omega: flexible, scalable schedulers for large compute clusters, EuroSys 2013. • V. K. Vavilapalli et al., Apache Hadoop YARN: yet another resource negotiator, SOCC 2013. • A. Verma et al., Large-scale cluster management at Google with Borg, EuroSys 2015. Valeria Cardellini - SABD 2016/17 90