Micro-Service Orchestration Deep Dive

© 202 D2iQ, Inc. All Rights Reserved. Ken Sipe Distribute Application Engineer

Apache Mesos Contributor Org Apache Committer Myriad, Open DCOS Developer: Embedded, C++, Java, Groovy, Grails, C#, Go @KenSipe [email protected] © 202 D2iQ, Inc. All Rights Reserved. 2 Agenda

● Introduction of Container Orchestration ● Mesos, DC/OS and Kubernetes ● Cluster Architecture ● Pods ● Scheduling ● Resource Selection ● Reconciliation ● Service Discovery

© 2017 Mesosphere, Inc. All Rights Reserved. Today’s Legacy Datacenter Today’s Legacy Datacenter

Provision VMs in the cloud or on physical servers Installing an Application with Static Partitioning

Install Hadoop on a static set of machines Installing an Application with Static Partitioning

Install Web Server on a static set of machines Resizing an Application with Static Partitioning

Scale up Hadoop manually Challenge: Known IP and Port for Resource What if your Laptop was operated like your Data Center? Challenge: Resource Utilization From Static Partitioning to Elastic Sharing

100% — WASTED WASTED WASTED Static Partitioning

WEB CACHE HADOOP

100% —

HADOOP FREE FREE Elastic Sharing WEB CACHE The Borg

●The beginning of container orchestration

specific

http://www.wired.com/wiredenterprise/2013/03/google-borg--mesos/all/

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos lets you treat a closer of nodes…

© 2017 Mesosphere, Inc. All Rights Reserved. 14 As one big computer

© 2017 Mesosphere, Inc. All Rights Reserved. 15 Mesos Overview

●Mesos Master ●Framework Scheduler (Driver) ●Mesos Agent ●Isolation, Reporting ●Framework Executor

© 2017 Mesosphere, Inc. All Rights Reserved.

Mesos Framework Overview

ZK 1 Master 1 Slave Slave Slave

ZK 2 Master 2 Slave Slave Slave

ZK 3 Master 3 Slave Slave Slave

Driver: Driver: Elastic Slave Slave Slave Marathon Search The UNIX Operating System Stack

Apache MySQL Memcached SSHd Applications

Init, Upstart, Systemd Init System

Linux, BSD Kernel

© 2017 Mesosphere, Inc. All Rights Reserved. 19 The Mesosphere Operating System Stack

Rails Redis Elasticsearch Memcached Applications

Marathon Init System

Mesos Kernel

© 2017 Mesosphere, Inc. All Rights Reserved. 20 Mesos Framework Components

Scheduler

1. resourceO!ers() 2. launchTasks() 5. statusUpdate()

Mesos Master

Mesos Slave

3. launchTask() 4. statusUpdate()

Executor

Task Task DC/OS brings it all together

● Service Discovery ● Load Balancing ● Security ● Ease of installation ● Comprehensive tooling for operations ● Built in frameworks for long running and scheduled jobs ● Catalog of pre-configured apps (including , …), browse at http://universe.dcos.io/ ● And much more https://dcos.io/

© 2017 Mesosphere, Inc. All Rights Reserved. 22 ● 100% open source (ASL2.0) DC/OS is … + A big, diverse community ● An umbrella for ~30 OSS projects + Roadmap and designs + Documentation and tutorials ● Not limited in any way

23 Services & Containers DC/OS HDFS Marathon Cassandra Flink Architecture Overview Spark Kafka MongoDB +30 more...

DC/OS

Security & Container Orchestration Monitoring & Operations User Interface & Command Line Governance

ANY INFRASTRUCTUR E

© 2017 Mesosphere, Inc. All Rights Reserved. 24 KUBERNETES

© 202 D2iQ, Inc. All Rights Reserved. What is Kubernetes?

●“Kubernetes is an open source system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.”

●Optimized for microservices-based web applications.

© 2017 Mesosphere, Inc. All Rights Reserved. What is K8S

clustering technology which provides a uniform platform for application deployment.

© 2017 Mesosphere, Inc. All Rights Reserved. KUBERNETES ARCHITECTURE

© 2017 Mesosphere, Inc. All Rights Reserved. Kubernetes Components and Analogies

Kubernetes Role DC/OS Component Component etcd Distributed key/value store Zookeeper kube-apiserver Central API server to interact with the cluster components kube-controller- Reconciles scale events for fault tolerant objects (e.g. Replication manager Controllers) kube-scheduler Schedules containers on the various worker nodes Marathon kubelet Agent running on each worker. Serves as an init daemon to start Mesos Agent containers kube-proxy Process to configure Netfilter (iptable) rules to route traffic across Minuteman the cluster kubectl CLI to interact with the cluster, and deploy containers dcos CLI © 2017 Mesosphere, Inc. All Rights Reserved. Master Node

●Etcd ●API Server ●Controller Manager Service ●Scheduler Service

© 2017 Mesosphere, Inc. All Rights Reserved. EtcD

● store configuration data that can be used by each of the nodes in the cluster ● simple HTTP/JSON API, the interface for setting or retrieving values ●configured on a single master server or, in production scenarios, distributed among a number of machines. ●The only requirement is that it be network accessible to each of the Kubernetes machines.

© 2017 Mesosphere, Inc. All Rights Reserved. API Server

●main management point of the entire cluster ●makes sure that the etcd store and the service details of deployed containers are in agreement ●bridge between various components to maintain cluster health ●disseminate information and commands

© 2017 Mesosphere, Inc. All Rights Reserved. Controller Manager

●Manages all the controllers ●Reads config changes and enacts on them ●Ex. replication controller ensures that the number of replicas defined for a service

© 2017 Mesosphere, Inc. All Rights Reserved. Schedule Service

● process that actually assigns workloads to specific nodes in the cluster ●tracking resource utilization on each host

© 2017 Mesosphere, Inc. All Rights Reserved. Master Node

●Etcd ●API Server ●Controller Manager Service ●Scheduler Service

© 2017 Mesosphere, Inc. All Rights Reserved. Node

●Docker ●Kubelet Service ●Proxy Service

© 2017 Mesosphere, Inc. All Rights Reserved. Node

●Docker ●Kubelet Service ●Proxy Service

© 2017 Mesosphere, Inc. All Rights Reserved. KUBERNETES ARCHITECTURE

© 2017 Mesosphere, Inc. All Rights Reserved. Kubernetes Work Units

● Pods ● Main object that encapsulates containers ● Can contain many containers ● One IP per pod ● Containers can communicate over localhost ● Services ● Provide load balancing solutions to pods ● named, addressed, and load balanced group of pods ● Replication Controllers ● Responsible over maintaining a desired number of copies ● pod lifecycle manager ● Analogous to Auto-Scale Groups in AWS

© 2017 Mesosphere, Inc. All Rights Reserved. Scheduling

© 202 D2iQ, Inc. All Rights Reserved. Two level scheduling

Mesos Master and Agents

● Abstract resources into single pool ● Offers and tracks resources ● Guarantees isolation ● Handles workload restart on failure

Mesos Framework

● Consumes resources ● Deploys tasks ● Provides application specific logic for deployment, recovery, upgrade

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos Framework Components

Scheduler

1. resourceO!ers() 2. launchTasks() 5. statusUpdate()

Mesos Master

Mesos Slave

3. launchTask() 4. statusUpdate()

Executor

Task Task Mesos Scheduling

● Many Schedulers ● DRF works for high demand schedulers ● Random better for undisciplined frameworks ● Often competing Frameworks ● Every scheduler is potentially different…

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Scheduling

● One Scheduler

https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Scheduling

● One Scheduler

https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Scheduling

● One Scheduler…. Except in 1.16 which introduced multiple schedulers

https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Scheduling

● Controllers / Operators augment the scheduler

“A Kubernetes Operator helps extend the types of applications that can run on Kubernetes by allowing developers to provide additional knowledge to applications that need to maintain state.” –Jonathan S. Katz

https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/

https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

© 2017 Mesosphere, Inc. All Rights Reserved. Failure Modes

●Mesos ●Not schedule (isn’t a failure) ●No matching option ●Loose coupling ●K8s ●Failure based on in ability to schedule

© 2017 Mesosphere, Inc. All Rights Reserved. Pods

© 202 D2iQ, Inc. All Rights Reserved. Containers

© 2017 Mesosphere, Inc. All Rights Reserved. CONTAINERS Pods

© 2017 Mesosphere, Inc. All Rights Reserved. NAMESPACES VS.

Namespaces provide Control groups control isolation: resources:

• pid (processes) • cpu (CPU shares)

• net (network interfaces, routing...) • cpusets (limit processes to a CPU)

• ipc (System V IPC) • memory (swap, dirty pages)

• mnt (mount points, filesystems) • blockio (throttle reads/writes)

• uts (hostname) • devices

• user (UIDs) • net_cls, net_prio: control packet class and priority Mesos Container Runtime

● Docker ● Mesos / Universal Containerizer (UCR)

● Legacy Support (no docker image)

© 2017 Mesosphere, Inc. All Rights Reserved. In-Common Mesos & K8s Pods

● Seccomp ● Linux capabilities support

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos vs K8s Pods

● Set of Containers ● Init containers ● Pre and post hooks

● Vs TaskGroups ● Independent

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos vs K8s Pods Ownership

● Mesos Pod / Container is “Owned” by a Framework ● No concept in k8s like this

● Result ● Specialized Rules based on different “controller” vs Consistent behavior ● More control or type of containers

© 2017 Mesosphere, Inc. All Rights Reserved. K8S Pod Life cycle

The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being Pending scheduled as well as time spent downloading images over the network, which could take a while.

The Pod has been bound to a node, and all of the Containers have been created. Running At least one Container is still running, or is in the process of starting or restarting.

Succeeded All Containers in the Pod have terminated in success, and will not be restarted.

All Containers in the Pod have terminated, and at least one Container has Failed terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system.

For some reason the state of the Pod could not be obtained, typically due to an Unknown error in communicating with the host of the Pod.

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos Pod Lifecycle

TASK_DROPPED The task failed to launch because of a transient error. TASK_ERROR TERMINAL: The task description contains an error. TASK_FAILED TERMINAL: The task failed to finish successfully. TASK_FINISHED The task finished successfully on its own without external interference. TASK_GONE The task is no longer running. TASK_GONE_BY_OPERATOR The task was running on an agent that the master cannot contact; the operator has asserted that the agent has been shutdown, but this has not been directly confirmed by the master. TASK_KILLED TERMINAL: The task was killed by the executor. TASK_KILLING NOTE: This should only be sent when the framework has the TASK_KILLING_STATE capability. TASK_LOST In Mesos 1.3, this will only be sent when the framework does NOT opt-in to the PARTITION_AWARE capability. TASK_RUNNING TASK_RUNNING = 1; TASK_STAGING Initial state. TASK_STARTING The task is being launched by the executor. TASK_UNKNOWN The master has no knowledge of the task. TASK_UNREACHABLE The task was running on an agent that has lost contact with the master, typically due to a network failure or partition.

© 2017 Mesosphere, Inc. All Rights Reserved. Pod Edge Cases

●Side car containers

●K8s “job” pod with side car containers

© 2017 Mesosphere, Inc. All Rights Reserved. Resource Selection

© 202 D2iQ, Inc. All Rights Reserved. Mesos

●Resources ●Cpu ●Gpu ●vs Attributes ●SSD ●Persistent Volume

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos Container Affinity

●Framework Based ●Easy node affinity, attribute affinity ●Easy to have resource anti-affinity

●Harder to have container A affinity to container B (up to scheduler) ●Harder still have container A and container B have different framework schedulers ●Hard to have daemon sets (which is super easy in k8s)

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Container Affinity

●Node affinity / anti-affinity ●Pod affinity / anti-affinity requiredDuringSchedulingIgnoredDuringExecution preferredDuringSchedulingIgnoredDuringExecution ●Taints and tolerations

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

© 2017 Mesosphere, Inc. All Rights Reserved. K8s Container Affinity

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: say-deployment spec: replicas: 3 template: metadata: labels: app: say spec: containers: - name: say image: gcr.io/hazel-champion-200108/say ports: - containerPort: 8080 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app" operator: In values: - say topologyKey: "kubernetes.io/hostname" ---

© 2017 Mesosphere, Inc. All Rights Reserved. Scaling

© 202 D2iQ, Inc. All Rights Reserved. Scale - What do you mean?

●Number of Nodes / Agents? ●Number of schedulers / operators? ●Number of a specific type of app (persistence volumes)? ●Number of Pods in cluster? ●Number of Pods on node? ●Scaling Applications?

© 2017 Mesosphere, Inc. All Rights Reserved. Mesos

●Multiple Orgs running 10,000 thousands of nodes / agents. 2 with 100,000+ agents. ●Most state is on Agents ●Master handles very little state… it is all loosely coupled

●300+ pods on an agent (UCR only)

© 2017 Mesosphere, Inc. All Rights Reserved. Kubernetes

● No more than 5000 nodes ● No more than 150000 total pods ● No more than 300,000 total containers ● No more than 100 pods per node

https://kubernetes.io/docs/setup/best-practices/cluster-large/ © 2017 Mesosphere, Inc. All Rights Reserved. Other

© 202 D2iQ, Inc. All Rights Reserved. Kubernetes

●The events / log is fantastic!!!

●Resources can be extended with CRD

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ © 2017 Mesosphere, Inc. All Rights Reserved. Tusen Tack!

© 2017 Mesosphere, Inc. All Rights Reserved.