Introduction to Kubeflow aronchick@ is a way of solving problems without explicitly knowing how to create the solution DC Ops PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness But... Magical LOTS OF Most Folks AI PAIN Goodness Why the Gap? Composability

Portability

Scalability Composability

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

Building Model Training Trainer a Validation At Scale Model

Roll-out Serving Monitoring Logging Portability Each ML Stage is an Independent System System 2

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

System 3 Building System 1 Model Training Trainer a Validation At Scale Model System 4

Roll-out Serving Monitoring Logging System 6 System 5 Portability Portability Laptop Model

UX

Tooling

Framework

Storage

Runtime

Drivers

OS

Accelerator

HW Portability Laptop Model

UX

Tooling

Framework

Storage

Runtime

Drivers

OS

Accelerator

HW Portability Laptop Training Rig Model Model

UX UX

Tooling Tooling

Framework Framework

Storage Storage

Runtime Runtime

Drivers Drivers

OS OS

Accelerator Accelerator

HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Scalability

● Machine specific HW (GPU) ● Limited (or unlimited) compute ● Network & storage constraints ○ Rack, Server Locality ○ Bandwidth constraints ● Heterogeneous hardware ● HW & SW lifecycle management ● Scale isn’t JUST about adding new machines! ○ Intern vs Researcher ○ Scale to 1000s of experiments You Know What’s Really Good at Composability, Portability, and Scalability? Containers and Kubernetes for ML

Namespace Spark Jupyter Airflow Quota Logging NFS Cassandra Tensorflow TF-Serving Monitoring RBAC Ceph MySQL Caffe Flask+Scikit

Kubernetes

Operating system (Linux, Windows)

CPU Memory SSD Disk GPU FPGA ASIC NIC

GCP AWS Azure On-prem Kubernetes for ML

● Supports accelerators in an extensible manner ○ GPUs already in progress ○ Support for FPGAs, high perf NICs under discussion ● Existing Controllers simplify devops challenges ○ K8S Jobs for Training ○ K8S Deployments for Serving ● Handles 1000s of nodes ● Container base images for ML workloads But Wait, There’s More!

● Kubernetes native scaling objects ○ Autoscaling cluster based on workload metrics ○ Priority eviction for removal of low priority jobs ○ Scaled to large number of pods (experiments)

● Passes through cluster specs for specific needs ○ Scheduling jobs where the data needed to run them is ○ Node labels for Heterogeneous HW (more in the future) ○ Manage SW drivers and HW health via addons But... Oh, you want to use ML on K8s?

Before that, can you become an expert in: ● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ... Kubeflow Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML on Kubernetes (Everywhere) Kubernetes + ML = Kubeflow = Win ● Composability ○ Choose from existing popular tools ○ Uses ksonnet packaging for easy setup ● Portability ○ Build using cloud native, portable Kubernetes APIs ○ Let K8s community solve for your deployment ● Scalability ○ TF already supports CPU/GPU/distributed ○ K8s scales to 5k nodes with same stack Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

KubeflowTooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

KubeflowTooling KubeflowTooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability Laptop Training Rig Cloud Model Model Model

UX UX UX

KubeflowTooling KubeflowTooling KubeflowTooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW What’s in the Box? ● Jupyter Hub - for collaborative & interactive training ● A TensorFlow Training Controller ● A TensorFlow Serving Deployment ● Argo for workflows ● SeldonCore for complex inference and non TF models ● Reverse Proxy (Ambassador) ● Wiring to make it work on any Kubernetes anywhere What’s in the Box?

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

Building Model Training Trainer a Validation At Scale Model

Roll-out Serving Monitoring Logging Using Kubeflow

# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}

# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job

# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core Don’t Like TensorFlow?

# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}

# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job ks pkg install kubeflow/sklearn-job # Soon

# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core Don’t Like TF Serving?

# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}

# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/-core # Soon ks pkg install kubeflow/tf-job

# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core That’s It? Yes… (For Now) Yes… (For Now) Yes… (For Now) We’re Just Getting Started!

● Who’s helping? ○ Redhat, Weave, CaiCloud, Canonical, many more

● What’s next... ○ Easy to use accelerator integration ○ Support for other popular tools like Spark ML, XGBoost, sklearn ○ Autoscaled TF Serving ○ tf.transform (programmatic data transforms)

● You tell us! (Or better yet, help!) Kubeflow is Open - open community - open design - open source - open to ideas https://github.com/kubeflow/kubeflow slack: kubeflow (http://kubeflow.slack.com) twitter: @kubeflow @aronchick ([email protected]) @jeremylewi ([email protected])` Portability

● As a data scientist, you want to use the Container

right HW for the job App ● Every variation is an opportunity for pain ○ GPUs/FPGAs, ASICs, NICs App Library ○ Kernel drivers, libraries, performance ● Even within an ML frameworks Kernel Drivers dependencies cause chaos

○ Package management GPU FPGA Infiniband ○ ML compilation