Introduction to Kubeflow aronchick@ Machine Learning is a way of solving problems without explicitly knowing how to create the solution Google DC Ops PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness But... Magical LOTS OF Most Folks AI PAIN Goodness Why the Gap? Composability
Portability
Scalability Composability
Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation
Building Model Training Trainer a Validation At Scale Model
Roll-out Serving Monitoring Logging Portability Each ML Stage is an Independent System System 2
Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation
System 3 Building System 1 Model Training Trainer a Validation At Scale Model System 4
Roll-out Serving Monitoring Logging System 6 System 5 Portability Portability Laptop Model
UX
Tooling
Framework
Storage
Runtime
Drivers
OS
Accelerator
HW Portability Laptop Model
UX
Tooling
Framework
Storage
Runtime
Drivers
OS
Accelerator
HW Portability Laptop Training Rig Model Model
UX UX
Tooling Tooling
Framework Framework
Storage Storage
Runtime Runtime
Drivers Drivers
OS OS
Accelerator Accelerator
HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
Tooling Tooling Tooling
Framework Framework Framework
Storage Storage Storage
Runtime Runtime Runtime
Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Scalability
● Machine specific HW (GPU) ● Limited (or unlimited) compute ● Network & storage constraints ○ Rack, Server Locality ○ Bandwidth constraints ● Heterogeneous hardware ● HW & SW lifecycle management ● Scale isn’t JUST about adding new machines! ○ Intern vs Researcher ○ Scale to 1000s of experiments You Know What’s Really Good at Composability, Portability, and Scalability? Containers and Kubernetes Kubernetes for ML
Namespace Spark Jupyter Airflow Quota Logging NFS Cassandra Tensorflow TF-Serving Monitoring RBAC Ceph MySQL Caffe Flask+Scikit
Kubernetes
Operating system (Linux, Windows)
CPU Memory SSD Disk GPU FPGA ASIC NIC
GCP AWS Azure On-prem Kubernetes for ML
● Supports accelerators in an extensible manner ○ GPUs already in progress ○ Support for FPGAs, high perf NICs under discussion ● Existing Controllers simplify devops challenges ○ K8S Jobs for Training ○ K8S Deployments for Serving ● Handles 1000s of nodes ● Container base images for ML workloads But Wait, There’s More!
● Kubernetes native scaling objects ○ Autoscaling cluster based on workload metrics ○ Priority eviction for removal of low priority jobs ○ Scaled to large number of pods (experiments)
● Passes through cluster specs for specific needs ○ Scheduling jobs where the data needed to run them is ○ Node labels for Heterogeneous HW (more in the future) ○ Manage SW drivers and HW health via addons But... Oh, you want to use ML on K8s?
Before that, can you become an expert in: ● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ... Kubeflow Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML on Kubernetes (Everywhere) Kubernetes + ML = Kubeflow = Win ● Composability ○ Choose from existing popular tools ○ Uses ksonnet packaging for easy setup ● Portability ○ Build using cloud native, portable Kubernetes APIs ○ Let K8s community solve for your deployment ● Scalability ○ TF already supports CPU/GPU/distributed ○ K8s scales to 5k nodes with same stack Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
Tooling Tooling Tooling
Framework Framework Framework
Storage Storage Storage
Runtime Runtime Runtime
Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
Tooling Tooling Tooling
Framework Framework Framework
Storage Storage Storage
Runtime Runtime Runtime
Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
Tooling Tooling Tooling
Framework Framework Framework
Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
KubeflowTooling Tooling Tooling
Framework Framework Framework
Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
KubeflowTooling KubeflowTooling Tooling
Framework Framework Framework
Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW Portability Laptop Training Rig Cloud Model Model Model
UX UX UX
KubeflowTooling KubeflowTooling KubeflowTooling
Framework Framework Framework
Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers
OS OS OS
Accelerator Accelerator Accelerator
HW HW HW What’s in the Box? ● Jupyter Hub - for collaborative & interactive training ● A TensorFlow Training Controller ● A TensorFlow Serving Deployment ● Argo for workflows ● SeldonCore for complex inference and non TF models ● Reverse Proxy (Ambassador) ● Wiring to make it work on any Kubernetes anywhere What’s in the Box?
Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation
Building Model Training Trainer a Validation At Scale Model
Roll-out Serving Monitoring Logging Using Kubeflow
# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}
# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job
# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core Don’t Like TensorFlow?
# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}
# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job ks pkg install kubeflow/sklearn-job # Soon
# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core Don’t Like TF Serving?
# Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}
# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/seldon-core # Soon ks pkg install kubeflow/tf-job
# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core That’s It? Yes… (For Now) Yes… (For Now) Yes… (For Now) We’re Just Getting Started!
● Who’s helping? ○ Redhat, Weave, CaiCloud, Canonical, many more
● What’s next... ○ Easy to use accelerator integration ○ Support for other popular tools like Spark ML, XGBoost, sklearn ○ Autoscaled TF Serving ○ tf.transform (programmatic data transforms)
● You tell us! (Or better yet, help!) Kubeflow is Open - open community - open design - open source - open to ideas https://github.com/kubeflow/kubeflow slack: kubeflow (http://kubeflow.slack.com) twitter: @kubeflow @aronchick ([email protected]) @jeremylewi ([email protected])` Portability
● As a data scientist, you want to use the Container
right HW for the job App ● Every variation is an opportunity for pain ○ GPUs/FPGAs, ASICs, NICs App Library ○ Kernel drivers, libraries, performance ● Even within an ML frameworks Kernel Drivers dependencies cause chaos
○ Package management GPU FPGA Infiniband ○ ML compilation