Introduction to Kubeflow What Is Machine Learning ? First, Start with a Question How Much Is My House Worth? How Much Is My House Worth?

Rokesh Jankie rokesh@google.com

11th October 2018 World AI Summit 2018 Introduction to Kubeflow What is Machine Learning ? First, Start with a Question How Much Is My House Worth? How Much Is My House Worth?

House Price

Square Footage How Much Is My House Worth?

House Price

Square Footage How Much Is My House Worth?

House Price

Square Footage How Much Is My House Worth?

House Price

Square Footage How Much Is My House Worth?

House Price

Square Footage Now, Answer Your Question How Much Is My House Worth?

House Price

$339K

2100 Sq. Ft. Square Footage Congrats, You’re a Machine Learning Expert! But... Things Can Get Complicated

House Price House Price

Square Square Footage Footage Non-Linear Multi-Dimensional Groupings Changes over Time Machine Learning is a way of solving problems without explicitly knowing how to create the solution Google DC Ops PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness PUE == Power Usage Effectiveness But... Magical LOTS OF Most Folks AI PAIN Goodness CLOUD AI PLATFORM’S GOAL Enable generalist software engineers to easily build and run custom AI applications anywhere. AI Problems Today Talent Ecosystem Flexibility Lack of expertise in ML Difficult to find, leverage Brittle, opinionated existing solutions infrastructure that breaks between cloud and on-prem AI Problems Today Talent Ecosystem Flexibility Lack of expertise in ML Difficult to find, leverage Brittle, opinionated existing solutions infrastructure that breaks between cloud and on-prem Solutions

1 Reusable Pipelines 2 Marketplace 3 Kubeflow AI Problems Today Talent Ecosystem Flexibility Lack of expertise in ML Difficult to find, leverage Brittle, opinionated existing solutions infrastructure that breaks between cloud and on-prem Solutions

1 Reusable Pipelines 2 Marketplace 3 Kubeflow ML Requirements Composability

Portability

Scalability Composability

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

Building Model Training Trainer a Validation At Scale Model

Roll-out Serving Monitoring Logging Portability Each ML Stage is an Independent System System 2

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

System 3 Building System 1 Model Training Trainer a Validation At Scale Model System 4

Monitorin Roll-out Serving Logging System 6 g System 5 Portability Laptop Portability Model

Tooling

Framework

Storage

Runtime

Drivers

Accelerator

HW Laptop Portability Model

Tooling

Framework

Storage

Runtime

Drivers

Accelerator

HW Portability Laptop Model

Tooling

Framework

Storage

Runtime

Drivers

Accelerator

HW Laptop Training Rig Portability Model Model

UX UX

Tooling Tooling

Framework Framework

Storage Storage

Runtime Runtime

Drivers Drivers

OS OS

Accelerator Accelerator

HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Portability

● As a data scientist, you want to use the Container right HW for the job App ● Every variation is an opportunity for pain ○ GPUs/FPGAs, ASICs, NICs ○ Kernel drivers, libraries, App Library performance Drivers ● Even within an ML frameworks Kernel dependencies cause chaos GPU FPGA Infiniband ○ Package management ○ ML compilation Scalability ● Machine specific HW (GPU, TPU) ● Limited (or unlimited) compute ● Network & storage constraints ○ Rack, Server Locality ○ Bandwidth constraints ● Heterogeneous hardware ● HW & SW lifecycle management ● Scale isn’t JUST about adding new machines! ○ Intern vs Researcher ○ Scale to 1000s of experiments You Know What’s Really Good at Composability, Portability, and Scalability? Containers and Kubernetes ICYMI (in case you missed it :) )

Seven cloud products with Google’s mission one billion users

Organize the world’s information and make it universally accessible and useful. Kubernetes for ML

Namespace Spark Jupyter Airflow Quota Logging NFS Cassandra Tensorflow TF-Serving Monitoring RBAC Ceph MySQL Caffe Flask+Scikit

Kubernetes

Operating system (Linux, Windows)

CPU Memory SSD Disk GPU FPGA ASIC NIC

GCP AWS Azure On-prem Kubernetes for ML ● Supports accelerators in an extensible manner ○ GPUs already in progress ○ Support for custom accelerators, high perf NICs ● Existing Controllers simplify devops challenges ○ K8S Jobs for Training ○ K8S Deployments for Serving ● Handles 1000s of nodes ● Container base images for ML workloads But Wait, There’s More! ● Kubernetes native scaling objects ○ Autoscaling cluster based on workload metrics ○ Priority eviction for removal of low priority jobs ○ Scaled to large number of pods (experiments) ● Passes through cluster specs for specific needs ○ Scheduling jobs where the data needed to run them is ○ Node labels for heterogeneous HW (more in the future) ○ Manage SW drivers and HW health via addons But... Oh, you want to use ML on K8s?

Before that, can you become an expert in: ● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ... Kubeflow Make it Easy for Everyone to Develop, Deploy and Manage Portable, Distributed ML on Kubernetes Make it Easy for Everyone to Develop, Deploy and Manage Portable, Distributed ML on Kubernetes (Everywhere) Kubernetes + ML = Kubeflow = Win ● Composability ○ Choose from existing popular tools ○ Uses ksonnet packaging for easy setup ● Portability ○ Build using cloud native, portable Kubernetes APIs ○ Let K8s community solve for your deployment ● Scalability ○ TF already supports CPU/GPU/distributed ○ K8s scales to 5k nodes with same stack Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage

Runtime Runtime Runtime

Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

Tooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

KubeflowTooling Tooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

KubeflowTooling KubeflowTooling Tooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW Laptop Training Rig Cloud Portability Model Model Model

UX UX UX

KubeflowTooling KubeflowTooling KubeflowTooling

Framework Framework Framework

Storage Storage Storage Kubeflow Runtime Runtime Runtime Drivers Drivers Drivers

OS OS OS

Accelerator Accelerator Accelerator

HW HW HW What’s in the Box? ● Jupyter Hub - for collaborative & interactive training ● A TensorFlow Training Controller ● A TensorFlow Serving Deployment ● Argo for workflows ● SeldonCore for complex inference and non TF models ● Reverse Proxy (Ambassador) ● Wiring to make it work on any Kubernetes anywhere What’s in the Box?

Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation

Building Model Training Trainer a Validation At Scale Model

Roll-out Serving Monitoring Logging Some concepts Using Kubeflow # Initialize a ksonnet APP APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME}

# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job

# Install Kubeflow components ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job ks pkg install kubeflow/sklearn-job # Soon

# Deploy Kubeflow NAMESPACE=kubeflow kubectl create namespace ${NAMESPACE} ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE} ks apply default -c kubeflow-core That’s It? Yes Yes... Yes… (for now) Aren’t You Giving All The “Good Stuff” Away? Reality of Open Source ML 1. TensorFlow is already OSS, we’re just helping adoption. 2. Our focus are companies not doing ANY ML (90%+ of the market); we need to help them adopt 3. Customers will naturally associate GCP with “ML Leadership” regardless of Kubeflow’s explicitly cloud-neutral stance. 4. We have a cloud built around high performance ML. We just need more people using ML to “win”. FAQ

● Why are you open sourcing it at all? ○ Because we want to win developers everywhere. We can only do this if we run everywhere, including not on GCP.

● Ok, but why move to support other clouds? ○ Because we must prevent developers from locking themselves into native solutions by making Kubeflow on OTHER clouds is equal to or better than the native solutions.

● How does a developer running Kubeflow on AWS help us? ○ A developer building on Sagemaker is lost O(forever). ○ A developer building on Kubeflow is one we at least have a chance of migrating to Kubeflow on GCP. Common Scenarios ● Multiple audiences with one stack ○ Laptop => Minikube ○ Experimentation => Single Node ○ Production => Distributed ● Migrate workloads based on specific requirements (e.g. TPU specific models, new HW availability) ● IoT - Build in one place, deploy your models everywhere (including low bandwidth) Customer Framing

On-Prem/Hybrid Just need Managed or Custom an API? Solution? Components?

Kubeflow Use an API!

Cloud Machine Learning Engine! Analogy

(circa 2008)

Kubeflow Ok, But How Is Kubeflow on GCP DIFFERENT and BETTER? Now, Differentiate... TL; DR Kubeflow = Easy, portable ML.

Kubeflow on GCP = Best Kubeflow Anywhere. Kubeflow on GCP

“I want Kubeflow “GCP data center “Give me unique to be easy to Secret Sauce™ to services on top manage” make Kubeflow of Kubeflow” great”

Kubeflow++ Time for coding! http://bit.ly/wais-kubeflow-codelab There is a codelab available to feel what it’s like to use Kubeflow. She was here yesterday! There is more to come…

For more info check https://kubeflow.org Cloud Summit Amsterdam 28th November 2018 Thanks! “I want Kubeflow to be easy to manage”

Customer Problem: Data scientists/developers want to focus on building solutions to problems, not on provisioning and management of services.

How Kubeflow on GCP Addresses the Need

● GCP SRE/services observing/optimizing your cluster and Kubeflow ● GCP services first class integration for Kubeflow backend requirements ○ Logging -> BigQuery ○ Monitoring -> StackDriver ○ Events -> PubSub ○ Block store -> GCS ○ Filer -> GCP Filer ● Dataflow for data pre-processing and supporting iterative learning on a data stream ● Integration with other GCP services for end-to-end scenarios with Open Service Broker (e.g. Cloud IoT, CMLE) “GCP data center Secret Sauce™ to make Kubeflow great”

Customer Problem: Enterprises want to be able to turn the speed vs. cost dial based on a per scenario basis, and have an option to use the best infra anywhere.

How Kubeflow on GCP Addresses the Need

● TPUs and TPU Pods including job scheduling with awareness of TPU pods/cliques ● ML aware infra provisioning ○ Autosizing clusters based on workloads - model size/training time/$$$ ○ Custom machine types based on model ○ Pre-emptible VMs for batch ● “Fancy” scheduling with autopilot/etc. ● Scale-to-zero based on workloads requirements “Give me unique services on top of Kubeflow”

Customer Need: ML is still too hard; enterprises need higher level services based on Google’s person-centuries of expertise.

How Kubeflow on GCP Addresses the Need

● AI Hub delivers pre-package models that reduce time to solution ● Embracing hybrid workloads and deployments ○ Multi-tenant clusters and data ○ Multi-cluster management in GKE with distributed models ○ First-class data migration/syncing between clusters ● GCP-managed AutoML running inside Kubeflow ● Hyperparameter search/tuning (Tuner/Vizier) ● Integrated data (Google Analytics, Doubleclick logs, User tracking) ● Bias busting as a service (test your models for bias using our data) ● Enterprise requirements, particularly around security- e.g. BinAuthZ, BCID That’s All Fine But I Want MORE Reusable Pipelines

TODAY REUSABLE PIPELINES

Connect to database Connect to database SWE Create features & labels Create features & labels SWE scikit-learn linear Create datasets regression Train scikit-learn linear Business logic SWE + regression model Data Scientist HP tuning

Publish model

SWE Business logic AI Hub

AI Hub REUSABLE PIPELINES

Connect to database

Create features & labels SWE scikit-learn linear Tensorflow neural network regression

Business logic AI Hub

AI Hub REUSABLE PIPELINES

Connect to database

Create features & labels SWE Tensorflow neural network

Business logic AI Hub == Google Play for ML

Direct AI Hub Research customers

Private AI Hub

Partners

SI / ISV Hardware Kubeflow: Hybrid deployments

REUSABLE PIPELINES CLOUD Connect to database

Create features & labels

Tensorflow neural network

Business logic

ON-PREM Kubeflow Enable ML Everywhere

GCP

Cloud computing 9% On-prem 71%

Colocation or Source multi-tenant 20% Source Hybrid Deployments Provide Workload Mobility

Direct customer and partner workloads

DATA GRAVITY

On-prem GCP

Other clouds Hybrid Deployments Provide Workload Mobility

Direct customer and partner workloads

● TPUs ● AI Hub On-prem ● Reusable pipelines GCP ● ML research ● Ease of ML deployment ● Cluster management Other clouds (GKE)

BETTER ON GCP Current Progress 2018 Kubeflow Goals Objective: Establish Kubeflow as the number one hybrid, open source project designed to make building and running ML stacks easy everywhere. Reason: Because we want all developers, everywhere, building on the Kubeflow stack.

Key 2018 Goals: ● A better getting started ML UX anywhere that Kubernetes runs ● 2-3 “industry standard” pipelines ● Better performance/services/scalability/$TBD on GCP (including, but not restricted to, hardware support) ● “Buy-in” from 5-10 key customers/partners/ISVs ● KF deployment on GKE is at least x users, y CPU hours, z GPU hours 2018 Top Level Roadmap

Phase 0 (0-6 months): Table stakes ● Deliver a polished experience, pre-built models, multiple frameworks ● Support for 2+ ML frameworks ● Swappable serving and UI (TensorBoard, TFJobs UI, K8s dashboard) ● Enterprise support for RBAC/IAM support; uses GCP IAM when run on GCP ● Logging and Monitoring integration with GCP on GKE

Phase 1 (6-12 months): Achieve escape velocity ● 20 production use cases for Kubeflow ● Simplify rolling out models to production and/or migrating a model to higher perf cluster ● Pipeline Orchestration that supports TFX metadata standard ● Tooling/solution for building containers on cluster and packaging models/data ● IAP is easy to setup (can be fully automated by GKE ingress)

Phase 2 (12 months+): GCP differentiators shine ● GCP management/monitoring/visibility (including hosted pipelines/dataflow/CMLE integration) ● Unique use of GCP underlying services/infra (scheduling intelligence for *Fish Donuts/Pods, autosizing clusters based on model size/training time/$$$, multi-cluster deployments for latency/data gravity/etc reasons) ● New/connected GCP Services that offer differential value (Hyperparameter search/tuning, integrated Google data (Google Analytics, Doubleclick logs, User tracking), Teleport support for moving/copying data) H1 2018 Roadmap APPENDIX BONEYARD DON’T GO BEYOND HERE (PLEASE) Boneyard Deck