Kubernetes & AI with Run:AI, & Excelero

AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST What’s next in technology and innovation? Kubernetes & AI with Run:AI, Red Hat & Excelero

AI WEBINAR

Your Host: Presenter: Presenter: Presenter: Tom Leyden Omri Geller Gil Vitzinger William Benton VP Marketing CEO & Co-Founder Software Developer Engineering Manager Kubernetes for AI Workloads Omri Geller, CEO and co-founder, Run:AI A Bit of History

Needed flexibility Reproducibility and and better utilization portability

Bare Metal Virtual Machines Containers

Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration

2 Enter Kubernetes

Track, Create Efficient Execute Across Schedule and Cluster Different Operationalize Utilization Hardware

3 Today, 60% of Those Who Deploy Containers Use K8s for Orchestration*

*CNCF

4 Now let’s talk about AI Computing Power Fuels Development of AI

Deep Learning

Classical

Manual Engineering

6 is a Completely Different Ballgame

New Distributed Experimentation accelerators computing R&D

7 Data Science Workflows and Hardware Accelerators are Highly Coupled

Data Hardware scientists accelerators

Constant Workflow Under-utilized hassles Limitations GPUs

8 This Leads to Frustration on Both Sides

Data Scientists are IT leaders are frustrated – speed and frustrated – GPU productivity are low utilization is low

9 AI Workloads are Also Built on Containers

NGC – Nvidia pre-trained models for AI Container ecosystem for Data experimentation on docker containers Science is growing

10 How Can We Bridge The Divide?

11 Kubernetes, the “De-facto” Standard for Container Orchestration

Lacks the Multiple queues following Automatic queueing/de-queueing capabilities: Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads

12 How is Experimentation Different?

Build Training

13 Distinguishing Between Build and Training Workflows

Build Training

• Development & debugging • Interactive sessions • Short cycles • Performance is less important • Low GPU utilization

14 Distinguishing Between Build and Training Workflows

Build Training

• Development & debugging • Training & HPO • Interactive sessions • Remote execution • Short cycles • Long workloads • Performance is less important • Throughput is highly important • Low GPU utilization • High GPU utilization

15 How to Solve? Guaranteed Quotas

Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota

16 Solution: Guaranteed Quotas

Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota

• More concurrent experiments • More multi-GPU training

17 Queueing Management Mechanism

18 Run:AI - Stitching it All Together Run:AI - Applying HPC Concepts to Kubernetes

With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap

Data Science teams IT teams gain visibility gain productivity and maximal GPU and speed utilization

20 Run:AI - Kubernetes-Based Abstraction Layer

INTEGRABLE Easily integrates with IT and Data Science platforms

MULTI-CLOUD Run on any public, private and hybrid cloud environment

IT GOVERNANCE Policy based orchestration and queuing management

21 Run:AI

Utilize Kubernetes across IT to improve resource utilization

Speed up experimentation process and time to market

Easily scale infrastructure to meet needs of the business

22 From 28% to 73% utilization, 2X speed, and $1M savings

Challenge Solution After implementing Run:AI’s platform

28% AVERAGE GPU UTILIZATION - 73% AVERAGE GPU UTILIZATION inefficient and underutilized resources • Enabled 2x more experiments to run • Saved $1M in additional GPU expenditures for 2020

23 Run:AI at-a-Glance

• Founded in 2018 • Backed by top VCs • Offices in Tel Aviv, New York, and Boston Venture • Fortune 500 customersFunded • Top cloud and virtualization engineers

24 Thank you NVMesh in Kubernetes 29 What is NVMesh CSI Driver

● What is NVMesh CSI Driver ? ○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes

● Main Features ○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs 30 CSI Driver Components

Kubernetes Controller

REST API NVMesh CSI Controller

NVMesh CSI NVMesh CSI NVMesh CSI Node Driver Node Driver Node Driver

NVMesh NVMesh NVMesh NVMesh Management Client Client Client

NVMesh Targets 31 Dynamic Provisioning & Attach Flow

Kubernetes Controller

User creates a Persistent Volume Claim (PVC)

NVMesh CSI Controller

Create Volume

NVMesh Management

NVMesh Targets 32 Dynamic Provisioning & Attach Flow

Kubernetes Controller

User creates a POD that uses the PVC

Nod NVMesh CSI NVMesh CSI User App Controller e Node Driver PODs

POD mount

Attach / Detach K8s internal mount

OS mount

NVMesh /dev/nvmesh/v1 Client NVMesh Management

Data NVMesh Targets 33 Exposing NVMesh volume in a Pod

User App POD User App POD 1 2

kubelet/pod1/volumes/v1 kublete/pod2/volumes/v1

CSI Publish Volume bind mount For each volume for each POD

kubelet/volume/mount Block Volume FileSystem Volume

mount CSI Stage Volume

mkfs Once for each Volume on the Node

/dev/nvmesh/v1

NVMesh attach

NVMesh Client 34 Usage Examples

kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes: - ReadWriteMany volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg 35 Summary

NVMesh Benefits for Kubernetes:

● Persistent storage that scales for stateful applications

● Predictable application performance – ensure that storage is not a bottleneck

● Scale your performance and capacity linearly

● Containers in a pod can access persistent storage presented to that pod, but with the freedom to restart the pod on an alternate physical node

● Choice of Kubernetes PVC access mode to match the storage to the application and file system requirements Machine learning discovery, workflows, and systems on Kubernetes

William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc. codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools

feature extraction process management

(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools

feature extraction process management

(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data engineers

events transform

databases transform federate archive

file, object transform storage events transform developer UI

databases transform federate

file, object transform storage

models train

data scientists application developers

transform developer UI web and events mobile databases transform federate archive

file, object transform reporting storage

models train management data engineers application developers

transform developer UI web and events mobile

databases transform federate archive

file, object transform reporting storage

models train management

data scientists codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation How Kubernetes can help Immutable images

a6afd91e user application code 6b8cad3e

33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e

base image 979229b9 Immutable images

a6afd91e user application code 6b8cad3e

33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e

base image 979229b9 Immutable images

model in production on 16 July 2019

a6afd91e user application code 6b8cad3e

33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e

base image 979229b9 Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Declarative app configuration

https://route.my-awesome-app.ai Integration and deployment Integration and deployment

OK! Integration and deployment

application code OK! configuration and installation recipes base image Integration and deployment

application code configuration and installation recipes

base image Data drift Data drift On-demand discovery with the Open Data Hub

0 0 0 1 1 0 1 0 1 0 0.13 0.13

0 0 1 0 0 0 1 1 0 0 0.06 0.07

1 0 1 1 0 1 0 0 0 0 0.07 0.06

0 0 0 0 0 0 1 1 0 1 0.02 0.08

0 1 0 0 1 0 0 1 0 0 0.17 0.11

1 0 0 0 0 1 0 1 1 0 * 0.11 0.09

0 0 1 0 1 0 1 0 0 0 0.04 0.18

0 1 0 0 0 1 0 0 1 1 0.13 0.04

0 0 0 0 1 0 0 1 0 1 0.13 0.21

1 1 0 0 0 0 0 0 0 1 0.14 0.03 more CPUs better GPUs

more storage sensitive data https://opendatahub.io Argo JupyterHub Pipelines Apache Superset PostgreSQL MariaDB Apache Spark SQL

Apache Kafka (via Strimzi)

Red Hat Ceph Grafana Storage Prometheus

OpenShift

TensorFlow Serving Spark PyTorch Serving Katib Seldon TFJob PyTorch codifying data model feature model model monitoring, problem collection training engineering validation deployment validation and metrics and cleaning and tuning model feature model training engineering validation and tuning

2 3 OpenShift Pipelines

data codifying problem feature model training model model monitoring, collection and and metrics engineering and tuning validation deployment validation cleaning OpenShift Pipelines

data codifying problem model model monitoring, collection and and metrics validation deployment validation cleaning 2 3 REST endpoint

OpenShift Serverless Further resources

Open Data Hub web site: https://opendatahub.io Contribute: https://github.com/opendatahub-io Get involved: https://gitlab.com/opendatahub/ opendatahub-community ML workflows on OpenShift and Open Data Hub: https://bit.ly/ml-workflows-ocp

Thank you!