Kubernetes & AI with Run:AI, Red Hat & Excelero
AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST What’s next in technology and innovation? Kubernetes & AI with Run:AI, Red Hat & Excelero
AI WEBINAR
Your Host: Presenter: Presenter: Presenter: Tom Leyden Omri Geller Gil Vitzinger William Benton VP Marketing CEO & Co-Founder Software Developer Engineering Manager Kubernetes for AI Workloads Omri Geller, CEO and co-founder, Run:AI A Bit of History
Needed flexibility Reproducibility and and better utilization portability
Bare Metal Virtual Machines Containers
Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration
2 Enter Kubernetes
Track, Create Efficient Execute Across Schedule and Cluster Different Operationalize Utilization Hardware
3 Today, 60% of Those Who Deploy Containers Use K8s for Orchestration*
*CNCF
4 Now let’s talk about AI Computing Power Fuels Development of AI
Deep Learning
Classical Machine Learning
Manual Engineering
6 Artificial Intelligence is a Completely Different Ballgame
New Distributed Experimentation accelerators computing R&D
7 Data Science Workflows and Hardware Accelerators are Highly Coupled
Data Hardware scientists accelerators
Constant Workflow Under-utilized hassles Limitations GPUs
8 This Leads to Frustration on Both Sides
Data Scientists are IT leaders are frustrated – speed and frustrated – GPU productivity are low utilization is low
9 AI Workloads are Also Built on Containers
NGC – Nvidia pre-trained models for AI Container ecosystem for Data experimentation on docker containers Science is growing
10 How Can We Bridge The Divide?
11 Kubernetes, the “De-facto” Standard for Container Orchestration
Lacks the Multiple queues following Automatic queueing/de-queueing capabilities: Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads
12 How is Experimentation Different?
Build Training
13 Distinguishing Between Build and Training Workflows
Build Training
• Development & debugging • Interactive sessions • Short cycles • Performance is less important • Low GPU utilization
14 Distinguishing Between Build and Training Workflows
Build Training
• Development & debugging • Training & HPO • Interactive sessions • Remote execution • Short cycles • Long workloads • Performance is less important • Throughput is highly important • Low GPU utilization • High GPU utilization
15 How to Solve? Guaranteed Quotas
Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota
16 Solution: Guaranteed Quotas
Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota
• More concurrent experiments • More multi-GPU training
17 Queueing Management Mechanism
18 Run:AI - Stitching it All Together Run:AI - Applying HPC Concepts to Kubernetes
With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap
Data Science teams IT teams gain visibility gain productivity and maximal GPU and speed utilization
20 Run:AI - Kubernetes-Based Abstraction Layer
INTEGRABLE Easily integrates with IT and Data Science platforms
MULTI-CLOUD Run on any public, private and hybrid cloud environment
IT GOVERNANCE Policy based orchestration and queuing management
21 Run:AI
Utilize Kubernetes across IT to improve resource utilization
Speed up experimentation process and time to market
Easily scale infrastructure to meet needs of the business
22 From 28% to 73% utilization, 2X speed, and $1M savings
Challenge Solution After implementing Run:AI’s platform
28% AVERAGE GPU UTILIZATION - 73% AVERAGE GPU UTILIZATION inefficient and underutilized resources • Enabled 2x more experiments to run • Saved $1M in additional GPU expenditures for 2020
23 Run:AI at-a-Glance
• Founded in 2018 • Backed by top VCs • Offices in Tel Aviv, New York, and Boston Venture • Fortune 500 customersFunded • Top cloud and virtualization engineers
24 Thank you NVMesh in Kubernetes 29 What is NVMesh CSI Driver
● What is NVMesh CSI Driver ? ○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes
● Main Features ○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs 30 CSI Driver Components
Kubernetes Controller
REST API NVMesh CSI Controller
NVMesh CSI NVMesh CSI NVMesh CSI Node Driver Node Driver Node Driver
NVMesh NVMesh NVMesh NVMesh Management Client Client Client
NVMesh Targets 31 Dynamic Provisioning & Attach Flow
Kubernetes Controller
User creates a Persistent Volume Claim (PVC)
NVMesh CSI Controller
Create Volume
NVMesh Management
NVMesh Targets 32 Dynamic Provisioning & Attach Flow
Kubernetes Controller
User creates a POD that uses the PVC
Nod NVMesh CSI NVMesh CSI User App Controller e Node Driver PODs
POD mount
Attach / Detach K8s internal mount
OS mount
NVMesh /dev/nvmesh/v1 Client NVMesh Management
Data NVMesh Targets 33 Exposing NVMesh volume in a Pod
User App POD User App POD 1 2
kubelet/pod1/volumes/v1 kublete/pod2/volumes/v1
CSI Publish Volume bind mount For each volume for each POD
kubelet/volume/mount Block Volume FileSystem Volume
mount CSI Stage Volume
mkfs Once for each Volume on the Node
/dev/nvmesh/v1
NVMesh attach
NVMesh Client 34 Usage Examples
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes: - ReadWriteMany volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg 35 Summary
NVMesh Benefits for Kubernetes:
● Persistent storage that scales for stateful applications
● Predictable application performance – ensure that storage is not a bottleneck
● Scale your performance and capacity linearly
● Containers in a pod can access persistent storage presented to that pod, but with the freedom to restart the pod on an alternate physical node
● Choice of Kubernetes PVC access mode to match the storage to the application and file system requirements Machine learning discovery, workflows, and systems on Kubernetes
William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc. codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools
feature extraction process management
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools
feature extraction process management
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data engineers
events transform
databases transform federate archive
file, object transform storage events transform developer UI
databases transform federate
file, object transform storage
models train
data scientists application developers
transform developer UI web and events mobile databases transform federate archive
file, object transform reporting storage
models train management data engineers application developers
transform developer UI web and events mobile
databases transform federate archive
file, object transform reporting storage
models train management
data scientists codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation How Kubernetes can help Immutable images
a6afd91e user application code 6b8cad3e
33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e
base image 979229b9 Immutable images
a6afd91e user application code 6b8cad3e
33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e
base image 979229b9 Immutable images
model in production on 16 July 2019
a6afd91e user application code 6b8cad3e
33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e
base image 979229b9 Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Stateless microservices Declarative app configuration
https://route.my-awesome-app.ai Integration and deployment Integration and deployment
OK! Integration and deployment
application code OK! configuration and installation recipes base image Integration and deployment
application code configuration and installation recipes
base image Data drift Data drift On-demand discovery with the Open Data Hub
0 0 0 1 1 0 1 0 1 0 0.13 0.13
0 0 1 0 0 0 1 1 0 0 0.06 0.07
1 0 1 1 0 1 0 0 0 0 0.07 0.06
0 0 0 0 0 0 1 1 0 1 0.02 0.08
0 1 0 0 1 0 0 1 0 0 0.17 0.11
1 0 0 0 0 1 0 1 1 0 * 0.11 0.09
0 0 1 0 1 0 1 0 0 0 0.04 0.18
0 1 0 0 0 1 0 0 1 1 0.13 0.04
0 0 0 0 1 0 0 1 0 1 0.13 0.21
1 1 0 0 0 0 0 0 0 1 0.14 0.03 more CPUs better GPUs
more storage sensitive data https://opendatahub.io Argo JupyterHub Kubeflow Pipelines Apache Superset PostgreSQL MariaDB Apache Spark SQL
Apache Kafka (via Strimzi)
Red Hat Ceph Grafana Storage Prometheus
OpenShift
TensorFlow Serving Spark PyTorch Serving Katib Seldon TFJob PyTorch codifying data model feature model model monitoring, problem collection training engineering validation deployment validation and metrics and cleaning and tuning model feature model training engineering validation and tuning
2 3 OpenShift Pipelines
data codifying problem feature model training model model monitoring, collection and and metrics engineering and tuning validation deployment validation cleaning OpenShift Pipelines
data codifying problem model model monitoring, collection and and metrics validation deployment validation cleaning 2 3 REST endpoint
OpenShift Serverless Further resources
Open Data Hub web site: https://opendatahub.io Contribute: https://github.com/opendatahub-io Get involved: https://gitlab.com/opendatahub/ opendatahub-community ML workflows on OpenShift and Open Data Hub: https://bit.ly/ml-workflows-ocp
Thank you!