Kubernetes & AI with Run:AI, Red Hat & Excelero
Total Page:16
File Type:pdf, Size:1020Kb
Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST What’s next in technology and innovation? Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Your Host: Presenter: Presenter: Presenter: Tom Leyden Omri Geller Gil Vitzinger William Benton VP Marketing CEO & Co-Founder Software Developer Engineering Manager Kubernetes for AI Workloads Omri Geller, CEO and co-founder, Run:AI A Bit of History Needed flexibility Reproducibility and and better utilization portability Bare Metal Virtual Machines Containers Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration 2 Enter Kubernetes Track, Create Efficient Execute Across Schedule and Cluster Different Operationalize Utilization Hardware 3 Today, 60% of Those Who Deploy Containers Use K8s for Orchestration* *CNCF 4 Now let’s talk about AI Computing Power Fuels Development of AI Deep Learning Classical Machine Learning Manual Engineering 6 Artificial Intelligence is a Completely Different Ballgame New Distributed Experimentation accelerators computing R&D 7 Data Science Workflows and Hardware Accelerators are Highly Coupled Data Hardware scientists accelerators Constant Workflow Under-utilized hassles Limitations GPUs 8 This Leads to Frustration on Both Sides Data Scientists are IT leaders are frustrated – speed and frustrated – GPU productivity are low utilization is low 9 AI Workloads are Also Built on Containers NGC – Nvidia pre-trained models for AI Container ecosystem for Data experimentation on docker containers Science is growing 10 How Can We Bridge The Divide? 11 Kubernetes, the “De-facto” Standard for Container Orchestration Lacks the Multiple queues following Automatic queueing/de-queueing capabilities: Advanced priorities & policies Advanced scheduling algorithms Affinity-aware scheduling Efficient management of distributed workloads 12 How is Experimentation Different? Build Training 13 Distinguishing Between Build and Training Workflows Build Training • Development & debugging • Interactive sessions • Short cycles • Performance is less important • Low GPU utilization 14 Distinguishing Between Build and Training Workflows Build Training • Development & debugging • Training & HPO • Interactive sessions • Remote execution • Short cycles • Long workloads • Performance is less important • Throughput is highly important • Low GPU utilization • High GPU utilization 15 How to Solve? Guaranteed Quotas Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota 16 Solution: Guaranteed Quotas Fixed quotas Guaranteed quotas • Fits build workloads • Fits training workflows • GPUs are always available • Users can go over quota • More concurrent experiments • More multi-GPU training 17 Queueing Management Mechanism 18 Run:AI - Stitching it All Together Run:AI - Applying HPC Concepts to Kubernetes With the advantages of K8s, plus some concepts from the world of HPC & distributed computing, we can bridge the gap Data Science teams IT teams gain visibility gain productivity and maximal GPU and speed utilization 20 Run:AI - Kubernetes-Based Abstraction Layer INTEGRABLE Easily integrates with IT and Data Science platforms MULTI-CLOUD Run on any public, private and hybrid cloud environment IT GOVERNANCE Policy based orchestration and queuing management 21 Run:AI Utilize Kubernetes across IT to improve resource utilization Speed up experimentation process and time to market Easily scale infrastructure to meet needs of the business 22 From 28% to 73% utilization, 2X speed, and $1M savings Challenge Solution After implementing Run:AI’s platform 28% AVERAGE GPU UTILIZATION - 73% AVERAGE GPU UTILIZATION inefficient and underutilized resources • Enabled 2x more experiments to run • Saved $1M in additional GPU expenditures for 2020 23 Run:AI at-a-Glance • Founded in 2018 • Backed by top VCs • Offices in Tel Aviv, New York, and Boston Venture • Fortune 500 customersFunded • Top cloud and virtualization engineers 24 Thank you NVMesh in Kubernetes 29 What is NVMesh CSI Driver ● What is NVMesh CSI Driver ? ○ CSI - Container Storage Interface ○ NVMesh as a storage backend in Kubernetes ● Main Features ○ Static Provisioning ○ Dynamic Provisioning ○ Block and File System volumes ○ Access Modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany) ○ Extend volumes ○ Using NVMesh VPGs 30 CSI Driver Components Kubernetes Controller REST API NVMesh CSI Controller NVMesh CSI NVMesh CSI NVMesh CSI Node Driver Node Driver Node Driver NVMesh NVMesh NVMesh NVMesh Management Client Client Client NVMesh Targets 31 Dynamic Provisioning & Attach Flow Kubernetes Controller User creates a Persistent Volume Claim (PVC) NVMesh CSI Controller Create Volume NVMesh Management NVMesh Targets 32 Dynamic Provisioning & Attach Flow Kubernetes Controller User creates a POD that uses the PVC Nod NVMesh CSI NVMesh CSI User App Controller e Node Driver PODs POD mount Attach / Detach K8s internal mount OS mount NVMesh /dev/nvmesh/v1 Client NVMesh Management Data NVMesh Targets 33 Exposing NVMesh volume in a Pod User App POD User App POD 1 2 kubelet/pod1/volumes/v1 kublete/pod2/volumes/v1 CSI Publish Volume bind mount For each volume for each POD kubelet/volume/mount Block Volume FileSystem Volume mount CSI Stage Volume mkfs Once for each Volume on the Node /dev/nvmesh/v1 NVMesh attach NVMesh Client 34 Usage Examples kind: PersistentVolumeClaim apiVersion: v1 metadata: name: block-pvc spec: accessModes: - ReadWriteMany volumeMode: Block resources: requests: storage: 15Gi storageClassName: nvmesh-raid10 kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvmesh-custom-vpg provisioner: nvmesh-csi.excelero.com parameters: vpg: your_custom_vpg 35 Summary NVMesh Benefits for Kubernetes: ● Persistent storage that scales for stateful applications ● Predictable application performance – ensure that storage is not a bottleneck ● Scale your performance and capacity linearly ● Containers in a pod can access persistent storage presented to that pod, but with the freedom to restart the pod on an alternate physical node ● Choice of Kubernetes PVC access mode to match the storage to the application and file system requirements Machine learning discovery, workflows, and systems on Kubernetes William Benton Engineering Manager and Senior Principal Engineer Red Hat, Inc. codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools feature extraction process management (Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data machine verification resource monitoring management configuration data collection serving infrastructure analysis tools feature extraction process management (Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015) data engineers events transform databases transform federate archive file, object transform storage events transform developer UI databases transform federate file, object transform storage models train data scientists application developers transform developer UI web and events mobile databases transform federate archive file, object transform reporting storage models train management data engineers application developers transform developer UI web and events mobile databases transform federate archive file, object transform reporting storage models train management data scientists codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation codifying problem data collection feature model training model model monitoring, and metrics and cleaning engineering and tuning validation deployment validation How Kubernetes can help Immutable images a6afd91e user application code 6b8cad3e 33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e base image 979229b9 Immutable images a6afd91e user application code 6b8cad3e 33721112 configuration and e8cae4f6 2bb6ab16 installation recipes a8296f7e base image 979229b9 Immutable images model in production on 16 July 2019 a6afd91e