TR-4798: Netapp AI Control Plane
Total Page:16
File Type:pdf, Size:1020Kb
Technical Report NetApp AI Control Plane Pairing Popular Open-Source Tools with NetApp to Enable AI, ML, and DL Data and Experiment Management Mike Oglesby, NetApp October 2020 | TR-4798 Abstract As organizations increase their use of artificial intelligence (AI), they face many challenges, including workload scalability and data availability. This document demonstrates how to address these challenges through the use of NetApp® AI Control Plane, a solution that pairs NetApp data management capabilities with popular open-source tools and frameworks that are used by data scientists and data engineers. In this document, we show you how to rapidly clone a data namespace just as you would a Git repo. We demonstrate how to define and implement AI training workflows that incorporate the near-instant creation of data and model baselines for traceability and versioning. We also show how to seamlessly replicate data across sites and regions and swiftly provision Jupyter Notebook workspaces with access to massive datasets. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 5 2 Concepts and Components ................................................................................................................. 6 2.1 Artificial Intelligence ........................................................................................................................................ 6 2.2 Containers ....................................................................................................................................................... 6 2.3 Kubernetes ...................................................................................................................................................... 7 2.4 NetApp Trident ................................................................................................................................................ 7 2.5 NVIDIA DeepOps ............................................................................................................................................ 7 2.6 Kubeflow ......................................................................................................................................................... 7 2.7 Apache Airflow ................................................................................................................................................ 8 2.8 NetApp ONTAP 9 ............................................................................................................................................ 8 2.9 NetApp Snapshot Copies ................................................................................................................................ 9 2.10 NetApp FlexClone Technology ..................................................................................................................... 10 2.11 NetApp SnapMirror Data Replication Technology ......................................................................................... 11 2.12 NetApp Cloud Sync ....................................................................................................................................... 12 2.13 NetApp XCP .................................................................................................................................................. 12 2.14 NetApp ONTAP FlexGroup Volumes ............................................................................................................ 12 3 Hardware and Software Requirements............................................................................................. 13 4 Support ................................................................................................................................................ 14 5 Kubernetes Deployment .................................................................................................................... 14 5.1 Prerequisites ................................................................................................................................................. 14 5.2 Use NVIDIA DeepOps to Install and Configure Kubernetes ......................................................................... 15 6 NetApp Trident Deployment and Configuration .............................................................................. 15 6.1 Prerequisites ................................................................................................................................................. 15 6.2 Install Trident ................................................................................................................................................ 15 6.3 Example Trident Backends for ONTAP AI Deployments .............................................................................. 16 6.4 Example Kubernetes StorageClasses for ONTAP AI Deployments .............................................................. 18 7 Kubeflow Deployment ........................................................................................................................ 19 7.1 Prerequisites ................................................................................................................................................. 19 7.2 Set Default Kubernetes StorageClass .......................................................................................................... 20 7.3 Use NVIDIA DeepOps to Deploy Kubeflow ................................................................................................... 20 8 Example Kubeflow Operations and Tasks ....................................................................................... 24 8.1 Provision a Jupyter Notebook Workspace for Data Scientist or Developer Use ........................................... 25 2 NetApp AI Control Plane © 2020 NetApp, Inc. All Rights Reserved. 8.2 Create a Snapshot of an ONTAP Volume from Within a Jupyter Notebook .................................................. 31 8.3 Trigger a Cloud Sync Replication Update from Within a Jupyter Notebook .................................................. 35 8.4 Create a Kubeflow Pipeline to Execute an End-to-End AI Training Workflow with Built-in Traceability and Versioning ..................................................................................................................................................... 40 8.5 Create a Kubeflow Pipeline to Rapidly Clone a Dataset for a Data Scientist Workspace ............................. 54 8.6 Create a Kubeflow Pipeline to Trigger a SnapMirror Volume Replication Update ......................................... 61 8.7 Create a Kubeflow Pipeline to Trigger a Cloud Sync Replication Update ..................................................... 62 9 Apache Airflow Deployment .............................................................................................................. 64 9.1 Prerequisites ................................................................................................................................................. 64 9.2 Install Helm ................................................................................................................................................... 64 9.3 Set Default Kubernetes StorageClass .......................................................................................................... 64 9.4 Use Helm to Deploy Airflow .......................................................................................................................... 64 10 Example Apache Airflow Workflows ................................................................................................ 67 10.1 Implement an End-to-End AI Training Workflow with Built-in Traceability and Versioning ............................ 67 10.2 Rapidly Clone a Dataset to create a Data Scientist Workspace .................................................................... 72 10.3 Trigger a SnapMirror Volume Replication Update ......................................................................................... 76 10.4 Trigger a Cloud Sync Replication Update ..................................................................................................... 79 10.5 Trigger an XCP Copy or Sync Operation ...................................................................................................... 84 11 Example Basic Trident Operations ................................................................................................... 86 11.1 Import an Existing Volume ............................................................................................................................ 86 11.2 Provision a New Volume ............................................................................................................................... 88 12 Example High-performance Jobs for ONTAP AI Deployments ..................................................... 88 12.1 Execute a Single-Node AI Workload ............................................................................................................. 88 12.2 Execute a Synchronous Distributed AI Workload .......................................................................................... 91 13 Performance Testing .......................................................................................................................... 95 14 Conclusion .........................................................................................................................................