TR-4798: Netapp AI Control Plane

Technical Report NetApp AI Control Plane Pairing Popular Open-Source Tools with NetApp to Enable AI, ML, and DL Data and Experiment Management Mike Oglesby, NetApp October 2020 | TR-4798 Abstract As organizations increase their use of artificial intelligence (AI), they face many challenges, including workload scalability and data availability. This document demonstrates how to address these challenges through the use of NetApp® AI Control Plane, a solution that pairs NetApp data management capabilities with popular open-source tools and frameworks that are used by data scientists and data engineers. In this document, we show you how to rapidly clone a data namespace just as you would a Git repo. We demonstrate how to define and implement AI training workflows that incorporate the near-instant creation of data and model baselines for traceability and versioning. We also show how to seamlessly replicate data across sites and regions and swiftly provision Jupyter Notebook workspaces with access to massive datasets. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 5 2 Concepts and Components ................................................................................................................. 6 2.1 Artificial Intelligence ........................................................................................................................................ 6 2.2 Containers ....................................................................................................................................................... 6 2.3 Kubernetes ...................................................................................................................................................... 7 2.4 NetApp Trident ................................................................................................................................................ 7 2.5 NVIDIA DeepOps ............................................................................................................................................ 7 2.6 Kubeflow ......................................................................................................................................................... 7 2.7 Apache Airflow ................................................................................................................................................ 8 2.8 NetApp ONTAP 9 ............................................................................................................................................ 8 2.9 NetApp Snapshot Copies ................................................................................................................................ 9 2.10 NetApp FlexClone Technology ..................................................................................................................... 10 2.11 NetApp SnapMirror Data Replication Technology ......................................................................................... 11 2.12 NetApp Cloud Sync ....................................................................................................................................... 12 2.13 NetApp XCP .................................................................................................................................................. 12 2.14 NetApp ONTAP FlexGroup Volumes ............................................................................................................ 12 3 Hardware and Software Requirements............................................................................................. 13 4 Support ................................................................................................................................................ 14 5 Kubernetes Deployment .................................................................................................................... 14 5.1 Prerequisites ................................................................................................................................................. 14 5.2 Use NVIDIA DeepOps to Install and Configure Kubernetes ......................................................................... 15 6 NetApp Trident Deployment and Configuration .............................................................................. 15 6.1 Prerequisites ................................................................................................................................................. 15 6.2 Install Trident ................................................................................................................................................ 15 6.3 Example Trident Backends for ONTAP AI Deployments .............................................................................. 16 6.4 Example Kubernetes StorageClasses for ONTAP AI Deployments .............................................................. 18 7 Kubeflow Deployment ........................................................................................................................ 19 7.1 Prerequisites ................................................................................................................................................. 19 7.2 Set Default Kubernetes StorageClass .......................................................................................................... 20 7.3 Use NVIDIA DeepOps to Deploy Kubeflow ................................................................................................... 20 8 Example Kubeflow Operations and Tasks ....................................................................................... 24 8.1 Provision a Jupyter Notebook Workspace for Data Scientist or Developer Use ........................................... 25 2 NetApp AI Control Plane © 2020 NetApp, Inc. All Rights Reserved. 8.2 Create a Snapshot of an ONTAP Volume from Within a Jupyter Notebook .................................................. 31 8.3 Trigger a Cloud Sync Replication Update from Within a Jupyter Notebook .................................................. 35 8.4 Create a Kubeflow Pipeline to Execute an End-to-End AI Training Workflow with Built-in Traceability and Versioning ..................................................................................................................................................... 40 8.5 Create a Kubeflow Pipeline to Rapidly Clone a Dataset for a Data Scientist Workspace ............................. 54 8.6 Create a Kubeflow Pipeline to Trigger a SnapMirror Volume Replication Update ......................................... 61 8.7 Create a Kubeflow Pipeline to Trigger a Cloud Sync Replication Update ..................................................... 62 9 Apache Airflow Deployment .............................................................................................................. 64 9.1 Prerequisites ................................................................................................................................................. 64 9.2 Install Helm ................................................................................................................................................... 64 9.3 Set Default Kubernetes StorageClass .......................................................................................................... 64 9.4 Use Helm to Deploy Airflow .......................................................................................................................... 64 10 Example Apache Airflow Workflows ................................................................................................ 67 10.1 Implement an End-to-End AI Training Workflow with Built-in Traceability and Versioning ............................ 67 10.2 Rapidly Clone a Dataset to create a Data Scientist Workspace .................................................................... 72 10.3 Trigger a SnapMirror Volume Replication Update ......................................................................................... 76 10.4 Trigger a Cloud Sync Replication Update ..................................................................................................... 79 10.5 Trigger an XCP Copy or Sync Operation ...................................................................................................... 84 11 Example Basic Trident Operations ................................................................................................... 86 11.1 Import an Existing Volume ............................................................................................................................ 86 11.2 Provision a New Volume ............................................................................................................................... 88 12 Example High-performance Jobs for ONTAP AI Deployments ..................................................... 88 12.1 Execute a Single-Node AI Workload ............................................................................................................. 88 12.2 Execute a Synchronous Distributed AI Workload .......................................................................................... 91 13 Performance Testing .......................................................................................................................... 95 14 Conclusion .........................................................................................................................................

TR-4798: Netapp AI Control Plane

Programming Models to Support Data Science Workflows

Presto: the Definitive Guide

Summary Areas of Interest Skills Experience

Using Amazon EMR with Apache Airflow: How & Why to Do It

Migrating from Snowflake to Bigquery Data and Analytics

Spring Boot AMQP Starter 1.5.8.RELEASE

Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS

Building a Google Cloud Data Platform

Quality of Analytics Management of Data Pipelines for Retail Forecasting

Airflow Documentation

Apache Airflow Overview Viktor Kotliar

ARCHIVED: Deep Learning On